IMPUTATION WITH ONE PHASED REFERENCE PANEL (PLUS VARIANT FILTERING)
This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis after flexibly removing a subset of sites from the reference panel.
The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:
-filt_rules_l 'eur.maf<0.01' 'afr.maf<=0.05' \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.annot.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
The main novelty here is the use of the -filt_rules_l option. This option works by defining "filtering rules" that combine annotation categories (here, eur.maf and afr.maf) with comparison operators (< and <=) and cutoff values (0.01 and 0.05). Each annotation string is present on the first line of the -l file and is followed by a column of numeric values (one for each site in the reference panel) that determine whether a given site should be filtered from the reference set. In this example, the filtering rules tell IMPUTE2 to ignore reference variants with minor allele frequency less than 1% in a European panel OR less than 5% in an African panel. (Filtering rules are always applied in 'OR' fashion.)
You can make your own filtering rules by adding numeric annotation columns to a reference legend (-l) file, or you can use the annotations that we provide in some of our reference panel download packages. For example, we have included continent-level minor allele frequencies in the legend files for the 1,000 Genomes Phase 1 integrated variant reference panel.
USAGE GUIDELINES FOR FILTERING RULES: Our main motivation in creating the -filt_rules_l option was to provide a fast and easy way of reducing the computational burden of large, sequence-based reference panels. A principled way to do this is to remove the reference SNPs that are expected to provide the least power in an imputation-based association analysis. We suggest that the rarest SNPs in a dataset fall into this category, both because there is generally less power to detect these under many study designs and because such SNPs are often harder to impute, which further diminishes the real power for detection. So, one simple approach is to use a minor allele frequency filtering rule (e.g., 'eur.maf<0.01') for MAF annotations from a population like the one being studied.
This example tells the program to produce results for a 100 kb region (positions 20,400,000-20,500,000) on a single chromosome (IMPUTE2 assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome). Applying the program to a much larger regionsay, a whole chromosome or the whole genomerequires running many such jobs with different values of the -int parameter, usually in parallel on a computing cluster. For more details about how to do this, see here.
All of the data files in the example command above are included in the Example/ directory that comes with the IMPUTE2 software download. You should run the command from the main download directory, which is the one that contains the impute2 executable. For example, if you just downloaded a software package named impute_v2.X.Y_i386.tgz and unpacked it according to the directions here, you can reach the appropriate directory by typing "cd impute_v2.X.Y_i386/" on the command line.
How to use example commands
Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser's Copy command, and then using the Paste command in your terminal window. (You may then need to hit 'enter' to start the run.)
Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.
You do not have to run IMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, see here.