Example command: Imputation with one phased reference panel (pre-phasing)

Home

Example command

IMPUTATION WITH ONE PHASED REFERENCE PANEL (PRE-PHASING)

This is the most common genotype imputation scenario: we want to use a panel of reference haplotypes to impute SNPs that were not typed in a study. Here, we show how to perform this task via pre-phasing, which is an approach that speeds up the imputation process by splitting it into two steps: (i) statistically phase the study genotypes; (ii) impute from the reference panel into the estimated study haplotypes.

The following commands show how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

Step 1: Pre-phasing

./impute2 \
-prephase_g \
-m ./Example/example.chr22.map \
-g ./Example/example.chr22.study.gens \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.prephasing.impute2

Step 2: Imputation into pre-phased haplotypes

./impute2 \
-use_prephased_g \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-known_haps_g ./Example/example.chr22.prephasing.impute2_haps \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.one.phased.impute2

Comments

Pre-phasing is a useful technique for speeding up an imputation run, but it is even more useful if you want to impute a single study dataset from multiple reference panels (e.g., successive updates to the reference haplotypes released by the 1,000 Genomes Project). In that situation, you can perform the pre-phasing step just once and save the estimated haplotypes; you can then use the same study haplotypes to perform the imputation step with each new reference panel.

If you are using IMPUTE2 for both the pre-phasing and subsequent imputation, it is important to use the same values of the -int parameter in both steps.

The -prephase_g flag activates a couple of features that are necessary for pre-phasing. First, it tells the program to estimate and print phased haplotypes at SNPs included in the -g file; the haplotypes will be written to a file named "[-o]_haps", where [-o] is the name supplied for the main output file. These haplotypes will include SNPs in the buffer regions that flank the main region specified via -int. Extending the haplotypes into the buffer regions helps prevent edge effects in downstream imputation runs.

It is possible to include a reference panel in the pre-phasing step, and this may improve the phasing quality. See here for an example of this kind of analysis (note that the linked example is missing the -prephase_g flag). To expedite the pre-phasing in this scenario, the program will not impute reference-only variants when -prephase_g is active, although you can override this behavior with the -os option.

You can use the -strand_g option in either the pre-phasing or downstream imputation step, but you should not use it in both. Strand alignment is not usually necessary when you just want to phase a dataset, but it is important when that dataset will be combined with a reference panel in a downstream analysis, as in this case.

Note that the file supplied to the -known_haps_g argument in the imputation step is the estimated haplotypes file from the pre-phasing step ("[-o]_haps"). Also note that the -use_prephased_g flag must be provided when imputing into pre-phased haplotypes.

This example tells the program to produce results for a 100 kb region (positions 20,400,000-20,500,000) on a single chromosome (IMPUTE2 assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome). Applying the program to a much larger region—say, a whole chromosome or the whole genome—requires running many such jobs with different values of the -int parameter, usually in parallel on a computing cluster. For more details about how to do this, see here.

How to use example commands

All of the data files in the example command above are included in the Example/ directory that comes with the IMPUTE2 software download. You should run the command from the main download directory, which is the one that contains the impute2 executable. For example, if you just downloaded a software package named impute_v2.X.Y_i386.tgz and unpacked it according to the directions here, you can reach the appropriate directory by typing "cd impute_v2.X.Y_i386/" on the command line.

Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser's Copy command, and then using the Paste command in your terminal window. (You may then need to hit 'enter' to start the run.)

Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.

You do not have to run IMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, see here.