Example command: Phasing with a reference panel

Home

Example command

PHASING WITH A REFERENCE PANEL

Although IMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the -phase option.

Here, we extend a basic phasing analysis to incorporate a phased reference panel. Population-based phasing methods work by pooling linkage disequilibrium information across individuals, so adding a panel of high-quality haplotypes can improve phasing accuracy.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
-phase \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 20000 \
-o ./Example/example.chr22.phasing.impute2

Comments

The -o file is always reserved for imputation output, so the phased haplotypes in this example get printed to a file named ./Example/example.chr22.phasing.impute2_haps, where the _haps suffix is added automatically. The format of this output file is explained here.

The reference panel in this example includes SNPs that are not present in the -g file. IMPUTE2 can simultaneously impute the untyped SNPs and phase the typed SNPs in that file, but it will not phase the untyped SNPs; the main output file (./Example/example.chr22.phasing.impute2) will include estimated genotypes for all study + reference SNPs, but the phased haplotype output file (./Example/example.chr22.phasing.impute2_haps) will include only the SNPs from the -g file. We decided not to have the program produce haplotypes at reference-panel-only SNPs because the computation needed to provide good estimates is much greater than that needed to phase just the input genotypes or to impute the untyped SNPs without phasing them. If you really want to try phasing the untyped SNPs as well, please contact us.

If you don't care about imputing the reference-panel-only SNPs into your study data (i.e., you just want to phase the original genotypes), you can substantially speed up the inference by adding "-os 2" to the command line. This tells the program to "output SNPs of type 2", which are ones with input data in both the reference and study panels. By implicitly telling the program not to output other kinds of SNPs (e.g., those typed only in the reference panel), you allow it to avoid wasting calculations that won't contribute to the final output.

Here we have used the -strand_g option to provide a strand file to the program. This file tells IMPUTE2 how to align the allele coding between the study genotypes (-g file) and the reference haplotypes (-h and -l files). You must always align the allele codings across your input datasets, either before running IMPUTE2 or during a run with the options described here.

In our experience this phasing procedure works well for SNP chip data, but it may have statistical convergence issues in datasets with high marker density, such as those that result from resequencing studies of population samples. If you would like to phase that kind of dataset, please contact us for suggestions about how to improve the quality of inference.

This example tells the program to produce results for a 100 kb region (positions 20,400,000-20,500,000) on a single chromosome (IMPUTE2 assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome). Applying the program to a much larger region—say, a whole chromosome or the whole genome—requires running many such jobs with different values of the -int parameter, usually in parallel on a computing cluster. For more details about how to do this, see here.

We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please contact us.

How to use example commands

All of the data files in the example command above are included in the Example/ directory that comes with the IMPUTE2 software download. You should run the command from the main download directory, which is the one that contains the impute2 executable. For example, if you just downloaded a software package named impute_v2.X.Y_i386.tgz and unpacked it according to the directions here, you can reach the appropriate directory by typing "cd impute_v2.X.Y_i386/" on the command line.

Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser's Copy command, and then using the Paste command in your terminal window. (You may then need to hit 'enter' to start the run.)

Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.

You do not have to run IMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, see here.