IMPUTE2

IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009:

B. N. Howie, P. Donnelly, and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]

IMPUTE2 also includes features that were introduced in other publications, which you can find here.

The figure below shows the most common scenario in which imputation is used: unobserved genotypes (red question marks) in a set of study individuals are imputed (or predicted) using a set of reference haplotypes and genotypes from a SNP chip.

Imputation
          Scenario A

Getting Started

IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new to IMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.

You should begin by downloading the program from here. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.

Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section on Examples shows how to use the most common IMPUTE2 functions. We suggest that you work through these examples and try to understand what the elements of each command are doing. If you don't understand something or would like to know if the program can perform a function that isn't listed, you can read our FAQ or submit a question to our mail list.

When you have learned the basic functionality of the program, you can use several features of this website to prepare your own analysis:


What's New?

New release (9 Dec 2013)

We have released a new version of the 1000 Genomes Phase 1 haplotypes. These are an updated version of the haplotypes released on 16 Sept 2013. There was a small problem with the strand of the Illumina OMNI data we used as the scaffold. 730 SNPs across the genome were not aligned to the + strand of the human genome reference. This does not affect the phasing of the haplotypes, but does affect downstream imputation, especially if these SNPs were genotyped directly in the study being imputed. The new haplotypes were not re-phased. We just switched the strand of the 730 affected SNPs.

The new haplotypes are available here.

New release (16 Sept 2013)

We have released a new version of the 1000 Genomes Phase 1 haplotypes. The haplotypes were phased using a new version of SHAPEIT2 that can handle genotype likelihoods and genotypes available from microarrays on the same samples. Using a set of validation genotypes at SNP and biallelic indels we have been able to show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low frequency variants.

The new haplotypes are available here.

New software release (04 Jan 2013)

We have just released IMPUTE v2.3.0, which includes a number of new features and minor bug fixes. One valuable new function is a simple and robust approach for merging reference panels; for example, it is easy to combine 1,000 Genomes haplotypes with population-specific sequence data to capture the strength of both reference sets. We have also written detailed documentation for the concordance tables printed at the end of most IMPUTE2 runs.

Paper on "pre-phasing" study genotypes for faster imputation

We recently published an article called "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing" in Nature Genetics. This paper describes a strategy ("pre-phasing") for efficient genotype imputation with large reference panels. By reducing the computational burden of imputation, pre-phasing makes imputation-based studies feasible for groups with limited computing power, and it also makes it easier to re-impute existing GWAS datasets as more informative reference panels become available. You can learn more about pre-phasing with IMPUTE2 here.

Latest 1,000 Genomes Phase I reference panel

In March 2012, the 1,000 Genomes Project released a powerful reference panel known as "Phase I version 3". In August 2012, we modified this panel by excluding variants with only one copy of the minor allele (singletons) across all 1,092 individuals. Singleton variants are difficult to impute, yet they make up ~20% of all variants in the reference panel; removing them makes imputation faster without hurting the power for association mapping. You can download either the orginal reference panel or the modified version (which is labeled "macGT1" for "minor allele count greater than one") here.

Paper on imputation strategies for ancestrally diverse reference panels

We published an article called "Genotype imputation with thousands of genomes" in the open-access journal G3: Genes, Genomes, Genetics. This paper describes our strategy for achieving high accuracy with ancestrally diverse reference panels, especially at low-frequency variants and in admixed study cohorts: we supply a cosmopolitan set of reference haplotypes to IMPUTE2, which can automatically find the most useful ones for each study individual with the help of the tuning parameter -k_hap. You can read more about the results that support this strategy in the article, and we provide practical suggestions for applying it here.

Pre-phasing with SHAPEIT

IMPUTE2's pre-phasing approach now works with phased haplotypes from SHAPEIT, a highly accurate phasing algorithm that can handle mixtures of unrelateds, duos, and trios. Details are available here. We highly recommend using SHAPEIT to infer the haplotypes underlying your study genotypes, then passing these to IMPUTE2 for imputation as shown in the second step of this example.


Download IMPUTE2

IMPUTE2 is freely available for academic use. To see rules for non-academic use, please read the LICENCE file, which is included with each software download.

Pre-compiled IMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please send a message to our mail list.

The latest software release is v2.3.0. We support only the most recent version.

Platform File
Linux (x86_64) Static Executable impute_v2.3.0_x86_64_static.tgz
Linux (x86_64) Dynamic Executable impute_v2.3.0_x86_64_dynamic.tgz
Mac OSX Intel impute_v2.3.0_MacOSX_Intel.tgz
Linux i686 Dynamic Executable impute_v2.3.0_i686.tgz
Windows MS-DOS (Intel) impute_v2.3.0_Windows.tgz
Solaris 5.10 impute_v2.3.0_Solaris5.10.tar.gz

To unpack the files on a Linux computer, use a command like this:

tar -zxvf impute_v2.X.Y_i386.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called impute2, a LICENCE file, and an Example/ directory that contains example data files. We show how to perform various kinds of analyses with the example files here.


Download Reference Data

IMPUTE2 can use publicly available reference datasets, such as haplotypes from major sequencing projects, as well as customized reference panels, such as SNP genotypes from a fine-mapping study. If you would like to download a public dataset, just click the relevant link below, which will take you to a page with background information and download options for that dataset.

The latest reference panels are from 1,000 Genomes Phase 1. Note that the 1,000 Genomes Phase 1 integrated variant set is meant to be used with IMPUTE version 2.2.0 or later.

Link to download page NCBI build Haplotype release date Release status
1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2) b37 Dec 2013

1000 Genomes Phase I integrated haplotypes (produced using SHAPEIT2)
b37
Sep 2013

1000 Genomes Phase I integrated variant set b37 Mar 2012 Includes chrX; updated 24 Aug 2012
1000 Genomes Phase I (interim) b37 Jun 2011 Includes chrX; updated 19 Apr 2012
1000 Genomes (2010 interim) b37 Dec 2010
1000 Genomes Pilot + HapMap 3 b36 Jun 2010 / Feb 2009
1000 Genomes Pilot b36 Jun 2010
HapMap 3 (release #2) b36 Feb 2009 Includes chrX
HapMap 2 (release #24) b36 Oct 2008
HapMap 2 (release #22) b36 Jan 2008
HapMap 2 (release #21) b35 Jul 2006


Using Multi-Population Reference Panels

Overview

Human genetic variation resources, like those produced by HapMap 3 and the 1,000 Genomes Project, capture a broad cross-section of human genetic diversity: detailed variation data have now been collected from a variety of sampling locations in Africa, Asia, Europe, and the Americas. Large sequencing projects are actively expanding these datasets to include additional populations and deeper sampling within populations. These public databases provide powerful reference panels for genotype imputation studies.

In this context, one important question is how to choose a reference panel that will produce high imputation accuracy in a population of interest. The answer is seldom obvious because human populations have experienced complex demographic histories with many migration and mixture events. Consequently, it can be hard to decide which reference haplotypes should be used in a particular study.

We have proposed a simple and universal solution to this problem: we provide all available reference haplotypes to IMPUTE2, then let the software choose a "custom" reference panel for each individual to be imputed. There are several advantages to this approach:

Practical suggestions

There are a few program settings that you should be aware of when using IMPUTE2 with an ancestrally diverse reference panel:

How does it work?

As explained above, we believe that the best way to use IMPUTE2 with modern reference panels is to provide all available haplotypes to the program and let it choose which ones to use. Here, we explain how this approach works.

IMPUTE2 does not use population labels or other genome-wide measures of relatedness between individuals, either for the reference haplotypes or the individuals being imputed. Instead, it looks for reference haplotypes that share high sequence identity with the haplotypes of a particular study individual. These haplotypes constitute a "custom" reference panel that can be used to impute missing genotypes in the individual of interest.

This process is largely insensitive to the ancestral composition of the reference panel: as long as the panel contains haplotypes that share segments of recent common ancestry with individuals in a study, IMPUTE2 can find the shared segments and use them to impute missing alleles. Consequently, the reference panel does not need to be restricted to haplotypes that "match" the ancestry of the study individuals—it can also include other kinds of haplotypes:

Expert users will note that the model underlying IMPUTE2 is formally designed to represent genetic variation in a single population. This might imply that the method would have trouble using reference panels that include populations with different linkage disequilibrium patterns, nucleotide diversity levels, and allele frequency spectra. However, we have found that the IMPUTE2 is extremely adaptable: it can find segments of shared ancestry in multi-population reference panels despite its simple model of human populations, and it is largely robust to changes in its model parameters. Imputation accuracy might theoretically be improved by more detailed modeling of population relationships (for example, the population labels that IMPUTE2 ignores might sometimes be informative), but we believe that our approach captures most of the potential accuracy in an efficient way.

Published results

We published our work supporting these ideas in an article called "Genotype imputation with thousands of genomes" in the open-access journal G3: Genes, Genomes, Genetics. Please cite this paper and the original IMPUTE2 paper when using IMPUTE2 with multi-population reference panels like those from the 1,000 Genomes Project.


Examples

This section provides some example commands that illustrate typical applications of IMPUTE2. All of the data files used in these commands are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the impute2 executable). Detailed explanations are provided at each link below.

Run type Description
Imputation with one phased reference panel Basic scenario in which most people will use IMPUTE2.
Imputation with one phased reference panel
(pre-phasing)
As above, but with pre-phasing functionality to speed up the analysis.
Imputation with one phased reference panel
(chromosome X)
Basic imputation scenario applied to human chromosome X, which requires special program options.
Imputation with one phased reference panel
(plus variant filtering)
Basic imputation scenario with flexible filtering of reference panel variants.
Imputation with one unphased reference panel Basic imputation scenario adapted to unphased reference genotypes.
Imputation with two phased reference panels Extended functionality for imputing from multiple reference panels defined on different sets of variants.
Imputation with two phased reference panels
(merge reference panels)
Merge reference panels defined on different sets of variants and use combined panel for imputation.
Imputation with one phased and one unphased reference panel Specialized method for combining reference panels of different types.
Imputation with one phased and one unphased reference panel, with additional options As above, but illustrating a variety of options that can be used to customize the behavior of IMPUTE2.
Phasing Methodology for inferring haplotypes from unphased genotypes.
Phasing with a reference panel Phasing analysis aided by reference haplotypes.

How to use example commands

All of the data files in the example commands below are included in the Example/ directory that comes with the IMPUTE2 software download. You should run the command from the main download directory, which is the one that contains the impute2 executable. For example, if you just downloaded a software package named impute_v2.X.Y_i386.tgz and unpacked it according to the directions here, you can reach the appropriate directory by typing "cd impute_v2.X.Y_i386/" on the command line.

Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser's Copy command, and then using the Paste command in your terminal window. (You may then need to hit Enter to start the run.)

Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.

You do not have to run IMPUTE2 exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, see here.

Most of the examples below include the string "-int 20.4e6 20.5e6", which tells the program to produce results for a 100 kb region (positions 20,400,000-20,500,000) on a single chromosome. IMPUTE2 assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome. Applying the program to a much larger region—say, a whole chromosome or the whole genome—requires running many such jobs with different values of the -int parameter, usually in parallel on a computing cluster. For more details about how to do this, see here.


Imputation with one phased reference panel

This is the most common genotype imputation scenario: we want to impute untyped SNPs in a study dataset from a panel of reference haplotypes.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.impute2

Comments


Imputation with one phased reference panel (pre-phasing)

This is the most common genotype imputation scenario: we want to use a panel of reference haplotypes to impute SNPs that were not typed in a study. Here, we show how to perform this task via pre-phasing, which is an approach that speeds up the imputation process by splitting it into two steps: (i) statistically phase the study genotypes; (ii) impute from the reference panel into the estimated study haplotypes.

The following commands show how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

Step 1: Pre-phasing

./impute2 \
 -prephase_g \
 -m ./Example/example.chr22.map \
 -g ./Example/example.chr22.study.gens \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.prephasing.impute2


Step 2: Imputation into pre-phased haplotypes

./impute2 \
 -use_prephased_g \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -known_haps_g ./Example/example.chr22.prephasing.impute2_haps \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.impute2
 -phase

Comments

Step 1: Chromosome X Pre-phasing

./impute2 \ 
 -prephase_g
\
 -chrX
\
 -m ./Example/chrX/example.chrX.map \
 -g ./Example/chrX/example.chrX.study.gen \
 -sample_g ./Example/chrX/example.chrX.study.sample \
 -int 10.3e6 10.7e6 \
 -Ne 20000 \
 -o ./Example/chrX/example.chrX.prephasing.impute2


Step 2: Imputation into pre-phased chromosome X haplotypes

./impute2 \
 -use_prephased_g
\

 -chrX
\
 -m ./Example/chrX/example.chrX.map \
 -h ./Example/chrX/example.chrX.reference.hap \
 -l ./Example/chrX/example.chrX.reference.legend \
 -known_haps_g ./Example/chrX/example.chrX.prephasing.impute2_haps \
 -int 10.3e6 10.7e6 \
 -Ne 20000 \
 -o ./Example/chrX/example.chrX.one.phased.impute2
 -phase



Imputation with one phased reference panel (chromosome X)

This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis on chromosome X, which requires special treatment due to the hemizygosity of males. (This example and the files in our download packages focus on the non-pseudoautosomal part of chromosome X.)

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -chrX \
 -m ./Example/chrX/example.chrX.map \
 -h ./Example/chrX/example.chrX.reference.hap \
 -l ./Example/chrX/example.chrX.reference.legend \
 -g ./Example/chrX/example.chrX.study.gen \
 -sample_g ./Example/chrX/example.chrX.study.sample \
 -int 10.3e6 10.7e6 \
 -Ne 20000 \
 -o ./Example/chrX/example.chrX.one.phased.impute2

Comments
File formats for chromosome X

Among human chromosomes, chromosome X is unique in that it is dizygous (two copies) in females but hemizygous (one copy) in males. To deal with chromosome X data, IMPUTE2 requires that you use the -chrX flag and make some small changes to the input file formats.


Imputation with one phased reference panel (plus variant filtering)

This example provides a twist on the common scenario of imputing untyped SNPs in a study dataset from a panel of reference haplotypes. Here, we want to perform the analysis after flexibly removing a subset of sites from the reference panel.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -filt_rules_l 'eur.maf<0.01' 'afr.maf<=0.05' 'TYPE==LOWCOV' \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.annot.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.impute2

Comments

Imputation with one unphased reference panel

It is not necessary for the reference panel to be phased: IMPUTE2 can do the phasing internally while accounting for the phase uncertainty. To use an unphased reference panel, simply replace the -h and -l files with a -g_ref file.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -m ./Example/example.chr22.map \
 -g_ref ./Example/example.chr22.reference.gens \
 -strand_g_ref ./Example/example.chr22.reference.strand \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.unphased.impute2

Comments

Imputation with two phased reference panels

It is sometimes helpful to use multiple reference panels to impute genotypes in a single study. For example, we previously recommended combining reference haplotypes from the 1,000 Genomes Pilot Project and HapMap 3: the first set provided extensive coverage of polymorphisms in the genome, while the second set provided greater sample size at a subset of SNPs. We no longer recommend that you use this hybrid reference panel because the 1,000 Genomes Project has generated even richer reference sets (which you can download here), but some investigators may have additional reference data that could be used in this way.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
    ./Example/example.chr22.hm3.haps \
 -l ./Example/example.chr22.1kG.legend \
    ./Example/example.chr22.hm3.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.two.phased.impute2

Comments

Imputation with two phased reference panels (merge reference panels)

Many investigators have access to multiple reference panels that could inform their imputation analyses. For example, they might want to supplement the 1,000 Genomes haplotypes (which can be downloaded here) with dedicated sequencing data from a study population.

If you have two panels that have been phased and put into IMPUTE2's reference format (legend/haplotype file pairs), you can ask the program to merge them internally and impute your study genotypes by entering the following command, which uses example data that come with the program download:

./impute2 \
 -merge_ref_panels \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
    ./Example/example.chr22.hm3.haps \
 -l ./Example/example.chr22.1kG.legend \
    ./Example/example.chr22.hm3.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.two.phased.impute2

Comments

Imputation with one phased and one unphased reference panel

Sometimes it is useful to combine a phased reference panel with an unphased reference panel when imputing genotypes in a study. For example, Howie et al. (2009) considered a hybrid reference panel that included phased haplotypes from HapMap and unphased genotypes from population controls typed on multiple SNP chips (they referred to this configuration as "Scenario B"). By using the genetic information in both panels simultaneously, IMPUTE2 can achieve a better combination of accuracy and coverage than it would with either panel alone.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g_ref ./Example/example.chr22.reference.gens \
 -strand_g_ref ./Example/example.chr22.reference.strand \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.one.phased.one.unphased.impute2

Comments

Imputation with one phased and one unphased reference panel, with additional options

Here we perform the same basic analysis as in this example, but we use a number of additional options to modify the behavior of IMPUTE2.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g_ref ./Example/example.chr22.reference.gens \
 -strand_g_ref ./Example/example.chr22.reference.strand \
 -exclude_snps_g_ref ./Example/example.chr22.reference.snp.exclusions \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -align_by_maf_g \
 -sample_g ./Example/example.study.samples \
 -exclude_samples_g ./Example/example.study.sample.exclusions \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -k 100 \
 -burnin 5 \
 -iter 20 \
 -pgs \
 -no_sample_qc_info \
 -o_gz \
 -o ./Example/example.chr22.complicated.impute2

Comments

Phasing

Although IMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the -phase option.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -phase \
 -m ./Example/example.chr22.map \
 -g ./Example/example.chr22.study.gens \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.phasing.impute2

Comments

We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to our mail list.


Phasing with a reference panel

Although IMPUTE2 was originally designed to impute missing genotypes, it can also be used for a classical phasing analysis in which we want to infer the haplotypes underlying a set of observed genotypes. This functionality is activated via the -phase option.

Here, we extend a basic phasing analysis to incorporate a phased reference panel. Population-based phasing methods work by pooling linkage disequilibrium information across individuals, so adding a panel of high-quality haplotypes can improve phasing accuracy.

The following command shows how to run this kind of analysis with IMPUTE2, using the example data that come with the program download:

./impute2 \
 -phase \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 20000 \
 -o ./Example/example.chr22.phasing.impute2

Comments

We have not yet posted instructions for how to reattach phased haplotypes across successive chunks along a chromosome. If you want to try this approach to phasing a whole chromosome, please send a message to our mail list.


Program Options

These links explain the command-line arguments that can be used to control IMPUTE2.

Option type Description
Required arguments The program will not run if these are not supplied.
Input file options A list of possible input files, with formatting requirements.
Output file options Naming conventions and options for controlling format of output files.
Basic options Options for controlling how the program processes input data.
Strand alignment options Options for aligning allele coding across data files.
Filtering options Options for controlling the filters that get applied to input data.
MCMC options Options for controlling the MCMC algorithm.
Pre-phasing options Options that facilitate pre-phasing and subsequent imputation.
Panel merging options Options for merging a pair of reference panels.
Chromosome X options Options for analyzing chromosome X data.
Expert options Options to be used by experts only.



Required arguments

This table shows the input arguments that you must supply in order for IMPUTE2 to run. These are just the minimum requirements; the program will not do anything useful unless you also supply other input options and/or data files.

Flag Default Description
-g <file>
REQUIRED unless -known_haps_g provided
none File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO.

If you do not supply a file of unphased genotypes via this argument, you must supply a file of phased study haplotypes via the -known_haps_g option.
-m <file>
REQUIRED
none Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)").

All of our reference panel download packages come with appropriate recombination map files.
-int <lower> <upper>
REQUIRED
none Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes.

IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the -allow_large_regions flag.



Input file options

This table explains the formatting requirements for input data files that can be supplied to IMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems). In all of these files, it is important that SNPs appear in base pair position order, from lowest to highest. It is also crucial that all SNP positions come from the same genome assembly (e.g., NCBI Build 37) so the program can combine information across input files.

Flag Default Description
-g <file>
REQUIRED unless -known_haps_g provided
none File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO.

If you do not supply a file of unphased genotypes via this argument, you must supply a file of phased study haplotypes via the -known_haps_g option.
-m <file>
REQUIRED
none Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)").

All of our reference panel download packages come with appropriate recombination map files.
-h <file 1> <file 2> none File of known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as 0 or 1, and each -h file must be provided with a corresponding legend file. We provide formatted haplotypes from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages.

In IMPUTE2, it is possible to specify two -h files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names.
-l <file 1> <file 2> none Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). We provide legend files for data from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages.

When using two -h files with IMPUTE2, you must supply the corresponding legend files in the same order—i.e., the file with more SNPs comes first.
-g_ref <file> none File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file).
-known_haps_g <file> none File containing known haplotypes for the study cohort. The format is the same as the output format from IMPUTE2's -phase option: five header columns (as in the -g file) followed by two columns (haplotypes) per individual. Allowed values in the haplotype columns are 0, 1, and ?.

If your study dataset is fully phased, you can replace the -g file with a -known_haps_g file. This will cause IMPUTE2 to perform haploid imputation, although it will still report diploid imputation probabilities in the main output file. If any genotypes are missing, they can be marked as '? ?' (two question marks separated by one space) in the input file. (The program does not allow just one allele from a diploid genotype to be missing.) If the reference panels are also phased, IMPUTE2 will perform a single, fast imputation step rather than its standard MCMC module—this is how the program imputes into pre-phased GWAS haplotypes.

The -known_haps_g file can also be used to specify study genotypes that are "partially" phased, in the sense that some genotypes are phased relative to a fixed reference point while others are not. We anticipate that this will be most useful when trying to phase resequencing data onto a scaffold of known haplotypes. To mark a known genotype as unphased, place an asterisk immediately after each allele, with no space between the allele (0/1) and the asterisk (*); e.g., "0* 1*" for a heterozygous genotype of unknown phase.



Output file options

The options in this table control the format and naming conventions of output files printed by IMPUTE2.

Flag Default Description
-o <file> ./test.impute2 Name of main output file. Follows the same format as the -g file.
-i <file> [-o]_info Name of SNP-wise information file with one line per SNP and a single header line at the beginning. This file always contains the following columns (header tags shown in parentheses):

1. SNP identifier from -g file (snp_id)
2. rsID (rs_id)
3. base pair position (position)
4. expected frequency of allele coded '1' in the -o file (exp_freq_a1)
5. measure of the observed statistical information associated with the allele frequency estimate (info) [details]
6. average certainty of best-guess genotypes (certainty)
7. internal "type" assigned to SNP (type)

Depending on the command-line options invoked, there may also be columns labeled info_typeX, concord_typeX, and r2_typeX. IMPUTE2 assigns every SNP an internal "type" which reflects the combination of input datasets that include data for that SNP; here, X gives the type, which takes values in {0,1,2}. You can learn how the program determines SNP types here.

For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes and the best-guess imputed genotypes, where the input genotypes at that SNP have been masked internally and then imputed as if the SNP were of type X; similarly, r2_typeX is the squared correlation between input and masked/imputed genotypes at a SNP.

The info_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-type X SNPs in the leave-one-out masking experiment. These columns are useful for post-hoc quality control; we will soon explain how we use them in our section on Best Practices for Imputation.
-r <file> [-o]_summary Name of log file that records a summary of the screen output.
-w <file> [-o]_warnings Name of file that records warnings generated by IMPUTE2.
-os <int> <int> ... 0 1 2 3 "Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in the Overview). By default, all imputed and genotyped SNPs are included in the output, i.e., "-os 0 1 2 3".
-o_gz
Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large.
-outdp <int> 3 Specifies the number of decimal places to use for reporting genotype probabilities in the main output file.
-no_snp_qc_info
Suppresses printing of info_typeX, concord_typeX, and r2_typeX columns in the -i file.
-no_sample_qc_info
Suppresses printing of per-sample quality control metrics file. The default is to print a file named "[-i]_by_sample".
-phase
IMPUTE2 always implicitly phases the study genotypes (-g file), and this flag tells the program to print the best-guess haplotypes that result from the phasing process. In addition to the standard imputation output file, the program also prints a separate haplotype file named "[-o]_haps". This file contains the same five header columns as the standard output, along with two columns (haplotypes) per individual, in the same order they appear in the main output.

In addition to this "best-guess" haplotype file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named "[-o]_haps_confidence". In this file, homozygotes are represented by * characters and heterozygotes are represented by numbers between 0.5 and 1.0; this is the estimated probability that the phasing between the current heterozygote and the previous heterozygote (upstream) is correct. By convention, the first heterozygous SNP in each individual for a given analysis region is assigned a phasing certainty of 1.0.

As illustrated by our example commands, it is possible to use the -phase option to produce haplotypes without the use of a reference panel; i.e., to perform a classical phasing analysis.
-pgs
"Predict Genotyped SNPs": Tells the program to replace the input genotypes from the -g file with imputed genotypes in the -o file (applies to Type 2 SNPs only).
-pgs_miss
Unlike -pgs, which replaces all input genotypes with imputed genotypes, this option tells the program to replace only the missing genotypes at typed SNPs. That is, any input genotype whose maximum probability exceeds the -call_thresh will simply be reprinted in the -o file, whereas input genotypes that fall below the calling threshold will be imputed in the output.

WARNING: This is an appealing option that will "fill in" sporadically missing genotypes in your input data. However, it is possible that this could cause subtle problems in downstream association testing. We therefore suggest that you use caution when applying this option.


Details about 'info' metric

IMPUTE2 reports an information metric in the fifth column of its -i file. This metric is similar to the r-squared metrics reported by other programs like MaCH and Beagle. Although each of these metrics is defined differently, they tend to be correlated.

Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).

Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.

We define our info metric and compare it against other metrics in a review paper that we recently published. If you have questions, please read that material first, then send a message to our mail list if anything is still unclear.


Basic options

These options control some basic processing that the program does to prepare input data for inference.

Flag Default Description
-int <lower> <upper>
REQUIRED
none Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes.

IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the -allow_large_regions flag.
-buffer <int> 250 kb Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int option. SNPs in the buffer regions inform the inference but do not appear in output files (unless you activate the -include_buffer_in_output flag).

Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. Larger buffers may improve accuracy for low-frequency variants (since such variants tend to reside on long haplotype backgrounds) at the cost of longer running times.
-allow_large_regions
Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is described here.
-include_buffer_in_output
Tells the program to include SNPs from the -buffer region in all output files. The main reason for using this option is to preserve the buffer information for downstream imputation, e.g. when pre-phasing a GWAS dataset.
-Ne <int> 20000 "Effective size" of the population (commonly denoted as Ne in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates that IMPUTE2 uses to guide its model of linkage disequilibrium patterns. When most imputation runs were conducted with reference panels from HapMap Phase 2, we suggested values of 11418 for imputation from HapMap CEU, 17469 for YRI, and 14269 for CHB+JPT.

Modern imputation analyses typically involve reference panels with greater ancestral diversity, which can make it hard to determine the "ideal" -Ne value for a particular study. Fortunately, we have found that imputation accuracy is highly robust to different -Ne values; within each of several human populations, we have obtained nearly identical accuracy levels for values between 10000 and 25000. We suggest setting -Ne to 20000 in the majority of modern imputation analyses.
-call_thresh <float> 0.9 Threshold for calling genotypes in the -g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing.

NOTE: This threshold applies only to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions.
-nind <int> # of indiv in -g file Number of individuals from the -g file to include in the analysis. For example, to impute only the first five individuals, set -nind 5. This option is useful for debugging and test runs.
-verbose
Print detailed output about the progress of imputation. By default, IMPUTE2 prints only the number of the current MCMC iteration when performing imputation, but this flag tells it to print more detailed updates.



Strand alignment options

In any imputation analysis, is it absolutely essential that all panels have their allele codings aligned to a fixed reference (usually the human genome reference sequence). The options in this table are meant to help align the allele codings in your input data files, but you should not assume that the program will do all the work for you. If you do not know exactly how your data were processed or what these options are doing, you should try to locate the original strand information or send a message to our mail list for assistance.

NOTE: IMPUTE2 will automatically align the strand between panels whenever it can do so unambiguously; e.g., flipping A/C in Panel 2 to match G/T in the reference. The options below pertain to variants where this is not possible, e.g. because an A/T SNP cannot be aligned by label alone.

NOTE: We currently assume that all phased reference files have already been aligned to the '+' strand of the human genome reference sequence, which is true of the files that we distribute; hence, the options here pertain only to study genotype files (like the -g and -known_haps_g files) and unphased reference files (i.e., a -g_ref file).

Flag Default Description
-strand_g <file> none File showing the strand orientation of the SNP allele codings in the -g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space.

The ordering of the SNPs in this file does not matter (by contrast to the -g file, which must be sorted by SNP position), and it is okay if some SNPs in the strand file are not present in the genotype file (e.g., due to filtering). We provide model strand files in the Example/ directory that comes with the software download.
-strand_g_ref <file> none Same as -strand_g, but applies to the -g_ref file.
-align_by_maf_g
Activates the program's internal strand alignment procedure for the -g file (AKA Panel 2; for details about the panel nomenclature used here, see the overview). The strand is aligned to the alleles in reference Panel 0, if present, otherwise to reference Panel 1. This option pertains only to A/T and C/G SNPs, which it aligns such that Panel 2 and the alignment reference (Panel 0 or 1) have the same minor allele.

NOTE: This flag can be used in conjunction with the -strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to align the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others.

NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The best way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured.

This flag replaces -fix_strand_g as of IMPUTE v2.2.
-align_by_maf_g_ref
Similar to -align_by_maf_g, but applies to the -g_ref file (Panel 1). In this case the strand is aligned to the alleles in Panel 0, so the flag does not work if Panel 0 was not provided (i.e., if you did not supply -l and -h files).

NOTE: Just as -align_by_maf_g can be used in conjunction with -strand_g, this flag can be used in conjunction with the -strand_g_ref option. As before, the strand file takes precedence over aligning the strand by MAF.

NOTE: As with -align_by_maf_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%.

This flag replaces -fix_strand_g_ref as of IMPUTE v2.2.



Filtering options

The options in this table affect the way that the program filters the input data. Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.

Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the -g file you should use -exclude_snps_g, and to exclude SNPs from the -g_ref file you should use -exclude_snps_g_ref.

Flag Default Description
-filt_rules_l <str> <str> ... none This option provides flexible variant filtering in the reference panel via "filter rules", which are based on annotation columns in a -l file. Each column should be labeled by a contiguous string (no whitespace) describing its contents. For example, the Example/ directory in the software download packages includes a file named example.chr22.1kG.annot.legend that contains columns named eur.maf and afr.maf and TYPE.

To filter variants based on the numeric annotation values in the -l file, you should combine a column string with a cutoff value and one of these six comparison operators: < <= > >= == != . For example, writing -filt_rules_l 'eur.maf<0.05' on the command line would tell the program to remove any variants with eur.maf values less than 0.05 from the reference panel. You can include an arbitrary number of filtering strings after the -filt_rules_l option, in which case the filtering conditions will be applied in 'or' fashion: if any condition is true, the variant will be removed.

It is very important that you enclose each filtering string in single quotes, as shown above. Otherwise, the command-line environment may interpret symbols like < and > as linux redirection operators. There should be no white space within the single quotes.

You can develop annotations yourself and add them to the -l file, or you can use the annotations that we provide in some of our reference download packages. For example, we have included continent-level minor allele frequencies in the legend files for the 1,000 Genomes Phase 1 integrated variant reference panel.

For an illustration of using -filt_rules_l in practice, see this example command.
-exclude_snps_g <file> none List of SNPs to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g file), their rsIDs (second column of -g file), or their base pair positions (third column of -g file). Excluded SNPs will be treated as if they had not been present in the genotypes file, and they will not be shown in the output unless you use the -impute_excluded option.
-exclude_snps_g_ref <file> none Same as -exclude_snps_g, but applies to the -g_ref file.
-impute_excluded
Specifies that SNPs excluded from the study dataset via the -exclude_snps_g option should be imputed and included in the output file. When this flag is not activated, excluded SNPs are simply ignored.
-include_snps <file> none List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the reference SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file).

This option does not have any effect on SNPs in the -g file.
-sample_g <file> none File of sample IDs for the individuals in the -g file; should follow the format described here. Only the first two columns are necessary, but they must be present and labeled "ID_1" and "ID_2".

NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option, or if you are analyzing chromosome X data via the -chrX option.
-sample_g_ref <file> none Same as -sample_g, but applies to the -g_ref file.
-exclude_samples_g <file> none List of samples to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The samples can be identified by the IDs in either of the first two columns of the -sample_g file, which is REQUIRED if you want to use this option. Excluded samples will be treated as if they had not been present in the genotypes file, and the program will re-print the original sample list, minus the excluded samples, to a file named "[-o]_samples", where -o is the name of the main output file.

NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option.
-exclude_samples_g_ref <file> none Same as -exclude_samples_g, but applies to the -g_ref file. One difference is that the program will not print a filtered list of -g_ref samples like the one that gets printed with -exclude_samples_g.



MCMC options

IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.

Flag Default Description
-iter <int> 30 Total number of MCMC iterations to perform, including burn-in. Increasing the number of iterations may improve accuracy slightly, although increasing -k generally leads to greater improvements for a fixed computational cost.
-burnin <int> 10 Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for unphased individuals during each of the first [-burnin] iterations, but these iterations do not contribute to the final imputation probabilities. We have found that 10 burn-in iterations is enough to ensure good results in a variety of different datasets.
-k <int> 80 Number of haplotypes (in the reference or study data) to use as templates when phasing observed genotypes. Increasing this value will lead to higher accuracy at the cost of longer running times, which scale quadratically with -k. The default value should be sufficient for most analyses.
-k_hap <int> 500 Number of reference haplotypes to use as templates when imputing missing genotypes. As a rule of thumb, you should set -k_hap to the number of reference haplotypes that you expect to be useful for your study population. If this value is less than the total number of haplotypes in your reference panel, IMPUTE2 will choose a "custom" set of -k_hap haplotypes each time it imputes missing alleles in a study haplotype.

If all of your reference haplotypes have similar ancestry to the subjects in your study, each haplotype is potentially useful for imputation, so the best accuracy can be achieved by setting -k_hap to the total number of reference haplotypes. Using smaller values will decrease the running time linearly while incurring a slight loss of accuracy.

Conversely, we now recommend running IMPUTE2 with large reference panels containing haplotypes of diverse ancestry. (For more details, see here.) In this context, our rule of thumb suggests setting -k_hap to be smaller than the total size of the reference panel. Imputation accuracy is robust to different values of -k_hap within a sensible range, so it should usually be sufficient to choose a value by intuition. When in doubt, we suggest that you err on the side of making -k_hap too large, since we often find that diverse reference panels contain more useful haplotypes than one might expect.

As of software version 2.3.0, -k_hap can accept two values when you are imputing from two reference panels—for example, '-k_hap 500 200'. In this context, the first value is the number of haplotypes to be chosen from Panel 0 and the second value is the number to be chosen from Panel 1. This flexibility can be useful when merging reference panels.



Pre-phasing options

You can greatly speed up your imputation through a process called "pre-phasing". The idea of this approach is to first phase your GWAS genotypes, then use the estimated GWAS haplotypes to impute untyped variants from a reference panel. The options in this table activate the corresponding functionality in IMPUTE2. You can see how these options are applied in this example command.

Flag Default Description
-prephase_g
Tells IMPUTE2 to phase the genotypes in the -g file. The estimated haplotypes are printed to a dedicated output file named "[-o]_haps", where [-o] is the name supplied for the main output file. To avoid edge effects in downstream imputation, IMPUTE2 will extend the estimated haplotypes into the buffer regions that flank the main region specified via -int.
-use_prephased_g
Tells IMPUTE2 to perform imputation with pre-phased GWAS haplotypes, which must be supplied via a -known_haps_g file. This file will often be produced by a pre-phasing run that used -prephase_g on the same imputation interval (-int), although it may also come from a different phasing algorithm like SHAPEIT, which can print haplotypes in -known_haps_g format.

We now recommend using SHAPEIT for pre-phasing and IMPUTE2 for downstream imputation.



Panel merging options

These options allow IMPUTE2 to efficiently combine two reference panels typed on partially overlapping sets of variants.

Flag Default Description
-merge_ref_panels
Tells the program to combine information across two reference panels using the approach described here.
-merge_ref_panels_output_ref <file> none Activates -merge_ref_panels and tells the program to store the merged panel in two output files: a legend file named <file>.legend and a haplotype file named <file>.hap.
-merge_ref_panels_output_gen <file> none Activates -merge_ref_panels and tells the program to store the merged panel in .gen format in an output file named <file>.gen.


NOTE: If you want IMPUTE2 to print a merged reference panel with buffer regions included, you should use one of the last two options together with the -include_buffer_in_output flag.

NOTE: You can see an example run that uses -merge_ref_panels here.


Chromosome X options

These options facilitate the analysis of genotype data from human chromosome X.

Flag Default Description
-chrX
Specifies that this is an analysis of chromosome X data. This flag changes the model parameters by automatically reducing the -Ne value by 25%, and it allows the -g file to include a mixture of dizygous females and hemizygous males.

When using the -chrX option, it is essential to provide a -sample_g file with a column named 'sex', since this tells the program which individuals are males and which are females. More details on the file formats for chromosome X analysis are available here, and you can see an example run here.
-Xpar
Specifies that the current dataset comes from a pseudoautosomal region (PAR) of chromosome X, where both males and females are diploid. When used together with -chrX, this flag will reduce -Ne by 25% but otherwise run the analysis in the same way as on the autosomes.



Expert options

The options in this table are meant for experts only. Don't use them unless you know what you are doing!

Flag Default Description
-seed <int> random Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option.
-no_warn
Turns warnings off, so that the -w file does not get printed.
-fill_holes
Turns on the "hole-filling" function, which allows SNPs that are typed in the -g file but not in the lowest reference panel to contribute to the inference.
-no_remove
Prevents the program from discarding SNPs whose alleles cannot be aligned across panels. Such SNPs will be retained in the output, but they will not be used for inference.



Best Practices for Imputation

IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.

Pre-imputation filtering of study genotypes

Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.

Variant position matching across input files

When you provide IMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). If two or more variants have the same position—perhaps because one is a SNP and one is an overlapping INDEL—then these variants are matched across panels based on their allele labels.

It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results from IMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".

Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Our reference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like the liftOver program from UCSC. If you need help with this step, please send a message to our mail list.

Strand alignment between study and reference data

It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.

Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options described here. If you cannot recover the strand alignment from the original assay, you can use other options that tell IMPUTE2 to make educated guesses.

Choosing a reference panel

Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supply IMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy can increase accuracy at low-frequency variants, and it avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population. You can read our paper on this strategy here, learn about practical ways of applying it here, and download state-of-the-art reference haplotypes here.

If you have collected a custom reference panel for your study population—say, exome-wide or genome-wide sequencing data—you can combine it with the 1,000 Genomes data to maximize accuracy and genomic coverage at the same time. To learn how IMPUTE2 does this, see here.

Genome-wide imputation

It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a few mechanisms to help with this process:

  1. IMPUTE2 includes command-line parameters that can be used to split the genome into discrete chunks for parallel analysis on a computing cluster. These parameters allow flexible partitioning of the genome with minimal manipulation of input files. See here for suggestions on how to use this functionality.
  2. IMPUTE2 is an efficient imputation method, but it still requires substantial computing time to process the whole genome in a large number of individuals. We have recently developed an approach called "pre-phasing" that greatly reduces the computational burden of imputation while sacrificing only a little accuracy; you can read more about the approach here. We now recommend this as the standard way of performing genome-wide imputation, although we still prefer the original IMPUTE2 MCMC algorithm for maximizing accuracy in smaller regions.
  3. Sequence-based reference panels contain large numbers of rare and low-frequency variants, which can drive up the computational cost of imputation. When computing power is limited, it may be desirable to remove some of these variants (e.g., those with very low frequencies in the population of interest) before running imputation. To facilitate this process, we have added the -filt_rules_l option, which can flexibly remove reference variants based on command-line input to an IMPUTE2 run. You can see an example application of this approach and some guidelines for using it here.

Post-imputation filtering

It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.

Association testing

We distribute a program called SNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at the SNPTEST website.

Follow-up imputation of putative associations

Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:

Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.


Pre-Phasing GWAS

Improvements in sequencing and genotyping technologies have rapidly increased the amount of reference data that can be used to impute untyped SNPs in association studies. Larger reference panels improve the power and resolution of imputation-based association mapping, but they also increase the computational burden of imputation. To help offset this cost, we have developed an extension of the IMPUTE2 methodology.

The basic idea is to "pre-phase" your study genotypes to produce best-guess haplotypes, then impute into these estimated haplotypes in a separate program run. By contrast, the original IMPUTE2 method integrates over the unknown phase of your study data during the course of an imputation analysis. Pre-phasing leads to a small loss of accuracy since the estimation uncertainty in the study haplotypes is ignored, but this allows for very fast imputation. This speedup is especially important because modern reference collections (such as those from the 1,000 Genomes Project) are frequently updated and expanded, so that many investigators would benefit from "re-imputing" their datasets following each reference panel update. The pre-phasing step needs to be performed just once per study dataset, so re-imputing is computationally cheap.

For these reasons, we now recommend pre-phasing as the standard approach for genotype imputation in genome-wide association studies, with the original IMPUTE2 algorithm reserved for maximizing accuracy in more targeted analyses. Pre-phasing is implemented through three program options: -prephase_g, -use_prephased_g, and -known_haps_g. The best way to learn how to use this approach is by example.

We recommend performing the pre-phasing step with an accurate phasing method called SHAPEIT2 (details here and here), then imputing into the estimated GWAS haplotypes with IMPUTE2.

If you use this functionality in your study, please remember to cite our article about pre-phasing in GWAS and the original IMPUTE2 article.


Analyzing Whole Chromosomes

In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.

We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.

The -int parameter provides an easy way to break a chromosome into smaller chunks for analysis by IMPUTE2. For example, if we wanted to split a chromosome into 5-Mb regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on, all without changing the input files. IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)

Once you have split a chromosome into multiple chunks and imputed them separately, the IMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:

cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used by IMPUTE2.)


Merging Reference Panels

Problem statement

Modern genotyping and sequencing technologies are generating a variety of reference datasets that can be used for genotype imputation in association studies. Combining reference panels from different populations can often improve imputation accuracy (e.g., see Howie et al. 2011), but it is not clear how best to merge panels that are genotyped at different sets of variants.

Howie et al. 2009 proposed a solution for the special case where one reference panel contains a subset of the variants in another reference panel. We previously released a combined 1,000 Genomes + HapMap 3 panel that takes advantage of this framework, and it was also used in the WTCCC2 studies.

Many association studies are now using the latest 1,000 Genomes data to drive their genotype imputation, but they may also have sequenced additional individuals from the population being studied. It makes sense to combine these resources in order to use all available reference information, but in this case each reference panel will contain many variants that are not found in the other—that is, the "hierarchical" variant framework of Howie et al. 2009 no longer applies.

With this in mind, we have devised a new strategy for combining reference panels created by different sequencing or genotyping studies.

Our approach

There are many possible ways to merge two reference panels. We are exploring several of these options, but we decided to start with the simple approach depicted in the figure below. The top panel of this figure shows two reference panels and a GWAS cohort; you can think of the rows as individuals and the columns as positions along the genome. Each vertical line represents a genotyped variant in a given panel, and each reference panel includes variants that are not found in the other.

Merging reference
        panels

We impute the untyped variants in this figure in three steps:

  1. Impute the variants that are specific to Panel 0 (red) into Panel 1 (blue). Variants shown in grey do not inform the imputation.
  2. Impute the variants that are specific to Panel 1 (blue) into Panel 0 (red). Variants shown in grey do not inform the imputation.
  3. Now that we have imputed the two reference panels up to the union of their variants, treat the imputed haplotypes as known (i.e., take the best-guess haplotypes) and impute the GWAS cohort in the usual way.

This process can be performed with IMPUTE2 (version 2.3 and later) in a streamlined way: all you have to do is add the -merge_ref_panels flag to the command line. You can see a working example command here.

Practical considerations

Using pre-phased study data

The -merge_ref_panels flag works with both unphased study genotypes (-g file) and pre-phased study haplotypes (-known_haps_g file).

Parameter settings

For finer control of the merging step, you can supply two values to -k_hap on the command line—for example, '-k_hap 500 200'. This setting tells IMPUTE2 to use 500 haplotypes from Panel 0 and 200 haplotypes from Panel 1. These values should reflect the number of haplotypes in each panel that you expect to be useful for imputation in the study population, which could be less than the total number if either panel is multi-ethnic.

Reference panel ordering

The order in which you supply the reference panels on the command line should not affect the accuracy of imputation from the merged panel: inside the program, the calculations are completely symmetric. One practical limitation is that only the first legend file in an IMPUTE2 command is allowed to have more than four columns. The 1,000 Genomes legend files we distribute typically have more than four columns, so if you are using these files it makes sense to provide the 1,000 Genomes panel before your other panel on the command line.

Printing the merged panel

By default, IMPUTE2 does not print the merged reference panel (the outcome of Steps 1 and 2 above); the merging is done internally, and the output shows only the imputed genotypes for the study cohort. If you want the program to output the merged panel, you can replace -merge_ref_panels with one of two options:

If you want to merge two reference panels without imputing into a study dataset (i.e., to skip Step 3 above), you should use one of these two options and omit the study data (-g file or -known_haps_g file) from your IMPUTE2 command.

Normally, these options print the merged reference panel within the region specified by the -int argument. If you want to include the buffer regions in the output, you should add the -include_buffer_in_output flag to your command line statement.

Publication and citation

Our approach for merging reference panels has not yet been published outside this website. We have tested the method on realistic datasets, and it has performed well in all of our analyses. We are actively working to document our work on this approach and to compare it with other strategies; we aim to report the results of these experiments and the details of our methodology as soon as possible.

In the meantime, we are happy to answer thoughtful questions and to hear about your experiences with this new functionality. If you would like to send comments, please do so through our mail list.


Imputation Concordance Tables

What is a concordance table?

Every run of IMPUTE2 produces a concordance table, except under certain settings that are not commonly used. A concordance table shows the results of an internal cross-validation that the program performs automatically. For this analysis, IMPUTE2 masks the genotypes of one variant at a time in the study data (Panel 2), then imputes the masked genotypes with information from the reference data and nearby study variants. The imputed genotypes are then compared with the original genotypes to evaluate the quality of the imputation. The results are summarized in a table like the one below:

Concordance table

If you are interested in the results of this experiment at a given variant, you can find this information in the _info file printed by IMPUTE2. The concord_typeX column shows the concordance between input genotypes and best-guess imputed genotypes at each variant, while the r2_typeX column gives the squared correlation between input genotypes and expected genotypes (or "dosages") from imputation. Note that the cross-validation cannot be performed at variants that were not provided in a Panel 2 input file (-g or -known_haps_g), so reference-only variants are assigned values of -1 in the _info file. To learn more about the format of this output file, see here.

How are concordance tables made?

Only variants with input data from a -g or -known_haps_g file are masked and imputed in this analysis. When a -known_haps_g file is provided, all input genotypes are treated as being true. When a -g file is provided, we make hard genotype calls by applying a threshold (default = 0.9) to the maximum value in each input probability triple. For example, a genotype with P(G=0,1,2) = (0.03, 0.95, 0.02) would be called as a '1' (heterozygous), while a genotype with P(G=0,1,2) = (0.1, 0.7, 0.2) would be left uncalled and omitted from the concordance calculations.

The genotype probabilities from imputation are used somewhat differently. In the first three columns of the table, we assign each imputed genotype to a bin (Interval) based on its maximum posterior probability. Then, for each bin we report the number of imputed genotypes that passed the calling threshold in the input data (#Genotypes). We then convert the imputed probabilities to 'best-guess' genotypes: for each posterior probability triple, we select the genotype with the highest value, regardless of magnitude. Finally, we compare the input genotype calls with the best-guess imputed genotypes and report the concordance (%Concordance) within each bin.

In the last three columns of the table, we again bin the imputed genotypes based on their maximum posterior probabilities, but this time the binning is cumulative: the bin at the bottom of the table includes only genotypes that were confidently imputed (max prob >= 0.9), while each bin above includes all genotypes that pass a more lenient certainty threshold. These thresholds are shown in the fourth column (Interval). The fifth column (%Called) shows the percentage of imputed genotypes that pass a given probability threshold, where the denominator is the total number of imputed genotypes for which hard calls are available in the input data. The sixth column (%Concordance) shows the percentage of imputed genotypes in a given bin that match the masked input genotypes.

What can I do with a concordance table?

We can learn a couple of things from this kind of analysis. First, the results can alert us to problems in the imputation: if the concordance between imputed and input genotypes is abnormally low, it may indicate that something went wrong in the analysis or input files. A useful summary statistic is the number in the upper righthand corner of the table, which gives the overall concordance from the cross-validation. This number should typically be around 95%; it may be lower in certain populations or regions of the genome, but if it is much lower then you may need to double-check the analysis. If you are worried about your results, please send a message with details of your analysis (including a _summary output file from IMPUTE2) to our mail list.

Concordance tables can also be used to predict the general quality of imputed genotypes at SNPs where we do not know the true genotypes. SNPs on GWAS microarrays tend to be easier to impute than untyped SNPs of the same frequency, so the cross-validation results may be somewhat optimistic, but they are often useful for relative comparisons—say, between different parameter settings of IMPUTE2.

Finally, the per-variant results of the cross-validation in the output _info file can help identify poorly genotyped SNPs and strand flips. For example, an input SNP that has a low concord_typeX value (implying that the imputed genotypes do not agree with the original genotypes) and a high info_typeX value (implying that the imputation is confident) might be worth investigating or removing from subsequent imputation runs.

Multiple reference panels

If you provide two reference panels to IMPUTE2, the program will perform the cross-validation in two different ways. First it will use only a single reference panel (Panel 0) to mimic Type 0 SNPs, and then it will use both reference panels together (Panels 0 and 1) to mimic Type 1 SNPs. In this case, IMPUTE2 will print two concordance tables—one for each type of reference SNP. Note that the same masked study genotypes are used to evaluate accuracy in both cases; the only difference is how much reference data we allow the program to see when imputing the masked genotypes.

Where can I find the concordance table?

The concordance table is printed at the end of an IMPUTE2 run. One copy is printed to STDOUT, and another copy is printed in the _summary output file.


Scripts

The following scripts are designed to help with various parts of an IMPUTE2 analysis. We provide them in the hope that they will be useful, but we do not offer software support for them, and we cannot guarantee that they will work on your data due to inconsistencies in file formats, assumptions, etc. If you want to use one of these scripts, we suggest that you first read through the code to understand how it works.

All of these scripts are released under the GNU General Public License. Each script will print a list of command line options if you run it with no arguments.

Script name Function
vcf2impute_legend_haps.pl Convert a phased VCF file into reference panel format: one legend file and one haplotypes file.
vcf2impute_gen.pl Convert a phased or unphased VCF file into genotype file format (.gen).



FAQ

Our FAQ has moved to this Google document.


References

[1]   J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]

[2]   B. N. Howie, P. Donnelly, and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]

[3]   J. Marchini and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11: 499-511 [Restricted Access PDF] [Supplementary Material]

[4]   B. Howie, J. Marchini, and M. Stephens (2011) Genotype imputation with thousands of genomes. G3: Genes, Genomics, Genetics 1(6): 457-470 [Open Access Article] [Supplementary Material]

[5]   B. Howie, C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 44(8): 955-959 [Restricted Access PDF]


Contributors

The following people developed the methodology and software for IMPUTE2:

Bryan Howie, Jonathan Marchini


Mail List

If you have a question about IMPUTE2, please send a message to our mailing list:

http://www.jiscmail.ac.uk/OXSTATGEN

You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.

IMPORTANT: If you are having a problem with the software, please include the following details in your e-mail; otherwise, we may not be able to diagnose the problem.

  1. The version number of IMPUTE2 and the type of computer you are using to run it—e.g., "IMPUTE v2.2.2 on Mac OSX 10.6".
  2. Any log files and/or screen output from the program; e.g., the "_summary" output file.
  3. For difficult problems like memory access errors (e.g., "segmentation faults"), we may need you to send data files that show the problem. These files should ideally be small, and we can provide suggestions if you are not allowed to share your actual data.