IMPUTE2

IMPUTE version 2 (also known as "IMPUTE2") is a genotype imputation and phasing program based on ideas from Howie et al. (2009). Please click on the links below to download the software or learn how to use it.

Page last updated 05 Dec 2012.


Home
Getting Started
What's New?
Download IMPUTE2
Download Reference Data
Example Commands
Program Options
Best Practices for Imputation
Analyzing Whole Chromosomes
Pre-Phasing GWAS
FAQ
Citing IMPUTE2
Registration and Updates
References
Contributors
Contact Information


Getting Started (top)

IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new to IMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.

You should begin by downloading the program from here. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.

Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section on Example Commands shows how to use the most common IMPUTE2 functions. We suggest that you work through these examples and try to understand what the elements of each command are doing. If you don't understand something or would like to know if the program can perform a function that isn't listed, please feel free to contact us.

Once you understand the basic functionality of the program, you can use several features of this website to prepare your own analysis:
  • Learn about best practices for imputation. [link]

  • Download reference data that you can use to impute genotypes in your study. [link]

  • Look through a complete list of program options. [link]

  • Browse our frequently asked questions. [link]


What's New? (top)

Paper on "pre-phasing" study genotypes for faster imputation

We recently published an article called "Fast and accurate genotype imputation in genome-wide association studies through pre-phasing" in Nature Genetics. This paper describes a strategy ("pre-phasing") for efficient genotype imputation with large reference panels. By reducing the computational burden of imputation, pre-phasing makes imputation-based studies feasible for groups with limited computing power, and it also makes it easier to re-impute existing GWAS datasets as more informative reference panels become available. You can learn more about pre-phasing with IMPUTE2 here.

Modified version of 1,000 Genomes Phase I reference panel

Earlier this year (March 2012), the 1,000 Genomes Project released a powerful reference panel known as "Phase I v3". We recently (August 2012) modified this panel by excluding variants with only one copy of the minor allele (singletons) across all 1,092 individuals. Singleton variants are difficult to impute, yet they make up ~20% of all variants in the reference panel; removing them makes imputation faster without hurting the power for association mapping. You can download either the orginal reference panel or the modified version (which is labeled "macGT1" for "minor allele count greater than one") here.

Paper on imputation strategies for ancestrally diverse reference panels

A few months ago we published an article called "Genotype imputation with thousands of genomes" in the open-access journal G3: Genes, Genomes, Genetics. This paper describes our strategy for achieving high accuracy with ancestrally diverse reference panels, especially at low-frequency variants and in admixed study cohorts: we supply a cosmopolitan set of reference haplotypes to IMPUTE2, which can automatically find the most useful ones for each study individual with the help of the tuning parameter -k_hap. You can read more about the results that support this strategy in the article, and we provide practical suggestions for applying it here.

Pre-phasing with SHAPEIT

IMPUTE2's pre-phasing approach now works with phased haplotypes from SHAPEIT, a highly accurate phasing algorithm that can handle mixtures of unrelateds, duos, and trios. Details are available here. We highly recommend using SHAPEIT to infer the haplotypes underlying your study genotypes, then passing these to IMPUTE2 for imputation as shown in the second step of this example.


Download IMPUTE2 (top)

IMPUTE2 is freely available for academic use. To see rules for non-academic use, please read the LICENCE file, which is included with each software download.

Pre-compiled IMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please contact us.

Platform
File
Linux (x86_64) Static Executable -- v2.2.2 impute_v2.2.2_x86_64_static.tgz
Linux (x86_64) Dynamic Executable -- v2.2.2 impute_v2.2.2_x86_64_dynamic.tgz
Mac OS X Intel -- v2.2.2 impute_v2.2.2_MacOSX_Intel.tgz
Windows MS-DOS (Intel) -- v2.2.2
impute_v2.2.2_Windows_Intel.tgz
Linux (i686) Dynamic Executable -- v2.2.2
impute_v2.2.2_i686_dynamic.tgz
Linux (x86_64) Dynamic Executable impute_v2.1.2_x86_64_dynamic.tgz
Linux (x86_64) Static Executable impute_v2.1.2_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3) impute_v2.1.2_SuSE9.3_x86_64_static.tgz
Linux (i386) Dynamic Executable impute_v2.1.2_i386_dynamic.tgz
Mac OS X Intel impute_v2.1.2_MacOSX_Intel.tgz
Solaris 5.10 (AMD Opteron) impute_v2.1.2_Solaris5.10_Opteron.tgz
Windows MS-DOS (Intel) impute_v2.1.2_Windows_Intel.tgz

To unpack the files on a Linux computer, use a command like

tar -zxvf impute_v2.X.Y_i386.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called impute2, a LICENCE file, and an Example/ directory that contains example data files. We show how to perform various kinds of analyses with the example files here.


Download Reference Data (top)

IMPUTE2 can use publicly available reference datasets, such as haplotypes from major sequencing projects, as well as customized reference panels, such as SNP genotypes from a fine-mapping study. If you would like to download a public dataset, just click the relevant link below, which will take you to a page with background information and download options for that dataset.

The two latest reference panels are from 1,000 Genomes Phase 1. The "interim" panel contains ~37 M SNPs, while the "integrated" panel contains ~39 M SNPs, INDELs, and SVs. We have also created a version of the integrated panel that omits singleton variants to reduce the size to ~31 M SNPs, INDELs, and SVs. Note that the 1,000 Genomes Phase 1 integrated variant set is meant to be used with IMPUTE version 2.2.0 or later.
Link to download page NCBI build Haplotype release date Release status
1000 Genomes Phase I integrated variant set b37 Mar 2012 Includes chrX; updated 24 Aug 2012
1000 Genomes Phase I (interim) b37 Jun 2011 Includes chrX; updated 19 Apr 2012
1000 Genomes (2010 interim) b37 Dec 2010
1000 Genomes Pilot + HapMap 3 b36 Jun 2010 / Feb 2009
1000 Genomes Pilot b36 Jun 2010
HapMap 3 (release #2) b36 Feb 2009 Includes chrX
HapMap 2 (release #24) b36 Oct 2008
HapMap 2 (release #22) b36 Jan 2008
HapMap 2 (release #21) b35 Jul 2006


Example Commands (top)

This section provides some example commands that illustrate typical applications of IMPUTE2. All of the data files used in these commands are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the impute2 executable). Detailed explanations are provided at each link below.
Run type Description
Imputation with one phased reference panel Basic scenario in which most people will use IMPUTE2.
Imputation with one phased reference panel
(pre-phasing)
As above, but with pre-phasing functionality to speed up the analysis.
Imputation with one phased reference panel
(chromosome X)
Basic imputation scenario applied to human chromosome X, which requires special program options.
Imputation with one phased reference panel
(plus variant filtering)
Basic imputation scenario with flexible filtering of reference panel variants.
Imputation with one unphased reference panel Basic imputation scenario adapted to unphased reference genotypes.
Imputation with two phased reference panels Extended functionality for imputing from multiple reference panels defined on different sets of variants.
Imputation with one phased and one unphased
reference panel
Specialized method for combining reference panels of different types.
Imputation with one phased and one unphased
reference panel, with additional options
As above, but showcasing a variety of options that can be used to customize the behavior of IMPUTE2.
Phasing Methodology for inferring haplotypes from unphased genotypes.
Phasing with a reference panel Phasing analysis aided by reference haplotypes.


Program Options (top)

These links explain the command-line arguments that can be used to control IMPUTE2.
Option type Description
Required arguments The program will not run if these are not supplied.
Input file options A list of possible input files, with formatting requirements.
Output file options Naming conventions and options for controlling format of output files.
Basic options Options for controlling how the program processes input data.
Strand alignment options Options for aligning allele coding across data files.
Filtering options Options for controlling the filters that get applied to input data.
MCMC options Options for controlling the MCMC algorithm.
Pre-phasing options Options that facilitate pre-phasing and subsequent imputation.
Chromosome X options Options for analyzing chromosome X data.
Expert options Options to be used by experts only.


Best Practices for Imputation (top)

IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.

PRE-IMPUTATION FILTERING OF STUDY GENOTYPES

Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.

VARIANT POSITION MATCHING ACROSS INPUT FILES

When you provide IMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results from IMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".

Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Our reference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like the liftOver program from UCSC. If you need help with this step, please contact us.

STRAND ALIGNMENT BETWEEN STUDY AND REFERENCE DATA

It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.

Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options described here. If you cannot recover the strand alignment from the original assay, you can use other options that tell IMPUTE2 to make educated guesses.

CHOOSING A REFERENCE PANEL

Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supply IMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy increases accuracy and avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population; you can read more about the strategy here, and you can download state-of-the-art reference haplotypes here.

GENOME-WIDE IMPUTATION

It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a few mechanisms to help with this process:
  1. IMPUTE2 includes command-line parameters that can be used to split the genome into discrete chunks for parallel analysis on a computing cluster. These parameters allow flexible partitioning of the genome with minimal manipulation of input files. See here for suggestions on how to use this functionality.

  2. IMPUTE2 is an efficient imputation method, but it still requires substantial computing time to process the whole genome in a large number of individuals. We have recently developed an approach called "pre-phasing" that greatly reduces the computational burden of imputation while sacrificing only a little accuracy; you can read more about the approach here. We now recommend this as the standard way of performing genome-wide imputation, although we still prefer the original IMPUTE2 MCMC algorithm for maximizing accuracy in smaller regions.

  3. Sequence-based reference panels contain large numbers of rare and low-frequency variants, which can drive up the computational cost of imputation. When computing power is limited, it may be desirable to remove some of these variants (e.g., those with very low frequencies in the population of interest) before running imputation. To facilitate this process, we have added the -filt_rules_l option, which can flexibly remove reference variants based on command-line input to an IMPUTE2 run. You can see an example application of this approach and some guidelines for using it here.

POST-IMPUTATION FILTERING

It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.

ASSOCIATION TESTING

We distribute a program called SNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at the SNPTEST website.

FOLLOW-UP IMPUTATION OF PUTATIVE ASSOCIATIONS

Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:
  • In contrast to the pre-phasing approach that we recommend for genome-wide imputation, we suggest using the standard IMPUTE2 MCMC algorithm for follow-up imputation. This method takes longer to run in each region, but it should lead to higher accuracy (especially at low-frequency variants) and remain computationally feasible when run on a limited portion of the genome.

  • If time permits, the overall accuracy may be improved by increasing the value of the -k parameter.

  • If time permits, the accuracy at low-frequency variants may be improved by increasing the size of the -buffer region—say, from the default value of 250 kb to 1000 kb (1 Mb).
Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.


Analyzing Whole Chromosomes (top)

In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.

We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.

The -int parameter provides an easy way to break a chromosome into smaller chunks for analysis by IMPUTE2. For example, if we wanted to split a chromosome into 5-Mb regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on, all without changing the input files. IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)

Once you have split a chromosome into multiple chunks and imputed them separately, the IMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:

cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used by IMPUTE2.)

We are working on a set of prototype scripts that will (i) partition any input dataset into chunks and (ii) submit the imputation jobs for these chunks to a computing cluster. We will post these to the website as soon as they are ready.


Pre-Phasing GWAS (top)

Improvements in sequencing and genotyping technologies have rapidly increased the amount of reference data that can be used to impute untyped SNPs in association studies. Larger reference panels improve the power and resolution of imputation-based association mapping, but they also increase the computational burden of imputation. To help offset this cost, we have developed an extension of the IMPUTE2 methodology.

The basic idea is to "pre-phase" your study genotypes to produce best-guess haplotypes, then impute into these estimated haplotypes in a separate program run. By contrast, the original IMPUTE2 method integrates over the unknown phase of your study data during the course of an imputation analysis. Pre-phasing leads to a small loss of accuracy since the estimation uncertainty in the study haplotypes is ignored, but this allows for very fast imputation. This speedup is especially important because modern reference collections (such as those from the 1,000 Genomes Project) are frequently updated and expanded, so that many investigators would benefit from "re-imputing" their datasets following each reference panel update. The pre-phasing step needs to be performed just once per study dataset, so re-imputing is computationally cheap.

For these reasons, we now recommend pre-phasing as the standard approach for genotype imputation in genome-wide association studies, with the original IMPUTE2 algorithm reserved for maximizing accuracy in more targeted analyses. Pre-phasing is implemented through three program options: -prephase_g, -use_prephased_g, and -known_haps_g. The best way to learn how to use this approach is by example. You can also learn from this outdated document and working examples.

We recommend performing the pre-phasing step with an accurate phasing method called SHAPEIT (details here and here), then imputing into the estimated GWAS haplotypes with IMPUTE2.

If you use this functionality in your study, please remember to cite our article about pre-phasing in GWAS.


FAQ (top)

Our FAQ has moved to this Google document.


Citing IMPUTE2 (top)

If you use IMPUTE2 in a published manuscript, please cite Howie et al. 2009 (PLoS Genetics).

If you use IMPUTE2 with a multi-population reference panel (such as the 1,000 Genomes or HapMap 3 "ALL" panels), please also cite Howie et al. 2011 (G3: Genes, Genomes, Genetics).

If you use our pre-phasing approach, please also cite Howie et al. 2012 (Nature Genetics).


Registration and Updates (top)

If you would like to receive e-mails about updates to this software, please fill out the registration form.


References (top)

[1]   J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]

[2]   B. N. Howie, P. Donnelly and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article] [Supplementary Material]

[3]   J. Marchini and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11: 499-511 [Restricted Access PDF] [Supplementary Material]

[4]   B. Howie, J. Marchini, and M. Stephens (2011) Genotype imputation with thousands of genomes. G3: Genes, Genomics, Genetics 1(6): 457-470 [Open Access Article] [Supplementary Material]

[5]   B. Howie, C. Fuchsberger, M. Stephens, J. Marchini, and G. R. Abecasis (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing. Nature Genetics 44(8): 955-959 [Restricted Access PDF]


Contributors (top)

The following people developed the methodology and software for IMPUTE2:

Bryan Howie, Jonathan Marchini


Contact Information (top)

If you have a question about IMPUTE2, please send a message to our mailing list:

http://www.jiscmail.ac.uk/OXSTATGEN

You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.

IMPORTANT: If you are having a problem with the software, please include the following details in your e-mail; otherwise, we may not be able to diagnose the problem.
  1. The version number of IMPUTE2 and the type of computer you are using to run it --
    e.g., "IMPUTE v2.2.2 on Mac OSX 10.6"

  2. Any log files and/or screen output from the program; e.g., the "_summary" output file.

  3. For difficult problems like memory access errors (e.g., "segmentation faults"), we may need you to send data files that show the problem. These files should ideally be small, and we can provide suggestions if you are not allowed to share your actual data.