IMPUTE2

IMPUTE version 2 (also known as "IMPUTE2") is a genotype imputation and phasing program based on ideas from Howie et al. (2009). Please click on the links below to download the software or learn how to use it.

Page last updated November 23, 2011.


Home
Getting Started
What's New?
Download IMPUTE2
Download Reference Data
Example Commands
Program Options
Best Practices for Imputation
Analyzing Whole Chromosomes
Pre-Phasing GWAS
FAQ
Citing IMPUTE2
Registration and Updates
References
Contributors
Contact Information


Getting Started (top)

IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new to IMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.

You should begin by downloading the program from here. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.

Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section on Example Commands shows how to use the most common IMPUTE2 functions. We suggest that you work through these examples and try to understand what the elements of each command are doing. If you don't understand something or would like to know if the program can perform a function that isn't listed, please feel free to contact us.

Once you understand the basic functionality of the program, you can use several features of this website to prepare your own analysis:
  • Learn about best practices for imputation. [link]

  • Download reference data that you can use to impute genotypes in your study. [link]

  • Look through a complete list of program options. [link]

  • Browse our frequently asked questions. [link]


What's New? (top)

Integrated Phase I haplotype release from the 1,000 Genomes Project

The 1,000 Genomes Project recently (October 2011) released an updated set of genome-wide reference haplotypes. This is the official haplotype release from Phase I of the project; it includes >40 million autosomal variants (SNPs, INDELs, SVs) typed in 1,092 people from around the world. These haplotypes are a powerful resource for imputation, and we currently recommend them as a state-of-the-art genome-wide reference panel. You can download the data here. Note that these reference haplotypes are meant to be used with IMPUTE version 2.2 or later.

Beta release of IMPUTE version 2.2

You can now download a limited release of IMPUTE v2.2 (beta). This release includes several new features, which you can read about here. We are planning to release version 2.2.0 in the very near future; the official release will include functionality for imputation on chromosome X and executables for a variety of computing platforms.


Download IMPUTE2 (top)

IMPUTE2 is freely available for academic use. To see rules for non-academic use, please read the LICENCE file, which is included with each software download.

Pre-compiled IMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please contact us.

Platform
File
Linux (x86_64) Static Executable -- v2.2 (beta) impute_v2.2_beta_x86_64_static.tgz
Windows MS-DOS (Intel) -- v2.2 (beta)
impute_v2.2_beta_Windows_Intel.tgz
Mac OSX Intel -- v2.2 (beta)
impute_v2.2_beta_MacOSX_Intel.tgz
Linux (x86_64) Dynamic Executable impute_v2.1.2_x86_64_dynamic.tgz
Linux (x86_64) Static Executable impute_v2.1.2_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3) impute_v2.1.2_SuSE9.3_x86_64_static.tgz
Linux (i386) Dynamic Executable impute_v2.1.2_i386_dynamic.tgz
Mac OS X Intel impute_v2.1.2_MacOSX_Intel.tgz
Solaris 5.10 (AMD Opteron) impute_v2.1.2_Solaris5.10_Opteron.tgz
Windows MS-DOS (Intel) impute_v2.1.2_Windows_Intel.tgz

To unpack the files on a Linux computer, use a command like

tar -zxvf impute_v2.X.Y_i386.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called impute2, a LICENCE file, and an Example/ directory that contains example data files. We show how to perform various kinds of analyses with the example files here.


Download Reference Data (top)

IMPUTE2 can use publicly available reference datasets, such as haplotypes from major sequencing projects, as well as customized reference panels, such as SNP genotypes from a fine-mapping study. If you would like to download a public dataset, just click the relevant link below, which will take you to a page with background information and download options for that dataset. We currently recommend using the integrated variant set haplotype release from Phase I of the 1,000 Genomes Project, which can be found at the first link below (note that this reference panel is meant to be used with IMPUTE version 2.2 or later).
Link to download page NCBI build Haplotype release date
1000 Genomes Phase I integrated variant set b37 Oct 2011
1000 Genomes Phase I (interim) b37 Jun 2011
1000 Genomes (2010 interim) b37 Dec 2010
1000 Genomes Pilot + HapMap 3 b36 Jun 2010 / Feb 2009
1000 Genomes Pilot b36 Jun 2010
HapMap 3 (release #2) b36 Feb 2009
HapMap 2 (release #24) b36 Oct 2008
HapMap 2 (release #22) b36 Jan 2008
HapMap 2 (release #21) b35 Jul 2006


Example Commands (top)

This section provides some example commands that illustrate typical applications of IMPUTE2. All of the data files used in these commands are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the impute2 executable). Detailed explanations are provided at each link below.
Run type Description
Imputation with one phased reference panel Basic scenario in which most people will use IMPUTE2.
Imputation with one phased reference panel
(pre-phasing)
As above, but with pre-phasing functionality to speed up the analysis.
Imputation with one unphased reference panel Basic imputation scenario adapted to unphased reference genotypes.
Imputation with two phased reference panels Extended functionality for imputing from multiple reference panels defined on different sets of variants.
Imputation with one phased and one unphased
reference panel
Specialized method for combining reference panels of different types.
Imputation with one phased and one unphased
reference panel, with additional options
As above, but showcasing a variety of options that can be used to customize the behavior of IMPUTE2.
Phasing Methodology for inferring haplotypes from unphased genotypes.
Phasing with a reference panel Phasing analysis aided by reference haplotypes.


Program Options (top)

These links explain the command-line arguments that can be used to control IMPUTE2.
Option type Description
Required arguments The program will not run if these are not supplied.
Input file options A list of possible input files, with formatting requirements.
Output file options Naming conventions and options for controlling format of output files.
Basic options Options for controlling how the program processes input data.
Strand alignment options Options for aligning allele coding across data files.
Filtering options Options for controlling the filters that get applied to input data.
MCMC options Options for controlling the MCMC algorithm.
Expert options Options to be used by experts only.


Best Practices for Imputation (top)

IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.

PRE-IMPUTATION FILTERING OF STUDY GENOTYPES

Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.

VARIANT POSITION MATCHING ACROSS INPUT FILES

When you provide IMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results from IMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".

Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Our reference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like the liftOver program from UCSC. If you need help with this step, please contact us.

STRAND ALIGNMENT BETWEEN STUDY AND REFERENCE DATA

It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.

Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options described here. If you cannot recover the strand alignment from the original assay, you can use other options that tell IMPUTE2 to make educated guesses.

CHOOSING A REFERENCE PANEL

Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supply IMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy increases accuracy and avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population; you can read more about the strategy here, and you can download state-of-the-art reference haplotypes here.

GENOME-WIDE IMPUTATION

It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a couple of mechanisms to help with this process:
  1. IMPUTE2 includes command-line parameters that can be used to split the genome into discrete chunks for parallel analysis on a computing cluster. These parameters allow flexible partitioning of the genome with minimal manipulation of input files. See here for suggestions on how to use this functionality.

  2. IMPUTE2 is an efficient imputation method, but it still requires substantial computing time to process the whole genome in a large number of individuals. We have recently developed an approach called "pre-phasing" that greatly reduces the computational burden of imputation while sacrificing only a little accuracy; you can read more about the approach here. We now recommend this as the standard way of performing genome-wide imputation, although we still prefer the default IMPUTE2 algorithm for analyzing smaller regions.

POST-IMPUTATION FILTERING

It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.

ASSOCIATION TESTING

We distribute a program called SNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at the SNPTEST website.

FOLLOW-UP IMPUTATION OF PUTATIVE ASSOCIATIONS

Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:
  • In contrast to the pre-phasing approach that we recommend for genome-wide imputation, we suggest using the standard IMPUTE2 MCMC algorithm for follow-up imputation. This method takes longer to run in each region, but it should lead to higher accuracy (especially at low-frequency variants) and remain computationally feasible when run on a limited portion of the genome.

  • If time permits, the overall accuracy may be improved by increasing the value of the -k parameter.

  • If time permits, the accuracy at low-frequency variants may be improved by increasing the size of the -buffer region—say, from the default value of 250 kb to 1000 kb (1 Mb).
Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.


Analyzing Whole Chromosomes (top)

In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2. However, we prefer to split each chromosome into smaller chunks for analysis, both because the program produces higher accuracy over short genomic regions and because imputing a chromosome in chunks is a good computational strategy: the chunks can be imputed in parallel on multiple computer processors, thereby decreasing the real computing time and limiting the amount of memory needed for each run.

We therefore recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.

The -int parameter provides an easy way to break a chromosome into smaller chunks for analysis by IMPUTE2. For example, if we wanted to split a chromosome into 5-Mb regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on, all without changing the input files. IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)

Once you have split a chromosome into multiple chunks and imputed them separately, the IMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:

cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used by IMPUTE2.)

We are working on a set of prototype scripts that will (i) partition any input dataset into chunks and (ii) submit the imputation jobs for these chunks to a computing cluster. We will post these to the website as soon as they are ready.


Pre-Phasing GWAS (top)

Improvements in sequencing and genotyping technologies have rapidly increased the amount of reference data that can be used to impute untyped SNPs in association studies. Larger reference panels improve the power and resolution of imputation-based association mapping, but they also increase the computational burden of imputation. To help offset this cost, we have developed an extension of the IMPUTE2 methodology.

The basic idea is to "pre-phase" your study genotypes to produce best-guess haplotypes, then impute into these estimated haplotypes in a separate program run. By contrast, the original IMPUTE2 method integrates over the unknown phase of your study data during the course of an imputation analysis. Pre-phasing leads to a small loss of accuracy since the estimation uncertainty in the study haplotypes is ignored, but this allows for very fast imputation. This speedup is especially important because modern reference collections (such as those from the 1,000 Genomes Project) are frequently updated and expanded, so that many investigators would benefit from "re-imputing" their datasets following each reference panel update. The pre-phasing step needs to be performed just once per study dataset, so re-imputing is computationally cheap.

For these reasons, we now recommend pre-phasing as the standard approach for genotype imputation in genome-wide association studies, with the original IMPUTE2 algorithm reserved for maximizing accuracy in more targeted analyses. We have written a dedicated document (soon to be updated) that describes the pre-phasing process in detail and is integrated with some working examples; you can download it here.

A publication on this methodology is forthcoming.


FAQ (top)

Q: What -Ne value should I use when there is more than one population in the reference panel?
A: You can find a complete answer here. The quick answer is ~20000.

Q: Why haven't you responded to my e-mail?
A: We go out of our way to respond promptly to queries about IMPUTE2. If you wrote to us and haven't heard back yet, the most likely reason is that we are too busy to reply immediately. There are a few things that you can do to improve the chances of receiving a fast response:
  • As you can see, we have put a lot of effort and accumulated wisdom into this website; please take a moment to see if your question is already answered in the documentation. We can tell when people have contacted us without reading the manual.

  • Please be as specific as possible in your question (this will be easier if you've read the documentation). Sometimes open-ended questions are unavoidable, but these naturally take longer to answer.

  • If the program is doing something you don't understand (e.g., crashing), it will be much easier for us to diagnose the problem if you send a copy of the screen output with your e-mail. Conveniently, IMPUTE2 automatically writes an output log file, so you can just attach this file to your message.

  • We are only human, and sometimes e-mails slip through the cracks, especially during busy times or holidays. If you haven't heard back from us after a week or so, feel free to e-mail again to check on the status of things -- we really do appreciate periodic reminders.


Citing IMPUTE2 (top)

If you use IMPUTE2 in a published manuscript, please cite Howie et al. (2009, PLoS Genetics).

If you use our strategy for imputing from multi-population reference panels, please also cite our paper on that work. (forthcoming)

If you use our pre-phasing approach, please also cite our paper on that work. (forthcoming)


Registration and Updates (top)

If you would like to receive e-mails about updates to this software, please fill out the registration form.


References (top)

[1] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]

[2] B. N. Howie, P. Donnelly and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article]

[3] J. Marchini and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11: 499-511 [Restricted Access PDF] [Supplementary Material]


Contributors (top)

The following people developed the methodology and software for IMPUTE2:

Bryan Howie, Jonathan Marchini

(Bryan promises he will have a new webpage up soon, but you will have to live with this outdated link for now.)


Contact Information (top)

If you have any questions regarding the use of IMPUTE2, please send an e-mail to both of the following people:

Dr. Bryan Howie (
bhowie <at> uchicago <dot> edu).
Dr. Jonathan Marchini ( marchini <at> stats <dot> ox <dot> ac <dot> uk).

It is a good idea to include a copy of the screen output (which is printed to the -r file) with your e-mail to help us identify any problems. If the program is not acting as expected with your data, it is very helpful if you can also send us a small, working example dataset that illustrates the problem.