IMPUTE2

IMPUTE version 2 (also known simply as IMPUTE2) is a genotype imputation and phasing program based on ideas from Howie et al. (2009). Please click on the links below to download the software or learn how to use it.

Page last updated Dec 8, 2010.

We frequently add new features to the website, so please check back if you are actively using the software.

Home

Getting Started

What's New?

Contributors

Download

Overview

Using IMPUTE2 with Public Reference Data

Example Runs

Program Options

Analyzing Whole Chromosomes

Pre-Phasing GWAS

QC and Troubleshooting

FAQ

Registration and Updates

References

Contact Information

Getting Started (top)

IMPUTE2 is a computer program for phasing observed genotypes and imputing missing genotypes. Most people use just a couple of the program's basic functions, but we have also built up a collection of specialized and powerful options. If you are new to IMPUTE2, or indeed to phasing and imputation in general, we suggest that you start by learning the basics.

You should begin by downloading the program from here. You will need to choose the link that matches your computing platform and then follow the instructions for opening the download package.

Once you have done this, you will be ready to try some example analyses on the test data that are provided with the download. The section on Example Runs shows how to use the most common IMPUTE2 functions; we suggest that you work through these examples and try to understand what the elements of each command-line call are doing. If you don't understand something, feel free to contact us.

After you have tried the example analyses, you can explore the rest of the website to see what IMPUTE2 can do. If you are not sure how certain options work, or would like to know if the program can perform a function that isn't listed, we are happy to advise.

What's New? (top)

IMPUTE v2.1.2 (released Oct 1, 2010) includes the following new features:

We have now compiled IMPUTE2 on a greater variety of systems.
Due to speed optimizations in v2.1.0, we have decided to change the default -k value from 42 to 80. This parameter controls the running time vs. accuracy tradeoff by adjusting the rigor of IMPUTE2's phasing algorithm. The new default value leads to meaningful increases in accuracy while keeping the running time reasonable for large datasets.
As explained in the section on Analyzing Whole Chromosomes, IMPUTE2 now prevents the analysis of regions longer than 7 Mb unless the -allow_large_regions flag has been activated. This change is intended to guide beginners while allowing more experienced users to selectively turn off the safety feature.
We have added extensive functionality to support "pre-phasing" of GWAS datasets to speed up multiple rounds of imputation. You can find more details about the philosophy and implementation of the pre-phasing approach here.
As part of the pre-phasing module, it is now possible to impute into datasets that consist of phased haplotypes rather than unphased genotypes. More details are provided in the section on Pre-Phasing GWAS Datasets.
In addition to imputing into completely phased or completely unphased datasets, IMPUTE2 can now handle partially phased datasets by supplementing the main genotypes file ( -g ) with a -known_haps_g file; details here.
We have improved the program's error reporting to deal with common problems, as well as fixing a couple of small bugs from v2.1.0.

New data

We recently posted the latest haplotypes from the 1,000 Genomes Project, which were released in June 2010. There are 120 CEU haplotypes, 120 CHB+JPT haplotypes, and 118 YRI haplotypes in the new dataset.

You can download the official release haplotypes or a set of haplotypes tailored to work with HapMap 3 below; we recommend using the latter (1,000 Genomes + HapMap 3) reference set for most imputation tasks.

New strategies

We have been working hard to show how IMPUTE2 can use large reference panels with diverse ancestry to improve the imputation of rare alleles and eliminate the need to choose which haplotypes will form the reference set. You may have seen us talk about this work; it is not yet published, but we have written a short summary of our ideas, results, and motivations here.

We have also used these ideas to inform the packaging of the 1,000 Genomes and HapMap 3 reference sets; you can download haplotypes that fit IMPUTE2's reference panel philosophy here.

Contributors (top)

The following people developed the methodology and software for IMPUTE2:

Bryan Howie, Jonathan Marchini

Download (top)

IMPUTE v2 is available free to use for academic use only. Please see the LICENCE here and also included with the package.

Pre-compiled IMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please contact us.

Platform	File
Linux (x86_64) Dynamic Executable	impute_v2.1.2_x86_64_dynamic.tgz
Linux (x86_64) Static Executable	impute_v2.1.2_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3)	impute_v2.1.2_SuSE9.3_x86_64_static.tgz
Linux (i386) Dynamic Executable	impute_v2.1.2_i386_dynamic.tgz
Mac OS X Intel	impute_v2.1.2_MacOSX_Intel.tgz
Mac OS X PowerPC	impute_v2.1.2_MacOSX_PowerPC_dynamic.tgz
Solaris 5.10 (AMD Opteron)	impute_v2.1.2_Solaris5.10_Opteron.tgz
Windows MS-DOS (Intel)	impute_v2.1.2_Windows_Intel.tgz

To unpack the files on a Linux computer, use a command like

tar -zxvf impute_v2.X.Y_i386.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called impute2, a LICENCE file, and an Example/ directory that contains example data files and program calls.

Overview (top)

The figure below provides a schematic overview of what IMPUTE2 does. In short, it uses a fine-scale recombination map and a densely genotyped reference panel to "fill in" missing genotypes in a study dataset, which might consist of cases and controls typed on a commercial SNP chip. By estimating the genotypes of SNPs that were not in the original study data, imputation allows a much larger set of SNPs to be tested for association. This can increase both the power to detect association signals and the signal resolution near a causal variant.

Imputation scenarios and program nomenclature

The next two figures illustrate the common imputation scenarios that IMPUTE2 is designed to handle. These figures introduce the nomenclature used by the program to label panels and SNPs, including "special" SNPs that do not fit into the standard imputation framework.

SCENARIO A: ONE REFERENCE PANEL

This is the imputation setup that most people are familiar with: a reference panel containing a dense set of SNPs is used to impute missing genotypes in a study dataset that has been typed at a sparser set of SNPs. IMPUTE2 refers to the reference data as Panel 0 or Panel 1 (for phased and unphased reference panels, respectively) and to the study data as Panel 2. These labels serve as a convenient shorthand in the program's screen output.

IMPUTE2 labels SNPs by the panels in which they have been genotyped. Each label denotes a specific functional role. In the figure above, SNPs that have data only in the reference panel are labeled Type 0 or Type 1 (for phased and unphased reference panels, respectively), whereas SNPs that have genotypes in the study dataset are labeled Type 2. Type 2 SNPs dictate which reference panel haplotypes should be "copied" for each individual; then, the reference panel alleles at Type 0/1 SNPs are used to fill in that individual's missing genotypes.

There is one novelty in the way that IMPUTE2 treats Scenario A. In the figure, one of the SNPs that is labeled as Type 2 has data in Panel 2 but not in the reference panel. Most imputation methods ignore these kinds of SNPs since they are hard to model. For example, IMPUTE v1 labels these as Type 3 SNPs, and it does not impute them or use them to inform the inference. By contrast, IMPUTE2 uses a novel approach to model the missing reference panel alleles, thereby allowing it to gain information from the study genotypes at such SNPs. This feature highlights one of the guiding principles of IMPUTE2: to increase imputation accuracy by using as much of the information in the data as possible.

SCENARIO B: TWO REFERENCE PANELS

Another novel feature of IMPUTE2 is the ability to combine two reference panels containing different sets of SNPs in a single imputation analysis. In the figure below, the first reference panel is called Panel 0, the second reference panel is called Panel 1, and the study dataset is called Panel 2. It is common for each successive panel (0,1,2) to be genotyped at a subset of the SNPs in the previous panel. For example, Panel 0 might comprise haplotypes from the 1,000 Genomes Project, which captures nearly all common SNPs in the genome; Panel 1 might comprise haplotypes from HapMap Phase 3, which surveys a subset of common SNPs; and Panel 2 might be a set of cases and controls genotyped on a commercial SNP chip. IMPUTE2 assumes that the sets of SNPs in Panels 0-2 follow this hierarchical scheme, although it can handle certain exceptions, as discussed below.

In imputation Scenario B, SNPs are labeled as follows:

Type 0 SNPs have data in Panel 0 only and are used for imputation.
Type 1 SNPs have data in Panel 1 and are used for imputation. They may or may not have data in Panel 0; if not, IMPUTE2 will simulate the Panel 0 alleles with its hole-filling function.
Type 2 SNPs have data in Panel 2 and Panel 1, and are used to determine which reference panel haplotypes will be copied. They may or may not have data in Panel 0; if not, IMPUTE2 will simulate the Panel 0 alleles with its hole-filling function.
Type 3 SNPs have data in Panel 2 only*. These SNPs do not fit easily into the imputation model, so their genotypes cannot inform the inference.

*In the figure above, the Type 3 SNP on the left is a special case: it has data in Panel 0 and Panel 2, but not in Panel 1. By default, IMPUTE2 ignores the Panel 0 data at such SNPs, thereby converting them to Type 3. However, this behavior can be changed by activating the -pgs flag, which causes the program to keep the Panel 0 data and impute the Panel 2 genotypes (effectively converting the SNP to Type 0). In the near future, we will extend the IMPUTE2 algorithm to handle these SNPs more naturally, but for now they should not be a big issue.

Using IMPUTE2 with Public Reference Data (top)

IMPUTE2 can use customized reference panels (e.g., SNP genotypes from a fine-mapping study) as well as publicly available reference datasets. There are a variety of reference panels to choose from

Reference Set	NCBI Genome build	Description
1000 Genomes August haplotypes	Build 37	These consist of three panels of haplotypes denoted EUR (European haplotypes), AFR (African haplotypes) and ASN (Asian haplotypes). There are 566 EUR haplotypes at 11,572,677 SNPs. There are 348 AFR haplotypes at 16,514,846 SNPs. There are 388 ASN haplotypes at 10,524,588 SNPs. We have also supplied recombination maps in Build 37 co-ordinates.
Combined 1000 Genomes low-coverage pilot haplotypes + HapMap3 haplotypes	Build 36	We combined the 1000 Genomes low-coverage haplotypes with HapMap3. This provides a panel that is dense (i.e. lots of SNPs in the 1000 Genomes panel) and deep (i.e. lots of haplotypes at SNPs in HapMap3). IMPUTE v2 can handle hierarchical reference panels to get advantages of both at the same time.
1000 Genomes low-coverage pilot haplotypes	Build 36	The 1000 Genomes haplotypes from the low-coverage pilot, release in June 2010. We have only made the CEU and YRI haplotypes available. This dataset has been superceeded by the 1000 Genomes August haplotypes.
HapMap3 haplotypes	Build 36	HapMap3 haplotypes
HapMap2 haplotypes	Build 36&35	These haplotype sets can be found on the IMPUTE v1 webpage.

1,000 Genomes August haplotypes -- NCBI Build 37

Haplotypes + legend files	Recombination maps (Build 37)
EUR.1000Genomes.Dec2010.haplotypes.tgz [547Mb] AFR.1000Genomes.Dec2010.haplotypes.tgz [528 Mb] ASN.1000Genomes.Dec2010.haplotypes.tgz [341 Mb]	genetic_maps_b37.tgz

HapMap 3 + 1,000 Genomes low-coverage pilot haplotypes (filtered) -- NCBI Build 36

--HapMap 3 files are from release #2 (Feb 2009)
--1,000 Genomes files are from the low-coverage pilot genotypes released Mar 2010; phased haplotypes released Jun 2010

We combined the reference haplotypes from the 1,000 Genomes Project and HapMap Phase 3. The 1,000 Genomes dataset provides wide coverage of the genome, in that it contains many more SNPs than the HapMap (with high enrichment for rare mutations), while HapMap 3 provides deep coverage, in that it contains a greater sampling of chromosomes from human populations. We have designed IMPUTE2 to integrate these wide and deep panels into a single analysis framework, as shown in this example.

To download the data needed to impute from a combined HapMap 3 + 1,000 Genomes reference panel, please click the appropriate link under the Download packages heading below:

Haplotype, legend, sample, and genetic map files

These downloads contain the data needed to impute genotypes using reference panels from HapMap 3 and the 1,000 Genomes Project. Each dataset includes the latest haplotypes from the 1,000 Genomes panel of interest, along with all available HapMap 3 haplotypes, except those present in the relevant 1,000 Genomes panel. We remove these duplicate haplotypes so that the two datasets can be combined without causing "double counting" of haplotypes during imputation. Both sets of haplotypes have also been filtered to remove SNPs with apparent quality issues.

When using these combined panels, you should set the -Ne argument to 15000, as explained here.

To see an example command that combines HapMap 3 and 1,000 Genomes haplotypes in a single imputation analysis, go here.

To see our rationale for using all HapMap 3 haplotypes together, rather than focusing on population-matched subsets, go here.

To learn more about our scheme for filtering out low-quality SNPs, go here.

If you prefer unfiltered 1,000 Genomes haplotypes, you can download them from here; similarly, you can download unfiltered HapMap 3 haplotypes from here.

Download packages (warning: large files)

[CEU]

[YRI]

[CHB+JPT (coming soon)]

NOTE: When combining datasets in an imputation analysis, you should always take great care to ensure that they have been aligned to the same strand convention. In this case, we have already aligned the HapMap 3 and 1,000 Genomes data to the '+' strand of the human reference sequence, and we have removed SNPs with unresolvable strand flips between panels. Consequently, you just need to make sure that your dataset is correctly aligned before imputing from the combined panel.

While we prefer the reference panels linked above, we recognize that some people may want to download the original, unfiltered HapMap 3 and 1,000 Genomes datasets. These can be obtained below:

1,000 Genomes haplotypes (unfiltered) -- NCBI Build 36

--1,000 Genomes files are from Pilot 1 genotypes released Mar 2010; phased haplotypes released Jun 2010

Haplotype, legend, sample, and genetic map files

These downloads contain the data needed to impute genotypes using reference panels from the 1,000 Genomes Project. The files are unfiltered, in the sense that we have not modified them from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs.

When using one of these panels, you should set the -Ne argument to the population-specific value suggested here.

Download packages (warning: large files)

[CEU]

[YRI]

[CHB+JPT (coming soon)]

HapMap 3 haplotypes (unfiltered) -- NCBI Build 36

--HapMap 3 files are from release #2 (Feb 2009)

Haplotype, legend, sample, and genetic map files

These downloads contain the data needed to impute genotypes using reference panels from HapMap Phase 3. The files are unfiltered, in the sense that we have only modified them minimally from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs. In HapMap 3, the most common problem is that an allele will "drop out" of the genotyping assay, thereby making every individual appear homozygous for the same allele.

When using this combined panel, you should set the -Ne argument to 15000, as explained here.

Download packages (warning: large files)

[ALL PANELS]

You can also download HapMap Phase 2 haplotypes in the format used by IMPUTE2; to access them, please click here.

We are continually working to distribute the most up-to-date and comprehensive reference datasets available. We will post them here in IMPUTE2 format as we process them.

Example Runs (top)

This section provides some example runs that illustrate typical applications of IMPUTE2. All of the data files used in these command-line calls are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the impute2 executable).

Note that, within each command box below, most lines end with the '\' character. This is not actually part of the command -- it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split each example command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window (so, for example, you should be able to directly paste these commands into the terminal and hit 'enter' to make them run), but it would be equivalent to put all of the arguments on a single line, separated by spaces.

ONE PHASED REFERENCE PANEL

This is the canonical imputation scenario: a single reference panel comprised of known haplotypes with no missing alleles.

./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 11418 \
-o ./Example/example.chr22.one.phased.impute2

We can also switch from a "wide" 1,000 Genomes reference panel to a "deep" HapMap 3 reference panel. The HapMap panel contains fewer SNPs, but the larger sample size means that the imputation of these SNPs will be more accurate, especially for SNPs with low minor allele frequencies.

./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.hm3.haps \
-l ./Example/example.chr22.hm3.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 11418 \
-o ./Example/example.chr22.one.phased.impute2

With IMPUTE2, there is no need to choose between the "wide" and "deep" analyses shown above. We show how to use both kinds of reference panels in a single imputation run below.

ONE UNPHASED REFERENCE PANEL

It is not necessary for the reference panel to be phased: IMPUTE2 can do the phasing internally while correctly accounting for the phase uncertainty. To use an unphased reference panel, simply replace the -h and -l files with a -g_ref file.

./impute2 \
-m ./Example/example.chr22.map \
-g_ref ./Example/example.chr22.reference.gens \
-strand_g_ref ./Example/example.chr22.reference.strand \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 11418 \
-o ./Example/example.chr22.one.unphased.impute2

TWO PHASED REFERENCE PANELS

This is the scenario that we think most people should be using for imputation at present: two phased reference panels, one of which contains roughly a subset of the SNPs in the other. Concretely, the 1,000 Genomes Project is already producing haplotypes with near-complete ascertainment of common SNPs, whereas HapMap Phase 3 includes a larger number of individuals who have been genotyped at a subset of these SNPs. Combining these reference datasets in a single imputation analysis yields extensive coverage of the genome (via the 1,000 Genomes SNPs) and increased accuracy at a subset of SNPs (those typed in HapMap 3). Note that the haplotype dataset containing more SNPs (here, the 1,000 Genomes haplotypes) should always be provided first on the command line.

./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
./Example/example.chr22.hm3.haps \
-l ./Example/example.chr22.1kG.legend \
./Example/example.chr22.hm3.legend \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-int 20.4e6 20.5e6 \
-Ne 11418 \
-o ./Example/example.chr22.two.phased.impute2

ONE PHASED REFERENCE PANEL, ONE UNPHASED REFERENCE PANEL

Sometimes it is useful to combine a set of publicly available haplotypes (e.g., from HapMap or the 1,000 Genomes Project) with an unphased reference dataset (e.g., genotypes from a SNP chip). The following command shows how to do this.

PHASING WITHOUT A REFERENCE PANEL

In addition to imputation, IMPUTE2 can also be used for highly accurate phasing. This command shows how you can use the -phase flag to perform a classical phasing analysis. Note that no strand alignment is needed in this example since we are using only one data panel. However, it may be important to align the strand at this stage if you intend to use the phased haplotypes for downstream imputation.

./impute2 \
-phase \
-m ./Example/example.chr22.map \
-g ./Example/example.chr22.study.gens \
-int 20.4e6 20.5e6 \
-Ne 11418 \
-o ./Example/example.chr22.phasing.impute2

The -o file is always reserved for imputation output, so the phased haplotypes in this example get printed to a file named ./Example/example.chr22.phasing.impute2_haps, where the _haps suffix is added automatically. The format of this output file is explained here.

A MORE COMPLICATED EXAMPLE

The preceding example runs use only a small fraction of the options that are available in IMPUTE2. Here we return to the ONE PHASED REFERENCE PANEL, ONE UNPHASED REFERENCE PANEL scenario to show how a broader range of options might be used.

./impute2 \
-m ./Example/example.chr22.map \
-h ./Example/example.chr22.1kG.haps \
-l ./Example/example.chr22.1kG.legend \
-g_ref ./Example/example.chr22.reference.gens \
-strand_g_ref ./Example/example.chr22.reference.strand \
-exclude_snps_g_ref ./Example/example.chr22.reference.snp.exclusions \
-g ./Example/example.chr22.study.gens \
-strand_g ./Example/example.chr22.study.strand \
-fix_strand_g \
-sample_g ./Example/example.study.samples \
-exclude_samples_g ./Example/example.study.sample.exclusions \
-int 20.4e6 20.5e6 \
-Ne 11418 \
-k 100 \
-burnin 5 \
-iter 20 \
-pgs \
-no_sample_qc_info \
-o_gz \
-o ./Example/example.chr22.complicated.impute2

The command above contains several options that were not used in previous example runs:

The -exclude_snps_g_ref option specifies a few SNPs to remove from the -g_ref file, using different types of SNP IDs. These might be SNPs that failed QC testing, for example.
The -fix_strand_g option tells the program to use its strand alignment procedure to make the allele coding in the -g file match the coding in the -l file. However, the -strand_g option takes precedence over -fix_strand_g, and in this case all of the genotyped SNPs have explicit alignments in the strand file, so the flag has no effect.
This run includes both a -sample_g file and an -exclude_samples_g file. The sample file tells IMPUTE2 which samples in the -g file are which, and the exclusions file tells it the IDs of samples that should be removed from the analysis. These might be individuals who showed systematic data quality problems on a genome-wide SNP chip, for example.
Here we have increased -k from its default value of 80 to 100. This will increase the imputation accuracy, but it will also increase IMPUTE2's running time. In this example we have tried to offset the increased running time by decreasing the -burnin value from 10 (default) to 5 and the -iter value from 30 (default) to 20.
The -pgs flag tells the program to "predict genotyped SNPs"; that is, to replace the original study genotypes with LD-based imputed genotypes in the output file.
The -no_sample_qc_info flag suppresses the output file that shows quality control metrics for each individual in the -g file.
The -o_gz flag specifies that the main output file should be compressed by the gzip algorithm; this is useful if you are running lots of jobs that produce large output files.

Program Options (top)

The following tables describe the command-line options that can be used to control IMPUTE2. Many of these options are similar to options in IMPUTE v1 (and earlier versions) but there are some key differences in how these options are handled by IMPUTE2 -- these are noted in green.

Input data files

This table explains the formatting requirements for input data files that can be supplied to IMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems). In all of these files, it is important that SNPs appear in base pair position order, from lowest to highest. It is also crucial that all SNP positions come from the same genome build (e.g., NCBI Build 36) so the program can combine information across input files.

Flag	Default	Description
-g <file> REQUIRED unless -known_haps_g provided	none	File containing genotypes for a study cohort that we want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO.
-known_haps_g <file>	none	File containing known haplotypes for the study cohort. The format is the same as the output format from IMPUTE2's -phase option: five header columns (as in the -g file) followed by two columns (haplotypes) per individual. Allowed values in the haplotype columns are 0, 1, and ?. If your study dataset is fully phased, you can replace the -g file with a -known_haps_g file. This will cause IMPUTE2 to perform haploid imputation, although it will still report diploid imputation probabilities in the main output file. If any genotypes are missing, they can be marked as ? ? in the input file. (The program does not allow just one allele from a diploid genotype to be missing.) If the reference panels are also phased, IMPUTE2 will perform a single, fast imputation step rather than its standard MCMC module. The -known_haps_g file can also be used to specify study genotypes that are "partially" phased, in the sense that some genotypes are phased relative to a fixed reference point while others are not. We anticipate that this will be most useful when trying to phase resequencing data onto a scaffold of known haplotypes. This functionality is not yet fully documented, but you are welcome to contact us if this seems useful to you.
-m <file> REQUIRED	none	Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)").
-h <file 1> <file 2>	none	File of known haplotypes, with one row per SNP and one column per haplotype. In IMPUTE2, it is possible to specify two known haplotypes files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 -- no other values are allowed.
-l <file 1> <file 2>	none	Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). When using two known haplotypes files with IMPUTE2, you must supply the corresponding legend files in the same order -- i.e., the file with more SNPs comes first.
-g_ref <file>	none	File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file).

Basic options

These options control some basic processing that the program does to prepare input data for inference.

Flag	Default	Description
-int <lower> <upper> REQUIRED	none	Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes. IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses.
-buffer <int>	250 kb	Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int flag. SNPs in the buffer regions inform the inference but do not appear in output files. Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval.
-allow_large_regions		Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is described here.
-Ne <int>	14000	"Effective size" of the population (commonly denoted as Ne in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates that IMPUTE2 uses to train its population model. As a starting point, we suggest values of 11418 for imputation from HapMap CEU, 17469 for YRI, and 14269 for CHB+JPT. When combining reference panels, we suggest taking the average of the panel-specific Ne values, weighted by the number of chromosomes in each panel; e.g., for a CEU+YRI+CHB+JPT panel in HapMap Phase II data, the Ne would be (120 * 11418) + (120 * 17469) + (180 * 14269) / (120 + 120 + 180) = 14369. For larger and more complicated reference panels where this calculation would become tedious (e.g., datasets that include all HapMap 3 panels), setting Ne to 15000 should be fine. In our experience, imputation accuracy is quite insensitive to the exact Ne value when using a large reference panel with diverse ancestry.
-call_thresh <float>	0.9	Threshold for calling genotypes in the -g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing. NOTE: This threshold only applies to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions.
-nind <int>	# of indiv in -g file	Number of individuals from the -g file to include in the analysis. For example, to impute only the first five individuals, set -nind 5. This option is useful for debugging and test runs.

Strand alignment options

In any imputation analysis, is it absolutely essential that all panels have their allele codings aligned to a fixed reference (usually the human genome reference sequence). The options in this table are meant to help align the allele codings in your input data files, but you should not assume that the program will do all the work for you. If you do not know exactly how your data were processed or what these options are doing, you should try to locate the original strand information or contact us for assistance.

NOTE: After applying the strand alignment options below, the program will discard any SNPs that have conflicting alleles across panels (e.g., A/T in the reference haplotypes and A/C in the study genotypes).

NOTE: We currently assume that all phased input files have already been aligned to the '+' strand of the human genome reference sequence, which is true of the files that we distribute; hence, the options here pertain only to unphased genotype files (i.e., the -g and -g_ref files).

Flag	Default	Description
-strand_g <file>	none	File showing the strand orientation of the SNP allele codings in the -g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP, and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space. The ordering of the SNPs in this file does not matter (by contrast to the -g file, which must be sorted by SNP position), and it is okay if some SNPs in the strand file are not present in the genotype file (e.g., due to filtering). Some model strand files are included in the Example/ directory that comes with the software download. NOTE: This flag replaces the -s flag from versions prior to v2.1.0.
-strand_g_ref <file>	none	Same as -strand_g, but applies to the -g_ref file. NOTE: This flag replaces the -s_ref flag from versions prior to v2.1.0.
-fix_strand_g		Activates the program's internal strand alignment procedure for the -g file (Panel 2). The strand is aligned to the alleles in Panel 0, if present, otherwise to Panel 1. The strand is aligned deterministically where possible (e.g., flipping A/C in Panel 2 to match G/T in the reference) and by allele frequency otherwise (at A/T and C/G SNPs, whose alignment cannot be resolved by labels alone); in the latter case, the program codes the alleles such that Panel 2 and the alignment reference (Panel 0 or 1) have the same minor allele. NOTE: This flag can be used in conjunction with the -strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to "fix" the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others. NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The only way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured. NOTE: This flag replaces the -fix_strand flag from versions prior to v2.1.0.
-fix_strand_g_ref		Similar to -fix_strand_g, but applies to the -g_ref file (Panel 1). In this case the strand is aligned to the alleles in Panel 0, so the flag does not work if this panel is not present. NOTE: Just as -fix_strand_g can be used in conjunction with -strand_g, this flag can be used in conjunction with the -strand_g_ref option. As before, the strand file takes precedence over the internal strand-fixing procedure. NOTE: As with -fix_strand_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%. NOTE: This flag replaces the -fix_strand_ref flag from versions prior to v2.1.0.

Filtering options

The options in this table affect the way that the program filters the input data (mainly the -g and -g_ref files). Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.

Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the -g file you should use -exclude_snps_g, and to exclude SNPs from the -g_ref file you should use -exclude_snps_g_ref. This convention causes some of the flags in IMPUTE v2.1.0 to have different names than the equivalent flags in earlier versions.

Flag	Default	Description
-exclude_snps_g <file>	none	List of SNPs to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g file), their rsIDs (second column of -g file), or their base pair positions (third column of -g file). Excluded SNPs will be treated as if they had not been present in the genotypes file, and they will not be shown in the output unless you use the -impute_excluded option.
-exclude_snps_g_ref <file>	none	Same as -exclude_snps_g, but applies to the -g_ref file.
-impute_excluded		Specifies that SNPs excluded from the study dataset via the -exclude_snps_g option should be imputed and included in the output file. When this flag is not activated, excluded SNPs are simply ignored.
-include_snps <file>	none	List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the Type 0 and Type 1 SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file). This option does not have any effect on Type 2 and Type 3 SNPs.
-sample_g <file>	none	File of sample IDs for the individuals in the -g file; should follow the format described here. Only the first three columns are necessary, and only the first two columns are used by IMPUTE2 (i.e., the third column can have dummy values, and subsequent columns do not affect the algorithm). NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option.
-sample_g_ref <file>	none	Same as -sample_g, but applies to the -g_ref file.
-exclude_samples_g <file>	none	List of samples to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The samples can be identified by the IDs in either of the first two columns of the -sample_g file, which is REQUIRED if you want to use this option. Excluded samples will be treated as if they had not been present in the genotypes file, and the program will re-print the original sample list, minus the excluded samples, to a file named "[-o]_samples". NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option.
-exclude_samples_g_ref <file>	none	Same as -exclude_samples_g, but applies to the -g_ref file. One difference is that the program will not print a filtered list of -g_ref samples like the one that gets printed with -exclude_samples_g.

MCMC options

IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.

Flag	Default	Description
-iter <int>	30	Total number of MCMC iterations to perform, including burn-in. Increasing the number of iterations may improve accuracy slightly.
-burnin <int>	10	Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for the observed genotypes during each of the first [-burnin] iterations, but these iterations do not contribute to the final imputation probabilities. We have found that 10 burn-in iterations is enough to ensure good results in a variety of different datasets.
-k <int>	80	Maximum number of copying states to use for diploid phasing updates. Setting this value higher will lead to higher accuracy at the cost of longer running times. The default is a good starting point, but this can be increased if there is more time available to run the program. In our experience, the method reaches maximal accuracy near a -k value of 100, so we suggest that there is little point in using more intensive settings, even for very large datasets.
-k_hap <int>	500	Maximum number of copying states to use for haploid imputation updates. The default setting should be sufficient for most applications.

Output files

The options in this table control the format and naming conventions of output files printed by IMPUTE2.

Flag	Default	Description
-o <file>	./test.impute2	Name of main output file. Follows the same format as the -g file.
-i <file>	[-o]_info	Name of SNP-wise information file with one line per SNP and a single header line at the beginning; versions of IMPUTE prior to v2.1.0 did not print the header. This file always contains the following columns (header tags shown in parentheses): 1. SNP identifier from -g file (snp_id) 2. rsID (rs_id) 3. base pair position (position) 4. expected frequency of allele coded '1' in the -o file (exp_freq_a1) 5. measure of the observed statistical information associated with the allele frequency estimate (info) 6. average certainty of best-guess genotypes (certainty) 7. internal "type" assigned to SNP (type) -- column did not exist prior to v2.1.0 Depending on the command-line options invoked, there may also be columns labeled info_typeX, concord_typeX, and r2_typeX, where X takes values in {0,1,2}. For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes (after applying the -call_thresh) and the best-guess imputed genotypes obtained by masking the input genotypes one SNP at a time and pretending the SNP is of type X; similarly, r2_typeX is the squared correlation between input and imputed genotypes. The info_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-type X SNPs in the leave-one-out masking experiment. These columns did not exist prior to v2.1.0. They are useful for post-hoc quality control, as we explain in the section on QC and troubleshooting.
-r <file>	[-o]_summary	Name of file that records a summary of the screen output.
-w <file>	[-o]_warnings	Name of file that records warnings generated by IMPUTE2.
-os <int> <int> ...	0 1 2 3	"Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in the Overview). By default, all imputed and genotyped SNPs are included in the output, i.e., "-os 0 1 2 3".
-o_gz		Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large.
-outdp <int>	3	Specifies the number of decimal places to use for reporting genotype probabilities in the main output file.
-no_snp_qc_info		Suppresses printing of info_typeX, concord_typeX, and r2_typeX columns in -i file.
-no_sample_qc_info		Suppresses printing of per-sample quality control metrics file. The default is to print a file named "[-i]_by_sample".
-phase		IMPUTE2 always implicitly phases the study dataset (-g file), and this flag tells the program to print the best-guess haplotypes that result from the phasing process. In addition to the standard imputation output file, the program also prints a separate haplotypes file named "[-o]_haps". This file contains the same five header columns as the standard output, along with two columns (haplotypes) per individual, in the same order they appear in the main output. In addition to this "best-guess" haplotypes file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named "[-o]_haps_confidence". This file contains one line per individual and one column per SNP in the phased haplotypes file. Homozygotes are represented by * characters and heterozygotes are represented by numbers between 0.5 and 1.0; this is the estimated probability that the phasing between the current heterozygote and the previous heterozygote (to the left) is correct. By convention, the leftmost heterozygous SNP in each individual is assigned a phasing certainty of 1.0. As shown in the examples section, it is possible to use the -phase option to produce haplotypes without the use of a reference panel; i.e., to perform a classical phasing analysis.
-pgs		"Predict Genotyped SNPs": Tells the program to replace the input genotypes from the -g file with imputed genotypes at Type 2 SNPs in the output file.
-pgs_miss		Unlike -pgs, this option tells the program to replace only missing genotypes with imputed genotypes. That is, any input genotype whose maximum probability exceeds the -call_thresh will simply be reprinted in the output file, whereas input genotypes that fall below the calling threshold will be imputed in the output. WARNING: This is an appealing option that promises to simply "fill in" sporadically missing genotypes in your input data. However, we think that following this procedure and then testing the SNPs for association could cause subtle problems. We are investigating these issues, but in the meantime we suggest that you only use this option with caution; using it naively may lead to bad results, and you do so at your own risk.

Options not intended for general use

The options in this table are meant for experts only. Don't use them unless you know what you are doing!

Flag	Default	Description
-seed <int>	random	Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option.
-no_warn		Turns warnings off, so that the -w file does not get printed.
-no_fill		Turns hole-filling off, so that SNPs included in the -g file but not in the lowest reference panel cannot contribute to the inference.
-no_remove		Prevents the program from discarding SNPs whose alleles cannot be aligned across panels.

Analyzing Whole Chromosomes (top)

In principle, it is possible to impute genotypes across an entire chromosome in a single run of IMPUTE2, but it is better to split a chromosome into smaller chunks for analysis. One important reason for this is that the population-genetic approximation used by IMPUTE2 works best over short genomic distances. The approximation works by modeling local genealogies, and the superior accuracy afforded by this model may diminish if there is too much recombination in the region.

Consequently, we recommend using the program on regions of ~5 Mb or shorter, and versions from v2.1.2 onward will throw an error if the analysis interval plus buffer region is longer than 7 Mb. People who have good reasons to impute a longer region in a single run can override this behavior with the -allow_large_regions flag.

Splitting a chromosome into smaller chunks is often a good computational strategy anyway, since it allows the chunks to be imputed separately on multiple computer processors. This decreases the effective computing time and limits the amount of RAM needed for each run.

The -int option provides an easy way to break a chromosome into smaller chunks for analysis by IMPUTE2. For example, if we wanted to split a chromosome into 5-megabase-pair-regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on. IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)

Once you have split a chromosome into multiple chunks and imputed them separately, the IMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:

cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

Here, "chr16_chunkX.impute2" is an output file for one chunk of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three chunks to satisfy the approximation used by IMPUTE2.)

Pre-Phasing GWAS (top)

The proliferation of cheap DNA sequencing technologies has greatly increased the rate at which reference panels for genotype imputation can be generated. In this context, many GWAS investigators would like to re-impute their datasets as larger reference datasets become available. Imputation is still relatively computer-intensive when performed genome-wide, so we have been working on ways to speed up the inference in this context.

The basic idea is to phase your GWAS genotypes once, then re-impute from the phased haplotypes as new reference panels come online. We have written a dedicated document that describes this process in great detail and is integrated with some working examples; you can download it all here.

QC and Troubleshooting (top)

We will soon be posting suggestions for making sure that IMPUTE2 has run successfully, detecting common problems, and processing the output files prior to association analysis. Stay tuned...

FAQ (top)

Q: What -Ne value should I use when there is more than one population in the reference panel?

A: You can find a complete answer here. The quick answer is ~15000.

Q: Why haven't you responded to my e-mail?

A: We go out of our way to respond promptly to queries about IMPUTE2. If you wrote to us and haven't heard back yet, the most likely reason is that we are too busy to reply immediately. There are a few things that you can do to improve the chances of receiving a fast response:

As you can see, we have put a lot of effort and accumulated wisdom into this website; please take a moment to see if your question is already answered in the documentation. We can tell when people have contacted us without reading the manual.
Please be as specific as possible in your question (this will be easier if you've read the documentation). Sometimes open-ended questions are unavoidable, but these naturally take longer to answer.
If the program is doing something you don't understand (e.g., crashing), it will be much easier for us to diagnose the problem if you send a copy of the screen output with your e-mail. Conveniently, IMPUTE2 automatically writes an output log to the -r file, so you can just attach this file to your message.
We are only human, and sometimes e-mails slip through the cracks, especially during busy times or holidays. If you haven't heard back from us after a week or so, feel free to e-mail again to check on the status of things -- we really do appreciate periodic reminders.

Registration and Updates (top)

If you would like to receive e-mails about updates to this software, please fill out the registration form.

References (top)

[1] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]

[2] B. N. Howie, P. Donnelly and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article]

Contact Information (top)

If you have any questions regarding the use of IMPUTE2, please send an e-mail to both of the following people:

Dr. Bryan Howie ( bhowie <at> uchicago <dot> edu).
Dr. Jonathan Marchini ( marchini <at> stats <dot> ox <dot> ac <dot> uk).

It is a good idea to include a copy of the screen output (which is printed to the -r file) with your e-mail to help us identify any problems.