IMPUTE v2

IMPUTE version 2 (also known simply as IMPUTE2) is a genotype imputation program based on ideas from Howie et al. (2009). Please click on the links below to download the software or learn how to use it.

Page last updated Aug 5, 2010.

We will be adding a number of features to the website in the near future, so please check back if you are actively using the software.


New data!

We have now posted the latest haplotypes from the 1,000 Genomes Project, which were released in June 2010. There are 120 CEU haplotypes, 120 CHB+JPT haplotypes, and 118 YRI haplotypes in the new dataset.

You can download the official release haplotypes or a set of haplotypes tailored to work with HapMap 3 below; we recommend using the latter (1,000 Genomes + HapMap 3) reference set for most imputation tasks.


New strategies!

We have been working hard to show how IMPUTE2 can use large reference panels with diverse ancestry to improve the imputation of rare alleles and eliminate the need to choose which haplotypes will form the reference set. You may have seen us talk about this work; it is not yet published, but we have written a short summary of our ideas, results, and motivations here.

We have also used these ideas to inform the packaging of the 1,000 Genomes and HapMap 3 reference sets; you can download haplotypes that fit IMPUTE2's reference panel philosophy here.


IMPORTANT:

The population-genetic approximation used by IMPUTE2 is only valid over short genomic distances -- if you use the program on too large a region, the quality of the inference will diminish. While the definition of "short" depends on various characteristics of a dataset, in general the program should only be used on regions of 5 Mb or shorter.

If you need to impute a longer region (e.g., a whole chromosome in a genome-wide association study), we provide simple command-line options for splitting chromosomes into smaller chunks for analysis: this process in described in the section on Analyzing Whole Chromosomes.



Home
Overview
What's New?
Coming Soon
Contributors
Download
Using IMPUTE2 with Public Reference Data
Example Runs
Program Options
Analyzing Whole Chromosomes
QC and Troubleshooting
Filling Reference Panel Holes
FAQ
Registration and Updates
References
Contact Information


Overview (top)

The figure below provides a schematic overview of what IMPUTE2 does. In short, it uses a fine-scale recombination map and a densely genotyped reference panel to "fill in" missing genotypes in a study dataset, which might consist of cases and controls typed on a commercial SNP chip. By estimating the genotypes of SNPs that were not in the original study data, imputation allows a much larger set of SNPs to be tested for association. This can increase both the power to detect association signals and the signal resolution near a causal variant.



Imputation scenarios and program nomenclature

The next two figures illustrate the common imputation scenarios that IMPUTE2 is designed to handle. These figures introduce the nomenclature used by the program to label panels and SNPs, including "special" SNPs that do not fit into the standard imputation framework.


SCENARIO A: ONE REFERENCE PANEL

This is the imputation setup that most people are familiar with: a reference panel containing a dense set of SNPs is used to impute missing genotypes in a study dataset that has been typed at a sparser set of SNPs. IMPUTE2 refers to the reference data as Panel 0 or Panel 1 (for phased and unphased reference panels, respectively) and to the study data as Panel 2. These labels serve as a convenient shorthand in the program's screen output.




IMPUTE2 labels SNPs by the panels in which they have been genotyped. Each label denotes a specific functional role. In the figure above, SNPs that have data only in the reference panel are labeled Type 0 or Type 1 (for phased and unphased reference panels, respectively), whereas SNPs that have genotypes in the study dataset are labeled Type 2. Type 2 SNPs dictate which reference panel haplotypes should be "copied" for each individual; then, the reference panel alleles at Type 0/1 SNPs are used to fill in that individual's missing genotypes.

There is one novelty in the way that IMPUTE2 treats Scenario A. In the figure, one of the SNPs that is labeled as Type 2 has data in Panel 2 but not in the reference panel. Most imputation methods ignore these kinds of SNPs since they are hard to model. For example, IMPUTE v1 labels these as Type 3 SNPs, and it does not impute them or use them to inform the inference. By contrast, IMPUTE2 uses a novel approach to model the missing reference panel alleles, thereby allowing it to gain information from the study genotypes at such SNPs; we describe this method for filling in "holes" in the reference panel here. This feature highlights one of the guiding principles of IMPUTE2: to increase imputation accuracy by using as much of the information in the data as possible.


SCENARIO B: TWO REFERENCE PANELS

Another novel feature of IMPUTE2 is the ability to combine two reference panels containing different sets of SNPs in a single imputation analysis. In the figure below, the first reference panel is called Panel 0, the second reference panel is called Panel 1, and the study dataset is called Panel 2. It is common for each successive panel (0,1,2) to be genotyped at a subset of the SNPs in the previous panel. For example, Panel 0 might comprise haplotypes from the 1,000 Genomes Project, which captures nearly all common SNPs in the genome; Panel 1 might comprise haplotypes from HapMap Phase 3, which surveys a subset of common SNPs; and Panel 2 might be a set of cases and controls genotyped on a commercial SNP chip. IMPUTE2 assumes that the sets of SNPs in Panels 0-2 follow this hierarchical scheme, although it can handle certain exceptions, as discussed below.




In imputation Scenario B, SNPs are labeled as follows: *In the figure above, the Type 3 SNP on the left is a special case: it has data in Panel 0 and Panel 2, but not in Panel 1. By default, IMPUTE2 ignores the Panel 0 data at such SNPs, thereby converting them to Type 3. However, this behavior can be changed by activating the -pgs flag, which causes the program to keep the Panel 0 data and impute the Panel 2 genotypes (effectively converting the SNP to Type 0). In the near future, we will extend the IMPUTE2 algorithm to handle these SNPs more naturally, but for now they should not be a big issue.


What's New? (top)

IMPUTE v2.1.0 includes a number of new features. These include:


Coming Soon (top)

We plan to add the following features to IMPUTE2 in the near future:


Contributors (top)

The following people developed the methodology and software for IMPUTE2:

Bryan Howie, Jonathan Marchini


Download (top)

Pre-compiled IMPUTE2 binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please contact us.

Platform
File
Linux (x86_64) Static Executable
impute_v2.1.0_x86_64_static.tgz
Mac OS X Intel
impute_v2.1.0_MacOSX_Intel.tgz

To unpack the files on a Linux computer, use a command like

tar -zxvf impute_v2.X.Y_i386.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called impute2, a LICENCE file, and an Example/ directory that contains example data files and program calls.


Using IMPUTE2 with Public Reference Data (top)

IMPUTE2 can use customized reference panels (e.g., SNP genotypes from a fine-mapping study) as well as publicly available reference datasets. In the latter category, we currently recommend using a combination of reference haplotypes from the 1,000 Genomes Project and HapMap Phase 3. The 1,000 Genomes dataset provides wide coverage of the genome, in that it contains many more SNPs than the HapMap (with high enrichment for rare mutations), while HapMap 3 provides deep coverage, in that it contains a greater sampling of chromosomes from human populations. We have designed IMPUTE2 to integrate these wide and deep panels into a single analysis framework, as shown in this example.

To download the data needed to impute from a combined HapMap 3 + 1,000 Genomes reference panel, please click the appropriate link under the Download packages heading below:


HapMap 3 + 1,000 Genomes haplotypes (filtered) -- NCBI Build 36

   --HapMap 3 files are from release #2 (Feb 2009)
   --1,000 Genomes files are from Pilot 1 genotypes released Mar 2010; phased haplotypes released Jun 2010


Haplotype, legend, sample, and genetic map files

These downloads contain the data needed to impute genotypes using reference panels from HapMap 3 and the 1,000 Genomes Project. Each dataset includes the latest haplotypes from the 1,000 Genomes panel of interest, along with all available HapMap 3 haplotypes, except those present in the relevant 1,000 Genomes panel. We remove these duplicate haplotypes so that the two datasets can be combined without causing "double counting" of haplotypes during imputation. Both sets of haplotypes have also been filtered to remove SNPs with apparent quality issues.

To see an example command that combines HapMap 3 and 1,000 Genomes haplotypes in a single imputation analysis, go here.

To see our rationale for using all HapMap 3 haplotypes together, rather than focusing on population-matched subsets, go here.

To learn more about our scheme for filtering out low-quality SNPs, go here.

If you prefer unfiltered 1,000 Genomes haplotypes, you can download them from here; similarly, you can download unfiltered HapMap 3 haplotypes from here.


Download packages (warning: large files)

 [CEU]

 [YRI]

 [CHB+JPT (coming soon)]



NOTE: When combining datasets in an imputation analysis, you should always take great care to ensure that they have been aligned to the same strand convention. In this case, we have already aligned the HapMap 3 and 1,000 Genomes data to the '+' strand of the human reference sequence, and we have removed SNPs with unresolvable strand flips between panels. Consequently, you just need to make sure that your dataset is correctly aligned before imputing from the combined panel.

While we prefer the reference panels linked above, we recognize that some people may want to download the original, unfiltered HapMap 3 and 1,000 Genomes datasets. These can be obtained below:


1,000 Genomes haplotypes (unfiltered) -- NCBI Build 36

   --1,000 Genomes files are from Pilot 1 genotypes released Mar 2010; phased haplotypes released Jun 2010

Haplotype, legend, sample, and genetic map files

These downloads contain the data needed to impute genotypes using reference panels from the 1,000 Genomes Project. The files are unfiltered, in the sense that we have not modified them from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs.


Download packages (warning: large files)

 [CEU]

 [YRI]

 [CHB+JPT (coming soon)]



HapMap 3 haplotypes (unfiltered) -- NCBI Build 36

   --HapMap 3 files are from release #2 (Feb 2009)

Haplotype, legend, sample, and genetic map files

These downloads contain the data needed to impute genotypes using reference panels from HapMap Phase 3. The files are unfiltered, in the sense that we have only modified them minimally from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs. In HapMap 3, the most common problem is that an allele will "drop out" of the genotyping assay, thereby making every individual appear homozygous for the same allele.


Download packages (warning: large files)

 [ALL PANELS]



You can also download HapMap Phase 2 haplotypes in the format used by IMPUTE2; to access them, please click here.

We are continually working to distribute the most up-to-date and comprehensive reference datasets available. We will post them here in IMPUTE2 format as we process them.


Example Runs (top)

This section provides some example runs that illustrate typical applications of IMPUTE2. All of the data files used in these command-line calls are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the impute2 executable).

Note that, within each command box below, most lines end with the '\' character. This is not actually part of the command -- it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split each example command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window (so, for example, you should be able to directly paste these commands into the terminal and hit 'enter' to make them run), but it would be equivalent to put all of the arguments on a single line, separated by spaces.


ONE PHASED REFERENCE PANEL

This is the canonical imputation scenario: a single reference panel comprised of known haplotypes with no missing alleles.

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -o ./Example/example.chr22.one.phased.impute2

We can also switch from a "wide" 1,000 Genomes reference panel to a "deep" HapMap 3 reference panel. The HapMap panel contains fewer SNPs, but the larger sample size means that the imputation of these SNPs will be more accurate, especially for SNPs with low minor allele frequencies.

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.hm3.haps \
 -l ./Example/example.chr22.hm3.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -o ./Example/example.chr22.one.phased.impute2

With IMPUTE2, there is no need to choose between the "wide" and "deep" analyses shown above. We show how to use both kinds of reference panels in a single imputation run below.


ONE UNPHASED REFERENCE PANEL

It is not necessary for the reference panel to be phased: IMPUTE2 can do the phasing internally while correctly accounting for the phase uncertainty. To use an unphased reference panel, simply replace the -h and -l files with a -g_ref file.

./impute2 \
 -m ./Example/example.chr22.map \
 -g_ref ./Example/example.chr22.reference.gens \
 -strand_g_ref ./Example/example.chr22.reference.strand \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -o ./Example/example.chr22.one.unphased.impute2


TWO PHASED REFERENCE PANELS

This is the scenario that we think most people should be using for imputation at present: two phased reference panels, one of which contains roughly a subset of the SNPs in the other. Concretely, the 1,000 Genomes Project is already producing haplotypes with near-complete ascertainment of common SNPs, whereas HapMap Phase 3 includes a larger number of individuals who have been genotyped at a subset of these SNPs. Combining these reference datasets in a single imputation analysis yields extensive coverage of the genome (via the 1,000 Genomes SNPs) and increased accuracy at a subset of SNPs (those typed in HapMap 3). Note that the haplotype dataset containing more SNPs (here, the 1,000 Genomes haplotypes) should always be provided first on the command line.

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
    ./Example/example.chr22.hm3.haps \
 -l ./Example/example.chr22.1kG.legend \
    ./Example/example.chr22.hm3.legend \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -o ./Example/example.chr22.two.phased.impute2


ONE PHASED REFERENCE PANEL, ONE UNPHASED REFERENCE PANEL

Sometimes it is useful to combine a set of publicly available haplotypes (e.g., from HapMap or the 1,000 Genomes Project) with an unphased reference dataset (e.g., genotypes from a SNP chip). The following command shows how to do this.

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g_ref ./Example/example.chr22.reference.gens \
 -strand_g_ref ./Example/example.chr22.reference.strand \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -o ./Example/example.chr22.one.phased.one.unphased.impute2


PHASING WITHOUT A REFERENCE PANEL

In addition to imputation, IMPUTE2 can also be used for highly accurate phasing. This command shows how you can use the -phase flag to perform a classical phasing analysis. Note that no strand alignment is needed in this example since we are using only one data panel.

./impute2 \
 -phase \
 -m ./Example/example.chr22.map \
 -g ./Example/example.chr22.study.gens \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -o ./Example/example.chr22.phasing.impute2


A MORE COMPLICATED EXAMPLE

The preceding example runs use only a small fraction of the options that are available in IMPUTE2. Here we return to the ONE PHASED REFERENCE PANEL, ONE UNPHASED REFERENCE PANEL scenario to show how a broader range of options might be used.

./impute2 \
 -m ./Example/example.chr22.map \
 -h ./Example/example.chr22.1kG.haps \
 -l ./Example/example.chr22.1kG.legend \
 -g_ref ./Example/example.chr22.reference.gens \
 -strand_g_ref ./Example/example.chr22.reference.strand \
 -exclude_snps_g_ref ./Example/example.chr22.reference.snp.exclusions \
 -g ./Example/example.chr22.study.gens \
 -strand_g ./Example/example.chr22.study.strand \
 -fix_strand_g \
 -sample_g ./Example/example.study.samples \
 -exclude_samples_g ./Example/example.study.sample.exclusions \
 -int 20.4e6 20.5e6 \
 -Ne 11418 \
 -k 60 \
 -burnin 5 \
 -iter 20 \
 -pgs \
 -no_sample_qc_info \
 -o_gz \
 -o ./Example/example.chr22.complicated.impute2

The command above contains several options that were not used in previous example runs:


Program Options (top)

The following tables describe the command-line options that can be used to control IMPUTE2. Many of these options are similar to options in IMPUTE v1 (and earlier versions) but there are some key differences in how these options are handled by IMPUTE2 -- these are noted in green.

Input data files

This table explains the formatting requirements for input data files that can be supplied to IMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems). In all of these files, it is important that SNPs appear in base pair position order, from lowest to highest. It is also crucial that all SNP positions come from the same genome build (e.g., NCBI Build 36) so the program can combine information across input files.

Flag Default Description
-g <file>
REQUIRED
none File containing genotypes for a study cohort in which we want to impute untyped SNPs. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO.
-m <file>
REQUIRED
none Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)").
-h <file 1> <file 2> none File of known haplotypes, with one row per SNP and one column per haplotype. In IMPUTE2, it is possible to specify two known haplotypes files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 -- no other values are allowed.
-l <file 1> <file 2> none Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). When using two known haplotypes files with IMPUTE2, you must supply the corresponding legend files in the same order -- i.e., the file with more SNPs comes first.
-g_ref <file> none File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file).


Basic options

These options control some basic processing that the program does to prepare input data for inference.

Flag Default Description
-int <lower> <upper>
REQUIRED
none Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes. IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses.
-buffer <int> 250 kb Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int flag. SNPs in the buffer regions inform the inference but do not appear in output files. Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval.
-Ne <int> 14000 "Effective size" of the population (commonly denoted as Ne in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates that IMPUTE2 uses to train its population model. As a starting point, we suggest values of 11418 for imputation from HapMap CEU, 17469 for YRI, and 14269 for CHB+JPT.

When combining reference panels, we suggest taking the average of the panel-specific Ne values, weighted by the number of chromosomes in each panel; e.g., for a CEU+YRI+CHB+JPT panel in HapMap Phase II data, the Ne would be (120 * 11418) + (120 * 17469) + (180 * 14269) / (120 + 120 + 180) = 14369.
-call_thresh <float> 0.9 Threshold for calling genotypes in the -g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing.

NOTE: This threshold only applies to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions.
-nind <int> # of indiv in -g file Number of individuals from the -g file to include in the analysis. For example, to impute only the first five individuals, set -nind 5. This option is useful for debugging and test runs.


Strand alignment options

In any imputation analysis, is it absolutely essential that all panels have their allele codings aligned to a fixed reference (usually the human genome reference sequence). The options in this table are meant to help align the allele codings in your input data files, but you should not assume that the program will do all the work for you. If you do not know exactly how your data were processed or what these options are doing, you should try to locate the original strand information or contact us for assistance.

NOTE: After applying the strand alignment options below, the program will discard any SNPs that have conflicting alleles across panels (e.g., A/T in the reference haplotypes and A/C in the study genotypes).

NOTE: We currently assume that all phased input files have already been aligned to the '+' strand of the human genome reference sequence, which is true of the files that we distribute; hence, the options here pertain only to unphased genotype files (i.e., the -g and -g_ref files).

Flag Default Description
-strand_g <file> none File showing the strand orientation of the SNP allele codings in the -g file, relative to a fixed reference point. Each SNP occupies one line, and the file should have two columns: (i) the base pair position of the SNP, and (ii) the strand orientation ('+' or '-') of the alleles in the genotype file; the columns should be separated by a single space. The ordering of the SNPs in this file does not matter (by contrast to the -g file, which must be sorted by SNP position), and it is okay if some SNPs in the strand file are not present in the genotype file (e.g., due to filtering). Some model strand files are included in the Example/ directory that comes with the software download.

NOTE: This flag replaces the -s flag from versions prior to v2.1.0.
-strand_g_ref <file> none Same as -strand_g, but applies to the -g_ref file.

NOTE: This flag replaces the -s_ref flag from versions prior to v2.1.0.
-fix_strand_g Activates the program's internal strand alignment procedure for the -g file (Panel 2). The strand is aligned to the alleles in Panel 0, if present, otherwise to Panel 1. The strand is aligned deterministically where possible (e.g., flipping A/C in Panel 2 to match G/T in the reference) and by allele frequency otherwise (at A/T and C/G SNPs, whose alignment cannot be resolved by labels alone); in the latter case, the program codes the alleles such that Panel 2 and the alignment reference (Panel 0 or 1) have the same minor allele.

NOTE: This flag can be used in conjunction with the -strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to "fix" the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others.

NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The only way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured.

NOTE: This flag replaces the -fix_strand flag from versions prior to v2.1.0.
-fix_strand_g_ref Similar to -fix_strand_g, but applies to the -g_ref file (Panel 1). In this case the strand is aligned to the alleles in Panel 0, so the flag does not work if this panel is not present.

NOTE: Just as -fix_strand_g can be used in conjunction with -strand_g, this flag can be used in conjunction with the -strand_g_ref option. As before, the strand file takes precedence over the internal strand-fixing procedure.

NOTE: As with -fix_strand_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%.

NOTE: This flag replaces the -fix_strand_ref flag from versions prior to v2.1.0.


Filtering options

The options in this table affect the way that the program filters the input data (mainly the -g and -g_ref files). Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.

Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the -g file you should use -exclude_snps_g, and to exclude SNPs from the -g_ref file you should use -exclude_snps_g_ref. This convention causes some of the flags in IMPUTE v2.1.0 to have different names than the equivalent flags in earlier versions.

Flag Default Description
-exclude_snps_g <file> none List of SNPs to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g file), their rsIDs (second column of -g file), or their base pair positions (third column of -g file). Excluded SNPs will be treated as if they had not been present in the genotypes file, and they will not be shown in the output unless you use the -impute_excluded option.
-exclude_snps_g_ref <file> none Same as -exclude_snps_g, but applies to the -g_ref file.
-impute_excluded Specifies that SNPs excluded from the study dataset via the -exclude_snps_g option should be imputed and included in the output file. When this flag is not activated, excluded SNPs are simply ignored.
-include_snps <file> none List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the Type 0 and Type 1 SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file). This option does not have any effect on Type 2 and Type 3 SNPs.
-sample_g <file> none File of sample IDs for the individuals in the -g file; should follow the format described here. Only the first three columns are necessary, and only the first two columns are used by IMPUTE2 (i.e., the third column can have dummy values, and subsequent columns do not affect the algorithm).

NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option.
-sample_g_ref <file> none Same as -sample_g, but applies to the -g_ref file.
-exclude_samples_g <file> none List of samples to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The samples can be identified by the IDs in either of the first two columns of the -sample_g file, which is REQUIRED if you want to use this option. Excluded samples will be treated as if they had not been present in the genotypes file, and the program will re-print the original sample list, minus the excluded samples, to a file named "[-o]_samples".

NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option.
-exclude_samples_g_ref <file> none Same as -exclude_samples_g, but applies to the -g_ref file. One difference is that the program will not print a filtered list of -g_ref samples like the one that gets printed with -exclude_samples_g.


MCMC options

IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.

Flag Default Description
-iter <int> 30 Total number of MCMC iterations to perform, including burn-in. Increasing the number of iterations may improve accuracy slightly.
-burnin <int> 10 Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for the observed genotypes during each of the first [-burnin] iterations, but these iterations do not contribute to the final imputation probabilities. We have found that 10 burn-in iterations is enough to ensure good results in a variety of different datasets.
-k <int> 42 Maximum number of copying states to use for diploid phasing updates. Setting this value higher will lead to higher accuracy at the cost of longer running times. The default is a good starting point, but this can be increased if there is more time available to run the program. In our experience, the method reaches maximal accuracy near a -k value of 100, so we suggest that there is little point in using more intensive settings, even for very large datasets.
-k_hap <int> 500 Maximum number of copying states to use for haploid imputation updates. The default setting should be sufficient for most applications.


Output files

The options in this table control the format and naming conventions of output files printed by IMPUTE2.

Flag Default Description
-o <file> ./test.impute2 Name of main output file. Follows the same format as the -g file.
-i <file> [-o]_info Name of SNP-wise information file with one line per SNP and a single header line at the beginning; versions of IMPUTE prior to v2.1.0 did not print the header. This file always contains the following columns (header tags shown in parentheses):

1. SNP identifier from -g file (snp_id)
2. rsID (rs_id)
3. base pair position (position)
4. expected frequency of allele coded '1' in the -o file (exp_freq_a1)
5. measure of the observed statistical information associated with the allele frequency estimate (info)
6. average certainty of best-guess genotypes (certainty)
7. internal "type" assigned to SNP (type) -- column did not exist prior to v2.1.0

Depending on the command-line options invoked, there may also be columns labeled info_typeX, concord_typeX, and r2_typeX, where X takes values in {0,1,2}. For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes (after applying the -call_thresh) and the best-guess imputed genotypes obtained by masking the input genotypes one SNP at a time and pretending the SNP is of type X; similarly, r2_typeX is the squared correlation between input and imputed genotypes. The info_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-type X SNPs in the leave-one-out masking experiment. These columns did not exist prior to v2.1.0. They are useful for post-hoc quality control, as we explain in the section on QC and troubleshooting.
-r <file> [-o]_summary Name of file that records a summary of the screen output.
-w <file> [-o]_warnings Name of file that records warnings generated by IMPUTE2.
-os <int> <int> ... 0 1 2 3 "Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in the Overview). By default, all imputed and genotyped SNPs are included in the output, i.e., "-os 0 1 2 3".
-o_gz Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large.
-outdp <int> 3 Specifies the number of decimal places to use for reporting genotype probabilities in the main output file.
-no_snp_qc_info Suppresses printing of info_typeX, concord_typeX, and r2_typeX columns in -i file.
-no_sample_qc_info Suppresses printing of per-sample quality control metrics file. The default is to print a file named "[-i]_by_sample".
-phase IMPUTE2 always implicitly phases the study dataset (-g file), and this flag tells the program to print the best-guess haplotypes that result from the phasing process. In addition to the standard imputation output file, the program also prints a separate haplotypes file named "[-o]_haps". This file contains two columns (haplotypes) per individual, in the same order they appear in the main output.

In addition to this "best-guess" haplotypes file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named "[-o]_haps_confidence". This file contains one line per individual and one column per SNP in the phased haplotypes file. Homozygotes are represented by * characters and heterozygotes are represented by numbers between 0.5 and 1.0; this is the estimated probability that the phasing between the current heterozygote and the previous heterozygote (to the left) is correct. By convention, the leftmost heterozygous SNP in each individual is assigned a phasing certainty of 1.0.

As shown in the examples section, it is possible to use the -phase option to produce haplotypes without the use of a reference panel; i.e., to perform a classical phasing analysis.
-pgs "Predict Genotyped SNPs": Tells the program to replace the input genotypes from the -g file with imputed genotypes at Type 2 SNPs in the output file.
-pgs_miss Unlike -pgs, this option tells the program to replace only missing genotypes with imputed genotypes. That is, any input genotype whose maximum probability exceeds the -call_thresh will simply be reprinted in the output file, whereas input genotypes that fall below the calling threshold will be imputed in the output.

WARNING: This is an appealing option that promises to simply "fill in" sporadically missing genotypes in your input data. However, we think that following this procedure and then testing the SNPs for association could cause subtle problems. We are investigating these issues, but in the meantime we suggest that you only use this option with great caution; using it naively may lead to bad results, and you do so at your own risk.


Options not intended for general use

The options in this table are meant for experts only. Don't use them unless you know what you are doing!

Flag Default Description
-seed <int> random Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option.
-no_warn Turns warnings off, so that the -w file does not get printed.
-no_fill Turns hole-filling off, so that SNPs included in the -g file but not in the lowest reference panel cannot contribute to the inference.
-no_remove Prevents the program from discarding SNPs whose alleles cannot be aligned across panels.


Analyzing Whole Chromosomes (top)

In principle, it is possible to estimate genotypes across an entire chromosome in a single run of IMPUTE2, but it is better to split a chromosome into smaller chunks for analysis. One important reason for this is that the population-genetic approximation underlying IMPUTE2 is only valid over short genomic distances, which means that you may get poor results if you try to impute a large region in a single run -- consequently, you should only impute regions of 5 Mb or shorter in any given run unless you know exactly what you are doing.

Splitting a chromosome into smaller pieces is often a good computational strategy anyway, since it allows the pieces to be imputed separately on multiple computer processors. This decreases the effective computing time and limits the amount of RAM needed for each run.

The -int option provides an easy way to break a chromosome into smaller chunks for analysis by IMPUTE2. For example, if we wanted to split a chromosome into 5-megabase-pair-regions for analysis, we could specify "-int 1 5000000" for the first run of the algorithm, "-int 5000001 10000000" for the second run, and so on. IMPUTE2 uses an internal buffer region of 250 kb on either side of the analysis interval to prevent edge effects; this means that data outside the region bounded by -int will contribute to the inference, but only SNPs inside that region will appear in the output. In this way, you can specify non-overlapping, adjacent intervals and obtain uniformly high-quality imputation. (Note: to change the size of the internal buffer region, use the -buffer option.)

Once you have split a chromosome into multiple chunks and imputed them separately, the IMPUTE2 output format makes it easy to synthesize your results into a single whole-chromosome file. On linux-based systems, you can simply type a command like this:

cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2

Here, "chr16_chunkX.impute2" is an output file for a subregion of chromosome 16, and "chr16_chunkAll.impute2" is a combined output file that contains results for the entire chromosome. (Note that chr16 would typically need to be split into more than three subregions to satisfy the approximation used by IMPUTE2.)


QC and Troubleshooting (top)

We will soon be posting suggestions for making sure that IMPUTE2 has run successfully, detecting common problems, and processing the output files prior to association analysis. Stay tuned...


Filling Reference Panel Holes (top)

This is a new function in IMPUTE v2.1.0. We will provide details about the procedure soon.


FAQ (top)

FAQ coming soon.


Registration and Updates (top)

If you would like to receive e-mails about updates to this software, please fill out the registration form.


References (top)

[1] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF] [Supplementary Material] [News and Views Article]

[2] B. N. Howie, P. Donnelly and J. Marchini (2009) A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genetics 5(6): e1000529 [Open Access Article]


Contact Information (top)

If you have any questions regarding the use of IMPUTE2, please send an e-mail to both of the following people:

Dr. Bryan Howie (
bhowie <at> uchicago <dot> edu).
Dr. Jonathan Marchini ( marchini <at> stats <dot> ox <dot> ac <dot> uk).

It is a good idea to include a copy of the screen output (which is printed to the -r file) with your e-mail to help us identify any problems.