IMPUTE version 2 (also known simply as IMPUTE2)
is a genotype imputation and phasing program based on ideas from
Howie et al. (2009).
Please click on the links below to download the software or learn how to use it.
Page last updated Oct 1, 2010. We frequently add new features to the website, so please check back if you are actively using the software. |
IMPUTE2 is a computer program for phasing
observed genotypes and imputing missing genotypes. Most people use
just a couple of the program's basic functions, but we have also built
up a collection of specialized and powerful options. If you are new to
IMPUTE2, or indeed to phasing and imputation in general, we
suggest that you start by learning the basics.
|
IMPUTE v2.1.2 (released Oct 1, 2010) includes the following new features:
|
The following people developed the methodology and software for IMPUTE2:
|
Pre-compiled IMPUTE2 binaries and example files can be downloaded from the
links below. For Linux machines, the dynamic binaries are smaller but may not work
on some machines due to gcc library compatibility issues; if the dynamic version
doesn't work for you, please try the static version. If you have any problems getting
the program to work on your machine or would like to request an executable for a
platform not shown here, please contact us.
|
Platform
|
File
|
Linux (x86_64) Dynamic Executable | impute_v2.1.2_x86_64_dynamic.tgz |
Linux (x86_64) Static Executable | impute_v2.1.2_x86_64_static.tgz |
Linux (x86_64) Static Executable (SuSE 9.3) | impute_v2.1.2_SuSE9.3_x86_64_static.tgz |
Linux (i386) Dynamic Executable | impute_v2.1.2_i386_dynamic.tgz |
Mac OS X Intel | impute_v2.1.2_MacOSX_Intel.tgz |
Solaris 5.10 (AMD Opteron) | impute_v2.1.2_Solaris5.10_Opteron.tgz |
Windows MS-DOS (Intel) | impute_v2.1.2_Windows_Intel.tgz |
tar -zxvf impute_v2.X.Y_i386.tgz |
The figure below provides a schematic overview of what IMPUTE2 does.
In short, it uses a fine-scale recombination map and a densely genotyped
reference panel to "fill in" missing genotypes in a study dataset, which
might consist of cases and controls typed on a commercial SNP chip.
By estimating the genotypes of SNPs that were not in the original study
data, imputation allows a much larger set of SNPs to be tested for
association. This can increase both the power to detect association signals
and the signal resolution near a causal variant.
|
IMPUTE2 can use customized reference panels (e.g., SNP genotypes from a fine-mapping study)
as well as publicly available reference datasets. In the latter category, we currently recommend using
a combination of reference haplotypes from the 1,000 Genomes Project and HapMap Phase 3. The 1,000
Genomes dataset provides wide coverage of the genome, in that it contains many more SNPs than the
HapMap (with high enrichment for rare mutations), while HapMap 3 provides deep coverage, in
that it contains a greater sampling of chromosomes from human populations. We have designed
IMPUTE2 to integrate these wide and deep panels into a single analysis framework, as shown in
this example.
|
Haplotype, legend, sample, and genetic map files
These downloads contain the data needed to impute genotypes using reference panels from HapMap 3 and the 1,000 Genomes Project. Each dataset includes the latest haplotypes from the 1,000 Genomes panel of interest, along with all available HapMap 3 haplotypes, except those present in the relevant 1,000 Genomes panel. We remove these duplicate haplotypes so that the two datasets can be combined without causing "double counting" of haplotypes during imputation. Both sets of haplotypes have also been filtered to remove SNPs with apparent quality issues. When using these combined panels, you should set the To see an example command that combines HapMap 3 and 1,000 Genomes haplotypes in a single imputation analysis, go here. To see our rationale for using all HapMap 3 haplotypes together, rather than focusing on population-matched subsets, go here. To learn more about our scheme for filtering out low-quality SNPs, go here. If you prefer unfiltered 1,000 Genomes haplotypes, you can download them from here; similarly, you can download unfiltered HapMap 3 haplotypes from here. |
Download packages (warning: large files)
[CEU] [YRI] [CHB+JPT (coming soon)] |
NOTE: When combining datasets in an imputation analysis, you should always take great care
to ensure that they have been aligned to the same strand convention. In this case, we have already
aligned the HapMap 3 and 1,000 Genomes data to the '+' strand of the human reference sequence, and
we have removed SNPs with unresolvable strand flips between panels. Consequently, you just need to
make sure that your dataset is correctly aligned before imputing from
the combined panel.
While we prefer the reference panels linked above, we recognize that some people may want to download the original, unfiltered HapMap 3 and 1,000 Genomes datasets. These can be obtained below: |
1,000 Genomes haplotypes (unfiltered) -- NCBI Build 36
--1,000 Genomes files are from Pilot 1 genotypes released Mar 2010; phased haplotypes released Jun 2010 |
Haplotype, legend, sample, and genetic map files
These downloads contain the data needed to impute genotypes using reference panels from the 1,000 Genomes Project. The files are unfiltered, in the sense that we have not modified them from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs. When using one of these panels, you should set the |
Download packages (warning: large files)
[CEU] [YRI] [CHB+JPT (coming soon)] |
HapMap 3 haplotypes (unfiltered) -- NCBI Build 36
--HapMap 3 files are from release #2 (Feb 2009) |
Haplotype, legend, sample, and genetic map files
These downloads contain the data needed to impute genotypes using reference panels from HapMap Phase 3. The files are unfiltered, in the sense that we have only modified them minimally from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs. In HapMap 3, the most common problem is that an allele will "drop out" of the genotyping assay, thereby making every individual appear homozygous for the same allele. When using this combined panel, you should set the |
Download packages (warning: large files)
[ALL PANELS] |
You can also download HapMap Phase 2 haplotypes in the format used by IMPUTE2;
to access them, please click
here.
We are continually working to distribute the most up-to-date and comprehensive reference datasets available. We will post them here in IMPUTE2 format as we process them. |
This section provides some example runs that illustrate typical applications of
IMPUTE2. All of the data files used in these command-line calls are
included in the
Example/
directory that comes with the software download. You should run the commands
from the main download directory (i.e., the one that contains the
impute2
executable).
|
./impute2 \ |
./impute2 \ |
./impute2 \ |
./impute2 \ |
./impute2 \ |
./impute2 \ |
./impute2 \ |
The following tables describe the command-line options that can be used to control
IMPUTE2. Many of these options are similar to options in IMPUTE v1
(and earlier versions) but there are some key differences in how these options are
handled by IMPUTE2 -- these are noted in green.
|
Input data files
This table explains the formatting requirements for input data files that can be
supplied to IMPUTE2. Some of these files allow more than one ID per SNP,
but the program identifies SNPs internally by their base pair positions (which means
that duplicate SNPs at a single position can cause problems).
In all of these files, it is important that SNPs appear in
base pair position order, from lowest to highest. It is also crucial that all SNP
positions come from the same genome build (e.g., NCBI Build 36) so the program can
combine information across input files.
|
Flag | Default | Description |
-g <file>
REQUIRED unless |
none | File containing genotypes for a study cohort that we want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO. |
-known_haps_g <file> | none |
File containing known haplotypes for the study cohort. The
format is the same as the output format from IMPUTE2's
If your study dataset is fully phased, you can replace the The |
-m <file>
REQUIRED |
none | Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). |
-h <file 1> <file 2> | none | File of known haplotypes, with one row per SNP and one column per haplotype. In IMPUTE2, it is possible to specify two known haplotypes files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 -- no other values are allowed. |
-l <file 1> <file 2> | none | Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). When using two known haplotypes files with IMPUTE2, you must supply the corresponding legend files in the same order -- i.e., the file with more SNPs comes first. |
-g_ref <file> | none | File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file). |
These options control some basic processing that the program does to
prepare input data for inference.
|
Flag | Default | Description |
-int <lower> <upper>
REQUIRED |
none |
Genomic interval to use for inference, as specified by
<lower>
and
<upper>
boundaries in base pair position. The boundaries can be expressed either in long form (e.g.,
|
-buffer <int> | 250 kb | Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int flag. SNPs in the buffer regions inform the inference but do not appear in output files. Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. |
-allow_large_regions | Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is described here. | |
-Ne <int> | 14000 |
"Effective size" of the population (commonly denoted as Ne in the population genetics literature)
from which your dataset was sampled. This parameter scales the recombination rates that IMPUTE2
uses to train its population model. As a starting point, we suggest values of
11418
for imputation from HapMap CEU,
17469
for YRI, and
14269
for CHB+JPT.
When combining reference panels, we suggest taking the average of the panel-specific Ne values, weighted by the number of chromosomes in each panel; e.g., for a CEU+YRI+CHB+JPT panel in HapMap Phase II data, the Ne would be For larger and more complicated reference panels where this calculation would become tedious (e.g., datasets that include all HapMap 3 panels), setting Ne to 15000 should be fine. In our experience, imputation accuracy is quite insensitive to the exact Ne value when using a large reference panel with diverse ancestry. |
-call_thresh <float> | 0.9 |
Threshold for calling genotypes in the
-g file.
For each individual at each SNP, the program will use the genotype with the maximum probability
if that probability exceeds the threshold; otherwise, the genotype will be treated as missing.
NOTE: This threshold only applies to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions. |
-nind <int> |
# of indiv in |
Number of individuals from the
-g file
to include in the analysis. For example, to impute only the first five individuals, set
|
In any imputation analysis, is it absolutely essential
that all panels have their allele codings aligned to a fixed reference (usually the
human genome reference sequence).
The options in this table are meant to help align the allele codings in your input data
files, but you should not assume that the program will do all the work for you.
If you do not know exactly how your data were processed or what these
options are doing, you should try to locate the original strand information or
contact us for assistance.
|
Flag | Default | Description |
-strand_g <file> | none |
File showing the strand orientation of the SNP allele codings in the
-g file,
relative to a fixed reference point. Each SNP occupies one line, and the file should have
two columns: (i) the base pair position of the SNP, and (ii) the strand orientation ('+' or '-')
of the alleles in the genotype file; the columns should be separated by a single space.
The ordering of the SNPs in this file does not matter (by contrast to the
-g file,
which must be sorted by SNP position),
and it is okay if some SNPs in the strand file are not present in the genotype file
(e.g., due to filtering). Some model strand files are included in the
Example/
directory that comes with the software download.
NOTE: This flag replaces the -s flag from versions prior to v2.1.0. |
-strand_g_ref <file> | none |
Same as
-strand_g,
but applies to the
-g_ref file.
NOTE: This flag replaces the -s_ref flag from versions prior to v2.1.0. |
-fix_strand_g |
Activates the program's internal strand alignment procedure for the
-g file
(Panel 2).
The strand is aligned to the alleles in Panel 0, if present, otherwise to
Panel 1. The strand is aligned deterministically where possible (e.g.,
flipping A/C in Panel 2 to match G/T in the reference) and by allele frequency otherwise
(at A/T and C/G SNPs, whose alignment cannot be resolved by labels alone); in the latter
case, the program codes the alleles such that Panel 2 and the alignment
reference (Panel 0 or 1) have the same minor allele.
NOTE: This flag can be used in conjunction with the -strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to "fix" the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others. NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The only way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured. NOTE: This flag replaces the -fix_strand flag from versions prior to v2.1.0. |
|
-fix_strand_g_ref |
Similar to
-fix_strand_g,
but applies to the
-g_ref file
(Panel 1). In this case the strand is aligned to the alleles in
Panel 0, so the flag does not work if this panel is not present.
NOTE: Just as -fix_strand_g can be used in conjunction with -strand_g, this flag can be used in conjunction with the -strand_g_ref option. As before, the strand file takes precedence over the internal strand-fixing procedure. NOTE: As with -fix_strand_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%. NOTE: This flag replaces the -fix_strand_ref flag from versions prior to v2.1.0. |
The options in this table affect the way that the program filters the input data
(mainly the
-g
and
-g_ref
files).
Some of the options provide direct control over which samples and SNPs get included
in the analysis, while others set rules for how the program should behave when faced
with certain filtering choices. These options are designed to make filtering more
flexible, so that it is easy to apply any desired set of filters to a single
underlying genotype file.
|
Flag | Default | Description |
-exclude_snps_g <file> | none |
List of SNPs to exclude from the
|
|
none | Same as -exclude_snps_g, but applies to the -g_ref file. |
-impute_excluded |
Specifies that SNPs excluded from the study dataset via the
|
|
-include_snps <file> | none | List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the Type 0 and Type 1 SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file). This option does not have any effect on Type 2 and Type 3 SNPs. |
-sample_g <file> | none |
File of sample IDs for the individuals in the
-g file;
should follow the format described
here.
Only the first three columns are necessary, and only the first two columns are used by
IMPUTE2 (i.e., the third column can have dummy values, and subsequent columns
do not affect the algorithm).
NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option. |
-sample_g_ref <file> | none | Same as -sample_g, but applies to the -g_ref file. |
-exclude_samples_g <file> | none |
List of samples to exclude from the
-g file.
The list should take the form of a single column of identifiers in a text file. The samples can be identified
by the IDs in either of the first two columns of the
-sample_g file,
which is REQUIRED
if you want to use this option. Excluded samples will be treated as if they had not been present in the
genotypes file, and the program will re-print the original sample list, minus the excluded samples,
to a file named
" NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option. |
-exclude_samples_g_ref <file> | none |
Same as
-exclude_samples_g,
but applies to the
-g_ref file.
One difference is that the program will not print a filtered list of
-g_ref
samples like the one that gets printed with
|
IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase
reconstructions for observed genotypes. The options in this table control the algorithm.
|
The options in this table control the format and naming conventions of
output files printed by IMPUTE2.
|
Flag | Default | Description |
-o <file> | ./test.impute2 | Name of main output file. Follows the same format as the -g file. |
-i <file> | [-o]_info |
Name of SNP-wise information file with one line per SNP and a single header line at the beginning;
versions of IMPUTE prior to v2.1.0 did not print the header.
This file always contains the following columns (header tags shown in parentheses):
1. SNP identifier from -g file (snp_id) 2. rsID (rs_id) 3. base pair position (position) 4. expected frequency of allele coded '1' in the -o file (exp_freq_a1) 5. measure of the observed statistical information associated with the allele frequency estimate (info) 6. average certainty of best-guess genotypes (certainty) 7. internal "type" assigned to SNP (type) -- column did not exist prior to v2.1.0 Depending on the command-line options invoked, there may also be columns labeled info_typeX, concord_typeX, and r2_typeX, where X takes values in {0,1,2}. For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes (after applying the |
-r <file> | [-o]_summary | Name of file that records a summary of the screen output. |
-w <file> | [-o]_warnings | Name of file that records warnings generated by IMPUTE2. |
-os <int> <int> ... | 0 1 2 3 |
"Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling
is discussed in the Overview). By default, all imputed and genotyped
SNPs are included in the output, i.e.,
" |
-o_gz | Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large. | |
-outdp <int> | 3 | Specifies the number of decimal places to use for reporting genotype probabilities in the main output file. |
-no_snp_qc_info | Suppresses printing of info_typeX, concord_typeX, and r2_typeX columns in -i file. | |
-no_sample_qc_info |
Suppresses printing of per-sample quality control metrics file. The default is to print a file named
" |
|
-phase |
IMPUTE2 always implicitly phases the study dataset
( In addition to this "best-guess" haplotypes file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named " As shown in the examples section, it is possible to use the |
|
-pgs |
"Predict Genotyped SNPs": Tells the program to replace the input genotypes from the
|
|
-pgs_miss |
Unlike
WARNING: This is an appealing option that promises to simply "fill in" sporadically missing genotypes in your input data. However, we think that following this procedure and then testing the SNPs for association could cause subtle problems. We are investigating these issues, but in the meantime we suggest that you only use this option with caution; using it naively may lead to bad results, and you do so at your own risk. |
The options in this table are meant for experts only. Don't use them unless you
know what you are doing!
|
Flag | Default | Description |
-seed <int> | random | Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option. |
-no_warn | Turns warnings off, so that the -w file does not get printed. | |
-no_fill | Turns hole-filling off, so that SNPs included in the -g file but not in the lowest reference panel cannot contribute to the inference. | |
-no_remove | Prevents the program from discarding SNPs whose alleles cannot be aligned across panels. |
In principle, it is possible to impute genotypes across an entire chromosome in a single run of
IMPUTE2, but it is better to split a chromosome into smaller chunks for analysis.
One important reason for this is that the population-genetic
approximation used by IMPUTE2
works best over short genomic distances. The approximation works by modeling local
genealogies, and the superior accuracy afforded by this model
may diminish if there is too much recombination in the
region.
|
cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2 |
The proliferation of cheap DNA sequencing technologies has greatly
increased the rate at which reference panels for genotype imputation
can be generated. In this context, many GWAS investigators would like
to re-impute their datasets as larger reference datasets become
available. Imputation is still relatively computer-intensive when
performed genome-wide, so we have been working on ways to speed up the
inference in this context.
|
We will soon be posting suggestions for making sure that IMPUTE2 has run successfully,
detecting common problems, and processing the output files prior to association analysis.
Stay tuned...
|
|
Q: What |
A: You can find a complete answer here. The quick answer is
|
Q: Why haven't you responded to my e-mail? |
A: We go out of our way to
respond promptly to queries about IMPUTE2. If you wrote
to us and haven't heard back yet, the most likely reason is
that we are too busy to reply immediately. There are a few
things that you can do to improve the chances of receiving a
fast response:
|
If you would like to receive e-mails about updates to this software,
please fill out the
registration form.
|
[1]
J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007)
A new multipoint method for genome-wide association studies via imputation of genotypes.
Nature Genetics 39: 906-913
[Free Access PDF]
[Supplementary Material]
[News and Views Article]
|
If you have any
questions regarding the use of IMPUTE2, please send an e-mail to both of the
following people:
|