IMPUTE version 2
(also known simply as IMPUTE2) is a genotype imputation and
phasing program based on ideas from Howie et al. (2009).
Please
click
on
the links below to download the software or learn how
to use it. Page last updated Dec 8, 2010. We frequently add new features to the website, so please check back if you are actively using the software. |
IMPUTE2
is a computer program for phasing observed genotypes and imputing
missing genotypes. Most people use just a couple of the program's basic
functions, but we have also built up a collection of specialized and
powerful options. If you are new to IMPUTE2, or indeed to
phasing and imputation in general, we suggest that you start by
learning the basics. |
IMPUTE v2.1.2
(released Oct 1, 2010) includes the following new features:
|
The
following people developed the methodology and software for IMPUTE2:
|
IMPUTE v2 is available free to use
for academic use only. Please see the LICENCE here
and also included with the package. |
Platform
|
File
|
Linux (x86_64) Dynamic Executable | impute_v2.1.2_x86_64_dynamic.tgz |
Linux (x86_64) Static Executable | impute_v2.1.2_x86_64_static.tgz |
Linux (x86_64) Static Executable (SuSE 9.3) | impute_v2.1.2_SuSE9.3_x86_64_static.tgz |
Linux (i386) Dynamic Executable | impute_v2.1.2_i386_dynamic.tgz |
Mac OS X Intel | impute_v2.1.2_MacOSX_Intel.tgz |
Mac
OS
X
PowerPC |
impute_v2.1.2_MacOSX_PowerPC_dynamic.tgz |
Solaris 5.10 (AMD Opteron) | impute_v2.1.2_Solaris5.10_Opteron.tgz |
Windows MS-DOS (Intel) | impute_v2.1.2_Windows_Intel.tgz |
tar -zxvf impute_v2.X.Y_i386.tgz |
The
figure below provides a schematic overview of what IMPUTE2
does. In short, it uses a fine-scale recombination map and a densely
genotyped reference panel to "fill in" missing genotypes in a study
dataset, which might consist of cases and controls typed on a
commercial SNP chip. By estimating the genotypes of SNPs that were not
in the original study data, imputation allows a much larger set of SNPs
to be tested for association. This can increase both the power to
detect association signals and the signal resolution near a causal
variant. |
IMPUTE2
can use customized reference panels (e.g., SNP genotypes from a
fine-mapping study) as well as publicly available reference datasets.
There are a variety of reference panels to choose from |
Reference
Set |
NCBI
Genome
build |
Description |
1000 Genomes August haplotypes |
Build
37 |
These
consist
of three panels of haplotypes denoted EUR (European
haplotypes), AFR (African haplotypes) and ASN (Asian haplotypes). There
are 566 EUR haplotypes at 11,572,677 SNPs. There are 348 AFR haplotypes
at 16,514,846 SNPs. There are 388 ASN haplotypes at 10,524,588 SNPs. We
have also supplied recombination maps in Build 37 co-ordinates. |
Combined 1000 Genomes low-coverage pilot
haplotypes + HapMap3 haplotypes |
Build
36 |
We
combined
the 1000 Genomes low-coverage haplotypes with HapMap3. This
provides a panel that is dense
(i.e. lots of SNPs in the 1000 Genomes panel) and deep (i.e. lots of haplotypes at
SNPs in HapMap3). IMPUTE v2 can handle hierarchical reference panels to
get advantages of both at the same time. |
1000 Genomes low-coverage pilot haplotypes |
Build
36 |
The
1000
Genomes haplotypes from the low-coverage pilot, release in June
2010. We have only made the CEU and YRI haplotypes available. This
dataset has been superceeded by the 1000 Genomes August haplotypes. |
HapMap3 haplotypes |
Build
36 |
HapMap3
haplotypes |
HapMap2
haplotypes |
Build
36&35 |
These
haplotype
sets can be found on the IMPUTE v1 webpage. |
Haplotypes
+
legend files |
Recombination
maps
(Build 37) |
EUR.1000Genomes.Dec2010.haplotypes.tgz
[547Mb] AFR.1000Genomes.Dec2010.haplotypes.tgz [528 Mb] ASN.1000Genomes.Dec2010.haplotypes.tgz [341 Mb] |
genetic_maps_b37.tgz |
Haplotype,
legend, sample, and genetic map files These downloads contain the data needed to impute genotypes using reference panels from HapMap 3 and the 1,000 Genomes Project. Each dataset includes the latest haplotypes from the 1,000 Genomes panel of interest, along with all available HapMap 3 haplotypes, except those present in the relevant 1,000 Genomes panel. We remove these duplicate haplotypes so that the two datasets can be combined without causing "double counting" of haplotypes during imputation. Both sets of haplotypes have also been filtered to remove SNPs with apparent quality issues. When using these combined panels, you should set the To see an example command that combines HapMap 3 and 1,000 Genomes haplotypes in a single imputation analysis, go here. To see our rationale for using all HapMap 3 haplotypes together, rather than focusing on population-matched subsets, go here. To learn more about our scheme for filtering out low-quality SNPs, go here. If you prefer unfiltered 1,000 Genomes haplotypes, you can download them from here; similarly, you can download unfiltered HapMap 3 haplotypes from here. |
Download packages
(warning: large files) [CEU] [YRI] [CHB+JPT (coming soon)] |
NOTE: When
combining datasets in an imputation analysis, you should always take
great care to ensure that they have been aligned to the same strand
convention. In this case, we have already aligned the HapMap 3 and
1,000 Genomes data to the '+' strand of the human reference sequence,
and we have removed SNPs with unresolvable strand flips between panels.
Consequently, you just need to make sure that your dataset is correctly
aligned before imputing from the combined panel. While we prefer the reference panels linked above, we recognize that some people may want to download the original, unfiltered HapMap 3 and 1,000 Genomes datasets. These can be obtained below: |
1,000 Genomes haplotypes (unfiltered) -- NCBI Build 36 --1,000 Genomes files are from Pilot 1 genotypes released Mar 2010; phased haplotypes released Jun 2010 |
Haplotype,
legend, sample, and genetic map files These downloads contain the data needed to impute genotypes using reference panels from the 1,000 Genomes Project. The files are unfiltered, in the sense that we have not modified them from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs. When using one of these panels, you should set the |
Download packages
(warning: large files) [CEU] [YRI] [CHB+JPT (coming soon)] |
HapMap 3 haplotypes (unfiltered) -- NCBI Build 36 --HapMap 3 files are from release #2 (Feb 2009) |
Haplotype,
legend, sample, and genetic map files These downloads contain the data needed to impute genotypes using reference panels from HapMap Phase 3. The files are unfiltered, in the sense that we have only modified them minimally from the official release versions. These haplotypes are generally of high quality, but they may contain a small fraction of poorly genotyped SNPs. In HapMap 3, the most common problem is that an allele will "drop out" of the genotyping assay, thereby making every individual appear homozygous for the same allele. When using this combined panel, you should set the |
Download packages
(warning: large files) [ALL PANELS] |
You can also
download HapMap Phase 2 haplotypes in the format used by IMPUTE2;
to
access
them,
please click here.
We are continually working to distribute the most up-to-date and comprehensive reference datasets available. We will post them here in IMPUTE2 format as we process them. |
This
section provides some example runs that illustrate typical applications
of IMPUTE2. All of the data files used in these command-line
calls are included in the Example/
directory that comes with the software download. You should run the
commands from the main download directory (i.e., the one that contains
the impute2
executable). |
./impute2
\ |
./impute2
\ |
./impute2
\ |
./impute2
\ |
./impute2
\ |
./impute2
\ |
./impute2
\ |
The
following tables describe the command-line options that can be used to
control IMPUTE2. Many of these options are similar to options
in IMPUTE v1 (and earlier versions) but there are some key
differences in how these options are handled by IMPUTE2 --
these are noted in green. |
Input
data
files
This
table explains the formatting requirements for input data files that
can be supplied to IMPUTE2. Some of these files allow more than
one ID per SNP, but the program identifies SNPs internally by their
base pair positions (which means that duplicate SNPs at a single
position can cause problems). In all of these files,
it is important that SNPs appear in base pair position order, from
lowest to highest. It is also crucial that all SNP positions come from
the same genome build (e.g., NCBI Build 36) so the program can combine
information across input files. |
Flag | Default | Description |
-g <file> REQUIRED unless |
none | File containing genotypes for a study cohort that we want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO. |
-known_haps_g <file> | none | File containing known
haplotypes for the study cohort. The format is the same as the output
format from IMPUTE2's If your study dataset is fully phased, you can replace the The |
-m <file> REQUIRED |
none | Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)"). |
-h <file 1> <file 2> | none | File of known haplotypes, with one row per SNP and one column per haplotype. In IMPUTE2, it is possible to specify two known haplotypes files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 -- no other values are allowed. |
-l <file 1> <file 2> | none | Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). When using two known haplotypes files with IMPUTE2, you must supply the corresponding legend files in the same order -- i.e., the file with more SNPs comes first. |
-g_ref <file> | none | File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file). |
These
options control some basic processing that the program does to prepare
input data for inference. |
Flag | Default | Description |
-int <lower> <upper> REQUIRED |
none | Genomic interval to use for
inference, as specified by <lower>
and
<upper>
boundaries
in
base
pair position. The boundaries can be expressed
either in long form (e.g., |
-buffer <int> | 250 kb | Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int flag. SNPs in the buffer regions inform the inference but do not appear in output files. Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. |
-allow_large_regions |
|
Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is described here. |
-Ne <int> | 14000 | "Effective size" of the
population (commonly denoted as Ne in the population genetics
literature) from which your dataset was sampled. This parameter scales
the recombination rates that IMPUTE2 uses to train its
population model. As a starting point, we suggest values of 11418
for imputation from HapMap CEU, 17469
for YRI, and 14269
for CHB+JPT. When combining reference panels, we suggest taking the average of the panel-specific Ne values, weighted by the number of chromosomes in each panel; e.g., for a CEU+YRI+CHB+JPT panel in HapMap Phase II data, the Ne would be For larger and more complicated reference panels where this calculation would become tedious (e.g., datasets that include all HapMap 3 panels), setting Ne to 15000 should be fine. In our experience, imputation accuracy is quite insensitive to the exact Ne value when using a large reference panel with diverse ancestry. |
-call_thresh <float> | 0.9 | Threshold for calling genotypes
in the -g file.
For
each
individual
at each SNP, the program will use the genotype with
the maximum probability if that probability exceeds the threshold;
otherwise, the genotype will be treated as missing. NOTE: This threshold only applies to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions. |
-nind <int> |
# of indiv in |
Number of individuals from the -g file
to include in the analysis. For example, to impute only the first five
individuals, set |
In any
imputation analysis, is it absolutely essential
that all panels have their allele codings aligned to a fixed reference
(usually the human
genome reference sequence). The options in this table are meant to
help align the allele codings in your input data files, but you should
not assume that the program will do all the work for you. If you do not know exactly how your data were processed or
what these options are doing, you should try to locate the original
strand information or contact us
for assistance. |
Flag | Default | Description |
-strand_g <file> | none | File showing the strand
orientation of the SNP allele codings in the -g file,
relative
to
a
fixed reference point. Each SNP occupies one line, and
the file should have two columns: (i) the base pair position of the
SNP, and (ii) the strand orientation ('+' or '-') of the alleles in the
genotype file; the columns should be separated by a single space. The
ordering of the SNPs in this file does not matter (by contrast to the -g file,
which
must
be
sorted by SNP position), and it is okay if some SNPs in
the strand file are not present in the genotype file (e.g., due to
filtering). Some model strand files are included in the Example/
directory that comes with the software download. NOTE: This flag replaces the -s flag from versions prior to v2.1.0. |
-strand_g_ref <file> | none | Same as -strand_g,
but
applies
to
the -g_ref file.
NOTE: This flag replaces the -s_ref flag from versions prior to v2.1.0. |
-fix_strand_g |
|
Activates the program's
internal strand alignment procedure for the -g file
(Panel 2). The strand is aligned to the alleles in Panel 0,
if
present,
otherwise
to Panel 1. The strand is aligned
deterministically where possible (e.g., flipping A/C in Panel 2
to match G/T in the reference) and by allele frequency otherwise (at
A/T and C/G SNPs, whose alignment cannot be resolved by labels alone);
in the latter case, the program codes the alleles such that Panel 2
and the alignment reference (Panel 0 or 1) have the
same minor allele. NOTE: This flag can be used in conjunction with the -strand_g option. In that case, the information from the strand file takes precedence, i.e., the program will not try to "fix" the strand of SNPs that have explicit strand info already. This is useful if you have strand information for some SNPs but not others. NOTE: You should take care when using this option. In particular, it can get the alignment wrong at A/T and C/G SNPs with minor allele frequencies near 50%, which can hurt the inference by distorting the local haplotype patterns. The only way to get the correct alignment at these kinds of SNPs is to track down the original assay and determine which strand was measured. NOTE: This flag replaces the -fix_strand flag from versions prior to v2.1.0. |
-fix_strand_g_ref |
|
Similar to -fix_strand_g,
but
applies
to
the -g_ref file
(Panel 1). In this case the strand is aligned to the
alleles in Panel 0, so the flag does not work if this
panel is not present. NOTE: Just as -fix_strand_g can be used in conjunction with -strand_g, this flag can be used in conjunction with the -strand_g_ref option. As before, the strand file takes precedence over the internal strand-fixing procedure. NOTE: As with -fix_strand_g, you should be careful about using this option to align A/T and C/G SNPs with minor allele frequencies near 50%. NOTE: This flag replaces the -fix_strand_ref flag from versions prior to v2.1.0. |
The
options in this table affect the way that the program filters the input
data (mainly the -g
and -g_ref
files). Some of the options provide direct control over which samples
and SNPs get included in the analysis, while others set rules for how
the program should behave when faced with certain filtering choices.
These options are designed to make filtering more flexible, so that it
is easy to apply any desired set of filters to a single underlying
genotype file. |
Flag | Default | Description |
-exclude_snps_g <file> | none | List of SNPs to exclude from
the |
|
none | Same as -exclude_snps_g, but applies to the -g_ref file. |
-impute_excluded |
|
Specifies that SNPs excluded
from the study dataset via the |
-include_snps <file> | none | List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the Type 0 and Type 1 SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file). This option does not have any effect on Type 2 and Type 3 SNPs. |
-sample_g <file> | none | File of sample IDs for the
individuals in the -g file;
should
follow
the
format described here.
Only
the
first
three columns are necessary, and only the first two
columns are used by IMPUTE2 (i.e., the third column can have
dummy values, and subsequent columns do not affect the algorithm). NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option. |
-sample_g_ref <file> | none | Same as -sample_g, but applies to the -g_ref file. |
-exclude_samples_g <file> | none | List of samples to exclude from
the -g file.
The
list
should
take the form of a single column of identifiers in a
text file. The samples can be identified by the IDs in either of the
first two columns of the -sample_g file,
which
is
REQUIRED
if you want to use this option. Excluded samples will be treated as if
they had not been present in the genotypes file, and the program will
re-print the original sample list, minus the excluded samples, to a
file named " NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option. |
-exclude_samples_g_ref <file> | none | Same as -exclude_samples_g,
but
applies
to
the -g_ref file.
One
difference
is
that the program will not print a filtered list of -g_ref
samples like the one that gets printed with |
IMPUTE2
uses an MCMC algorithm to integrate over the space of possible phase
reconstructions for observed genotypes. The options in this table
control the algorithm. |
The
options in this table control the format and naming conventions of
output files printed by IMPUTE2. |
Flag | Default | Description |
-o <file> | ./test.impute2 | Name of main output file. Follows the same format as the -g file. |
-i <file> | [-o]_info | Name of SNP-wise information
file with one line per SNP and a single header line at the beginning; versions of IMPUTE prior to v2.1.0 did
not print the header. This file always contains the following
columns (header tags shown in parentheses): 1. SNP identifier from -g file (snp_id) 2. rsID (rs_id) 3. base pair position (position) 4. expected frequency of allele coded '1' in the -o file (exp_freq_a1) 5. measure of the observed statistical information associated with the allele frequency estimate (info) 6. average certainty of best-guess genotypes (certainty) 7. internal "type" assigned to SNP (type) -- column did not exist prior to v2.1.0 Depending on the command-line options invoked, there may also be columns labeled info_typeX, concord_typeX, and r2_typeX, where X takes values in {0,1,2}. For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes (after applying the |
-r <file> | [-o]_summary | Name of file that records a summary of the screen output. |
-w <file> | [-o]_warnings | Name of file that records warnings generated by IMPUTE2. |
-os <int> <int> ... | 0 1 2 3 | "Output SNPs": specifies the
SNP types that will be printed to the output file (SNP labeling is
discussed in the Overview). By default, all
imputed and genotyped SNPs are included in the output, i.e., " |
-o_gz |
|
Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large. |
-outdp <int> | 3 | Specifies the number of decimal places to use for reporting genotype probabilities in the main output file. |
-no_snp_qc_info |
|
Suppresses printing of info_typeX, concord_typeX, and r2_typeX columns in -i file. |
-no_sample_qc_info |
|
Suppresses printing of
per-sample quality control metrics file. The default is to print a file
named " |
-phase |
|
IMPUTE2 always
implicitly phases the study dataset ( In addition to this "best-guess" haplotypes file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named " As shown in the examples section, it is possible to use the |
-pgs |
|
"Predict Genotyped SNPs": Tells
the program to replace the input genotypes from the |
-pgs_miss |
|
Unlike WARNING: This is an appealing option that promises to simply "fill in" sporadically missing genotypes in your input data. However, we think that following this procedure and then testing the SNPs for association could cause subtle problems. We are investigating these issues, but in the meantime we suggest that you only use this option with caution; using it naively may lead to bad results, and you do so at your own risk. |
The
options in this table are meant for experts only. Don't use them unless
you know what you are doing! |
Flag | Default | Description |
-seed <int> | random | Initial seed for random number generator. The seed is set using the system clock unless it is manually overridden with this option. |
-no_warn |
|
Turns warnings off, so that the -w file does not get printed. |
-no_fill |
|
Turns hole-filling off, so that SNPs included in the -g file but not in the lowest reference panel cannot contribute to the inference. |
-no_remove |
|
Prevents the program from discarding SNPs whose alleles cannot be aligned across panels. |
In
principle, it is possible to impute genotypes across an entire
chromosome in a single run of IMPUTE2, but it is better to
split a chromosome into smaller chunks for analysis. One important
reason for this is that the population-genetic approximation used by IMPUTE2
works best over short genomic distances. The approximation works by
modeling local genealogies, and the superior accuracy afforded by this
model may diminish if there is too much recombination in the region. |
cat chr16_chunk1.impute2 chr16_chunk2.impute2 chr16_chunk3.impute2 > chr16_chunkAll.impute2 |
The
proliferation of cheap DNA sequencing technologies has greatly
increased the rate at which reference panels for genotype imputation
can be generated. In this context, many GWAS investigators would like
to re-impute their datasets as larger reference datasets become
available. Imputation is still relatively computer-intensive when
performed genome-wide, so we have been working on ways to speed up the
inference in this context. |
We will
soon be posting suggestions for making sure that IMPUTE2 has
run successfully, detecting common problems, and processing the output
files prior to association analysis. Stay tuned... |
|
Q: What |
A:
You can find a complete answer here. The quick
answer is |
Q: Why haven't you responded to my e-mail? |
A:
We go out of our way to respond promptly to queries about IMPUTE2.
If
you
wrote
to us and haven't heard back yet, the most likely reason
is that we are too busy to reply immediately. There are a few things
that you can do to improve the chances of receiving a fast response:
|
If you
would like to receive e-mails about updates to this software, please
fill out the registration
form. |
[1] J. Marchini, B. Howie, S. Myers, G.
McVean and P. Donnelly (2007) A new multipoint method for
genome-wide association studies via imputation of genotypes. Nature
Genetics 39: 906-913 [Free
Access PDF] [Supplementary
Material] [News
and
Views
Article] |
If you
have any questions regarding the use of IMPUTE2, please send an
e-mail to both of the following people: |