IMPUTE v1

IMPUTE is a program for imputing unobserved genotypes in genome-wide case-control studies based on a set of known haplotypes (like the HapMap Phase II haplotypes [2]). The program is designed to work seamlessly with the output of both the genotype calling program CHIAMO [1] and HAPGEN and produce output that can be analyzed using the program  SNPTEST [2]. An earlier version of IMPUTE was used to carry out genotype imputation as part of the analysis of the 7 genome-wide association studies analyzed by the Wellcome Trust Case-Control Consortium (WTCCC) [3].

Home X Chromosome Imputation
Overview
Options
Contributors
FAQ
Download
References
Running IMPUTE Contact Information
Formatting Haploid Sample Genotypes Version History
HapMap2, HapMap3 and 1000 Genomes Haplotypes


Overview (top)


Contributors (top)

The following people have contributed to the development of the methodology and software for IMPUTE.

Jonathan Marchini, Bryan Howie

Download (top)

Pre-compiled versions of the program and example files can be downloaded from the links below. We've supplied both static and dynamic versions of the Linux executables. If you intend to run IMPUTE on a machine running an old kernel then you probably want to use the dynamic version. If you have any problems getting the program to work on your machine please contact us.

NOTE: We have not yet compiled IMPUTE v1 on all of the platforms for which we distribute IMPUTE v0.5 binaries. If you would like to request that a specific platform be prioritized, please send us an e-mail.

Platform
File
Linux (x86_64) Static Executable
impute_v1.0.0_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3)
impute_v1.0.0_SuSE9.3_x86_64_static.tgz
Linux (x86_64) Dynamic Executable
impute_v1.0.0_x86_64_dynamic.tgz
Linux (i386) Dynamic Executable
impute_v1.0.0_i386_dynamic.tgz
Mac OS X Intel impute_v1.0.0_MacOSX_Intel.tgz
Mac OS X PowerPC impute_v1.0.0_MacOSX_PowerPC.tgz
Solaris 5.8 (Sun SPARC)
impute_v1.0.0_Solaris5.8_SPARC.tgz
Solaris 5.10 (AMD Opteron)
impute_v1.0.0_Solaris5.10_Opteron.tgz
SLES 10 (Intel Itanium2)   
impute_v1.0.0_SLES10_Itanium2.tgz

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use a command like

tar zxvf impute_vX.X.X_i386.tgz

This will create an executable called impute and a directory /example that contains the example files.

Running IMPUTE (top)

IMPUTE is a command line program. To illustrate its use we have included an example dataset in the directory /example

If you are a new user we suggest you spend some time working with the example files to get used to the input and output file formats, the command line options and flags and the effect they have on the results.

To run the program on the example file use

./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000

This command runs IMPUTE on the example files and specifies that imputation is carried out from position 62,000,000bp to 63,000,000bp i.e. 62Mb to 63Mb.

This will produce the following screen output. We have annotated this output with comments in blue.

bash$ ./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000

IMPUTE v1.0.0
=============

Copyright 2006 Jonathan Marchini
Please see the LICENCE file included with this program for conditions of use.

haplotypes file : example/haplo.txt
    legend file : example/legend.txt
 genotypes file : example/geno.txt
       map file : example/map.txt              <----  list of input and output files

    strand file : example/strand.txt
 exclusion file : NULL
 inclusion file : NULL
    output file : ./out
      info file : ./info
   results file : ./summary
    sample file : NULL

imputation interval : [62000000,63000000]
      <---- specifies the region of imputation from -int option
reading genetic map...done
reading haplotypes
 # ind = 120
 # snps read in = 1129
reading genotypes
 # ind = 50
 # SNPs with genotypes read in = 250
reading strand file
 # SNPs in strand file = 250
 # SNPs in imputed region that have had strand assigned = 250

Summary :
122 SNPs in left-hand buffer region
223 SNPs in right-hand buffer region
662 type 1 SNPs will be in output file (type 1 = SNP in haplotype file only)
141 type 2 SNPs will be in output file (type 2 = SNP in haplotype file and genotype file)
27 type 3 SNPs will be in output file (type 3 = SNP in genotype file only)
830 SNPs will be in output file in total
1175 SNPs in total

-using strand file to orientate strand
 --flipped strand at 103 genotyped SNPs out of a total of 204
   <---- details of strand alignment
-aligning allele labels of haplotypes and genotypes
-removing non-aligned genotyped SNPs
 --removing 0 genotyped SNPs out of a total of 204

setting weights...done
setting storage space...done
setting mutation matrices...done
setting switch rates...done

Estimated RAM required is 74.115Mb


      n_hap : 120
      n_gen : 50
       nind : 50
   interval : [62000000, 63000000]
     buffer : 250
     <--- this is the buffer region (in kb) used on each end of the region to avoid edge effects
         Ne : 11400   <--- this is the Ne value used in the model
call_thresh : 0.900   <--- this is the threshold used to call genotypes from the input genotype file
      theta : 0.18655
      model : 4

 predicting individual [50/50] [forward sweep]  [backward sweep]  [predict]

Breakdown of impution accuracy at SNPs with genotypes in the input file
  This assessment only uses genotypes in input file that are called above threshold of 0.90
  There are 7024 such genotypes in total
  For each of these genotypes the maximum imputed genotype calls are distributed as follows
  Interval  #Genotypes %Concordance         Interval  %Called %Concordance
  [0.0-0.1]          0          0.0         [ >= 0.0]   100.0         95.9
  [0.1-0.2]          0          0.0         [ >= 0.1]   100.0         95.9
  [0.2-0.3]          0          0.0         [ >= 0.2]   100.0         95.9
  [0.3-0.4]          0          0.0         [ >= 0.3]   100.0         95.9
  [0.4-0.5]         32         40.6         [ >= 0.4]   100.0         95.9
       imputation accuracy
  [0.5-0.6]        175         51.4         [ >= 0.5]    99.5         96.1  <--- For genotypes in the input file
  [0.6-0.7]        155         65.8         [ >= 0.6]    97.1         97.3       this says that using a calling
  [0.7-0.8]        163         77.3         [ >= 0.7]    94.8         98.0       threshold of 0.5 99.5% of
  [0.8-0.9]        305         82.3         [ >= 0.8]    92.5         98.5       imputed genotypes would be
  [0.9-1.0]       6194         99.3         [ >= 0.9]    88.2         99.3       called and 96.1% of those are
                                                                                 concordant/correct.
finito   <--- this says 'I am finished' in Italian


Here are a few more examples that illustrate how various options and flags can modify the behaviour of IMPUTE.
See below for a full description of the options, input file formats and output file formats.


Example 1 This command uses the internal strand alignment (using the -fix_strand flag) rather than using a strand file (using the -s option). If you run this you should see that the accuracy is very similar to that obtained when using the strand file in the example above.
./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -fix_strand -Ne 11400 -int 62000000 63000000

Example 2 This command differs from the first example in two ways. The -os 2 option specifies that only Type 2 SNPs (SNPs that occur in both the genotype file and the haplotype file) should occur in the output file. The -ps flag specifies that these SNPs should have their genotypes overwritten with predictions in the output file.
./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000 -pgs -os 2

Example 3 The -exclude_snps option specifies a file that lists SNPs to be excluded from the genotype file. Imputation will be carried out  ignoring the data at these SNPs and these SNPs should not appear in the output. The -impute_excluded flag modifies the behaviour of the -exclude_snps option. It specfies that the SNPs excluded should be imputed i.e. these SNPs will appear in the output file but their  genotypes will be over-written with predictions.
./impute -h example/haplo.txt -l example/legend.txt -g example/geno.txt -m example/map.txt -s example/strand.txt -Ne 11400 -int 62000000 63000000 -exclude_snps example/exclude.txt -impute_excluded

Using IMPUTE with the HapMap2, HapMap3 and 1000 Genomes Project Data (top)

A main use of this program will be imputing genotypes based on the haplotypes from HapMap2, HapMap3 and the 1000 Genomes Project data. To facilitate this use we have prepared these haplotype sets in the format required by IMPUTE for all 22 autosomes. Be careful to make sure your genotype data uses base-pair positions that are matched to the genome-build used by the haplotype, rate and strand files. We recommend that genome-wide imputation of genotypes be carried in relatively small chunks to avoid running out of RAM on your computer. For imputation of the WTCCC dataset we used a chunk size of 7Mb. The imputed chunks were then concatenated together to produce an imputed file for each chromosome. The chunk size can be specifed using the -int option. The -buffer should also be used to avoid edge effects of imputing in relatively small chunks.

1000 Genomes Project (August 2009 CEU haplotypes) - NCBI Build 36
(dbSNP b126)
Polymorphic files - The August 2009 release of phased data from the 1000 Genomes
Project. The file contains the haplotypes, legend files, recombination rates and one
example file.
[CEU]

Strand files Affy500k [These were constructed using these Affymetrix annotation files - Nsp Sty]
Affy6.0 [These files were created using this Affymetrix annotation file - LINK]
Example - the CEU file contains a set of 20 simulated individuals on chromosome 22 (example.gen). Below is an example of imputing these indviduals using the CEU panel in the interval 20-25Mb. Note : no strand file is needed as this is simulated data. For real data you would need to need to either supply a strand file, align the strand of the genotype data to the + strand or use the -fix_strand option.
./impute -h CEU.0908.chr22.hap -l CEU.0908.chr22.legend -m genetic_map_chr22_combined_b36.txt -g example.gen -int 20000000 25000000 -o example.results

HapMap 3 (release 2) haplotypes - NCBI Build 36 (dbSNP b126)
Polymorphic files - Phased haplotypes from release 2 of the HapMap 3
dataset for all the populations : ASW, CEU, CHD, GIH, JPT+CHB, LWK,
MEX, MKK, TSI, YRI and a combined CEU+TSI set. The file contains
the haplotypes, legend files, recombination rates and one example file.
[HM3]

Strand files Affy500k [These were constructed using these Affymetrix annotation files - Nsp Sty]
Affy6.0 [These files were created using this Affymetrix annotation file - LINK]
Example - the HM3 file contains a set of 20 simulated individuals on chromosome 22 (example.gen). Below is an example of imputing these indviduals using the CEU+TSI panel in the interval 20-25Mb. Note : no strand file is needed as this is simulated data. For real data you would need to need to either supply a strand file, align the strand of the genotype data to the + strand or use the -fix_strand option.
./impute -h CEU+TSI.chr22.hap -l hapmap3.r2.b36.chr22.legend -m genetic_map_chr22_combined_b36.txt -g example.gen -int 20000000 25000000 -o example.results

HapMap rel#24 - NCBI Build 36
(dbSNP b126)
Polymorphic files - these files contain SNPs polymorphic in each panel respectively
i.e. the CEU haplotypes only contain data at SNPs that are polymorphic in the CEU panel.
The files contain the haplotypes and associated legend files.
[CEU]
[YRI]
Recombination rate files (nb. these are the same as the rel#22 rates)
[CEU]   [YRI]   [COMBINED]
Strand files Affy500k [These were constructed using these Affymetrix annotation files - Nsp Sty]
Affy6.0 [These files were created using this Affymetrix annotation file - LINK]

HapMap rel#22 - NCBI Build 36 (dbSNP b126)
Polymorphic files - these files contain SNPs polymorphic in each panel respectively
i.e. the CEU haplotypes only contain data at SNPs that are polymorphic in the CEU panel.
The files contain the haplotypes and associated legend files.
[CEU]
[YRI]
[JPT+CHB]
Consensus files - these files contain SNPs that occur in all 3 of the HapMap panels.
There are also files for all combinations of the panels, which are useful for imputation
of admixed individuals.
Single panels
Pairs of panels
Combined panels
[CEU]                      
[YRI]
                
[JPT+CHB]
[CEU+CHB+JPT]    
[CEU+YRI]
   
[CHB+JPT+YRI]
[CEU+YRI+CHB+JPT]

 
[Legend files]

Recombination rate files [CEU]   [YRI]   [COMBINED]
Strand files Affy500k [These were constructed using these Affymetrix annotation files - Nsp Sty]
Affy6.0 [These files were created using this Affymetrix annotation file - LINK]

HapMap rel#21 - NCBI Build 35 (dbSNP b125)
Polymorphic files - these files contain SNPs polymorphic in each panel respectively
i.e. the CEU haplotypes only contain data at SNPs that are polymorphic in the CEU panel.
The files contain the haplotypes and associated legend files.
[CEU]
[
YRI]
[
JPT+CHB]
Recombination rate files [CEU]    [YRI]    [JPT+CHB]    [COMBINED]
Strand files Affy500k [These files were constructed using these Affymetrix annotation files - Nsp Sty]

Formatting Haploid Sample Genotypes (top)

In the context of certain flags (currently, -haploid and -chrX), IMPUTE can treat some or all of the data in the -g file as phased haplotypes rather than unphased genotypes. In this case the program will use the same basic model for imputation, but it will assume that the input haplotypes are correct rather than following its standard procedure, which is to integrate over all possible phasings of each individual's multilocus genotype. This should improve accuracy when the haplotypes in your study are known with high accuracy, e.g. through genotyped family members or experimental methods that interrogate a single molecule. However, if there is considerable uncertainty about the phase of your data it is preferable to use the standard diploid input format and let IMPUTE account for that uncertainty internally.

To provide phased haplotypes to IMPUTE, you should create a file with the same format used for the standard -g file, with one difference: instead of defining a probability triple (p0, p1, p2), where p0 = Pr(genotype is AA), p1 = Pr(genotype is AB), and p2 = Pr(genotype is BB), you should define a probability triple where p0 = Pr(allele is A), p2 = Pr(allele is B), and p1 is a dummy value. For example, consider a diploid individual with phased haplotypes at four SNP sites, where the first haplotype is A-B-A-B and the second haplotype is A-A-B-B. If we ignored (or didn't know) the phase, this individual's data could be represented by three columns in a standard diploid IMPUTE input file:

1 0 0
0 1 0
0 1 0
0 0 1

In this phase-unknown format, each row corresponds to a different SNP, and the SNPs have diploid genotypes A/A, A/B, A/B, and B/B. If we knew the phase of the underlying haplotypes, we could instead assign three columns to each haplotype (columns 1-3 to the first haplotype and columns 4-6 to the second haplotype):

1 0 0 1 0 0
0 0 1 1 0 0
1 0 0 0 0 1
0 0 1 0 0 1

Here, the first three columns encode the haplotype A-B-A-B and the next three columns encode the haplotype A-A-B-B.

You must specify the appropriate flags in order for IMPUTE to interpret haploid genotypes correctly. The -haploid flag tells the program that all of the input genotypes are haploid. Conversely, a chromosome X dataset may contain a mixture of haploid (male) and diploid (female) genotypes; these can be represented in a single file, with the ploidy of each column specified by the -sample file (which is required when using the -chrX flag). There are more details on chromosome X imputation below. When using either the -haploid or -chrX flags, IMPUTE will produce an output file in which the individuals (triples of columns) have the same ploidy as in the input file.

X Chromosome Imputation (top)

IMPUTE can carry out imputation of genotypes on the X chromosome but it is slightly more complicated.
There are 3 special flags associated with X chromosome imputation (-chrX, -Xpar and -sample). See the option list below for more details.
The pseudoautosomal (par) and non-pseduoautosomal  (non-par) regions of chromosome X are dealt with in slightly different ways.

We have put together a set of files for X chromosome imputation [chrX_files.tgz]. See the included README file for a complete description of the files.

Here is an example of using these files for carrying out imputation in the non-pseudoautosomal region of the X chromosome. The output format is the same as running IMPUTE on the autosomes. Males are reported as having 3 posterior probabilities for each genotype but the heterozygote probability will always be 0. The AA and BB homozygote probabilities for males correspond to the posterior probabilities of carrying the two alleles A and B respectively.

./impute -chrX -h chrX_files/genotypes_chrX_CEU_r21_nr_fwd_non-par_phased_by_snp_no_mono -l chrX_files/genotypes_chrX_CEU_r21_nr_fwd_non-par_legend.txt -m chrX_files/genetic_map_chrX_non-par.txt -s chrX_files/Affy500k_chrX_non-par.strand -g chrX_files/chrX.example.gen -sample chrX_files/chrX.example.sample -Ne 11400 -int 4000000 4100000

Options (top)

Flags
Required/Optional
Default Description
-h <file>
Required

File containing a set of known haplotypes for the region of interest. The alleles of the haplotypes should be coded as 0 and 1. The format of this input file is one line per SNP and one column per haplotype.
-l <file>
Required
Legend file for haplotypes file which give rs ID, position and the alleles that are coded as 0 and 1 in the haplotypes file. The alleles should be taken from A, C, G and T. Note that this file needs a header line (see the example file legend.txt for details)
-g <file>
Required
File containing a set of genotypes for the set if individuals. The file format is described in detail on the FILE FORMAT WEBPAGE. The file format is the same as the output format from our genotype calling program CHIAMO.
NOTE 1 : The SNPs  MUST appear in base-pair position order (lowest to highest) i.e. the 3rd column of this file must be sorted.
NOTE 2 : Base-pair positions of SNPs must use the same genome build as that used in the haplotype file.
-g_gz
Optional

Specify that the genotype file is gzipped.
-m <file>
Required
Fine-scale recombination map covering the region at which impution is required. There is one line for each position on the map. The first column contains the base pair position, the 2nd column contains the recombination rate in cM/Mb to the next point on the map and the 3rd column contains the recombination map position in cM.Note that this file needs a header line (see the example file map.txt for details)
-Ne <int>
Required
Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations.
-int <lower> <upper>
Required

Lower are Upper boudaries (in base pair position) of the region in which imputation should be carried out.
-s <file>
Optional
File listing the strand orientation of the SNPs in the genotype file relative to the orientation of the alleles in the haplotypes file. This is file is required if the orientation of alleles at SNPs in the haplotype and genotype files does not match up. The file should contain a line for each SNP in the genotype file with two entries (i) the base-position of the SNPs, and (b) the strand (+ or -) of the alleles in the genotype file. SNPs  do not have to be in the same order as in the genotypes file and the file can include SNPs that are not in the genotypes file i.e. if the genotypes file has had some SNPs filtered out. Take a look at the example files for an illustration of the required format.
NOTE : It is critical that the alleles used to code genotypes in the haplotype file and the genotype file match up. If not, then the quality of imputation may decrease substantially. Great care should be taken in constructing a strand file for your data.
NOTE : see the -fix_strand and -no_remove options below which control the internal strand alignment functions.
-fix_strand
Optional

This flag invokes an internal strand alignment at SNPs that occur in both the genotypes and haplotypes files. It is based on the allele labels (at non A/T and G/C SNPs), and discorandant allele frequencies (at A/T and G/C SNPs ). 
-no_remove
Optional

This flag turns off the default removal of all SNPs in the genotype file that are not aligned. The removal of SNPs is carried out after any specified strand file has been applied and after the checks described in the previous option have been applied.
-o <file>
Optional
./out
Name of main output file that will contain the imputed genotypes. The files has one line per SNP and has exactly the same format as the genotypes file format. NB the program will estimate probabilities for all genotypes including those that are known in the genotypes file (this allows an asssesment of genotyping errors and imputation of missing data at these SNPs)
-o_gz
Optional

Specify that the output file should be gzipped.
-i <file>
Optional ./info
Name of the file that information measures that describe theaccuracy of imputation at each SNP. This file contains one line per SNP that contains SNP ID, rs ID, position, expected allele frequency of the SNP, a measure of the observed statistical information associated with the estimate of the allele frequency and an alternative confidence score for the SNP (calculated as the average of the maximum posterior probabilities of the imputed genotypes). The information measure and the confidence score will be 1 if the SNP is imputed with hign confidence. Both measures decrease towards 0 as imputation confidence decreases.
-r <file>
Optional
./summary
Specify file where a copy of the screen output is written.
-buffer <int>
Optional
250
To avoid edge effects in the imputation the program includes genotypes either side of the interval specified by the the -int flag. This option specifies the length of the buffer region (in kb) at each end of the interval.
-call_thresh <double>
Optional
0.9
Threshold for calling genotypes in genotype input file. The genotype with the maximum probability will be used if that probability is above the threshold. Otherwise the genotype will treated as missing.
-nind <int>
Optional

Specify the number of individuals to impute i.e. the impute just the 1st individual use -nind 1
-exclude_snps <file> Optional
Exclude a set of genotyped SNPs (i.e. SNPs that occur in the file specified by the -g option) with ID equal to those listed in the file. The IDs can be either the rs ID or the alternate ID given in the first column of the genotype file. These SNPs will not be used for imputation and will not occur in the output files.
-impute_excluded
Optional

This flag modifies the behaviour of the -exclude_snps option. For Type 2 SNPs that have been excluded it places imputed genotypes in the output file.
-os <int>
-include_snps <file>
Optional
Optional
1 2 3

The SNPs that are included in the output are controlled by the combination of the -os and -include_snps options.

The -os option controls which types of SNPs are included in the output. There are three types of SNPs
1 = SNPs that occur ONLY in the haplotypes file
2 = SNPs that occur in BOTH the haplotypes and genotypes file
3 = SNPs that occur ONLY in the genotypes file

You can specify more than one type of SNP using the -os option. For example, using -os 1 2 would output SNPs in the haplotypes file. The default setting is to produce output at all snps i.e -os 1 2 3.

Using -os 2 is a useful if all you require is an LD-based estimate of the genotypes at SNPs in the genotypes file and can be substantially quicker than the default setting.

The -include_snps option specifies a list of SNPs to be included in the output BUT this list only applies to those SNPs that appear only in the haplotype file i.e the SNPs specified by -os 1. The IDs should be the rsIDs given in the legend file that corresponds to the haplotypes file.
-pgs
Optional

For SNPs that occur in the genotype file the default is now to return these genotypes in the output file rather than their predictions (which was the old default). The -pgs flag (which stands for predict genotyped snps) can be used to specify that the predictions should be written to the output file.
-outdp <int>
Optional
2
Specify the number of decimal places used to report the genotype probabilities.
-chrX
Optional

Specify this flag if you want to impute genotypes on the X chromosome. The haplotype files, legend file, map file and strand file should set appropriately. A sample file must also be supplied (see -sample below).
-Xpar
Optional

Controls whether you wish to do imputation in the pseudoautosomal or non-pseudoautosomal region of the X chromosome. If the flag is given it specifies that you are working in the pseudoautosomal region. If the flag is absent it specifies that you are working in the non-pseudoautosomal region. Only works when used in conjunction with the -chrX flag.
-sample
Optional

Sample file (see FILE FORMAT WEBPAGE for more details) containing a covariate named 'sex' specifying the sex of all indviduals in the genotype file. Males should be coded 1 and females coded 2.
-haploid
Optional

Specify that the -g file contains haplotypes, not diploid genotypes. See above for details about file formatting.

FAQ (top)

Q. How do I code missing genotypes in the genotype file?
A. Internally, IMPUTE turns the probabilities in the genotype file into a single genotype by choosing the genotype with the maximum probability if it is greater than the threshold value supplied by the -call_thresh option (default is 0.9). If the threshold isn't reached then the genotype is set to missing. So if yo want to force a missing genotype then using (0 0 0) as the set fo genotype probabilities will work with the default threshold.

Q. How do I create a strand file?
A. The creation of strand files is difficult. You need to work out which strand of the human reference sequence the alleles for each SNP have been expressed against. This will depend on the genotyping chip/method used to measure the genotypes so you will need to refer to the appropriate annotation files for the platform you have used. We have supplied strand files for the Affy 500k chip that work with the build 35 release of the HapMap haplotypes (also available from this website in the correct format for IMPUTE). We are working on supplying strand files for other GWA chips and these will appear on the website. Finally, v0.3.0 introduced some internal checks that attempt to align the strands of the genotype and haplotype files (see above). These checks are particulary useful for the Illumina 300, 550 and 650 chips which do not have any A/T and G/C SNPs on them so that the strand of the genotype data can be aligned to the strand of the HapMap haplotypes using the alleles labels alone.

Q. Why do I get the message "rs numbers don't  agree"?
A. SNPs from the haplotype and genotype files are aligned on their base pair position. Once aligned IMPUTE checks to see if the rs id from the legend file matches the rs id from the genotype file. If they don't match IMPUTE prints the message. The most likely explanation is that the rs ids in the legend file and genotype file were created from different sources i.e. different versions of dbSNP. For example, the legend files available from the IMPUTE webiste (above) were created from the HapMap project and used the rs ids of the SNPs from dbSNP at the time of that project i.e. over a year ago. The genotype file you use will probably have rs ids from some later version of dbSNP e.g. the annotation file from one of the Affymetrix or Illumina chips. In dbSNP SNPs with different rs ids can get merged into one SNP if they get information that leads them to believe they are the same SNP so it is possible that  SNPs can have the same base pair position but different rs ids. If you see some of these messages then its is worth querying the mis-matching rs ids in dbSNP to check that this is the cause. For example, querying rs7446851 in dbSNP shows that this id was merged with another rs id
http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?searchType=adhoc_search&type=rs&rs=rs7446851. This is the kind of thing you would be looking for.

Q. Why does IMPUTE terminate with the message "terminate called after throwing an instance of 'std::bad_alloc'" or something similar?
A. The most likely cause is that you have tried to run IMPUTE on a whole chromosome and run out of RAM on your computer. See the advice above on how to use IMPUTE on whole chromosomes.

Q. How do i know whether IMPUTE is working as it should and giving good predictions?
A. In v0.5.0 we introduced new screen output that attempts to gauge the accuracy of the imputation using the known genotype data you have supplied using the -g option. IMPUTE predicts all of the data at the genotyped SNPs in a leave-one-out fashion. These predictions are then compared to the supplied genotypes to assess accuracy. The level of accuracy that is obtained will be close to the accuracy obtained at other imputed SNPs that do not occur in the genotype file. This information appears at the end of the screen output. In the example given above (which is real data) the concordance rate and missing data rate when calling imputed genotypes at a threshold of 0.9 was 99.3% and 88.2% respectively.

Version History (top)

1.0.0 19-06-2009 Version 1.0.0 released.
  • Addition of -haploid flag for imputing missing alleles in haploid datasets.

References (top)

[1] J. Marchini, C. Spencer. Y.Y. Teo and P. Donnelly (2007) A Bayesian Hierarchical Mixture Model for Genotype Calling in a multi-cohort study. (in preparation)
[2] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[3] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447;661-78. PMID: 17554300 DOI: 10.1038/nature05911

Contact Information (top)

If you have any questions regarding the use of this program please send an email to both the following people

Dr. Bryan Howie (
howie <at> stats <dot> ox <dot> ac <dot> uk).
Dr. Jonathan Marchini (marchini <at> stats <dot> ox <dot> ac <dot> uk).

It is a good idea to include a copy of the screen output (in the ./summary file) with your email which helps us identify any problems.