Combining reference panels across populations

As of August 2010, we are packaging the HapMap 3 haplotypes as a complete set for imputation: each download contains all available HapMap 3 haplotypes in a single file, except where haplotypes shared with a 1,000 Genomes panel have been removed to prevent "double counting". We propose that the full set of haplotypes (which have ancestry spanning Africa, Asia, Europe, and the Americas) be used as a reference panel in every imputation analysis, regardless of the ancestry of the individuals being imputed.

This differs from the standard practice, which is to restrict the reference panel to a subset of haplotypes that are thought to be genetically close to the population of interest. There are several reasons why we recommend deviating from this orthodoxy when using IMPUTE2:

The main reason for using a large, diverse reference panel instead of a smaller, targeted reference panel is to improve imputation of rare alleles (e.g., alleles with frequencies from 1-5%). In theory, using a reference panel that includes haplotypes from diverse populations could improve the accuracy of rare allele imputation. The idea is that rare mutations in the population of interest could have drifted to higher frequency in diverged populations; if haplotypes from those populations were included in the reference panel, it might be easier to impute the rare alleles in your study.

We have tested this idea extensively in the HapMap 3 data, and it appears to work well: in populations from around the world, rare alleles are imputed more accurately from the complete HapMap 3 reference set than from the subset of HapMap 3 haplotypes that would usually be deemed "close enough" for imputation.
Certain aspects of the IMPUTE2 algorithm make it especially suited to using a large, diverse reference panel. IMPUTE2 uses an adaptive algorithm that is based on a model of local genealogies, and this allows it to automatically identify the reference panel haplotypes that will be most useful for imputing a given study individual. Hence, while the method has access to all of the reference panel haplotypes, it will only use a subset of them for each imputation step. This subset will differ between individuals, genomic regions, and even stages of the algorithm (since the method averages over different plausible subsets of reference haplotypes).

There are three benefits to this approach. First, by using a genealogical approximation that automatically eliminates haplotypes that are too diverged to be useful, the method avoids putting too much weight on distantly related haplotypes. Second, by restricting the set of haplotypes that are used for imputation, the algorithm remains fast and accurate, even with very large reference panels. Third, the algorithm can selectively reach across populations in the reference set when there are helpful shared haplotypes, thereby increasing the accuracy of rare allele imputation.
It is not always obvious which reference populations are "close enough" to a study population to be used for imputation, and this question is growing more difficult as more reference datasets become available. IMPUTE2 completely removes the need to worry about choosing a reference set: you can just provide the method with all available reference haplotypes, and it will choose the best ones internally. The potential downsides to this approach -- longer running times and lower accuracy caused by treating a structured reference panel as unstructured -- are mitigated by IMPUTE2's modeling strategy.

For these reasons, we believe that IMPUTE2 users should start imputing from the entire set of HapMap 3 reference haplotypes. We also intend to combine the 1,000 Genomes reference haplotypes across populations, but we have not yet been able to do so because of technical issues.

If you would still prefer to impute from a subset of HapMap 3 haplotypes, it should still be easy to do so: just download the HapMap 3 dataset, which contains a single haplotype file for each chromosome and a sample list that specifies which haplotypes (columns in the haplotype file) come from each constituent HapMap 3 panel. Given this information, it should be easy to parse out the columns of interest; let us know if you have any trouble with this.

The observations described above and the experiments underlying them have not yet been published, although we are actively working to do so. In the meantime, we are happy to answer questions about this imputation strategy on a case-by-case basis.