As of August 2010, we are packaging the HapMap 3 haplotypes as
a complete set for imputation: each download contains all
available HapMap 3 haplotypes in a single file, except where
haplotypes shared with a 1,000 Genomes panel have been removed
to prevent "double counting". We propose that the full set of
haplotypes (which have ancestry spanning Africa, Asia, Europe,
and the Americas) be used as a reference panel in every
imputation analysis, regardless of the ancestry of the
individuals being imputed.
This differs from the standard practice, which is to restrict
the reference panel to a subset of haplotypes that are thought
to be genetically close to the population of interest. There
are several reasons why we recommend deviating from this
orthodoxy when using IMPUTE2:
-
The main reason for using a large, diverse reference panel
instead of a smaller, targeted reference panel is to improve
imputation of rare alleles (e.g., alleles with frequencies
from 1-5%). In theory, using a reference
panel that includes haplotypes from diverse populations
could improve the accuracy of
rare allele imputation. The idea is that rare mutations in
the population of interest could have drifted to higher
frequency in diverged populations; if haplotypes from
those populations were included in the reference panel, it
might be easier to impute the rare alleles in your study.
We have tested this idea extensively in the HapMap 3 data,
and it appears to work well: in populations from around the
world, rare alleles are imputed more accurately from the
complete HapMap 3 reference set than from the subset of
HapMap 3 haplotypes that would usually be deemed "close
enough" for imputation.
-
Certain aspects of the IMPUTE2 algorithm make it
especially suited to using a large, diverse reference
panel. IMPUTE2 uses an adaptive algorithm that is
based on a model of local genealogies, and this allows it to
automatically identify the reference panel haplotypes that
will be most useful for imputing a given study
individual. Hence, while the method has access to all of the
reference panel haplotypes, it will only use a subset of
them for each imputation step. This subset will differ between
individuals, genomic regions, and even stages of the
algorithm (since the method averages over different
plausible subsets of reference haplotypes).
There are three benefits to this approach. First, by using a
genealogical approximation that automatically eliminates
haplotypes that are too diverged to be useful, the method
avoids putting too much weight on distantly related
haplotypes. Second, by restricting the set of haplotypes
that are used for imputation, the algorithm remains fast and
accurate, even with very large reference panels. Third, the
algorithm can selectively reach across populations in the
reference set when there are helpful shared haplotypes,
thereby increasing the accuracy of rare allele imputation.
-
It is not always obvious which reference populations are
"close enough" to a study population to be used for
imputation, and this question is growing more difficult
as more reference datasets become available. IMPUTE2
completely removes the need to worry about choosing a
reference set: you can just provide the method with all
available reference haplotypes, and it will choose the best
ones internally. The potential downsides to this approach --
longer running times and lower accuracy caused by treating a
structured reference panel as unstructured -- are mitigated
by IMPUTE2's modeling strategy.
For these reasons, we believe that IMPUTE2
users should start imputing from the entire set of HapMap 3
reference haplotypes. We also intend to combine the 1,000 Genomes
reference haplotypes across populations, but we have not yet been
able to do so because of technical issues.
If you would still prefer to impute from a subset of HapMap 3
haplotypes, it should still be easy to do so: just download the
HapMap 3 dataset,
which contains a single haplotype file for each chromosome and
a sample list that specifies which haplotypes (columns in the
haplotype file) come from each constituent HapMap 3
panel. Given this information, it should be easy to parse out
the columns of interest; let us know if you have any trouble
with this.
The observations described above and the experiments
underlying them have not yet been published, although we are
actively working to do so. In the meantime, we are happy to
answer questions about this imputation strategy on a
case-by-case basis.
|