Using multi-population reference panels with IMPUTE2

This page explains our preferred approach to using imputation reference panels from modern genotyping and sequencing projects.

Practical suggestions
How does it work?
Published results

Overview (top)

Human genetic variation resources, like those produced by HapMap 3 and the 1,000 Genomes Project, capture a broad cross-section of human genetic diversity: detailed variation data have now been collected from a variety of sampling locations in Africa, Asia, Europe, and the Americas. Large sequencing projects are actively expanding these datasets to include additional populations and deeper sampling within populations. These public databases provide powerful reference panels for genotype imputation studies.

In this context, one important question is how to choose a reference panel that will produce high imputation accuracy in a population of interest. The answer is seldom obvious because human populations have experienced complex demographic histories with many migration and mixture events. Consequently, it can be hard to decide which reference haplotypes should be used in a particular study.

We propose a simple and universal solution to this problem: we provide all available reference haplotypes to IMPUTE2, then let the software choose a "custom" reference panel for each individual to be imputed. There are several advantages to this approach:
  • Investigators do not need to waste time deciding which haplotypes to include in the reference panel. Good results can be obtained in any study population by tuning a single software parameter (-k_hap) with a simple rule of thumb; see below for more details.

  • This strategy works in a variety of human populations. Our group and others have used this approach to successfully impute populations ranging from homogeneous isolates to recent and complex admixtures.

  • IMPUTE2 is often more accurate with an ancestrally inclusive reference panel than with a smaller panel chosen by intuition. This is because individuals from "diverged" populations may still share genomic segments of recent common ancestry, and IMPUTE2 can use this haplotype sharing to improve accuracy. At the same time, the software can ignore haplotypes that are not helpful.

    The benefits of using inclusive reference panels are greatest at low-frequency variants (MAF < 5%), since these variants may be poorly represented in a reference panel from the population of interest (due to sampling effects) but well-represented in panel from a different population (e.g., due to genetic drift).

  • IMPUTE2 can efficiently process large reference panels. You might worry that using all available reference haplotypes would greatly increase the computational burden of imputation, but IMPUTE2 uses an approximation that limits the cost of adding reference haplotypes while maintaining (or improving) accuracy.

Practical suggestions (top)

There are a few program settings that you should be aware of when using IMPUTE2 with an ancestrally diverse reference panel:
  • -k_hap -- This parameter determines how many of the reference haplotypes will be used in the "custom" reference panel for each study individual. The default value is 500, which is a good starting point for modern reference datasets.

    As a rule of thumb, you should set -k_hap to the number of reference haplotypes that you expect to be useful for your study population. For example, suppose you were imputing a Spanish dataset from a reference panel containing 400 Western European haplotypes and 400 African American haplotypes. In this case, you could achieve high accuracy by leaving -k_hap at the default value of 500 since, in any part of the genome, the expected number of reference haplotypes with European ancestry is roughly 400 + 0.2 * (400) = 480. (This calculation assumes that, on average, African American haplotypes have 20% European ancestry.)

    Imputation accuracy is not highly sensitive to -k_hap, which is why this rule of thumb usually provides good results without requiring detailed parameter tuning. If you want advice on the best value for your dataset, please feel free to contact us.

  • -Ne -- This parameter controls the effective population size in the population-genetic model used by IMPUTE2. Different human populations have different effective sizes (as estimated from genetic diversity levels), so it is not obvious how to choose a single -Ne value when using a multi-population reference panel.

    Fortunately, we have found that IMPUTE2 achieves high accuracy across a wide range of -Ne values, with slightly higher accuracy at large values. We therefore recommend a universal -Ne value of 20000, regardless of the study population being imputed or the composition of the reference panel. This will become the default value in our next software release (v2.1.3), but for now you should set it manually.

  • -int -- This command-line option specifies the boundaries of the region to be imputed on the current chromosome, using two numbers. For example, "-int 1 5e6" tells IMPUTE2 to analyze physical positions 1-5,000,000.

    The imputation interval should not be too large because this weakens IMPUTE2's approximation for choosing custom reference panels, which is based on an assumption of limited recombination in the region being analyzed. In theory, it might be desirable to tailor the interval size to the population being imputed—e.g., to use shorter intervals in African populations—but in practice, we have found that the exact size of the interval has little effect on imputation accuracy as long as the interval is relatively small (say, < 10 Mb). We therefore recommend that the size of the analysis interval be chosen for computational convenience, without regard to the ancestry of the study or reference datasets.

How does it work? (top)

As explained above, we believe that the best way to use IMPUTE2 with modern reference panels is to provide all available haplotypes to the program and let it choose which ones to use. Here, we explain how this approach works.

IMPUTE2 does not use population labels or other genome-wide measures of relatedness between individuals, either for the reference haplotypes or the individuals being imputed. Instead, it looks for reference haplotypes that share high sequence identity with the haplotypes of a particular study individual. These haplotypes constitute a "custom" reference panel that can be used to impute missing genotypes in the individual of interest.

This process is largely insensitive to the ancestral composition of the reference panel: as long as the panel contains haplotypes that share segments of recent common ancestry with individuals in a study, IMPUTE2 can find the shared segments and use them to impute missing alleles. Consequently, the reference panel does not need to be restricted to haplotypes that "match" the ancestry of the study individuals -- it can also include other kinds of haplotypes:
  • Recently admixed haplotypes -- If two or more distinct populations have mixed within the past few hundred years, the resulting admixed population may contain some haplotype segments that are closely related to a population of interest and other segments that are highly diverged. IMPUTE2 can identify the useful segments while ignoring the diverged segments, thereby achieving accurate imputation.

  • Moderately diverged haplotypes -- Even if a set of reference haplotypes comes from a different population than the one you want to impute, it may still provide segments of recent ancestry that can help the imputation. The prevalence of such segments is a complicated function of reference panel size and population history, but in our experience there is often a surprising amount of ancestry sharing between genetically distinct populations.

  • Highly diverged haplotypes -- Reference haplotypes that are highly diverged from your study population are unlikely to be useful for imputation, but such haplotypes are easily identified and ignored by IMPUTE2. In other words, highly diverged reference haplotypes neither help nor hurt imputation accuracy. This is important because the distinction between "moderately" and "highly" diverged populations is not always clear; since it does not hurt to include unhelpful reference haplotypes, we can err on the side of including too many in order to capture more of the moderately diverged ones that improve imputation accuracy.
Expert users will note that the model underlying IMPUTE2 is formally designed to represent genetic variation in a single population. This might imply that the method would have trouble using reference panels that include populations with different linkage disequilibrium patterns, nucleotide diversity levels, and allele frequency spectra. However, we have found that the IMPUTE2 is extremely adaptable: it can find segments of shared ancestry in multi-population reference panels despite its simple model of human populations, and it is largely robust to changes in its model parameters. Imputation accuracy might theoretically be improved by more detailed modeling of population relationships (for example, the population labels that IMPUTE2 ignores might sometimes be informative), but we believe that our approach captures most of the potential accuracy in an efficient way.

Published results (top)

We have now published our work supporting these ideas in an article called "Genotype imputation with thousands of genomes" in the open-access journal G3: Genes, Genomes, Genetics. Please cite this paper when using IMPUTE2 with multi-population reference panels like those from the 1,000 Genomes Project.