Merging reference panels

This page explains our suggested procedure for merging multiple reference panels with IMPUTE2.

Home
Problem statement
Our approach
Extensions for advanced users
Publication and citation


Problem statement (top)

Modern genotyping and sequencing technologies are generating a variety of reference datasets that can be used for genotype imputation in association studies. Combining reference panels from different populations can often improve imputation accuracy (e.g., see Howie et al. 2011), but it is not clear how best to merge panels that are genotyped at different sets of variants.

Howie et al. 2009 proposed a solution for the special case where one reference panel contains a subset of the variants in another reference panel. We previously released a combined 1,000 Genomes + HapMap 3 panel that takes advantage of this framework, and it was also used in the WTCCC2 studies.

Many association studies are now using the latest 1,000 Genomes data to drive their genotype imputation, but they may also have sequenced additional individuals from the population being studied. It makes sense to combine these resources in order to use all available reference information, but in this case each reference panel will contain many variants that are not found in the other -- that is, the "hierarchical" variant framework of Howie et al. 2009 no longer applies.

With this in mind, we have devised a new strategy for combining reference panels created by different sequencing or genotyping studies.


Our approach (top)

There are many possible ways to merge two reference panels. We are exploring several of these options, but we decided to start with the simple approach depicted in the figure below. The top panel of this figure shows two reference panels and a GWAS cohort; you can think of the rows as individuals and the columns as positions along the genome. Each vertical line represents a genotyped variant in a given panel, and each reference panel includes variants that are not found in the other.

Merging reference panels

We impute the untyped variants in this figure in three steps:
  1. Impute the variants that are specific to Panel 0 (red) into Panel 1 (blue). Variants shown in grey do not inform the imputation.
  2. Impute the variants that are specific to Panel 1 (blue) into Panel 0 (red). Variants shown in grey do not inform the imputation.
  3. Now that we have imputed the two reference panels up to the union of their variants, treat the imputed haplotypes as known (i.e., take the best-guess haplotypes) and impute the GWAS cohort in the usual way.
This process can be performed with IMPUTE2 (version 2.3 and later) in a streamlined way: all you have to do is add the -merge_ref_panels flag to the command line. You can see a working example command here.


Extensions for advanced users (top)

For finer control of the merging step, you can supply two values to -k_hap on the command line -- for example, '-k_hap 500 200'. This setting tells IMPUTE2 to use 500 haplotypes from Panel 0 in Step 1 and 200 haplotypes from Panel 1 in Step 2. These values should reflect the number of haplotypes in each panel that are expected to be useful for imputation in the study population, which could be less than the total number if either panel is multi-ethnic.

By default, IMPUTE2 does not print the merged reference panel (the outcome of Steps 1 and 2 above); the merging is done internally, and the output shows only the imputed genotypes for the GWAS cohort. If you want the program to output the merged panel, you can replace -merge_ref_panels with one of two options:
  • -merge_ref_panels_output_ref -- This option tells the software to merge the two reference panels and print the results in IMPUTE2 reference file format: one legend file and one haplotypes file. See the link for more information.
  • -merge_ref_panels_output_gen -- This option tells the software to merge the two reference panels and print the results in IMPUTE2 .gen file format. Phase information is ignored when creating this file, which can be useful if you want to re-phase the merged reference panel. See the link for more information.
Normally, these options print the merged reference panel within the region specified by the -int argument. If you want to include the buffer regions in the output, you should add the -include_buffer_in_output flag to your command line statement.


Publication and citation (top)

Our approach for merging reference panels has not yet been published outside this website. We have tested the method on realistic datasets, and it has performed well in all of our analyses. We are actively working to document our work on this approach and to compare it with other strategies; we aim to report the results of these experiments and the details of our methodology as soon as possible.

In the meantime, we are happy to answer thoughtful questions and to hear about your experiences with this new functionality. If you would like to send comments, please do so through our mailing list.