Best Practices for Imputation
(top)
IMPUTE2 includes a rich collection of functions for analyzing genetic datasets, but it is most commonly used to perform genotype imputation in genome-wide association studies. To help investigators perform this kind of analysis, we have condensed the information on this website into a list of current best practices.
PRE-IMPUTATION FILTERING OF STUDY GENOTYPES
Before you perform an imputation run with your study genotypes, you should filter the data to remove low-quality variants and individuals, as these can degrade the accuracy of the final results. Standard GWAS quality control filters are usually sufficient to prepare a dataset for imputation. It may also help to add an imputation-based QC step to the filtering process; we will describe this approach in the near future.
VARIANT POSITION MATCHING ACROSS INPUT FILES
When you provide IMPUTE2 with reference and study data, the program determines which variants are shared across datasets by looking at their positions on the chromosome (as opposed, say, to their rsIDs). It is important to note that genomic coordinates change every couple of years as the human genome reference sequence is updated, so a given SNP may have different positions in different datasets. In order to obtain high-quality results from IMPUTE2, you must make sure that the variant positions in your input files are mapped to the same coordinate system, or "assembly".
Genomic assemblies are typically identified by their NCBI build number (e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19"). Our reference data download section shows the assembly to which each reference panel is mapped. If your study genotypes come from a different assembly than your reference panel, you should map the positions in your data to the reference coordinate system by using a tool like the liftOver program from UCSC. If you need help with this step, please contact us.
STRAND ALIGNMENT BETWEEN STUDY AND REFERENCE DATA
It is absolutely essential to align your study genotypes to the same strand convention as the reference panel from which you are imputing. Variants that are aligned to different strands may have different alleles (e.g., A/G in one dataset and T/C in another) or the same alleles at disparate frequencies (e.g., A/T in two datasets, where the 'A' allele occurs at 5% frequency in one dataset and 95% frequency in the other), and either of these scenarios can decrease imputation quality.
Most publicly available reference panels are aligned to the '+' strand of the human genome reference sequence, so the goal is to align your genotypes to the same convention. The best way to do this is to obtain assay information from the vendor who provided your genotypes; once you have this information, you can align your genotypes either manually or with the options described here. If you cannot recover the strand alignment from the original assay, you can use other options that tell IMPUTE2 to make educated guesses.
CHOOSING A REFERENCE PANEL
Historically, most GWAS investigators have tried to choose reference panels that match the ancestry of their study samples. We have developed a different approach: first supply IMPUTE2 with a worldwide reference panel, then let the program decide which haplotypes to use for imputation. This strategy increases accuracy and avoids difficult choices about which haplotypes to include in the reference set. We currently recommend this approach for imputing genotypes in any human population; you can read more about the strategy here, and you can download state-of-the-art reference haplotypes here.
GENOME-WIDE IMPUTATION
It can be complicated and computationally demanding to impute thousands of individuals across the entire genome. We provide a couple of mechanisms to help with this process:
-
IMPUTE2 includes command-line parameters that can be used to split the genome into discrete chunks for parallel analysis on a computing cluster. These parameters allow flexible partitioning of the genome with minimal manipulation of input files. See here for suggestions on how to use this functionality.
-
IMPUTE2 is an efficient imputation method, but it still requires substantial computing time to process the whole genome in a large number of individuals. We have recently developed an approach called "pre-phasing" that greatly reduces the computational burden of imputation while sacrificing only a little accuracy; you can read more about the approach here. We now recommend this as the standard way of performing genome-wide imputation, although we still prefer the default IMPUTE2 algorithm for analyzing smaller regions.
POST-IMPUTATION FILTERING
It is standard practice to perform additional filtering once a batch of imputation runs has completed, mainly to remove poorly imputed variants that might behave badly in association tests. We are currently preparing some recommendations for this process; we will post them on the website as soon as they are ready.
ASSOCIATION TESTING
We distribute a program called SNPTEST that contains a powerful suite of statistical tests for association between phenotypes and imputed genotypes. You can download the software and read more about its functions at the SNPTEST website.
FOLLOW-UP IMPUTATION OF PUTATIVE ASSOCIATIONS
Once you have performed genome-wide imputation and association testing, you may want to take a closer look at regions with interesting associations. To get the best possible results, we recommend re-imputing this subset of regions with more intensive program settings:
-
In contrast to the pre-phasing approach that we recommend for genome-wide imputation, we suggest using the standard IMPUTE2 MCMC algorithm for follow-up imputation. This method takes longer to run in each region, but it should lead to higher accuracy (especially at low-frequency variants) and remain computationally feasible when run on a limited portion of the genome.
-
If time permits, the overall accuracy may be improved by increasing the value of the -k parameter.
-
If time permits, the accuracy at low-frequency variants may be improved by increasing the size of the -buffer regionsay, from the default value of 250 kb to 1000 kb (1 Mb).
Once you have re-imputed each region of interest, you should perform the association tests again to obtain a high-resolution estimate of the association landscape.
|