Best Practices for Imputation
(top)
IMPUTE2
includes a rich collection of functions for analyzing genetic datasets,
but it is most commonly used to perform genotype imputation in
genome-wide association studies. To help investigators perform this
kind of analysis, we have condensed the information on this website
into a list of current best practices.
PRE-IMPUTATION FILTERING OF STUDY
GENOTYPES
Before you perform an imputation run with your study genotypes, you
should filter the data to remove low-quality variants and individuals,
as these can degrade the accuracy of the final results. Standard GWAS
quality control filters are usually sufficient to prepare a dataset for
imputation. It may also help to add an imputation-based QC step to the
filtering process; we will describe this approach in the near future.
VARIANT POSITION MATCHING ACROSS
INPUT FILES
When you provide IMPUTE2 with reference and study data, the
program determines which variants are shared across datasets by looking
at their positions on the chromosome (as opposed, say, to their rsIDs).
It is important to note that genomic coordinates change every couple of
years as the human genome reference sequence is updated, so a given SNP
may have different positions in different datasets. In order to obtain
high-quality results from IMPUTE2, you must make sure that the
variant positions in your input files are mapped to the same coordinate
system, or "assembly".
Genomic assemblies are typically identified by their NCBI build number
(e.g., "b36" or "b37") or their UCSC version (e.g., "hg18" or "hg19").
Our reference data download section
shows the assembly to which each reference panel is mapped. If your
study genotypes come from a different assembly than your reference
panel, you should map the positions in your data to the reference
coordinate system by using a tool like the liftOver program
from UCSC. If you need help with this step, please contact us.
STRAND ALIGNMENT BETWEEN STUDY AND
REFERENCE DATA
It is absolutely essential to align your study genotypes to the same
strand convention as the reference panel from which you are imputing.
Variants that are aligned to different strands may have different
alleles (e.g., A/G in one dataset and T/C in another) or the same
alleles at disparate frequencies (e.g., A/T in two datasets, where the
'A' allele occurs at 5% frequency in one dataset and 95% frequency in
the other), and either of these scenarios can decrease imputation
quality.
Most publicly available reference panels are aligned to the '+' strand
of the human genome reference sequence, so the goal is to align your
genotypes to the same convention. The best way to do this is to obtain
assay information from the vendor who provided your genotypes; once you
have this information, you can align your genotypes either manually or
with the options described here.
If you cannot recover the strand alignment from the original assay, you
can use other
options that tell IMPUTE2 to make educated guesses.
CHOOSING A REFERENCE PANEL
Historically, most GWAS investigators have tried to choose reference
panels that match the ancestry of their study samples. We have
developed a different approach: first supply IMPUTE2 with a
worldwide reference panel, then let the program decide which haplotypes
to use for imputation. This strategy increases accuracy and avoids
difficult choices about which haplotypes to include in the reference
set. We currently recommend this approach for imputing genotypes in any
human population; you can read more about the strategy here, and you
can download state-of-the-art reference haplotypes here.
GENOME-WIDE IMPUTATION
It can be complicated and computationally demanding to impute thousands
of individuals across the entire genome. We provide a couple of
mechanisms to help with this process:
- IMPUTE2 includes command-line parameters that can
be used to split the genome into discrete chunks for parallel analysis
on a computing cluster. These parameters allow flexible partitioning of
the genome with minimal manipulation of input files. See here for suggestions on how to
use this functionality.
- IMPUTE2 is an efficient imputation method, but it
still requires substantial computing time to process the whole genome
in a large number of individuals. We have recently developed an
approach called "pre-phasing" that greatly reduces the computational
burden of imputation while sacrificing only a little accuracy; you can
read more about the approach here. We
now recommend this as the standard way of performing genome-wide
imputation, although we still prefer the default IMPUTE2
algorithm for analyzing smaller regions.
POST-IMPUTATION FILTERING
It is standard practice to perform additional filtering once a batch of
imputation runs has completed, mainly to remove poorly imputed variants
that might behave badly in association tests. We are currently
preparing some recommendations for this process; we will post them on
the website as soon as they are ready.
ASSOCIATION TESTING
We distribute a program called SNPTEST that contains a powerful
suite of statistical tests for association between phenotypes and
imputed genotypes. You can download the software and read more about
its functions at the SNPTEST
website.
FOLLOW-UP IMPUTATION OF PUTATIVE
ASSOCIATIONS
Once you have performed genome-wide imputation and association testing,
you may want to take a closer look at regions with interesting
associations. To get the best possible results, we recommend
re-imputing this subset of regions with more intensive program
settings:
- In contrast to the pre-phasing
approach that we recommend for genome-wide imputation, we suggest using
the standard IMPUTE2 MCMC algorithm for follow-up imputation.
This method takes longer to run in each region, but it should lead to
higher accuracy (especially at low-frequency variants) and remain
computationally feasible when run on a limited portion of the genome.
- If time permits, the overall accuracy may be improved by
increasing the value of the -k
parameter.
- If time permits, the accuracy at low-frequency variants
may be improved by increasing the size of the -buffer
region—say, from the default value of 250 kb to 1000 kb
(1 Mb).
Once you have re-imputed each region of interest, you should perform
the association tests again to obtain a high-resolution estimate of the
association landscape. |