Best Practices for Imputation
(top)
IMPUTE2
includes a rich collection of functions for analyzing
genetic datasets, but it is most commonly used to perform
genotype imputation in genome-wide association studies. To
help investigators perform this kind of analysis, we have
condensed the information on this website into a list of
current best practices.
PRE-IMPUTATION FILTERING OF
STUDY GENOTYPES
Before you perform an imputation run with your study
genotypes, you should filter the data to remove
low-quality variants and individuals, as these can degrade
the accuracy of the final results. Standard GWAS quality
control filters are usually sufficient to prepare a
dataset for imputation. It may also help to add an
imputation-based QC step to the filtering process; we will
describe this approach in the near future.
VARIANT POSITION MATCHING
ACROSS INPUT FILES
When you provide IMPUTE2 with reference and study
data, the program determines which variants are shared
across datasets by looking at their positions on the
chromosome (as opposed, say, to their rsIDs). It is
important to note that genomic coordinates change every
couple of years as the human genome reference sequence is
updated, so a given SNP may have different positions in
different datasets. In order to obtain high-quality
results from IMPUTE2, you must make sure that the
variant positions in your input files are mapped to the
same coordinate system, or "assembly".
Genomic assemblies are typically identified by their NCBI
build number (e.g., "b36" or "b37") or their UCSC version
(e.g., "hg18" or "hg19"). Our reference data download
section shows the assembly to which each reference
panel is mapped. If your study genotypes come from a
different assembly than your reference panel, you should
map the positions in your data to the reference coordinate
system by using a tool like the liftOver
program from UCSC. If you need help with this step, please
contact us.
STRAND ALIGNMENT BETWEEN
STUDY AND REFERENCE DATA
It is absolutely essential to align your study genotypes
to the same strand convention as the reference panel from
which you are imputing. Variants that are aligned to
different strands may have different alleles (e.g., A/G in
one dataset and T/C in another) or the same alleles at
disparate frequencies (e.g., A/T in two datasets, where
the 'A' allele occurs at 5% frequency in one dataset and
95% frequency in the other), and either of these scenarios
can decrease imputation quality.
Most publicly available reference panels are aligned to
the '+' strand of the human genome reference sequence, so
the goal is to align your genotypes to the same
convention. The best way to do this is to obtain assay
information from the vendor who provided your genotypes;
once you have this information, you can align your
genotypes either manually or with the options described here. If you
cannot recover the strand alignment from the original
assay, you can use other
options that tell IMPUTE2 to make
educated guesses.
CHOOSING A REFERENCE PANEL
Historically, most GWAS investigators have tried to choose
reference panels that match the ancestry of their study
samples. We have developed a different approach: first
supply IMPUTE2 with a worldwide reference panel,
then let the program decide which haplotypes to use for
imputation. This strategy increases accuracy and avoids
difficult choices about which haplotypes to include in the
reference set. We currently recommend this approach for
imputing genotypes in any human population; you can read
more about the strategy here,
and you can download state-of-the-art reference haplotypes
here.
GENOME-WIDE IMPUTATION
It can be complicated and computationally demanding to
impute thousands of individuals across the entire genome.
We provide a few mechanisms to help with this process:
- IMPUTE2 includes command-line parameters
that can be used to split the genome into discrete
chunks for parallel analysis on a computing cluster.
These parameters allow flexible partitioning of the
genome with minimal manipulation of input files. See here for
suggestions on how to use this functionality.
- IMPUTE2 is an efficient imputation method,
but it still requires substantial computing time to
process the whole genome in a large number of
individuals. We have recently developed an approach
called "pre-phasing" that greatly reduces the
computational burden of imputation while sacrificing
only a little accuracy; you can read more about the
approach here. We now
recommend this as the standard way of performing
genome-wide imputation, although we still prefer the
original IMPUTE2 MCMC algorithm for maximizing
accuracy in smaller regions.
- Sequence-based reference panels contain large
numbers of rare and low-frequency variants, which can
drive up the computational cost of imputation. When
computing power is limited, it may be desirable to
remove some of these variants (e.g., those with very
low frequencies in the population of interest) before
running imputation. To facilitate this process, we
have added the -filt_rules_l
option, which can flexibly remove reference variants
based on command-line input to an IMPUTE2 run.
You can see an example application of this approach
and some guidelines for using it here.
POST-IMPUTATION FILTERING
It is standard practice to perform additional filtering
once a batch of imputation runs has completed, mainly to
remove poorly imputed variants that might behave badly in
association tests. We are currently preparing some
recommendations for this process; we will post them on the
website as soon as they are ready.
ASSOCIATION TESTING
We distribute a program called SNPTEST that
contains a powerful suite of statistical tests for
association between phenotypes and imputed genotypes. You
can download the software and read more about its
functions at the SNPTEST
website.
FOLLOW-UP IMPUTATION OF
PUTATIVE ASSOCIATIONS
Once you have performed genome-wide imputation and
association testing, you may want to take a closer look at
regions with interesting associations. To get the best
possible results, we recommend re-imputing this subset of
regions with more intensive program settings:
- In contrast to the pre-phasing
approach that we recommend for genome-wide imputation,
we suggest using the standard IMPUTE2 MCMC
algorithm for follow-up imputation. This method takes
longer to run in each region, but it should lead to
higher accuracy (especially at low-frequency variants)
and remain computationally feasible when run on a
limited portion of the genome.
- If time permits, the overall accuracy may be
improved by increasing the value of the -k
parameter.
- If time permits, the accuracy at low-frequency
variants may be improved by increasing the size of the
-buffer
region—say, from the default value of 250 kb to
1000 kb (1 Mb).
Once you have re-imputed each region of interest, you
should perform the association tests again to obtain a
high-resolution estimate of the association landscape.
|