Pre-phasing
imputation
using
SHAPEIT and IMPUTE2
The pre-phasing idea – As
haplotype reference panels continue to grow in size (#haplotypes and
#snps) the process of genotype imputation for GWAS becomes increasing
computational intensive. Pre-phasing involves speeding up this process
by first estimating haplotypes from your GWAS samples, and then
imputing alleles into these haplotypes from a reference haplotype
panel. The phasing of the GWAS samples needs only be done once so that
when a new haplotype reference panel becomes available the imputation
step is very quick. A paper detailing this approach is currently in
review [1].
SHAPEIT and
IMPUTE2
We currently
recommend that pre-phasing based imputation is carried out using a
combination of the programs SHAPEIT
and IMPUTE2. There is good
evidence that together these tools can provide the highest accuracy of
results together with computational efficiency at all stages of phasing
and imputation.
SHAPEIT is a new program for
haplotype phasing that has several desirable properties and features.
•
It
has
been shown to be more accurate at haplotype estimation on
several GWAS scale datasets than IMPUTE2,
MACH, BEAGLE and fastPHASE.
•
It
can
phase whole chromosomes at once.
•
It
can
handle GWAS datasets consisting of unrelated, trios and duos.
•
The
algorithm
has linear complexity in the number of SNPs, number of
samples and the number of conditioning haplotypes used in each
iteration of the phasing.
•
The
program
can take advantage of multi-threading to decrease
computational time on multi-core machines.
The program is
available free for academic use at http://www.shapeit.fr/
A paper
detailing the method and its performance is to appear soon in Nature
Methods [2].
IMPUTE2 is a program that carries
out imputation of genotypes from a reference panel of haplotypes. It
can take either unphased or phased GWAS data as input. It has been
shown to be more accurate than other imputation programs [3,4,5].
The program is
available free for academic use at
https://mathgen.stats.ox.ac.uk/impute/impute_v2.html
EXAMPLE
To illustrate how SHAPEIT and IMPUTE2 can be used together to
carryout pre-phasing imputation we have put together a set of example
files that can be downloaded from here
https://mathgen.stats.ox.ac.uk/impute/SHAPEIT+IMPUTE.pre-phasing.examples.tgz
The files can be
upacked using
tar
zxvf
SHAPEIT+IMPUTE.pre-phasing.examples.tgz
and consist of
gwas_data_chr10.gen
– a set of genotype data on 100 samples and 23,231 SNPs. This file is
in the format used by IMPUTE2 and SNPTEST (see here for more details)
gwas_data_chr10.sample
– associated sample file for the .gen file above
genetic_map_chr10_combined_b36.txt
– genetic map file on b36 co-oordinates.
pilot1.jun2010.b36.CEU.chr10.snpfilt.haps
– set of June 2010 1000 Genomes haplotypes from chr10 in build 36
coordinates.
pilot1.jun2010.b36.CEU.chr10.snpfilt.legend
– associated legend file for the .haps file above
Phasing with
SHAPEIT
A set of
haplotypes can be estimated from the genotype data across the whole
chromosome using the following SHAPEIT
command
./shapeit.v1.r254.dynamic.linux.x86_64
\
--input-gen
gwas_data_chr10.gen
gwas_data_chr10.sample
\
--input-map genetic_map_chr10_combined_b36.txt
\
--input-gen-threshold 0.95
\
--output-max
gwas_data_chr10_phased.haps
gwas_data_chr10_phased.sample \
--thread 1
Notes :
(a) the --input-gen-threshold
option is used for calling genotypes when gen/sample files are given as
input. For each individual at each SNP, the program will use the most
likely genotype if that probability exceeds the threshold. Otherwise,
the genotype will be considered as missing. This is not a required
option (the default is 0.9). We include a description here so users are
aware that thresholding can occur.
(b) you
can increase the value used by the --thread
option if you are
running on a multi-core machine and want to take advantage of this.
(c) the
effective population size is set at its default value (15,000). In
our experience this parameter does not have a significant effect on
performance, but you you may wish to adjust it, using the --effective-size
option, based on our current best estimates of these parameters from
HapMap2 (European 11,418, African 17,469, Asian 14,269)
The SHAPEIT
command above will produce two output files
gwas_data_chr10_phased.haps
gwas_data_chr10_phased.sample
These file
contains the set of estimated haplotypes on the 100 samples (and
associated sample file) across the whole of chr10. The format of this
file is described on the following page and is a format used by IMPUTE2 to specify known haplotype
data http://www.shapeit.fr/pages/hapssample.html.
Imputation
with IMPUTE2
The next step is
to carryout imputation using the IMPUTE2
program. We recommend that this needs to be run in separate chunks
across the chromosome. We recommend 5Mb chunks. So to impute the region
of chr10 between positions 20,000,000 and
25,000,000 you should use the command
./impute2 \
-known_haps_g gwas_data_chr10_phased.haps \
-h pilot1.jun2010.b36.CEU.chr10.snpfilt.haps
\
-l pilot1.jun2010.b36.CEU.chr10.snpfilt.legend
\
-m genetic_map_chr10_combined_b36.txt
\
-int 20000000 25000000 \
-Ne 15000 \
-buffer 250 \
-o gwas_data_chr10_imputed.20-25Mb.gen
A whole
chromosome can then be imputed by running separate IMPUTE2 commands on non-overlapping
chunks across each chromosome. The resulting files can easily be
concatenated together to create a single imputed gen file per
chromosome.
References
[1] B. Howie, C. Fuchsberger , M. Stephens , J. Marchini, G. Abecasis
(2012) Fast and accurate genotype imputation in genome-wide association
studies through pre-phasing (in review)
[2] O. Delaneau,
J. Marchini, JF. Zagury. A linear complexity phasing method for
thousands of genomes. Nature Methods
2011 (To appear)
[3] B. Howie, P.
Donnelly, J. Marchini (2009) A Flexible and Accurate Genotype
Imputation Method for the Next Generation of Genome-Wide Association
Studies. PLoS Genetics
5(6): e1000529 [Open
Access
Article]
[4] J. Marchini
and B. Howie (2010) Genotype imputation for genome-wide association
studies. Nature Reviews Genetics
[Link]
[5] B. Howie, J.
Marchini, M. Stephens (2011) Genotype Imputation with Thousands of
Genomes. G3 doi:
10.1534/g3.111.001198 [Link]