Pre-phasing imputation using SHAPEIT and IMPUTE2


The pre-phasing idea – As haplotype reference panels continue to grow in size (#haplotypes and #snps) the process of genotype imputation for GWAS becomes increasing computational intensive. Pre-phasing involves speeding up this process by first estimating haplotypes from your GWAS samples, and then imputing alleles into these haplotypes from a reference haplotype panel. The phasing of the GWAS samples needs only be done once so that when a new haplotype reference panel becomes available the imputation step is very quick. A paper detailing this approach is currently in review [1].


SHAPEIT and IMPUTE2


We currently recommend that pre-phasing based imputation is carried out using a combination of the programs SHAPEIT and IMPUTE2. There is good evidence that together these tools can provide the highest accuracy of results together with computational efficiency at all stages of phasing and imputation.

SHAPEIT is a new program for haplotype phasing that has several desirable properties and features.
•    It has been shown to be more accurate at haplotype estimation on several GWAS scale datasets than IMPUTE2, MACH, BEAGLE and fastPHASE.
•    It can phase whole chromosomes at once.
•    It can handle GWAS datasets consisting of unrelated, trios and duos.
•    The algorithm has linear complexity in the number of SNPs, number of samples and the number of conditioning haplotypes used in each iteration of the phasing.
•    The program can take advantage of multi-threading to decrease computational time on multi-core machines.

The program is available free for academic use at http://www.shapeit.fr/

A paper detailing the method and its performance is to appear soon in Nature Methods [2].

IMPUTE2 is a program that carries out imputation of genotypes from a reference panel of haplotypes. It can take either unphased or phased GWAS data as input. It has been shown to be more accurate than other imputation programs [3,4,5].

The program is available free for academic use at https://mathgen.stats.ox.ac.uk/impute/impute_v2.html

EXAMPLE

To illustrate how SHAPEIT and IMPUTE2 can be used together to carryout pre-phasing imputation we have put together a set of example files that can be downloaded from here

https://mathgen.stats.ox.ac.uk/impute/SHAPEIT+IMPUTE.pre-phasing.examples.tgz

The files can be upacked using

tar zxvf SHAPEIT+IMPUTE.pre-phasing.examples.tgz

and consist of

gwas_data_chr10.gen – a set of genotype data on 100 samples and 23,231 SNPs. This file is in the format used by IMPUTE2 and SNPTEST (see here for more details)

gwas_data_chr10.sample – associated sample file for the .gen file above

genetic_map_chr10_combined_b36.txt – genetic map file on b36 co-oordinates.

pilot1.jun2010.b36.CEU.chr10.snpfilt.haps – set of June 2010 1000 Genomes haplotypes from chr10 in build 36 coordinates.

pilot1.jun2010.b36.CEU.chr10.snpfilt.legend – associated legend file for the .haps file above

Phasing with SHAPEIT

A set of haplotypes can be estimated from the genotype data across the whole chromosome using the following SHAPEIT command

./shapeit.v1.r254.dynamic.linux.x86_64 \
--input-gen gwas_data_chr10.gen gwas_data_chr10.sample \
--input-map genetic_map_chr10_combined_b36.txt \
--input-gen-threshold 0.95 \ 
--output-max gwas_data_chr10_phased.haps gwas_data_chr10_phased.sample \
--thread 1

Notes :
(a)  the --input-gen-threshold option is used for calling genotypes when gen/sample files are given as input. For each individual at each SNP, the program will use the most likely genotype if that probability exceeds the threshold. Otherwise, the genotype will be considered as missing. This is not a required option (the default is 0.9). We include a description here so users are aware that thresholding can occur.
(b) you can increase the value used by the --thread option if you are running on a multi-core machine and want to take advantage of this.

(c)  the effective population size is set at its default value (15,000). In our experience this parameter does not have a significant effect on performance, but you you may wish to adjust it, using the --effective-size option, based on our current best estimates of these parameters from HapMap2 (European 11,418, African 17,469, Asian 14,269)

The SHAPEIT command above will produce two output files

gwas_data_chr10_phased.haps
gwas_data_chr10_phased.sample

These file contains the set of estimated haplotypes on the 100 samples (and associated sample file) across the whole of chr10. The format of this file is described on the following page and is a format used by IMPUTE2 to specify known haplotype data http://www.shapeit.fr/pages/hapssample.html.

Imputation with IMPUTE2

The next step is to carryout imputation using the IMPUTE2 program. We recommend that this needs to be run in separate chunks across the chromosome. We recommend 5Mb chunks. So to impute the region of chr10 between positions 20,000,000 and 25,000,000 you should use the command

./impute2 \
-known_haps_g gwas_data_chr10_phased.haps \
-h pilot1.jun2010.b36.CEU.chr10.snpfilt.haps \
-l pilot1.jun2010.b36.CEU.chr10.snpfilt.legend \
-m genetic_map_chr10_combined_b36.txt \
-int 20000000 25000000 \
-Ne 15000 \
-buffer 250 \
-o gwas_data_chr10_imputed.20-25Mb.gen

A whole chromosome can then be imputed by running separate IMPUTE2 commands on non-overlapping chunks across each chromosome. The resulting files can easily be concatenated together to create a single imputed gen file per chromosome.

References

[1] B. Howie, C. Fuchsberger , M. Stephens , J. Marchini, G. Abecasis (2012) Fast and accurate genotype imputation in genome-wide association studies through pre-phasing (in review)

[2] O. Delaneau, J. Marchini, JF. Zagury. A linear complexity phasing method for thousands of genomes. Nature Methods 2011 (To appear)

[3] B. Howie, P. Donnelly, J. Marchini (2009) A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genetics  5(6): e1000529 [Open Access Article]

[4] J. Marchini and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics [Link]

[5] B. Howie, J. Marchini, M. Stephens (2011) Genotype Imputation with Thousands of Genomes. G3 doi: 10.1534/g3.111.001198 [Link]