1,000 Genomes haplotypes -- Phase I integrated variant set release (SHAPEIT2) in NCBI build 37 (hg19) coordinates

This page was last updated on 16 September 2013.

These files are based on sequence data for 1,092 TGP samples from

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/

This dataset contains genotype likelihoods for 36,820,992 SNPs, 1,384,273 short bi-allelic indels and 14,017 structural variations (SVs).

The haplotypes were phased using a new version of SHAPEIT2 that can handle genotype likelihoods and genotypes available from SNP microarrays on the same samples. The phasing proceeds in 2 steps

(i) Firstly the SNP array data are phased in order to build a backbone (or 'scaffold') of haplotypes across each chromosome.
(ii) We then use SHAPEIT2 to phase the sequence data 'onto' this haplotype scaffold.

This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. The approach is described in the following paper

Olivier Delaneau, Jonathan Marchini and the 1000 Genomes Project Consortium (2013) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. (in review)

Using a set of validation genotypes at SNP and biallelic indels we have been able to show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low frequency variants. The following figure shows the imputation performance of the previous 1000GP haplotypes and this new release.

We have provided two versions of the haplotypes. WARNING : these files are over 3Gb in size.

(i) ALL_1000G_phase1integrated_SHAPEIT2_impute.polymorphic.tgz - haplotypes at all polymorphic sites.
(ii) ALL_1000G_phase1integrated_SHAPEIT2_impute.nosingleton.tgz - haplotypes with singleton sites removed.

These files are gzipped tar archives and contain 4 kinds of files :

1. *.haplotypes.gz (one file per autosome) -- Phased haplotype file in IMPUTE -h format (compressed by gzip software).

2. *.legend.gz (one file per autosome) -- Legend file in IMPUTE -l format (compressed by gzip software; variant positions in NCBI b37 coordinates). These files have the following columns

column 1 (id) - variant ID
column 2 (position) - base pair position
column 3 (a0) - allele labeled '0' in .hap file. This is the reference allele.
column 4 (a1) - allele labeled '1' in .hap file
column 5 (type) - SNP/INDEL/SV denotes type of biallelic variant
column 6 (source) - LOWCOV/EXOME denotes the type of sequencing used to generate the data at the site
column 7-10 (afr.aaf amr.aaf asn.aaf eur.aaf) alternate (non-reference) allele frequencies in 4 ancestral groups
column 11-14 (afr.maf amr.maf asn.maf eur.maf) minor allele frequencies (MAF) in 4 ancestral groups

Filtering using the legend file We provide a single worldwide haplotype file per chromosome (rather than splitting the files by population or group) since IMPUTE2 is designed to work with cosmopolitan reference panels via its -k_hap parameter. One way to reduce the computational cost of imputation is to flexibly remove certain SNPs on the command line via the -filt_rules_l option. For example, if you are imputing into a European dataset and want to ignore reference variants that are monomorphic in the 1,000 Genomes EUR data, you can include " -filt_rules_l 'eur.maf==0' " in your IMPUTE2 command, which will tell the software to ignore SNPs with values of zero in the 'eur.maf' column of the legend file. For more details about -k_hap and -filt_rules_l, please see the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html).

3. genetic_map* (one file per autosome) -- Genetic map file in IMPUTE -m format (physical positions in NCBI b37 coordinates).

4. *.sample -- Text file with sample, population, group, and sex IDs for the individuals in the haplotype files. The group IDs correspond roughly to the continents of sampling, with occasional exceptions. The sex IDs are '1' for males and '2' for females; these are useful when interpreting data from the non-pseudoautosomal region ('nonPAR') of chromosome X. There are two haplotypes per sample, and the column order of haplotypes matches the row order of sample IDs. The population IDs are defined below, with group IDs in square brackets and sample counts in parentheses.

ASW [AFR] (61) - African Ancestry in Southwest US
CEU [EUR] (85) - Utah residents (CEPH) with Northern and Western European ancestry
CHB [ASN] (97) - Han Chinese in Beijing, China
CHS [ASN] (100) - Han Chinese South
CLM [AMR] (60) - Colombian in Medellin, Colombia
FIN [EUR] (93) - Finnish from Finland
GBR [EUR] (89) - British from England and Scotland
IBS [EUR] (14) - Iberian populations in Spain
JPT [ASN] (89) - Japanese in Toyko, Japan
LWK [AFR] (97) - Luhya in Webuye, Kenya
MXL [AMR] (66) - Mexican Ancestry in Los Angeles, CA
PUR [AMR] (55) - Puerto Rican in Puerto Rico
TSI [EUR] (98) - Toscani in Italia
YRI [AFR] (88) - Yoruba in Ibadan, Nigeria
-----------
TOTAL [AFR=246, AMR=181, ASN=286, EUR=379] (1092)

A VCF file with these haplotypes is available from the 1000 Genomes Project FTP site here

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/shapeit2_phased_haplotypes/