1,000 Genomes haplotypes -- Phase 3 integrated variant set release in NCBI build 37 (hg19) coordinates
 
UPDATED FILES 12 Oct 2014 : So that each variant has a unique ID we have edited the legend files so that the variant ID will be either rsID:position:ref:alt, or chr:position:ref:alt. If the site is a structural variant then the variant ID will be rsid:position:ref:alt:END or chrom:position:ref:alt:END where END is the endpoint of the variant.

These files are based on sequence data for 2,504 samples from

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/


We have created a set of files that will work directly with IMPUTE2. The Phase 3 haplotypes contain multi-allelic variants, and we have processed these variants to create multiple bi-allelics variants. For example, a tri-allelic SNP with 3 alleles (REF, ALT1, ALT2) will have been recoded as two bi-allelic SNPs. The first one with alleles REF and ALT1, and the second SNP with alleles REF and ALT2. After this re-coding the numbers of sites for each variant type are as follows

Variant Type
Number of sites
Biallelic_SNP
77,818,332
Multiallelic_SNP
520,275
Biallelic_INDEL
2,982,597
Multiallelic_INDEL
324,022
Biallelic_DEL
32,306
Biallelic_DUP
5,791
Biallelic_INV
100
Biallelic_MNP
1
Multiallelic_CNV
6,210
Biallelic_INS:ME:ALU
12,491
Biallelic_INS:ME:LINE1
2,910
Biallelic_INS:ME:SVA
822
Biallelic_INS:MT
165

WARNING : this file is over 12Gb in size  1000GP_Phase3.tgz

This file is a gzipped tar archive and contains 4 kinds of files :

1. *.hap.gz (one file per autosome) -- Phased haplotype file in IMPUTE -h format (compressed by gzip software).

2. *.legend.gz (one file per autosome) -- Legend file in IMPUTE -l format (compressed by gzip software; variant positions in NCBI b37 coordinates). These files have the following columns

column 1 (id) - variant ID
column 2 (position) - base pair position
column 3 (a0) -
allele labeled '0' in .hap file. This is the REF (or reference) allele.
column 4 (a1) - allele labeled '1' in .hap file. This is the ALT (or alternate) allele.
column 5 (TYPE) -
SNP/INDEL/SV denotes type of biallelic variant
column 6-10 (AFR, AMR, EAS, EUR, SAS) - ALT allele frequency in continental groups. The mapping of populations to groups is given below.
column 11 (ALL)
- ALT allele frequency across all 2,504 samples

variant IDs Each bi-allelic variant has an ID of the form rsID:ALT or chromosome:position:ALT.

Filtering using the legend file We provide a single worldwide haplotype file per chromosome (rather than splitting the files by population or group) since IMPUTE2 is designed to work with cosmopolitan reference panels via its -k_hap parameter. One way to reduce the computational cost of imputation is to flexibly remove certain SNPs on the command line via the -filt_rules_l option. For example, if you are imputing into a European dataset and want to ignore reference variants that are monomorphic in the 1,000 Genomes EUR data, you can include " -filt_rules_l 'EUR==0' " in your IMPUTE2 command, which will tell the software to ignore SNPs with values of zero in the 'EUR' column of the legend file. For more details about -k_hap and -filt_rules_l, please see the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html).

3. genetic_map* (one file per autosome) -- Genetic map file in IMPUTE -m format (physical positions in NCBI b37 coordinates).

4. 1000GP_Phase3.sample -- Text file with sample ID, population and continental group for the individuals in the haplotype files. There are two haplotypes per sample, and the column order of haplotypes matches the row order of sample IDs. The population IDs are defined below, grouped by continental group.

AMR (Americas) SAS (Southern Asians) EAS (East Asians) EUR (Europeans) AFR (Africans)

CLM  Colombian in Medellin, Colombia

MXL  Mexican Ancestry in Los Angeles, California

PEL  Peruvian in Lima, Peru

PUR  Puerto Rican in Puerto Rico

BEB  Bengali in Bangladesh

GIH  Gujarati Indian in Houston,TX

ITU  Indian Telugu in the UK

PJL  Punjabi in Lahore,Pakistan

STU  Sri Lankan Tamil in the UK

 

CDX  Chinese Dai in Xishuangbanna, China

CHB  Han Chinese in Bejing, China

CHD  Chinese in Denver, Colorado

CHS  Southern Han Chinese, China

JPT  Japanese in Tokyo, Japan

KHV  Kinh in Ho Chi Minh City, Vietnam

CEU  Utah residents with Northern and Western European ancestry

IBS  Iberian populations in Spain

FIN  Finnish in Finland

GBR  British in England and Scotland

TSI  Toscani in Italy

 

ACB  African Caribbean in Barbados

ASW  African Ancestry in Southwest US

ESN  Esan in Nigeria

GWD  Gambian in Western Division, The Gambia

LWK  Luhya in Webuye, Kenya

MSL  Mende in Sierra Leone

YRI  Yoruba in Ibadan, Nigeria