README for download package 'ALL_1000G_phase1interim_jun2011_impute.tgz' The files in this download are based on a 1,000 Genomes Project sequence data freeze from 23 Nov 2010. This callset was release Jun 2011, and it includes phased SNP haplotypes for 1,094 individuals. The haplotypes were inferred from low-coverage sequence data by the SNPTools software package (Fuli Yu and Yi Wang, Baylor College of Medicine). The original haplotypes can be downloaded as VCF ("variant call format") files from here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/. UPDATED 06 FEB 2012: The download package now contains haplotypes from chromosome X, and all legend files now include group-level minor allele frequencies that can be used with the -filt_rules_l option in IMPUTE2. UPDATED 05 MAR 2012: The .hap and .legend files have now been filtered to remove sites that are monomorphic across all 1,094 individuals. The files still include singletons, even though these will typically be very hard to impute. We have also fixed a small problem in the chrX nonPAR genetic map file that arose when mapping physical coordinates across different genome builds. UPDATED 18 APR 2012: Due to a processing issue, the previous update still included monomorphic sites; these have now been removed. This download package includes data from the autosomes and chromosome X. The chromosome X files are split into three disjoint regions: the non-pseudoautosomal region (nonPAR) and two pseudoautosomal regions (PAR1 and PAR2). When imputing from the nonPAR reference panel with IMPUTE2, you should make sure to use the -chrX flag and code your study genotypes appropriately; for more details, please see the section about chromosome X imputation on the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/chromosome_X_options.html). The autosomal and sex chromosome variant counts are as follows: Chromosomes 1-22: 37,138,905 SNPs Chromosome X: 1,369,184 SNPs (nonPAR); 50,842 SNPs (PAR1+PAR2) This download contains four kinds of files: 1. *impute.hap.gz (one file per autosome, three files for chrX) -- Phased haplotype file in IMPUTE -h format (compressed by gzip software). 2. *impute.legend.gz (one file per autosome, three files for chrX) -- Legend file in IMPUTE -l format (compressed by gzip software; SNP positions in NCBI b37 coordinates). The first four columns (variant ID, base pair position, allele labeled '0' in .hap file, and allele labeled '1' in .hap file) are mandatory in this format, and in this download package the '0' allele is always the reference allele. When a SNP was not provided with an ID in the original VCF file, we assigned it an ID of the form "[chr]-[pos]", where [chr] and [pos] are the chromosome and position of the SNP. The remaining columns provide additional information about each variant. Columns labeled 'xxx.maf' (where 'xxx' is a lowercase group ID) provide the minor allele frequencies of the different ancestral groups in the corresponding haplotype file, while columns labeled 'xxx.aaf' provide the alternate (non-reference) allele frequencies. We provide a single worldwide haplotype file per chromosome (rather than splitting the files by population or group) since IMPUTE2 is designed to work with cosmopolitan reference panels via its -k_hap parameter. One way to reduce the computational cost of imputation is to flexibly remove certain SNPs on the command line via the -filt_rules_l option. For example, if you are imputing into a European dataset and want to ignore reference variants that are monomorphic in the 1,000 Genomes EUR data, you can include " -filt_rules_l 'eur.maf==0' " in your IMPUTE2 command, which will tell the software to ignore SNPs with values of zero in the 'eur.maf' column of the legend file. For more details about -k_hap and -filt_rules_l, please see the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html). 3. genetic_map* (one file per autosome, three files for chrX) -- Genetic map file in IMPUTE -m format (physical positions in NCBI b37 coordinates). 4. *.sample -- Text file with sample, population, group, and sex IDs for the individuals in the haplotype files. The group IDs correspond roughly to the continents of sampling, with occasional exceptions. The sex IDs are '1' for males and '2' for females; these are useful when interpreting data from the non-pseudoautosomal region of chromosome X. There are two haplotypes per sample, and the column order of haplotypes matches the row order of sample IDs. The population IDs are defined below, with group IDs in square brackets and sample counts in parentheses. ASW [AFR] (61) - African Ancestry in Southwest US CEU [EUR] (87) - Utah residents (CEPH) with Northern and Western European ancestry CHB [ASN] (97) - Han Chinese in Beijing, China CHS [ASN] (100) - Han Chinese South CLM [AMR] (60) - Colombian in Medellin, Colombia FIN [EUR] (93) - Finnish from Finland GBR [EUR] (89) - British from England and Scotland IBS [EUR] (14) - Iberian populations in Spain JPT [ASN] (89) - Japanese in Toyko, Japan LWK [AFR] (97) - Luhya in Webuye, Kenya MXL [AMR] (66) - Mexican Ancestry in Los Angeles, CA PUR [AMR] (55) - Puerto Rican in Puerto Rico TSI [EUR] (98) - Toscani in Italia YRI [AFR] (88) - Yoruba in Ibadan, Nigeria ----------- TOTAL [AFR=246, AMR=181, ASN=286, EUR=381] (1094)