README for download package 'ALL_1000G_phase1integrated_v3_impute.tgz' The files in this download are based on 1,000 Genomes Project sequence data freezes from 23 Nov 2010 (low-coverage whole-genome) and 21 May 2011 (high-coverage exome). This callset includes phased haplotypes for 1,092 individuals. The callset was released Oct 2011, then revised in Feb 2012 and Mar 2012 to remove subsets of variants that were enriched for sequencing artifacts; the current revision is known as "version 3." The haplotypes were inferred from a combination of low-coverage genome sequence data and high-coverage exome sequence data, and they contain SNPs, short indels, and large deletions. The haplotypes were produced by using BEAGLE (Brian and Sharon Browning, University of Washington) to call and phase genotypes from genotype likelihoods, then refining the BEAGLE haplotype estimates with MaCH/Thunder (Yun Li, University of North Carolina, and Goncalo Abecasis, University of Michigan). The original haplotypes can be downloaded as VCF ("variant call format") files from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/. This download package includes data from the autosomes and chromosome X. The chromosome X files are split into three disjoint regions: the non-pseudoautosomal region (nonPAR) and two pseudoautosomal regions (PAR1 and PAR2). When imputing from the nonPAR reference panel with IMPUTE2, you should make sure to use the -chrX flag and code your study genotypes appropriately; for more details, please see the section about chromosome X imputation on the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/chromosome_X_options.html). The files in this download do not include sites that were called as monomorphic across all 1,092 individuals. The files do include singletons, which will typically be very hard to impute. The autosomal and sex chromosome variant counts are as follows: Chromosomes 1-22: 36,648,992 SNPs 1,380,736 INDELs 13,805 SVs Chromosome X: 1,193,934 SNPs (nonPAR); 46,880 SNPs (PAR1+PAR2) 55,901 INDELs (nonPAR); 3,221 INDELs (PAR1+PAR2) 383 SVs (nonPAR); 48 SVs (PAR1+PAR2) This download contains four kinds of files: 1. *impute.hap.gz (one file per autosome, three files for chrX) -- Phased haplotype file in IMPUTE -h format (compressed by gzip software). 2. *impute.legend.gz (one file per autosome, three files for chrX) -- Legend file in IMPUTE -l format (compressed by gzip software; variant positions in NCBI b37 coordinates). The first four columns (variant ID, base pair position, allele labeled '0' in .hap file, and allele labeled '1' in .hap file) are mandatory in this format, and in this download package the '0' allele is always the reference allele. When a variant was not provided with an ID in the original VCF file, we assigned it an ID of the form "chr[chr]:[pos]:[type]", where [chr] and [pos] are the chromosome and position of the variant and [type] takes values of 'I' and 'D' for insertion/deletion polymorphisms, respectively ([type] is not used for SNPs). The remaining columns provide additional information about each variant. Columns labeled 'xxx.maf' (where 'xxx' is a lowercase group ID) provide the minor allele frequencies of the different ancestral groups in the corresponding haplotype file, while columns labeled 'xxx.aaf' provide the alternate (non-reference) allele frequencies. We provide a single worldwide haplotype file per chromosome (rather than splitting the files by population or group) since IMPUTE2 is designed to work with cosmopolitan reference panels via its -k_hap parameter. One way to reduce the computational cost of imputation is to flexibly remove certain SNPs on the command line via the -filt_rules_l option. For example, if you are imputing into a European dataset and want to ignore reference variants that are monomorphic in the 1,000 Genomes EUR data, you can include " -filt_rules_l 'eur.maf==0' " in your IMPUTE2 command, which will tell the software to ignore SNPs with values of zero in the 'eur.maf' column of the legend file. For more details about -k_hap and -filt_rules_l, please see the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html). NOTE: Unlike the legend files in some of our previous 1,000 Genomes Phase 1 release datasets, the files in this release do not contain type=SNP/INDEL/SV columns or source=LOWCOV/EXOME columns. We have omitted these columns from the current download because they can trigger a bug in IMPUTE v2.2.2 when -filt_rules_l is active. Legend files that include these columns are available at a separate location on our website (please see the data download page for this release), and we will soon release an updated version of IMPUTE2 that can handle them smoothly. 3. genetic_map* (one file per autosome, three files for chrX) -- Genetic map file in IMPUTE -m format (physical positions in NCBI b37 coordinates). 4. *.sample -- Text file with sample, population, group, and sex IDs for the individuals in the haplotype files. The group IDs correspond roughly to the continents of sampling, with occasional exceptions. The sex IDs are '1' for males and '2' for females; these are useful when interpreting data from the non-pseudoautosomal region ('nonPAR') of chromosome X. There are two haplotypes per sample, and the column order of haplotypes matches the row order of sample IDs. The population IDs are defined below, with group IDs in square brackets and sample counts in parentheses. ASW [AFR] (61) - African Ancestry in Southwest US CEU [EUR] (85) - Utah residents (CEPH) with Northern and Western European ancestry CHB [ASN] (97) - Han Chinese in Beijing, China CHS [ASN] (100) - Han Chinese South CLM [AMR] (60) - Colombian in Medellin, Colombia FIN [EUR] (93) - Finnish from Finland GBR [EUR] (89) - British from England and Scotland IBS [EUR] (14) - Iberian populations in Spain JPT [ASN] (89) - Japanese in Toyko, Japan LWK [AFR] (97) - Luhya in Webuye, Kenya MXL [AMR] (66) - Mexican Ancestry in Los Angeles, CA PUR [AMR] (55) - Puerto Rican in Puerto Rico TSI [EUR] (98) - Toscani in Italia YRI [AFR] (88) - Yoruba in Ibadan, Nigeria ----------- TOTAL [AFR=246, AMR=181, ASN=286, EUR=379] (1092)