README for download package 'ALL_1000G_phase1integrated_v3_annotated_legends.tgz' The files in this download are based on 1,000 Genomes Project sequence data freezes from 23 Nov 2010 (low-coverage whole-genome) and 21 May 2011 (high-coverage exome). This callset was released Oct 2011, then revised in Feb 2012 and Mar 2012 to remove subsets of variants that were enriched for sequencing artifacts; the current revision is known as "version 3." The variants were inferred from a combination of low-coverage genome sequence data and high-coverage exome sequence data, and they include SNPs, short indels, and large deletions. The original data can be downloaded as VCF ("variant call format") files from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/. This download package includes data from the autosomes and chromosome X. The chromosome X files are split into three disjoint regions: the non-pseudoautosomal region (nonPAR) and two pseudoautosomal regions (PAR1 and PAR2). When imputing from the nonPAR reference panel with IMPUTE2, you should make sure to use the -chrX flag and code your study genotypes appropriately; for more details, please see the section about chromosome X imputation on the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/chromosome_X_options.html). The files in this download do not include sites that were called as monomorphic across all 1,092 Phase 1 individuals. The files do include singletons, which will typically be very hard to impute. The autosomal and sex chromosome variant counts are as follows: Chromosomes 1-22: 36,648,992 SNPs 1,380,736 INDELs 13,805 SVs Chromosome X: 1,193,934 SNPs (nonPAR); 46,880 SNPs (PAR1+PAR2) 55,901 INDELs (nonPAR); 3,221 INDELs (PAR1+PAR2) 383 SVs (nonPAR); 48 SVs (PAR1+PAR2) This download contains annotated legend files in IMPUTE format. These files have been compressed by gzip software, and all variant positions are in NCBI b37 coordinates. The first four columns (variant ID, base pair position, allele labeled '0' in .hap file, and allele labeled '1' in .hap file) are mandatory in this format, and in this download package the '0' allele is always the reference allele. When a variant was not provided with an ID in the original VCF file, we assigned it an ID of the form "chr[chr]:[pos]:[type]", where [chr] and [pos] are the chromosome and position of the variant and [type] takes values of 'I' and 'D' for insertion/deletion polymorphisms, respectively ([type] is not used for SNPs). The remaining columns provide additional information about each variant. Columns labeled 'xxx.maf' (where 'xxx' is a lowercase group ID) provide the minor allele frequencies of the different ancestral groups in the corresponding haplotype file, while columns labeled 'xxx.aaf' provide the alternate (non-reference) allele frequencies. There are also additional annotations describing features of each variant: The 'type' column specifies whether a variant is a single-nucleotide polymorphism (SNP), insertion/deletion polymorphism (INDEL), or large structural variant (SV). The 'source' column specifies whether a variant was detected by low-coverage shotgun sequencing (LOWCOV), high-coverage exome capture sequencing (EXOME), or both. The 'rsq' column provides an estimate of the squared correlation between the called genotypes and the true ones. This is a measure of the confidence of the genotype calls for a particular variant. NOTE: Legend files with non-numeric annotation columns like 'type' and 'source' can trigger a bug in current versions of IMPUTE2 (v2.2.2 or earlier) when used with the -filt_rules_l option. This will be fixed in the next software release; in the meantime, please use the legend files in the main download package (ALL_1000G_phase1integrated_v3.tgz) if you want to use this option to filter reference variants at run-time.