README for download package 'ALL_1000G_phase1integrated_v3_impute_macGT1.tgz'

This download package is identical to the one called 'ALL_1000G_phase1integrated_v3_impute.tgz', except that the legend and haplotype files (".legend.gz" and ".hap.gz", respectively) have been restricted to variants with more than one minor allele copy ("macGT1", or "minor allele count greater than 1") across all 1,092 individuals. Minor alleles at singleton variants are generally imputed poorly, yet there are many such variants in sequence-based reference panels, so removing these variants decreases the computational burden of imputation (in this case, by ~20%) without discarding much information.

The files in this download are based on 1,000 Genomes Project sequence data freezes from 23 Nov 2010 (low-coverage whole-genome) and 21 May 2011 (high-coverage exome). This callset includes phased haplotypes for 1,092 individuals. The callset was released Oct 2011, then revised in Feb 2012 and Mar 2012 to remove subsets of variants that were enriched for sequencing artifacts; the current revision is known as "version 3." The haplotypes were inferred from a combination of low-coverage genome sequence data and high-coverage exome sequence data, and they contain SNPs, short indels, and large deletions. The haplotypes were produced by using BEAGLE (Brian and Sharon Browning, University of Washington) to call and phase genotypes from genotype likelihoods, then refining the BEAGLE haplotype estimates with MaCH/Thunder (Yun Li, University of North Carolina, and Goncalo Abecasis, University of Michigan). The original haplotypes can be downloaded as VCF ("variant call format") files from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/.

This download package includes data from the autosomes and chromosome X. The chromosome X files are split into three disjoint regions: the non-pseudoautosomal region (nonPAR) and two pseudoautosomal regions (PAR1 and PAR2). When imputing from the nonPAR reference panel with IMPUTE2, you should make sure to use the -chrX flag and code your study genotypes appropriately; for more details, please see the section about chromosome X imputation on the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/chromosome_X_options.html).

The files in this download do not include sites that were called as monomorphic or as singletons across all 1,092 individuals. The autosomal and sex chromosome variant counts are as follows:

Chromosomes 1-22:
  28,681,763 SNPs
  1,380,133 INDELs
  10,842 SVs

Chromosome X:
  901,830 SNPs (nonPAR); 39,055 SNPs (PAR1+PAR2)
  55,760 INDELs (nonPAR); 3,218 INDELs (PAR1+PAR2)
  377 SVs (nonPAR); 45 SVs (PAR1+PAR2)

This download contains four kinds of files:

1. *impute.hap.gz (one file per autosome, three files for chrX) -- Phased haplotype file in IMPUTE -h format (compressed by gzip software).

2. *impute.legend.gz (one file per autosome, three files for chrX) -- Legend file in IMPUTE -l format (compressed by gzip software; variant positions in NCBI b37 coordinates). The first four columns (variant ID, base pair position, allele labeled '0' in .hap file, and allele labeled '1' in .hap file) are mandatory in this format, and in this download package the '0' allele is always the reference allele. When a variant was not provided with an ID in the original VCF file, we assigned it an ID of the form "chr[chr]:[pos]:[type]", where [chr] and [pos] are the chromosome and position of the variant and [type] takes values of 'I' and 'D' for insertion/deletion polymorphisms, respectively ([type] is not used for SNPs). The remaining columns provide additional information about each variant. Columns labeled 'xxx.maf' (where 'xxx' is a lowercase group ID) provide the minor allele frequencies of the different ancestral groups in the corresponding haplotype file, while columns labeled 'xxx.aaf' provide the alternate (non-reference) allele frequencies.

We provide a single worldwide haplotype file per chromosome (rather than splitting the files by population or group) since IMPUTE2 is designed to work with cosmopolitan reference panels via its -k_hap parameter. One way to reduce the computational cost of imputation is to flexibly remove certain SNPs on the command line via the -filt_rules_l option. For example, if you are imputing into a European dataset and want to ignore reference variants that are monomorphic in the 1,000 Genomes EUR data, you can include " -filt_rules_l 'eur.maf==0' " in your IMPUTE2 command, which will tell the software to ignore SNPs with values of zero in the 'eur.maf' column of the legend file. For more details about -k_hap and -filt_rules_l, please see the IMPUTE2 website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.html).

3. genetic_map* (one file per autosome, three files for chrX) -- Genetic map file in IMPUTE -m format (physical positions in NCBI b37 coordinates).

4. *.sample -- Text file with sample, population, group, and sex IDs for the individuals in the haplotype files. The group IDs correspond roughly to the continents of sampling, with occasional exceptions. The sex IDs are '1' for males and '2' for females; these are useful when interpreting data from the non-pseudoautosomal region ('nonPAR') of chromosome X. There are two haplotypes per sample, and the column order of haplotypes matches the row order of sample IDs. The population IDs are defined below, with group IDs in square brackets and sample counts in parentheses.

ASW [AFR] (61) - African Ancestry in Southwest US
CEU [EUR] (85) - Utah residents (CEPH) with Northern and Western European ancestry
CHB [ASN] (97) - Han Chinese in Beijing, China
CHS [ASN] (100) - Han Chinese South
CLM [AMR] (60) - Colombian in Medellin, Colombia
FIN [EUR] (93) - Finnish from Finland
GBR [EUR] (89) - British from England and Scotland
IBS [EUR] (14) - Iberian populations in Spain
JPT [ASN] (89) - Japanese in Toyko, Japan
LWK [AFR] (97) - Luhya in Webuye, Kenya
MXL [AMR] (66) - Mexican Ancestry in Los Angeles, CA
PUR [AMR] (55) - Puerto Rican in Puerto Rico
TSI [EUR] (98) - Toscani in Italia
YRI [AFR] (88) - Yoruba in Ibadan, Nigeria
-----------
TOTAL [AFR=246, AMR=181, ASN=286, EUR=379] (1092)