Input file options

Home

This table explains the formatting requirements for input data files that can be supplied to IMPUTE2. Some of these files allow more than one ID per SNP, but the program identifies SNPs internally by their base pair positions (which means that duplicate SNPs at a single position can cause problems). In all of these files, it is important that SNPs appear in base pair position order, from lowest to highest. It is also crucial that all SNP positions come from the same genome assembly (e.g., NCBI Build 36) so the program can combine information across input files.

Flag Default Description
-g <file>
REQUIRED unless -known_haps_g provided
none File containing genotypes for a study cohort that you want to impute or phase. The format of this file is described on our file format webpage and is the same as the output format from our genotype calling program CHIAMO.

If you do not supply a file of unphased genotypes via this argument, you must supply a file of phased study haplotypes via the -known_haps_g option.
-m <file>
REQUIRED
none Fine-scale recombination map for the region to be analyzed. This file should have three columns: physical position (in base pairs), recombination rate between current position and next position in map (in cM/Mb), and genetic map position (in cM). The file should also have a header line with an unbroken character string for each column (e.g., "position COMBINED_rate(cM/Mb) Genetic_Map(cM)").

All of our reference panel download packages come with appropriate recombination map files.
-h <file 1> <file 2> none File of known haplotypes, with one row per SNP and one column per haplotype. All alleles must be coded as 0 or 1, and each -h file must be provided with a corresponding legend file. We provide formatted haplotypes from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages.

In IMPUTE2, it is possible to specify two -h files. In this case, the file with more SNPs should be provided first (in the <file 1> position) and the file with fewer SNPs should be provided second (in the <file 2> position), with a single space separating the file names.
-l <file 1> <file 2> none Legend file(s) with information about the SNPs in the -h file(s). Each file should have four columns: rsID, physical position (in base pairs), allele 0, and allele 1. The last two columns specify the alleles underlying the 0/1 coding in the corresponding -h file; these alleles can take values in {A,C,G,T}. Each legend file should also have a header line with an unbroken character string for each column (e.g., "rsID position a0 a1"). We provide legend files for data from the HapMap Project and the 1,000 Genomes Project in our reference panel download packages.

When using two -h files with IMPUTE2, you must supply the corresponding legend files in the same order -- i.e., the file with more SNPs comes first.
-g_ref <file> none File containing unphased genotypes to use as a reference panel for imputation. This file should follow the same format as the -g file. A -g_ref file can be used as the lone reference panel for imputation, or it can be combined with a single -h file to create a two-tiered reference panel (in the latter case, the -g_ref file should contain roughly a subset of the SNPs in the -h file).
-known_haps_g <file> none File containing known haplotypes for the study cohort. The format is the same as the output format from IMPUTE2's -phase option: five header columns (as in the -g file) followed by two columns (haplotypes) per individual. Allowed values in the haplotype columns are 0, 1, and ?.

If your study dataset is fully phased, you can replace the -g file with a -known_haps_g file. This will cause IMPUTE2 to perform haploid imputation, although it will still report diploid imputation probabilities in the main output file. If any genotypes are missing, they can be marked as '? ?' (two question marks separated by one space) in the input file. (The program does not allow just one allele from a diploid genotype to be missing.) If the reference panels are also phased, IMPUTE2 will perform a single, fast imputation step rather than its standard MCMC module -- this is how the program imputes into pre-phased GWAS haplotypes.

The -known_haps_g file can also be used to specify study genotypes that are "partially" phased, in the sense that some genotypes are phased relative to a fixed reference point while others are not. We anticipate that this will be most useful when trying to phase resequencing data onto a scaffold of known haplotypes. To mark a known genotype as unphased, place an asterisk immediately after each allele, with no space between the allele (0/1) and the asterisk (*); e.g., " 0* 1*" for a heterozygous genotype of unknown phase.