-----------------------------------
 Calling sequence genotypes with IMPUTE2
-----------------------------------

This document gives a brief introduction to calling genotypes from resequencing data using the IMPUTE2 software.

MAIN ASSUMPTIONS:

1. Before using IMPUTE2, some effort has already been made to remove false variants from the data.

2. Only biallelic variants are included in the input files.


BASIC INPUT FILES:

1. Genotype likelihoods file (command-line argument -g). This file should be formatted according to the specifications described at http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html#Genotype_File_Format. Basically, SNPs are rows and individuals are triples of columns, and all columns are separated by single spaces. There are five header columns: SNP ID 1, SNP ID 2, position, allele coded '0', allele coded '1'. After the header columns, each successive triple of columns (6-8, 9-11, 11-14, etc.) gives the likelihoods of genotypes 0/0, 0/1, and 1/1 (respectively) for an individual -- e.g., column 6 is Pr(G = 0/0) for the first individual.

2. Known haplotypes/genotypes file (command-line argument -known_haps_g). This file uses the same five header columns as the genotype likelihoods file; subsequently, each individual is represented by a pair of columns (one for each haplotype). Acceptable values for the haplotype alleles are '?', '0', '1', '0*', and '1*', with the following meanings:

'?' -- No known haplotype data; default to genotype likelihood data, if present; otherwise, consider this allele entirely missing.

'0' -- Known allele of the type specified in header column 4; phase known.

'1' -- Known allele of the type specified in header column 5; phase known.

'0*' -- Known allele of the type specified in header column 4; phase unknown.

'1*' -- Known allele of the type specified in header column 5; phase unknown.

The program currently assumes that all individuals are diploids, so it is not possible to combine all of the codes specified above. Valid examples for one individual at one SNP are '? ?', '0 1', '1 0', '0* 1*', and '1* 0*'; by contrast, '0 1*' and '1 ?' are not valid.

Neither the SNPs nor the individuals in this file need be the same as those in the genotype likelihoods file: the program lines up the SNPs by their base-pair positions (column 3), and it uses the sample files described below to line up the individuals. For example, if there are some individuals in the likelihoods file that do not have known haplotype information, they can be omitted from the -known_haps_g file as long as the appropriate sample files are provided.

Individuals that are present in the known haplotypes file but not the likelihoods file will be ignored. SNPs that are present in the known haplotypes file but not the likelihoods file will be used for inference; any individuals who are missing data at such SNPs will have their genotypes imputed. Individuals and SNPs that are present in the likelihoods file but not the known haplotypes file will be dealt with in the usual way for uncertain sequence data.

3. Sample file for genotype likelihoods (command-line argument -sample_g). This file should be formatted according to the specifications described at http://www.stats.ox.ac.uk/~marchini/software/gwas/file_format.html#Sample_File_Format_. Only the first three columns are required for use with IMPUTE2, and the values in the third column are not important for the task described in this document. The number of individuals in this file (i.e., the number of lines after the two header lines) must equal the number of individuals in the likelihoods file (i.e., (N - 5) / 3, where N is the number of columns in the likelihoods file).

4. Sample file for known_haplotypes (command-line argument -sample_known_haps_g). This file should follow the same format described above. The number of individuals in this file (i.e., the number of lines after the two header lines) must equal the number of individuals in the known haplotypes file (i.e., (N - 5) / 2, where N is the number of columns in the known haplotypes file).

5. Genetic map file (command-line argument -m). This file specifies the recombination rates that are used to fit the IMPUTE2 model. Files in the appropriate format can be downloaded from here: https://mathgen.stats.ox.ac.uk/wtccc-software/recombination_rates/genetic_map_b36_combined.tgz. (Other genome builds are available upon request.)


BASIC COMMAND-LINE OPTIONS

1. -int: specifies the analysis interval, which can be used to split whole-chromosome data files into smaller chunks for analysis; takes two arguments (start and end of interval in bp position)

2. -buffer: specifies the length of the buffer region (in kbp) to add to each side of the analysis interval to prevent edge effects; default = 250 (kbp)

3. -Ne: effective population size, which is used to scale the recombination rates in the IMPUTE2 model; guideline values are 11500 for Europeans, 14000 for east Asians, and 17500 for sub-Saharan Africans; for datasets with a mixture of ancestries, a value of 15000 seems to work well; default = 14000

4. -k: parameter that controls the rigor of the phasing/imputation approximation; accuracy improves with larger k, but computation increases quadratically; default = 80

5. -iter: total number of MCMC iterations to perform (including burn-in; see below), where each iteration updates the haplotypes for every individual in the dataset; default = 30

6. -burnin: number of -iter to discard as burnin; default = 10

7. -prob_g: flag that tells the program to treat the values in the -g file probabilistically (i.e., as likelihoods), rather than using the default approach of thresholding them at 0.9

8. -pgs_prob: flag that tells the program to "predict genotyped SNPs" (pgs) for the probabilistic genotype data; this will cause the likelihoods in the input file to be replaced by posterior genotype probabilities in the output file, which has the same format

9. -o: string specifying the name of the main output file, which will contain the genotype posterior probabilities; auxiliary files that add suffixes to this string will also be printed


WORKING EXAMPLE

In the directory named Examples/ that came with this download package, you can run the following command-line call to see IMPUTE2 in action:

./impute2 \
  -g pilot1_b36_chr10_ceu_example.gen \
  -sample_g pilot1_b36_ceu_example.sample_list \
  -known_haps_g hapmap3_r2_b36_chr10_ceu_example.known_haps \
  -sample_known_haps_g hapmap3_r2_b36_ceu_example.sample_list \
  -m genetic_map_chr10_combined_b36.txt \
  -int 10.5e6 10.7e6 \
  -buffer 50 \
  -Ne 11500 \
  -iter 10 \
  -burnin 3 \
  -k 30 \
  -prob_g \
  -pgs_prob \
  -o test_run.impute2

This test run will call genotypes in the 200 kb interval [10.5 Mb, 10.7 Mb] on chromosome 10, with a 50 kb buffer region to prevent edge effects. The MCMC parameters (-k, -iter, and -burnin) are somewhat smaller than we would normally use in order to make this example run quickly. The posterior genotype probabilities will appear in the file 'test_run.impute2'. (Note that IMPUTE2 does not currently modify the known haplotypes with sequence data, so the genotypes that were provided in the -known_haps_g file will be unaltered in the output.)
