Basic options

Home

These options control some basic processing that the program does to prepare input data for inference.

Flag Default Description
-int <lower> <upper>
REQUIRED
none Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The boundaries can be expressed either in long form (e.g., -int 5420000 10420000) or in exponential notation (e.g., -int 5.42e6 10.42e6). This option is particularly useful for restricting test jobs to small regions or splitting whole-chromosome analyses into manageable chunks, as discussed in the section on analyzing whole chromosomes.

IMPUTE2 requires that you specify an analysis interval in order to prevent accidental whole-chromosome analyses. If you want to impute a region larger than 7 Mb (which is not generally recommended), you must activate the -allow_large_regions flag.
-buffer <int> 250 kb Length of buffer region (in kb) to include on each side of the analysis interval specified by the -int option. SNPs in the buffer regions inform the inference but do not appear in output files (unless you activate the -include_buffer_in_output flag).

Using a buffer region helps prevent imputation quality from deteriorating near the edges of the analysis interval. Larger buffers may improve accuracy for low-frequency variants (since such variants tend to reside on long haplotype backgrounds) at the cost of longer running times.
-allow_large_regions Allows the analysis of regions larger than 7 Mb. If this flag is not activated and the analysis interval plus buffer region exceeds 7 Mb, the program will quit with an error. The rationale for this flag is described here.
-include_buffer_in_output Tells the program to include SNPs from the -buffer region in all output files. The main reason for using this option is to preserve the buffer information for downstream imputation, e.g. when pre-phasing a GWAS dataset.
-Ne <int> 20000 "Effective size" of the population (commonly denoted as Ne in the population genetics literature) from which your dataset was sampled. This parameter scales the recombination rates that IMPUTE2 uses to guide its model of linkage disequilibrium patterns. When most imputation runs were conducted with reference panels from HapMap Phase 2, we suggested values of 11418 for imputation from HapMap CEU, 17469 for YRI, and 14269 for CHB+JPT.

Modern imputation analyses typically involve reference panels with greater ancestral diversity, which can make it hard to determine the "ideal" -Ne value for a particular study. Fortunately, we have found that imputation accuracy is highly robust to different -Ne values; within each of several human populations, we have obtained nearly identical accuracy levels for values between 10000 and 25000. We suggest setting -Ne to 20000 in the majority of modern imputation analyses.
-call_thresh <float> 0.9 Threshold for calling genotypes in the -g file. For each individual at each SNP, the program will use the genotype with the maximum probability if that probability exceeds the threshold; otherwise, the genotype will be treated as missing.

NOTE: This threshold applies only to input genotypes. If you want to apply a calling threshold to IMPUTE2's output probabilities, you will have to do it yourself. However, it is usually not a good idea to treat imputation output this way; see the webpage of our association-testing software SNPTEST for better suggestions.
-nind <int> # of indiv in -g file Number of individuals from the -g file to include in the analysis. For example, to impute only the first five individuals, set -nind 5. This option is useful for debugging and test runs.
-verbose Print detailed output about the progress of imputation. By default, IMPUTE2 prints only the number of the current MCMC iteration when performing imputation, but this flag tells it to print more detailed updates.