MCMC options

Home

IMPUTE2 uses an MCMC algorithm to integrate over the space of possible phase reconstructions for observed genotypes. The options in this table control the algorithm.

Flag Default Description
-iter <int> 30 Total number of MCMC iterations to perform, including burn-in. Increasing the number of iterations may improve accuracy slightly, although increasing -k generally leads to greater improvements for a fixed computational cost.
-burnin <int> 10 Number of MCMC iterations to discard as burn-in. The algorithm samples new haplotypes for unphased individuals during each of the first [-burnin] iterations, but these iterations do not contribute to the final imputation probabilities. We have found that 10 burn-in iterations is enough to ensure good results in a variety of different datasets.
-k <int> 80 Number of haplotypes (in the reference or study data) to use as templates when phasing observed genotypes. Increasing this value will lead to higher accuracy at the cost of longer running times, which scale quadratically with -k. The default value should be sufficient for most analyses.
-k_hap <int> 500 Number of reference haplotypes to use as templates when imputing missing genotypes. As a rule of thumb, you should set -k_hap to the number of reference haplotypes that you expect to be useful for your study population. If this value is less than the total number of haplotypes in your reference panel, IMPUTE2 will choose a "custom" set of -k_hap haplotypes each time it imputes missing alleles in a study haplotype.

If all of your reference haplotypes have similar ancestry to the subjects in your study, each haplotype is potentially useful for imputation, so the best accuracy can be achieved by setting -k_hap to the total number of reference haplotypes. Using smaller values will decrease the running time linearly while incurring a slight loss of accuracy.

Conversely, we now recommend running IMPUTE2 with large reference panels containing haplotypes of diverse ancestry. (For more details, see here.) In this context, our rule of thumb suggests setting -k_hap to be smaller than the total size of the reference panel. Imputation accuracy is robust to different values of -k_hap within a sensible range, so it should usually be sufficient to choose a value by intuition. When in doubt, we suggest that you err on the side of making -k_hap too large, since we often find that diverse reference panels contain more useful haplotypes than one might expect.

As of software version 2.3.0, -k_hap can accept two values when you are imputing from two reference panels -- for example, '-k_hap 500 200'. In this context, the first value is the number of haplotypes to be chosen from Panel 0 and the second value is the number to be chosen from Panel 1. This flexibility can be useful when merging reference panels.