There are a few program settings that you should be aware of when using IMPUTE2 with an ancestrally diverse reference panel:
-k_hap -- This parameter determines how many of the reference haplotypes will be used in the "custom" reference panel for each study individual. The default value is 500, which is a good starting point for modern reference datasets.
As a rule of thumb, you should set -k_hap to the number of reference haplotypes that you expect to be useful for your study population. For example, suppose you were imputing a Spanish dataset from a reference panel containing 400 Western European haplotypes and 400 African American haplotypes. In this case, you could achieve high accuracy by leaving -k_hap at the default value of 500 since, in any part of the genome, the expected number of reference haplotypes with European ancestry is roughly 400 + 0.2 * (400) = 480. (This calculation assumes that, on average, African American haplotypes have 20% European ancestry.)
Imputation accuracy is not highly sensitive to -k_hap, which is why this rule of thumb usually provides good results without requiring detailed parameter tuning. If you want advice on the best value for your dataset, please feel free to contact us.
-Ne -- This parameter controls the effective population size in the population-genetic model used by IMPUTE2. Different human populations have different effective sizes (as estimated from genetic diversity levels), so it is not obvious how to choose a single -Ne value when using a multi-population reference panel.
Fortunately, we have found that IMPUTE2 achieves high accuracy across a wide range of -Ne values, with slightly higher accuracy at large values. We therefore recommend a universal -Ne value of 20000, regardless of the study population being imputed or the composition of the reference panel. This will become the default value in our next software release (v2.1.3), but for now you should set it manually.
-int -- This command-line option specifies the boundaries of the region to be imputed on the current chromosome, using two numbers. For example, "-int 1 5e6" tells IMPUTE2 to analyze physical positions 1-5,000,000.
The imputation interval should not be too large because this weakens IMPUTE2's approximation for choosing custom reference panels, which is based on an assumption of limited recombination in the region being analyzed. In theory, it might be desirable to tailor the interval size to the population being imputede.g., to use shorter intervals in African populationsbut in practice, we have found that the exact size of the interval has little effect on imputation accuracy as long as the interval is relatively small (say, < 10 Mb). We therefore recommend that the size of the analysis interval be chosen for computational convenience, without regard to the ancestry of the study or reference datasets.