Output file options

Home
Details about 'info' metric

The options in this table control the format and naming conventions of output files printed by IMPUTE2.

Flag Default Description
-o <file> ./test.impute2 Name of main output file. Follows the same format as the -g file.
-i <file> [-o]_info Name of SNP-wise information file with one line per SNP and a single header line at the beginning. This file always contains the following columns (header tags shown in parentheses):

1. SNP identifier from -g file (snp_id)
2. rsID (rs_id)
3. base pair position (position)
4. expected frequency of allele coded '1' in the -o file (exp_freq_a1)
5. measure of the observed statistical information associated with the allele frequency estimate (info) [details]
6. average certainty of best-guess genotypes (certainty)
7. internal "type" assigned to SNP (type)

Depending on the command-line options invoked, there may also be columns labeled info_typeX, concord_typeX, and r2_typeX. IMPUTE2 assigns every SNP an internal "type" which reflects the combination of input datasets that include data for that SNP; here, X gives the type, which takes values in {0,1,2}. You can learn how the program determines SNP types here.

For SNPs that have genotypes in the -g file, concord_typeX is the concordance between the input genotypes and the best-guess imputed genotypes, where the input genotypes at that SNP have been masked internally and then imputed as if the SNP were of type X; similarly, r2_typeX is the squared correlation between input and masked/imputed genotypes at a SNP.

The info_typeX column is the same information metric used in column 5, but here is it applied to genotypes that have been imputed from pseudo-type X SNPs in the leave-one-out masking experiment. These columns are useful for post-hoc quality control; we will soon explain how we use them in our section on Best Practices for Imputation.
-r <file> [-o]_summary Name of log file that records a summary of the screen output.
-w <file> [-o]_warnings Name of file that records warnings generated by IMPUTE2.
-os <int> <int> ... 0 1 2 3 "Output SNPs": specifies the SNP types that will be printed to the output file (SNP labeling is discussed in the Overview). By default, all imputed and genotyped SNPs are included in the output, i.e., "-os 0 1 2 3".
-o_gz Specifies that the main output file should be compressed by the gzip utility; this also applies to some non-standard output files that can become large.
-outdp <int> 3 Specifies the number of decimal places to use for reporting genotype probabilities in the main output file.
-no_snp_qc_info Suppresses printing of info_typeX, concord_typeX, and r2_typeX columns in the -i file.
-no_sample_qc_info Suppresses printing of per-sample quality control metrics file. The default is to print a file named "[-i]_by_sample".
-phase IMPUTE2 always implicitly phases the study genotypes (-g file), and this flag tells the program to print the best-guess haplotypes that result from the phasing process. In addition to the standard imputation output file, the program also prints a separate haplotype file named "[-o]_haps". This file contains the same five header columns as the standard output, along with two columns (haplotypes) per individual, in the same order they appear in the main output.

In addition to this "best-guess" haplotype file, the program also prints the certainty that each successive pair of heterozygous SNPs is correctly phased. These certainties occur in a file named "[-o]_haps_confidence". In this file, homozygotes are represented by * characters and heterozygotes are represented by numbers between 0.5 and 1.0; this is the estimated probability that the phasing between the current heterozygote and the previous heterozygote (upstream) is correct. By convention, the first heterozygous SNP in each individual for a given analysis region is assigned a phasing certainty of 1.0.

As illustrated by our example commands, it is possible to use the -phase option to produce haplotypes without the use of a reference panel; i.e., to perform a classical phasing analysis.
-pgs "Predict Genotyped SNPs": Tells the program to replace the input genotypes from the -g file with imputed genotypes in the -o file (applies to Type 2 SNPs only).
-pgs_miss Unlike -pgs, which replaces all input genotypes with imputed genotypes, this option tells the program to replace only the missing genotypes at typed SNPs. That is, any input genotype whose maximum probability exceeds the -call_thresh will simply be reprinted in the -o file, whereas input genotypes that fall below the calling threshold will be imputed in the output.

WARNING: This is an appealing option that will "fill in" sporadically missing genotypes in your input data. However, it is possible that this could cause subtle problems in downstream association testing. We therefore suggest that you use caution when applying this option.


Details about 'info' metric

IMPUTE2 reports an information metric in the fifth column of its -i file. This metric is similar to the r-squared metrics reported by other programs like MaCH and Beagle. Although each of these metrics is defined differently, they tend to be correlated.

Our metric typically takes values between 0 and 1, where values near 1 indicate that a SNP has been imputed with high certainty. The metric can occasionally take negative values when the imputation is very uncertain, and we automatically assign a value of -1 when the metric is undefined (e.g., because it wasn't calculated).

Investigators often use the info metric to remove poorly imputed SNPs from their association testing results. There is no universal cutoff value for post-imputation SNP filtering; various groups have used cutoffs of 0.3 and 0.5, for example, but the right threshold for your analysis may differ. One way to assess different info thresholds is to see whether they produce sensible Q-Q plots, although we emphasize that Q-Q plots can look bad for many reasons besides your post-imputation filtering scheme.

We define our info metric and compare it against other metrics in a review paper that we recently published. If you have questions, please read that material first, then contact us if anything is still unclear.