Filtering options

 Home

The options in this table affect the way that the program filters the input data. Some of the options provide direct control over which samples and SNPs get included in the analysis, while others set rules for how the program should behave when faced with certain filtering choices. These options are designed to make filtering more flexible, so that it is easy to apply any desired set of filters to a single underlying genotype file.

Some of these options apply to the dataset as a whole while others apply only to specific panels. The flag name for each panel-specific option ends in the command-line symbol for the file on which it operates; e.g., to exclude SNPs from the -g file you should use -exclude_snps_g, and to exclude SNPs from the -g_ref file you should use -exclude_snps_g_ref.

Flag Default Description
-filt_rules_l <str> <str> ... none This option provides flexible variant filtering in the reference panel via "filter rules", which are based on annotation columns in a -l file. Each column should be labeled by a contiguous string (no whitespace) describing its contents. For example, the Example/ directory in the software download packages includes a file named example.chr22.1kG.annot.legend that contains columns named eur.maf and afr.maf.

To filter variants based on the numeric annotation values in the -l file, you should combine a column string with a cutoff value and one of these six comparison operators: < <= > >= == != . For example, writing -filt_rules_l 'eur.maf<0.05' on the command line would tell the program to remove any variants with eur.maf values less than 0.05 from the reference panel. You can include an arbitrary number of filtering strings after the -filt_rules_l option, in which case the filtering conditions will be applied in 'or' fashion: if any condition is true, the variant will be removed.

It is very important that you enclose each filtering string in single quotes, as shown above. Otherwise, the command-line environment may interpret symbols like < and > as linux redirection operators. There should be no white space within the single quotes.

You can develop annotations yourself and add them to the -l file, or you can use the annotations that we provide in some of our reference download packages. For example, we have included continent-level minor allele frequencies in the legend files for the 1,000 Genomes Phase 1 integrated variant reference panel.

For an illustration of using -filt_rules_l in practice, see this example command.
-exclude_snps_g <file> none List of SNPs to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g file), their rsIDs (second column of -g file), or their base pair positions (third column of -g file). Excluded SNPs will be treated as if they had not been present in the genotypes file, and they will not be shown in the output unless you use the -impute_excluded option.
-exclude_snps_g_ref <file> none Same as -exclude_snps_g, but applies to the -g_ref file.
-impute_excluded Specifies that SNPs excluded from the study dataset via the -exclude_snps_g option should be imputed and included in the output file. When this flag is not activated, excluded SNPs are simply ignored.
-include_snps <file> none List of reference-panel-only SNPs to impute. If you do not want the program to impute all of the reference SNPs in the region you are analyzing, you can use this list to specify a subset of SNPs to impute; all other SNPs will be ignored unless they have data in the -g file. The list should take the form of a single column of identifiers in a text file. The SNPs can be identified by their SNP IDs (first column of -g_ref file), their rsIDs (second column of -g_ref file or first column of -l file), or their base pair positions (third column of -g_ref file or second column of -l file).

This option does not have any effect on SNPs in the -g file.
-sample_g <file> none File of sample IDs for the individuals in the -g file; should follow the format described here. Only the first two columns are necessary, but they must be present and labeled "ID_1" and "ID_2".

NOTE: Currently, the only reason to provide a sample file is if you want to exclude some individuals via the -exclude_samples_g option, or if you are analyzing chromosome X data via the -chrX option.
-sample_g_ref <file> none Same as -sample_g, but applies to the -g_ref file.
-exclude_samples_g <file> none List of samples to exclude from the -g file. The list should take the form of a single column of identifiers in a text file. The samples can be identified by the IDs in either of the first two columns of the -sample_g file, which is REQUIRED if you want to use this option. Excluded samples will be treated as if they had not been present in the genotypes file, and the program will re-print the original sample list, minus the excluded samples, to a file named "[-o]_samples", where -o is the name of the main output file.

NOTE: Part of the IMPUTE2 algorithm involves pooling information across the individuals in your study dataset. Samples with systematically aberrant genotypes (due, e.g., to degraded assay DNA) can confuse this part of the model; you should take care to identify such samples ahead of time and exclude them either manually or with this option.
-exclude_samples_g_ref <file> none Same as -exclude_samples_g, but applies to the -g_ref file. One difference is that the program will not print a filtered list of -g_ref samples like the one that gets printed with -exclude_samples_g.