SNPTEST is a program for the analysis of single SNP association in genomewide studies. The tests implemented include
The program is designed to work seamlessly with the output of our genotype imputation software IMPUTE [1] and the programs QCTOOL and GTOOL. This program was used in the analysis of the 7 genomewide association studies carried out by the Wellcome Trust CaseControl Consortium (WTCCC) [2]. Much of the theory behind the implemented tests is described in this paper [3].
SNPTEST has many different features which are illustrated below through a number of different examples that use the datasets provided with the software in the directory example/. These files contain data at 200 SNPs on 1000 individuals that are split into a control cohort and a case cohort. These datasets can be used to try out the tests using both binary (casecontrol) and quantitative phenotypes.
The latest version of SNPTEST is v2.5.2. Changes in this release include bug fixes and enhancements as documented here. To get started, download a prebuilt binary for your platform from the download page and run an example command.
To contact us, please use the OXSTATGEN mailing list  see here for details.
The following people contributed to the design and development of SNPTEST:
.gen.gz
extension.
A new set of modelfitting code, activated using method newml, has been developed for case/control phenotypes in SNPTEST v2.5. This behaves broadly like method ml, but supports new features:
Note: method newml currently only supports frequentist additive model tests (frequentist 1).
SNPTEST is available free to use for academic use only. Please see the LICENCE and also included with the package.
Precompiled versions of the program and example files can be downloaded from the links below. For linux, you should use the dynamically linked version unless you run into trouble. On some systems, library incompatibilities cause problems so we have provided two statically linked versions as well. If you have any problems getting the program to work on your machine please contact us.
Please fill out the registration form to receive emails about updates to this software.
To unpack the files use the command like
This will create a folder called snptest_v2.5.1_linux_x86_64_dynamic/ containing an executable snptest_v2.5.1 and an example/ directory containing the example files. To see a list of options available in SNPTEST, cd into the directory and type
Version  File 

v2.5.2 Mac OS X 
snptest_v2.5.2_MacOSX_x86_64.tgz^{*} 
v2.5.2 Ubuntu 12.04 (x8664) 
snptest_v2.5.2_linux_x86_64_dynamic.tgz^{*} snptest_v2.5.2_linux_x86_64_static.tgz 
v2.5.2 CentOS6.5 (x8664) 
snptest_v2.5.2_CentOS6.5_x86_64_dynamic.tgz^{*} snptest_v2.5.2_CentOS6.5_x86_64_static.tgz 
v2.5.2 CentOS5 (x8664) 
snptest_v2.5.2_CentOS5_x86_64_dynamic.tgz^{*} snptest_v2.5.2_CentOS5_x86_64_static.tgz 
v2.5.1 Mac OS X 
snptest_v2.5.1_MacOSX_x86_64.tgz^{*} 
v2.5.1 Ubuntu 12.04 (x8664) 
snptest_v2.5.1_linux_x86_64_dynamic.tgz^{*} snptest_v2.5.1_linux_x86_64_static.tgz 
v2.5.1 CentOS6.5 (x8664) 
snptest_v2.5.1_CentOS6.5_x86_64_dynamic.tgz^{*} snptest_v2.5.1_CentOS6.5_x86_64_static.tgz 
v2.5.1 CentOS5 (x8664) 
snptest_v2.5.1_CentOS5_x86_64_dynamic.tgz^{*} snptest_v2.5.1_CentOS5_x86_64_static.tgz 
v2.5 Linux (x86_64) 
snptest_v2.5_linux_x86_64_dynamic.tgz snptest_v2.5_linux_x86_64_static.tgz snptest_v2.5_scientificlinux_x86_64_static.tgz 
v2.5 Mac OS X 
snptest_v2.5_MacOSX_x86_64.tgz 
v2.4.1 Linux (x86_64) (statically linked) 
snptest_v2.4.1_Linux_x86_64_static.tgz snptest_v2.4.1_Linux_x86_64_static2.tgz 
v2.4.1 Linux (x86_64) (dynamically linked) 
snptest_v2.4.1_Linux_x86_64.tgz 
v2.4.1 Linux (i686) 
snptest_v2.4.1_Linux_i686_dynamic.tgz snptest_v2.4.1_Linux_i686_static.tgz 
v2.4.1 Mac OS X 10.410.7.3 Intel 
snptest_v2.4.1_MacOSX_Intel.tgz 
SNPTEST allows the analysis of multiple cohorts of individuals. The data for each cohort is stored in two files. The first file (the genotype file) stores the genotype data for the cohort. The second file (the sample file) stores the ID's and associated covariate and phenotype information of the individuals of each cohort. For the example datasets included with the software the sample and genotype files for each of these cohorts have the suffices .sample and .gen respectively. The file format is described on a FILE FORMAT WEBPAGE.
When using multiple cohorts SNPTEST assumes that
Several file formats are supported:
These will be used if the filename extension is .gen or .gen.gz, or if the extension is otherwise unrecognised. The format is described on the FILE FORMAT WEBPAGE. In addition, SNPTEST v.2.3.0 and above support GEN files with an additional column containing chromosome information; this column must be the first column in the file.
BGEN (binary GEN) format will be used if the filename extension is .bgen.
BGEN files are designed to have file size similar or better than gzipped GEN files, but to support faster loading
and seeking of individual SNPs. More information on using BGEN files and on converting
GEN files to BGEN files can be found on the BGEN file format website
and the QCTOOL website.
Support for the BGEN format was added in v2.2.0.
As of v2.5.1, SNPTEST has support for plink binary format (BED) files, described here and here. (SNPTEST only understands the SNPmajor versions of these files, which begin with the thee bytes 0x6c, 0x1b, and 0x01, not samplemajor version. Most BED files are in SNPmajor format.) A few points to note are:
VCF format
(version 4.0,
4.1,
or 4.2) will be
assumed if the filename extension is .vcf
or .vcf.gz. VCF is more complicated than GEN format and there are a few
points to bear in mind.
Sample files must be in the format described on the FILE FORMAT WEBPAGE. However, SNPTEST supports arbitrary (nonwhitespace) string values in discrete covariate columns (of type "D"). These are mapped internally to covariate levels. The default missing value for samples is now the twocharacter string "NA".
In SNPTEST v2.5 a few changes have been made to the output file format, described below.
Metadata reflecting the options used is now written to the top of the file protected by a '#' comment character. For example, here is the metadata from the output for an example command:
# Analysis: "SNPTEST analysis, started 20130521 15:38:16" # started: 20130521 15:38:16 # # Analysis properties: # data cohort1.gen cohort1.sample (usersupplied) # frequentist 1 (usersupplied) # log /tmp/log (usersupplied) # method newml (usersupplied) # o /tmp/snptest.out (usersupplied) # pheno bin2 (usersupplied)We have found this feature useful in keeping track of different analyses run using SNPTEST. (You can give the analysis a different name using the analysis_name option.)
SNPTEST v2.5 and above support commaseparated and tabseparated files in addition to the default spaceseparated files. The desired output format is detected based on the filename extension (.csv for csv files, .tsv for tabseparated files, and anything else for spaceseparated files.)
It's also possible to write gzipped output files  add the .gz extension to the filename to get this behaviour.
SNPTEST v2.5 and above support output to a database instead of a flat file using the odb option. Currently the sqlite embedded database is supported. (Sqlite databases are entirely contained in a single file, and don't require the use of special server software.) For example, the command
A major motivation for this feature is that large flat files like the ones SNPTEST outputs can be difficult to work with  in particular, rows are not indexed, and the large number of columns can make viewing particular fields awkward. The snptest.sqlite database above has indices which makes it easy to find data by position or rsid, and queries can be adjusted to select desired columns.
A rough guide to the database schema produced by the above example command is as follows.
Table or view  Description 

Variant  Stores a list of variants (SNPs and indels) used by the analysis. Variants are considered the same if they have the same chromosome, position and alleles. (Where a variant has several identifiers, these are stored in the VariantIdentifier table.) 
TestAnalysis  This table contains the main analysis results and has one column for each variable SNPTEST computes. 
TestAnalysisView  This is a convenience view which links the Variant and TestAnalysis tables. This view closely resembles the results of a traditional flat file output. 
AnalysisView  A view which shows analyses that have been stored in the database. 
EntityDataView  A view of metadata about analyses, analogous to the metadata example above. 
There are a few things to bear in mind when outputting to a database.
The simplest use of SNPTEST is to calculate data summaries for each SNP i.e genotype counts, allele frequencies, SNP missing data proportions and odds ratios. This is specified using the summary_stats_only option.
NOTE : within each command box below, most lines end with the '\' character. This is not actually part of the command  it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split each example command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unixstyle terminal window (so, for example, you should be able to directly paste these commands into the terminal and hit 'enter' to make them run), but it would be equivalent to put all of the arguments on a single line, separated by spaces.
For example, the command
produces a file ./example/ex.out which contains the data summaries for all 200 SNPs across the two cohorts. Note how the cohorts are specified by placing the relevant genotype and sample files after the data and option in the command. For each cohort the name of the genotype file should be followed by its associated sample file. There is a limit of 18 cohorts that can be specified.
The o option specified the output file i.e. ./example/ex.out. This file contains a line for each SNP and there is a header line which specifies the contents of each column.
The following table give a description of each of the entries in the output file.
id 
SNP ID (taken from input files) 
rsid 
RS ID of the SNP (taken from input files) 
chromosome 
A 2letter chromosome identifier (if SNPTEST can determine it) or the value NA. See the section on chromosomes. 
pos 
Base pair position of the SNP 
allele_A allele_B 
The two alleles at the SNP. allele_A is coded 0 and allele_B is coded 1. 
average_maximum_posterior_call 
The average maximum posterior probability across all individuals in the sample that are used for the test at each SNP.This is a measure of how much uncertainty there is at each SNP. Samples excluded will be (a) those excluded using the exclude_samples option, (b) samples with a missing phenotype or covariate relevant to the test, (c) samples without genotypes if the method threshold option is used, (d) samples where the sum of the genotype probabilities is less than 0.1. 
info 
A measure of the observed statistical information for the estimate of allele frequency of the SNP using all individuals in the sample that are used for the test at each SNP. This measure has a maximum value of 1 that indicates that perfect information. Samples excluded will be (a) those excluded using the exclude_samples option, (b) samples with a missing phenotype or covariate relevant to the test, (c) samples without genotypes if the method threshold option is used, (d) samples where the sum of the genotype probabilities is less than the value set by the option total_prob_limit (default 0.1). 
cohort_1_AA cohort_1_AB cohort_1_BB cohort_1_NULL 
Counts of AA, AB, BB and NULL genotypes in the 1st cohort. See Note below which details exactly how genotype counts are calculated in SNPTEST v2. 
cohort_2_AA cohort_2_AB cohort_2_BB cohort_2_NULL  Counts of AA, AB, BB and NULL genotypes for the 2nd cohort (see details above). Subsequent cohorts will be included in a similar way. See Note below which details exactly how genotype counts are calculated in SNPTEST v2. 
all_AA all_AB all_BB all_NULL all_total  Counts of AA, AB, BB and NULL thresholded genotypes, as well as the total number of samples considered, across all cohorts. See Note below which details exactly how genotype counts are calculated in SNPTEST v2. 
all_maf 
Minor allele frequencies (MAF) in the combined controls, combined cases and combined across all cohorts. 
missing_data_proportion 
The proportion of missing data across all cohorts. 
If a test for a binary phenotype is being carried out then the following additional fields are included:
controls_AA controls_AB controls_BB controls_NULL  Counts of AA, AB, BB and NULL genotypes across all case cohorts. See Note above which details exactly how genotype counts are calculated in SNPTEST v2. 
cases_AA cases_AB cases_BB cases_NULL  Counts of AA, AB, BB and NULL genotypes across all case cohorts. See Note above which details exactly how genotype counts are calculated in SNPTEST v2. 
cases_maf controls_maf 
Minor allele frequencies (MAF) in the controls and cases across all cohorts. 
het_OR het_OR_lower het_OR_upper 
Estimated odds ratios and lower and upper 95% confidence limits for the heterozygote genotype AB versus the (baseline) AA genotype. 
hom_OR hom_OR_lower hom_OR_upper 
Estimated odds ratios and lower and upper 95% confidence limits for the homozygote genotype BB versus the (baseline) AA genotype. 
all_OR, all_OR_lower all_OR_upper 
Estimated allelic odds ratios and lower and upper 95% confidence limits for the B allele versus the (baseline) A allele. 
NOTE : Odds ratios and their confidence limits are set to NA if they cannot be calculated.
See the section on frequentist tests for association for further columns that are output when performing association tests.
SNPTEST tries to include the 'right' set of samples in computation of genotype counts, NULL call counts, allele frequencies and info measures. To avoid confusion the rules SNPTEST uses to determine samples to include are as follows:
NOTE (1): the behaviour of NULL call counts has changed in v2.5. In previous versions, NULL call counts would only reflect samples that had high enough genotype probability to be included in the association test (i.e. those passing the limit set by total_prob_limit (default 0.1), but whose genotype call probabilities summed to less than one. In v2.5, NULL call counts include in addition all those samples that have nonmissing phenotype (and, where relevant, nonmissing covariates) but have missing genotypes or whose genotype probabilities are too low to be included in analysis.
NOTE (2): prior to v2.4, NULL count counts would in addition reflect samples whose phenotype and/or covariate information was missing.
You should notice that SNPTEST produces some screen output when run. Information about which data files were specified, the tests selected, the numbers of SNPs, the total number of cases and the total number of controls, information about the covariates and phenotypes in the sample files and information about individuals and SNPs selected for exclusion is all written to the screen. Also, information about the progress of the program is written to the screen. Warning and/or error messages may also be shown. Incorrect use of the options or input files with the wrong format may cause the program to terminate. The screen output can be used to identify any problems that lead to the termination. The flag printids can be used to print the SNP IDs of each SNP as it is processed which can be useful to identify where problems occur.
For example, the command
Welcome to SNPTEST © University of Oxford 20082013 https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html Read LICENCE file for conditions of use. ============== Data Files : gen files : cohort1.gen sample files : cohort1.sample Tests : frequentist : 1 method newml reading sample exclusion lists Inspecting data (this may take some time)... Sample and exclusions summary :  Number of individuals in : (cohort 1) 500 Reading sample files : Summary of covariates and phenotypes # discrete variables : 3 cov1 : type = D (Discrete covariate) cov2 : type = D (Discrete covariate) sex : type = D (Discrete covariate) # continuous variables : 2 cov3 : type = C (Continuous covariate) cov4 : type = C (Continuous covariate) # phenotypes : 4 pheno1 : type = P (Continuous phenotype) pheno2 : type = P (Continuous phenotype) bin1 : type = B (Binary phenotype) bin2 : type = B (Binary phenotype) Covariate summary : cov1 : missing levels 1 0(244) 1(255) cov2 : missing levels 1 0(10) 1(76) 2(150) 3(164) 4(76) 5(23) cov3 : missing min max mean variance (unnormalised): 1 3.2702 3.8310 0.0703 1.0131 (normalised): 1 3.3189 3.7364 0.0000 1.0000 (histogram): 50 *  **  *****  * ***** 26 ********* **  ************  ************  *************** 3 ****************** * + 3.43 3.85 cov4 : missing min max mean variance (unnormalised): 1 2.8552 3.1769 0.0324 0.8858 (normalised): 1 3.0681 3.3411 0.0000 1.0000 (histogram): 45 *  ***  **** **  ******* 24 **********  *************  *************** *  ***************** * 3 * ******************** + 3.17 3.45 sex : missing levels 2 female(237) male(261) Phenotype summary : pheno1 : missing min max mean variance (unnormalised): 1 1.0766 5.2884 2.1386 1.4532 (normalised): 1 2.6672 2.6129 0.0000 1.0000 (histogram): 45 *  *  * *  ** *** 24 **********  ************  *************** *  ******************** 3 ** *********************** ** + 2.75 2.70 pheno2 : missing min max mean variance (unnormalised): 1 2.5428 3.7000 0.0028 1.0025 (normalised): 1 2.5369 3.6982 0.0000 1.0000 (histogram): 46 **  ***  * ***  ******* 24 *********  ********** *  *************  * ************** * 3* ********************** + 2.64 3.80 bin1 : missing levels 1 1(499) bin2 : missing levels 1 0(236) 1(263) Phenotype being used : bin2 Data Summaries : number of SNPs = (unknown) Data with missing genotype data threshold and exclusion list applied : cohort1.gen : 500 Analyzing Data : PerVariantComputationManager: using the following computations: > NewMLSinglePhenotypeTest with regression design: phenotype baseline genotype 0.00 1.00 ? 1.00 1.00 ? 0.00 1.00 ? 0.00 1.00 ? 0.00 1.00 ? 0.00 ~ 1.00 ? 1.00 1.00 ? NA 1.00 ? 0.00 1.00 ? 0.00 1.00 ? > GenotypeCountComputation( all ) > InfoMeasureComputation( all ) > GenotypeCountComputation( cases ) > InfoMeasureComputation( cases ) > GenotypeCountComputation( cohort_1 ) > InfoMeasureComputation( cohort_1 ) > GenotypeCountComputation( controls ) > InfoMeasureComputation( controls ) scanning... read chunk [1 of (unknown)]... done. scanning... read chunk [2 of (unknown)]... done. scanning... read chunk [3 of (unknown)]... done. scanning... no more data. finito
There are 3 options that control Frequentist testing for association (pheno, frequentist and method),
pheno <name> 
This specifies which phenotype you wish to test. The <name> should match one of the phenotypes in the sample file. If the phenotype in the sample file is binary (B) then a casecontrol test is carried out. If the phenotypes in the sample file is continuous (P) then a quantitative trait test (i.e. Ftest for a linear model) is carried out. See FILE FORMAT WEBPAGE for more details about how to specify a phenotype in the sample file. If no phenotype is specified then the first phenotype in the sample file is used. 
frequentist <t1>...<tn> 
This option controls the model you wish to test at each SNP versus a model of no association. The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. When using this option the output file will have a column for each test that contains the pvalue for the test as well as estimates of the model parameters (beta's) and their standard errors. SNPTEST codes allele_A as 0 and allele_B as 1 and this defines the meaning of the beta's and there se's. For example, when using the additive model the beta estimates the increase in logodds that can be attributed to each copy of allele_B. When a model cannot be fitted to the data the pvalue is set to 1. 
quantile_normalise_phenotypes 
(This option applies to continuous phenotypes only). Quantile normalize continuous phenotypes. This is done AFTER samples have been excluded. 
use_raw_phenotypes 
(This option applies to continuous phenotypes only). By default continuous phenotypes are mean centered and scaled to have variance 1. This feature can be turned off with this option. 
The method option which controls the way genotype uncertainty is taken into account when carrying out association tests. The options are listed in the table below.
method threshold 
Use thresholded genotypes. The calling threshold is controlled by the flag call_thresh. The default calling threshold is 0.9. This is the same as the default option in previous versions. 
method expected 
Use expected genotype counts (aka genotype dosages). 
method score 
Use a missing data likelihood score test. This is equivalent to the proper option in previous versions, except that if the score test experiences problems at a SNP (usually due to a rare SNP and/or high uncertainty) then method em is used for this SNP. 
method ml 
Use multiple NewtonRaphson iterations to estimate the parameters in the missing data likelihood for the model. 
method em 
Use an EM algorithm to estimate the parameters in the missing data likelihood for the model. 
There are two other options that control how the imputed genotypes are treated.
renorm 
The methods described above to deal with genotype uncertainty were developed for the use with imputed SNPs. This implies that the genotype probabilities will sum to 1. If probabilistic genotype calls from an algorithm like CHIAMO are used then the probabilities might sum to less than one and any left over probability is the probability of a NULL call. The renorm option renormalizes the genotype probabilities to sum to 1. The default is not to renormalize the probablities unless the method expected option is chosen in which case it is automatically turned on. 
total_prob_limit <x> 
There is an internal lower limit set on the sum of genotype probabilities. The default is 0.1. If this threshold is not met then that genotype is not included in the test. This protects against SNPs with a high proportion of NULL genotypes. 
The statistical details of the Frequentist tests implemented are given in this pdf.
If score, ml or em are chosen as the method when using a frequentist test then a relative information measure will be calculated at each SNP. This will be reported in a column ending in _info.The statistical details of these information measures are given in this pdf.
From SNPTEST v2.5 , the naming convention used for columns of the output file that contain results of statistical tests is
Alternatively,the use_long_column_naming_scheme option can be used to produce names similar to those output by SNPTEST v2.4 and below:
<test_type>  frequentist or bayesian 
<genetic_model>  add, dom, rec, gen or het 
<summary_measure>  One of pvalue, info, beta_X, se_X or log10_bf depending on the column 
<phenotype_name(s)>  The name (or names if mpheno is used) of the phenotypes used in the test. 
<covariate_name(s)>  The name (or names) of the covariates being conditioned upon in the test 
The following example carries out a casecontrol test for the binary phenotype named bin1.
The pvalues for the test is given in the column bin1_frequentist_add_pvalue. Parameter estimates and their standard errors are given in the columns labeled bin1_frequentist_add_beta_1 and bin1_frequentist_add_se_1.
The following example carries out a casecontrol test for the quantitative phenotype named pheno1
The pvalues for the test is given in the column pheno1_frequentist_add_pvalue. Parameter estimates and their standard errors are given in the columns labeled pheno1_frequentist_add_beta_1 and pheno1_frequentist_add_se_1.
The Bayesian tests are specified by the bayesian option, in a similar way to the use of the frequentist option. The statistical details of the Bayesian tests implemented are given in this pdf.
bayesian <t1>...<tn> 
This option controls the model you wish to test at each SNP versus a model of no association. The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. When using this option the output file will have a column for each test that contains the log10 Bayes Factor for the test as well as posterior mean estimates of the model parameters (beta's) and their standard errors. SNPTEST codes allele_A as 0 and allele_B as 1 and this defines the meaning of the beta's and there se's. For example, when using the additive model the beta estimates the increase in logodds that can be attributed to each copy of allele_B. A Bayes factor will always be calculated at a SNP. 
The method option is also used to control the way the Bayesian models are fit, but not all options are valid.
The table below gives a description of the linear predictor of the logistic regression used, the form of the priors used on the model parameters, the default priors used in SNPTEST and the command line option that can be used to change the priors.
Model 
Linear Predictor 
Priors 
Default 
Coding 
Command line option 
Additive 
log(p_{i}/(1p_{i})) = µ + ßG_{i} 
µ~N(a_{0}, a_{1}^{2}) ß~N(b_{0}, b_{1}^{2}) 
a_{0}=0, a_{1}=1 b_{0}=0, b_{1}=0.2 
G_{i} is the additive coding of the SNP i.e. AA > 0, AB >1, BB > 2. 
prior_add a_{0} a_{1} b_{0} b_{1} 
Dominant 
log(p_{i}/(1p_{i})) = µ + ßD_{i} 
µ~N(a_{0}, a_{1}^{2}) ß~N(b_{0}, b_{1}^{2}) 
a_{0}=0, a_{1}=1 b_{0}=0, b_{1}=0.5 
D_{i} is the dominant coding of the SNP i.e. AA > 0, AB > 1, BB > 1. 
prior_dom a_{0} a_{1} b_{0} b_{1} 
Recessive 
log(p_{i}/(1p_{i})) = µ + ßR_{i} 
µ~N(a_{0}, a_{1}^{2}) ß~N(b_{0}, b_{1}^{2}) 
a_{0}=0, a_{1}=1 b_{0}=0, b_{1}=0.5 
R_{i} is the recessive coding of the SNP i.e. AA > 0, AB > 0, BB > 1. 
prior_rec a_{0} a_{1} b_{0} b_{1} 
General 
log(p_{i}/(1p_{i})) = µ + ßG_{i} + qH_{i} 
µ~N(a_{0}, a_{1}^{2}) ß~N(b_{0}, b_{1}^{2}) q~N(c_{0}, c_{1}^{2}) 
a_{0}=0, a_{1}=1 b_{0}=0, b_{1}=0.2 c_{0}=0, c_{1}=0.5 
G_{i} is the additive coding of the SNP i.e. AA > 0, AB >1, BB > 2. H_{i} is the heterozygote coding of the SNP i.e. AA > 0, AB >1, BB > 0. 
prior_gen a_{0} a_{1} b_{0} b_{1} c_{0} c_{1} 
Heterozygote 
log(p_{i}/(1p_{i})) = µ + ßH_{i} 
µ~N(a_{0}, a_{1}^{2}) ß~N(b_{0}, b_{1}^{2}) 
a_{0}=0, a_{1}=1 b_{0}=0, b_{1}=0.5 
H_{i} is the heterozygote coding of the SNP i.e. AA > 0, AB >1, BB > 0. 
prior_het a_{0} a_{1} b_{0} b_{1} 
In SNPTEST v2 there is a new option to specify the use of tdistribution priors on the genetic effects. The fatter tails of the tdistribution allow more flexibility in specifying beliefs about the size of the genetic effects. This option is controlled by the following two options.
t_prior 
Specfies the use of tdistribution priors on the genetic effects. Effectively, this option modifies the priors described in the table above i.e. the mean and variance of the tdistributions are specified by the options given in the table above, but the normal distributon is replaced by the tdistribution. NOTE : a tdistribution is only used for the genetic effects i.e. the parameters ß and q in the models above. For example, bayesian add t_prior would specify the linear predictor log(p_{i}/(1p_{i})) = µ + ßG_{i} and the priors would be µ~N(a_{0}, a_{1}^{2}) and ß~t(b_{0}, b_{1}^{2}, df = 3). 
t_df <x> 
The degrees of freedom parameter of the tdistribution. The default value is 3. When this parameter is set very large the prior converges to the normal distribution prior. 
The following example calculates a Bayesian additive model Bayes Factor for the binary phenotype bin1 named using the default priors.
The Bayesian tests for quantitative traits are carried out using the conjugate prior formulation of the linear model using either thresholded genotypes (method threshold) or the expected genotypes (method expected). The model is most easily explained through an example. For an additive model the formulation is
where
y_{i} = the residual phenotype for the ith individual. The residual phenotype is calculated by subtracting off a baseline term and estimates of any specified covariates.
G_{i} = an additive coding for the thresholded or expected genotype of the ith indvidual.
σ^{2} = the error variance of the model.
This model is compared to the model y_{i} = e_{i}, e_{i} ~ N(0, σ^{2}).
We use a Normal Inverse Gamma (NIG) prior on the effects ß and σ^{2}. This prior has the form
This makes it clear that the prior variance on ß is specified in terms of the fraction (V_{ß}) of the error variance.
It can be shown that the expected noncentrality parameter for the Ftest when fitting the above linear model is approximately Np(1 − p)2ß^{2}/σ^{2}
where ß and σ^{2} are the true values of the alternative model, p is the allele frequency of the SNP and 2N is the total sample size.
This can be usefully compared to the noncentrality parameter for the casecontrol test which is approximately Np(1 − p)ß^{2}
assuming N cases and N controls, and here ß is the logodds ratio parameter of a logistic regression model. So,
if we are happy to put a N(0, 0.2^{2}) prior on ß for a binary trait we might reasonably put the same prior on √2ß/σ in the model above i.e ß ∼ N(0, 0.02σ^{2}).
In the context of the NIG prior used in SNPTEST v2 this would mean setting m_{ß}=0 and V_{ß} = 0.02.
By default all quantitative phenotypes are centered and scaled to have zero mean and unit variance before analysis. This places all the quantitative phenotypes on a comparable scale. Since most genetic effects will be very small in GWAS it is reasonable to assume that the error variance σ^{2} will be close to 1. Thus using a IG(3,2) prior for σ^{2} which has mean 1 and variance 1 will produce reasonably robust results. The centering and scaling can be turned off with the use_raw_phenotypes flag. In this case the prior on the error variance σ^{2} should be specified to take this into account.
The following example uses this model to analyze the phenotype pheno1. This produces a log_{10} Bayes Factor in the output file.
Model name 
Model 
Priors 
Command line options needed 
Additive 
y_{i} = ßG_{i} + e_{i}, e_{i} ~ N(0, σ^{2}) 
ß~N(b_{0}, V_{ß}σ^{2}) σ^{2} ~ IG(a,b) 
prior_qt_mean_b b_{0} prior_qt_V_b V_{ß} prior_qt_a a prior_qt_b b 
Dominant 
y_{i} = ßD_{i} + e_{i}, e_{i} ~ N(0, σ^{2}) 
ß~N(b_{0}, V_{ß}σ^{2}) σ^{2} ~ IG(a,b) 
prior_qt_mean_b b_{0} prior_qt_V_b V_{ß} prior_qt_a a prior_qt_b b 
Recessive 
y_{i} = ßR_{i} + e_{i}, e_{i} ~ N(0, σ^{2}) 
ß~N(b_{0}, V_{ß}σ^{2}) σ^{2} ~ IG(a,b) 
prior_qt_mean_b b_{0} prior_qt_V_b V_{ß} prior_qt_a a prior_qt_b b 
General 
y_{i} = ßG_{i} + qH_{i} + e_{i}, e_{i} ~ N(0, σ^{2}) 
ß~N(b_{0}, V_{ß}σ^{2}) ß~N(b_{1}, V_{q}σ^{2}) σ^{2} ~ IG(a,b) 
prior_qt_mean_b b_{0} prior_qt_V_b V_{ß} prior_qt_mean_q b_{1} prior_qt_V_q V_{q} prior_qt_a a prior_qt_b b 
Heterozygote 
y_{i} = ßH_{i} + e_{i}, e_{i} ~ N(0, σ^{2}) 
ß~N(b_{0}, V_{ß}σ^{2}) σ^{2} ~ IG(a,b) 
prior_qt_mean_b b_{0} prior_qt_V_b V_{ß} prior_qt_a a prior_qt_b b 
The option mean_bf is used to average over a set of Bayesian models. This can be used for both binary and quantitative phenotype tests. This option does not currently work with the mpheno option.
mean_bf <w1>...<wn> 
Specify that a log10 Bayes factor for a weighted average over the models specified by bayesian with weights given by <w1>....<wn>. For example, bayesian 1 4 mean_bf 9 1 would calculate a Bayes factor for a weighted average of the additive and general models where the additive model is given weight 9 and the general model is given weight 1. The log10 Bayes factor will be written in a column with the label mean_bf. 
A Bayesian test for association of a SNP with multiple quantitative phenotypes can be carried out with the mpheno option.
The model we use is the Bayesian Multivariate Linear model which is specified by
where the (y_{i1},....,y_{iq}) is the vector of the q residual phenotypes measured on the ith individual. The residual phenotype is calculated by subtracting off an baseline term and estimates of any specified covariates. Further we assume that each of these phenotypes has been centered and scaled to have zero mean and unit variance. Also, G_{i} is the coded version of the SNP genotype for the ith individual.
We use the conjugate prior for this model. This is an inverse Wishart prior IW(c,Q) prior on the error covariance matrix Σ and a matrix normal (N) prior on the vector of parameters
A. P. Dawid (1981) Some matrixvariate distribution theory : notational considerations and a bayesian application. Biometrika 68:265274.
This distribution has the property that the covariance matrix of (ß_{1},...,ß_{q})  M is given by VΣ. By a similar argument to that used above when discussing how to set the priors for a single quantitative phenotype we recommend setting V=0.02 and M = (0,...,0). Since the phenotypes have been centered and scaled we also recommend placing a IW(6,4I_{q}) prior on Σ where I_{q} is the (qxq) Identity matrix. The centering and scaling can be turned off with the use_raw_phenotypes flag.
The fit of the full model (M_{1}) in which (ß_{1},...,ß_{q}) are unconstrained is compared to the fit of the null model (M_{0}) in which (ß_{1},...,ß_{q}) = 0. The Bayes factor calculated then has the form
The following example uses this model to analyze the phenotypes pheno1 and pheno2 jointly. This produces a log_{10} Bayes Factor in the output file.
NOTE : the InverseWishart prior is set with the options prior_mqt_c <c> and prior_mqt_Q <Q>. This specifies an IW(c,QI_{q}).
SNPTEST v2.5.1 includes support for testing categorical traits using a multinomial logistic regression likelihood. This extends the logistic regression implemented for binary traits to multiple categories. This feature is currently considered experimental and this page provides initial documentation on its use.
To specify a multinomial traits you must:
Parameters in the multinomial model can be thought of as forming a matrix (β_{ij}), where β_{ij} is the effect size for predictor j (i.e. the jth column of the design matrix) on nonbaseline outcome level i. SNPTEST internally renumbers these parameters as β_{k}, i = k, ..., K, where K = (number of nonbaseline outcome levels) × (number of predictors) . To allow parameter identification, the output contains columns named in the following way:
To avoid cluttering the output, corresponding standard errors and other columns are simply identified by number, e.g. the column containing standard errors for the ith parameter is named
For example, suppose the column 'bin3' contains a phenotype with levels control, case1 and case2. The command
fits a multinomial logistic regression at each SNP with a single additive genetic effect parameter, using "control" as the baseline outcome. SNPTEST will output the following columns relevant to the parameters:
column  description 

frequentist_add_beta_1:add/bin3=case1  Effect size parameter (β_{1}) for outcome case1 relative to control 
frequentist_add_beta_3:add/bin3=case2  Effect size parameter (β_{3}) for outcome case2 relative to control. 
frequentist_add_se_1  Standard error for β_{1} 
frequentist_add_se_3  Standard error for β_{3} 
frequentist_add_cov_1,3  Covariance between the two parameters. 
frequentist_add_wald_pvalue_1  Wald test pvalue for β_{1} (based on the effect size and standard error). 
frequentist_add_wald_pvalue_3  Wald test pvalue for β_{3} 
Important: the particular order or numbering of parameters may change in future.
Similarly, the command
will fit a model with both additive ('add') and heterozygote ('het') parameters, with effect sizes columns named as follows.
column  description 

frequentist_add_beta_1:add/bin3=case1  Effect size parameter (β_{1}) for additive effect on outcome case1. 
frequentist_add_beta_2:het/bin3=case1  Effect size parameter (β_{2}) for heterozygote effect on outcome case1. 
frequentist_add_beta_4:add/bin3=case2  Effect size parameter (β_{4}) for additive effect on outcome case2. 
frequentist_add_beta_5:het/bin3=case2  Effect size parameter (β_{5}) for heterozygote effect on outcome case2. 
Similarly, number columns for the standard errors, covariances and Wald test pvalues will be output.
There are several options that control how covariates and/or SNPs can be conditioned upon in order to carryout a test of association. These options work with both the Frequentist and Bayesian association tests.
cov_names <name_1> ... <name_n> 
Condition upon the covariates in the sample files with names name_1,...., name_n. 
cov_all 
Condition upon all the covariates in the sample files. 
cov_all_discrete 
Condition upon all the discrete covariates (D) in the sample files. 
cov_all_continuous 
Condition upon all the continuous covariates (C) in the sample files. 
condition_on <snp_1> <model_1> ... <snp_n> <model_n> 
Condition upon a list of SNPs with IDs given by snp_1,...,snp_n. For each SNP a list of models can be supplied; the choices are add, dom, rec, het, or gen. Here "gen" is shorthand for "add het", i.e. condition on additive and heterozygote dosages. If no model is supplied, the default "add" is used. These covariates are internally added to the sample file as continuous (type C) covariates and appear in the covariate summary in the screen output. 
Conditioning upon one (or more) covariate means that the test of association being carried out is testing for a genetic effect over and above that explained by the covariate(s). Discrete covariates are added into the model as factors i.e. a different baseline term for each level of the factor is fitted.
If a single Discrete (D) covariate is conditioned upon then this is equivalent to a MantelHantzel test. This is a test for a common genetic effect where each group is allowed to have it's own baseline effect. Here is an example of conditioning upon the binary covariate called cov1 in the sample files.
This produces an output file ./example/ex.out which contains a column with header bin2_frequentist_add_cov1_pvalue that contains the pvalues for the test based on the covariate.
For association studies it has become popular to use eigenvectors from a PCA analysis to code for unobserved population structure. This is carried out in SNPTEST by setting the eigenvectors as Continunus (C) covariates in the sample file and then conditioning upon these covariates. Here is an example of conditioning upon the two continuous covariates called cov3 and cov4 in the sample files.
In regions where an association has been found it is often desirable to carryout a test conditioning upon the most associated SNP to look for secondary signals of association which may highlight allelic heterogeneity or possible a haplotype effect in the region. This can be carried out in SNPTEST using the condition_on option. A list of SNPs can be specified along with the coding to be applied to those SNPs. The following example carries out a conditional test of association conditional upon the SNPs with IDs RSID_10 and RSID_20. The SNP RSID_10 is coded as an additive effect while SNP RSID_20 is coded as a general effect.
The pvalues from this command occurs in a column labelled bin1_frequentist_add_RSID_10:additive_dosage_RSID_20:additive_dosage_RSID_20:heterozygote_dosage_score_pvalue.
A summary of the conditionedon dosages appears in the main covariate summary in the screen output.
In case of SNPs for which a useful ID is not present, the syntax condition_on position=chr:xxxx (or condition_on position=xxxx if chromosome information is missing) can be used, where chr:xxxx is the chromosome and position of the SNP to be conditioned on. ( position can be shortened to pos if desired.)
The range option can be used to analyze only those SNPs whose basepair position lies within a given set of intervals. The following example only carries out tests on SNPs within the intervals [20000,30000] and [40000, 50000].
In a range specification the start or end of the range can be omitted. For example, the syntax range 50000 will restrict to all SNPs with position 50000 or above.
The snpid option can be used to specify a list of specific SNPs to analyze. The following example only carries out tests at SNPs with IDs RSID_4 and SNPID_7.
The exclude_snps option can be used to specify a file containing a list of SNPs that should be excluded from the analysis. The IDs in the file can be the SNP IDs (first column of the genotype file) or RS IDs (second column of the genotype file). For example, the file ./example/snps.list contains a list of the SNP IDs for the first 10 SNPs in the example data files. To exclude these SNPs from the analysis we can use
You should notice that the screen output reports that it has read in 10 SNP IDs and that the output file does not contain output for these SNPs.
The exclude_samples option can be used to specify a file containing a list of individuals that should be excluded from the analysis. The IDs in the file should be the ID that appears in the first column of the sample files. For example, the file ./example/samples.list contains a list of the IDs for the first 10 individuals in the example data files. To exclude these individuals from the analysis we can use
You should notice that the screen output reports that it has read in 10 sample IDs and that these individuals were excluded.
The miss_thresh option can be used to exclude individuals whose proportion of missing data does exceeds some level. The missing data proportion of each individual is specified in the 3rd column of the sample file. For example, to specify a maximum missing data proportion of 1% use
You should notice that the screen output reports that it has read in 10 SNPs IDs that the number of individuals included after the missing data threshold and exclusion list has been applied is less than the original number of individuals in the raw files.
The overlap option can be used to when multiple .gen files with differing sets of SNPs are supplied with the data option. This option will find the intersection of the SNPs in all the .gen file and test these SNPs. A restriction is that all .gen files must have SNPs ordered in position order. If this is not the case a warning will be given. In the following example the
files cohort1.gen and cohort2_partial.gen, which have an overlap of 100 SNPs, are combined together.
When carrying out a statistical test that conditions on covariates or uses a quantitative phenotype any indvidual with at least one missing value of a covariate or phenotype will be excluded from the test. The default code for missing covariates or phenotypes in the sample files is NA (see FILE FORMAT). The option missing_code can be used to specify a list of commaseparated alphanumeric codes that will be interpreted as missing values. For example, the syntax missing_code NA,999 will treat any value equal to 999 or NA in the sample files as missing.
SNPTEST v2.5 and above includes specific support for testing for association on the sex chromosomes. Both X and Y chromosomes are supported but we focus the discussion on the X chromosome here. There are a few complexities to bear in mind when testing on the X chromosome:
When using method newml for case/control traits, SNPTEST ignores samples with missing sex and assumes a model of full X inactivation by default. The command
SNPTEST will ignore samples with unspecified sex as well as males that are coded wrongly. By default, sex information is taken from a column names "sex" in the sample file, and males are coded in the input file in the same way as homozygote females. The sex_column and haploid_genotype_coding options can be used to adjust this behaviour.
SNPTEST reads chromosome information from the input files and understands "X" or "0X" in the input data to be the nonpseudoautosomal part of the X chromosome, "Y" or "0Y" to be the Y chromosome, and "XY" to be the pseudoautosomal loci on the X and Y chromosomes. (The pseudoautosomal regions are treated like autosomes.)
If chromosome data is not present in the input files, use the assume_chromosome option to specify the chromosome.
By default, gender information must be supplied in a column called 'sex' in the sample file. This can be adjusted using the sex_column option. Currently, SNPTEST understands M or MALE to indicate a male sample and F or FEMALE to indicate a female sample. For compatibility with IMPUTE, SNPTEST also permits encoding males as 1 and females as 2.
To allow for heterogeneity between males and females, or to allow for incomplete inactivation in females, the stratify_on option can be used. For example, the command
Note: when using stratify_on, it is usually correct to specify the same variables to cov_names to allow for a different baseline between strata.
SNPTEST v2.5 includes a new option stratify_on which performs an association test stratified over levels of a given discrete covariate  i.e. fitting a different effect parameter in each stratum. (Currently this option only applies when using method newml.) Possible uses for this option might be
For example, the command
Note: when using stratify_on, you should (almost) always specify the same variables to cov_names to allow for a different baseline between strata. In casecontrol settings it almost never makes sense to stratify effects but not baseline parameters.
When using stratify_on, in addition to Pvalue and other columns, SNPTEST will output one effect size parameter and one standard error for each level of the covariate. For example, in the above command cov1 has two levels 0 and 1, and SNPTEST outputs variables with the following names:
Name  Value 

bin2_cov1_frequentist_add_newml_beta_1:genotype/cov1=0  Effect size for strata with cov1 = 0 
bin2_cov1_frequentist_add_newml_beta_1:genotype/cov1=1  Effect size for strata with cov1 = 1 
bin2_cov1_frequentist_add_newml_se_1:genotype/cov1=1  Standard error of effect size for strata with cov1 = 0 
bin2_cov1_frequentist_add_newml_se_1:genotype/cov1=1  Standard error of effect size for strata with cov1 = 1 
bin2_cov1_frequentist_add_newml_degrees_of_freedom  Degrees of freedom in likelihood ratio test (here equal to 2) 
bin2_cov1_frequentist_add_newml_pvalue  Pvalue from likelihood ratio test. 
By default, SNPTEST will refuse to test a variant if any stratum contains fewer than 100 individuals. This limit can be adjusted using the lower_sample_limit option.
SNPTEST v2.5 computes two types of info measure.
The IMPUTE info measure, which reflects the information in imputed genotypes relative to the information if only the allele frequency were known. It can be written as
The info measure takes the value 1 if all genotypes are completely certain, and the value 0 if the genotype probabilities for each sample are completely uncertain in HardyWeinberg proportions (i.e. they equal (1θ)^{2}, 2θ(1θ), θ^{2}). It is also possible for info to drop below zero.
Info is usually computed as if assuming all samples are diploid and that the genotype probabilities for each sample sum to one. This is what IMPUTE computes, and also what SNPTEST computes when you use a method other than newml.
The assumptions of diploidy and that probabilities sum to one are generally applicable to imputed, autosomal SNPs. They may break down for typed SNPs (where missing probability data is possible) and for variants on the sex chromosomes. To deal with this, when using method newml only, SNPTEST currently makes two modifications to the above. Firstly, missing probability data is filled in using the expected distribution given θ and the assumption of HardyWeinberg equilibrium. This modification implies that completely missing individuals contribute 1/n to the info measure, and in fact that
When using method newml, SNPTEST will also output columns named ..._impute_info which reflect the traditional computation outlined above.
For some methods, SNPTEST also computes an association test info measure which reflects the relative information about the parameter of interest; see this pdf for details.
Option and value(s)  Description 

hwe 
This will produce an output file with columns that contain the pvalues for an exact test of HWE in each cohort. If a test for a binary phenotype is carried out then HWE for all the case individuals and all the control individuals are also reported. 
chunk <x>  The program works by reading in, analyzing and writing output for chunks of the data at a time. This option is included to control the maximum amount of RAM used by the program at any one time. The default chunk size is 100 SNPs. 
log <filename> 
Copy all screen output to the specified log file. 
printids  Print out each variant to the screen and/or log file before analysing it. (This is useful for debugging problems with data). 
lower_sample_limit <n>  By default, SNPTEST will refuse to run a regression if there are fewer than 100 samples in the design matrix (or, when using stratify_on, if there are fewer than 100 samples in any strata). This option can be used to alter this limit. 
Q : My sample file looks fine but SNPTEST says it is malformed  why?
Up to v2.5, SNPTEST would fall over on files that have Windowsstyle line endings (CR LF) but used on
platforms with UNIX line endings (LF), or vice versa. The solution is to convert the line endings to LFs using either
the dos2unix command or a text editor.
From v2.5.1, SNPTEST should cope with files with either line ending convention.
Q : SNPTEST does not produce a pvalue at my SNP.
SNPTEST sometimes fails to fit the association model at a variant.
In this case it tries to produce an indication of the reason for failure in the comment column.
Possible reasons are:
Q : I get the error "igamc underflow error" printed to the screen. What does this mean?
This error occurs at SNPs where a very small pvalue from a chisquared test needs to be calculated. The CPROB library used by SNPTEST is used to carry this out and it reports an underflow error when this occurs. In this case it returns a pvalue of 0. This usually occurs when the signal of association is very huge and can sometime indicate problems with the data. To identify which SNPs this occurs at you can use the printids flag.
If you have a question about SNPTEST, please send a message to our mailing list:
You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated email folder if you don't want them all landing in your inbox.If you are having a problem with the software, please try to include the following details in your email (otherwise we may be unable to help):
For difficult problems like memory access errors (e.g. "segmentation faults") we may further ask you to send data files that show the problem. These should generally be small and we can provide suggestions if you are not allowed to share your actual data.
Version  Date  Details 

2.4.1  03/07/2012 
Bug fix release.

2.4.0 
13/04/2012 
Minor release.

2.3.0 
16/12/2011 
This release can be found here.

2.2.0 
07/12/2010 
This release can be found here. This is a substantial update on the previous version that implements a number of new features

2.1.1 
01/04/2010 
Minor update. This release can be found here. 
2.1.0 
19/03/2010 
This is major change to SNPTEST from previous versions. Please read the following carefully

1.1.5 
28/05/2008 
This release can be found here 
[1] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genomewide association studies via imputation of genotypes. Nature Genetics 39 : 906913 [Free Access PDF][Supplementary Material][News and Views Article]
[2] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447;66178. PMID: 17554300 DOI: 10.1038/nature05911
[3] J. Marchini and B. Howie (2010) Genotype imputation for genomewide association studies. Nature Reviews Genetics [Link]