SNPTEST v1.0.1

SNPTEST is a program for the analysis of single SNP association in genome-wide studies. Both binary (case-control) and quantitative phenotypes are catered for. The program is designed to work with seamlessly with the output of both the genotype calling program CHIAMO [1], the genotype imputation program IMPUTE [2] and the program GTOOL. This program was used in the analysis of the 7 genome-wide association studies carried out by the Wellcome Trust Case-Control Consortium (WTCCC) [3].

SNPTEST has many different features which are illustrated below through a number of different examples that use the datasets provided with the software in the directory /example. These files contain data at 100 SNPs on 2000 individuals that are split into a control cohort and a case cohort. These datasets can be used to try out the tests using both binary (case-control) and quantitative phenotypes.

Home Screen Output Taking account of genotype uncertainty
Download
Excluding SNPs/Individuals Other Options
Version History
Testing for Hardy-Weinberg Equilibrium References
Input File Formats Basic Association Tests Contact Information
Data Summaries
Tests that condition upon covariates
Calculating Missing Data Rates Bayesian Tests of Association (Bayes Factors)

Download (top)

Pre-compiled versions of the program and example files can be downloaded from the links below. We've supplied both static and dynamic versions of the Linux executables. If you intend to run SNPTEST on a machine running an old kernel then you probably want to use the dynamic version. If you have any problems getting the program to work on your machine please contact me.

Platform
File
Linux (x86_64) Static Executable
snptest_v1.0.1_x86_64_static.tgz
Linux (x86_64) Dynamic Executable
snptest_v1.0.1_x86_64_dynamic.tgz
Linux (i386) Static Executable
snptest_v1.0.1_i386_static.tgz
Linux (i386) Dynamic Executable
snptest_v1.0.1_i386_dynamic.tgz
Mac OS X 10.4 (Intel)
snptest_v1.0.1_MacOSX_10_Intel.tgz
Solaris 5.8 (Sun SPARC)
snptest_v1.0.1_Solaris5.8_SPARC_static.tgz
Solaris 5.10 (Sun SPARC)
snptest_v1.0.1_Solaris5.10_SPARC_static.tgz

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use the command like


tar zxvf snptest_vX.X.X_i386.tgz

This will create an executable called snptest and a directory /example that contains the example files.

Version History (top)

0.9.1
07-06-2007
First version
0.9.2
21-06-2007
Small change to convergence diagnostics in logistic regression.
0.9.3
26-06-2007
-exclude_snps bugfix so that either of the SNP IDs can be used to exclude SNPs.
1.0.0
17-07-2007
  • Support for Quantitative Trait tests (both Frequentist and Bayesian tests).
  • Heterozygote tests added.
  • Changes to the way that tests conditional on covariates are specified.
  • Change made to be in-line with the amended FILE FORMAT
1.0.1
18-07-2007
Bug fix to -create_miss option

Input File Formats (top)

SNPTEST allows the analysis of multiple cohorts of individuals. The data for each cohort is stored in two files. The first file (the sample file) stores the ID's and associated covariate and phenotype information of the individuals of each cohort. The second file (the genotype file) stores the genotype data for the cohort. For the example datasets included with the software the sample and genotype files for each of these cohorts have the suffices _sample and _gen respectively. The is a FILE FORMAT WEBPAGE with more details of the file formats.

Data Summaries (top)

The simplest use of SNPTEST is to calculate data summaries for each SNP (genotype counts, allele frequencies, SNP missing data proportions and odds ratios). For example, the command

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out

produces a file ./example/ex.out which contains the data summaries for all 100 SNPs across the two cohorts. Note how the cohorts where specified as either case or control cohorts by placing the relevant genotype and sample files after the -cases and -controls options in the command. For each cohort the name of the genotype file should be followed by its associated sample file. The is no limit to the number of case and control cohorts that can be specified.

The -o option specified the output file i.e. ./example/ex.out. This file contains a line for each SNP and there is a header line which specifies the contents of each column (see here). The file contains genotype counts for each cohort i.e. controls_1_AA controls_1_AB controls_1_BB controls_1_NULL are the counts of AA, AB, BB and NULL genotypes in the controls_1 cohort. Genotype counts are called using a default threshold of 0.9 (see below). Genotype counts and minor allele frequencies (MAF) in the combined controls, combined cases and combined across all cohorts are also given. The proportion of missing data across all cohorts is given in the column with header missing_data_proportion. Estimated odds ratios are also reported. The columns het_OR, het_OR_lower and het_OR_upper are the estimated odds ratios and lower and upper 95% confidence limits for the heterozygote genotype AB versus the (baseline) AA genotype. The columns hom_OR, hom_OR_lower and hom_OR_upper are the estimated odds ratios and lower and upper 95% confidence limits for the homozygote genotype BB versus the (baseline) AA genotype. The columns all_OR, all_OR_lower and all_OR_upper are the estimated allelic odds ratios and lower and upper 95% confidence limits for the B allele versus the (baseline) A allele. Odds ratios and their confidence limits are set to -1 if they cannot be calculated.

Calculating Missing Data Rates (top)

The third column of the sample files contain the missing data proportion for each individual. This can be useful for filtering out individuals with high missing data rates (see below). The -create_miss option can be used to calculate the missing data rates needed to make the sample files. For example, to calculate the missing data rates for the first control cohort use the command

./snptest -create_miss ./example/controls.gen -o ./example/ex.out

This creates an output file ./example/ex.out with the missing data proportion for each individual in the specified genotype file. The proportions are based on calling genotypes at the default calling threshold of 0.9 (see below for details of how to change the threshold).  Multiple files can be specified using the -create_miss option which is useful if the genotype data has been stored in a separate file for each chromosome. 

Screen Output (top)

You should notice that SNPTEST produces some screen output when run. Information about which data files were specified, the tests selected, the numbers of SNPs, the total number of cases and the total number of controls, information about the covariates and phenotypes in the sample files and information about individuals and SNPs selected for exclusion is all written to the screen. Also, information about the progress of the program is written to the screen. Warning and/or error messages may also be shown. Incorrect use of the options or input files with the wrong format may cause the program to terminate. The screen output can be used to identify any problems that lead to the termination. The flag -printids can be used to print the SNP IDs of each SNP as it is processed which can be useful to identify where problems occur.

Excluding SNPs and/or Individuals (top)

Excluding SNPs

The -exclude_snps option can be used to specify a file containing a list of SNPs that should be excluded from the analysis. The IDs in the file can be the SNP IDs (first column of the genotype file) or RS IDs (second column of the genotype file). For example, the file ./example/snps.list contains a list of the SNP IDs for the first 10 SNPs in the example data files. To exclude these SNPs from the analysis we can use

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -exclude_snps ./example/snps.list

You should notice that the screen output reports that it has read in 10 SNP IDs and that the output file does not contain output for these SNPs.

Alternatively, the program can be run for on a single SNP using the command line option -snpid. For example, to run SNPTEST on the SNP with a SNP ID of 61 we can use

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -snpid 61

Excluding Individuals

The -exclude_samples option can be used to specify a file containing a list of individuals that should be excluded from the analysis. The IDs in the file should be the ID that appears in the first column of the sample files. For example, the file ./example/samples.list contains a list of the IDs for the first 10 individuals in the example data files. To exclude these individuals from the analysis we can use

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -exclude_samples ./example/samples.list

You should notice that the screen output reports that it has read in 10 sample IDs and that these individuals were excluded.

The -miss_thresh option can be used to exclude individuals whose proportion of missing data does exceeds some level. The missing data proportion of each individual is specified in the 3rd column of the sample file. For example, to specify a maximum missing data proportion of 1% use

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -miss_thresh 0.01

You should notice that the screen output reports that it has read in 10 SNPs IDs that the number of individuals included after the missing data threshold and exclusion list has been applied is less than the original number of individuals in the raw files.

Testing For Hardy-Weinberg Equilibrium (HWE) (top)

Tests for HWE can be included in the output for each SNP by adding the -hwe flag. For example,
./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -hwe

will produce an output file ./example/ex.out with columns that contain the p-values for an exact test of HWE in each control cohort, the combined set of control cohorts, each case cohort and the combined set for case cohorts.

Basic Association Tests (top)

Case-Control Tests

Standard frequentist case-control tests of association for additive, dominant, recessive, general and heterozygote models can be carried out using the -frequentist option. For example, the following command can be used used to carry out tests for these four models on the example datasets.

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -frequentist 1 2 3 4 5

The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. There is no need to specify all five models as in the example above i.e. using the option -frequentist 1 4 would only test for association for the additive and general models of association. The additive model is the Cochran-Armitage test for an additive genetic effect. The dominant and recessive models are specified using the AA genotype as the baseline genotype. The general model is a the standard 2-df test of association.

The output file ./example/ex.out contains all of the summary information about each SNP as described above. The p-values for the four tests are given in the columns frequentist_add, frequentist_dom, frequentist_rec, frequentist_gen and  frequentist_het. When a model cannot be fitted to the data the p-value is set to -1.

NOTE : when carrying out a case-control test you must separate the cases and controls into at least two cohorts.

Quantitative Trait Tests

Tests of SNP association with a quantitative phenotype can be carried out using the -qt option. This option carries out a F-test of association at a SNP. The -frequentist option is used to specify the coding of the genotypes at each SNP (see above). The phenotype(s) of each individual must appear in the sample file (see FILE FORMAT WEBPAGE). By default, the tests will use the first phenotype in the sample files. You should use the -pheno option to specify which phenotype you wish to test. For example, the following command can be used used to carry out tests for the five different models using the 2nd phenotype in the example datasets.

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -qt -pheno 2 -frequentist 1 2 3 4 5

NOTE : When using the -qt option the data can be specified as a single control cohort or as multiple cohorts. If a single cohort is specified then use the -control flag argument to specify it. If the data are specified as multiple case and control cohorts the case-control status is ignored and the specified phenotype is used instead.

Covariates (top)

The -cov option can be used to carry out an association tests conditional upon a covariate. For example, to carry out an additive test of association conditional upon the 2nd covariate in the sample file use

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -frequentist 1 -cov 2

This produces an output file ./example/ex.out which contains a column with header frequentist_add_cov_1 that contains the p-values for the test based on the covariate.

The type of test carried out depends upon the type of covariate specified in the sample files. If a covariate is of type 1
then a Mantel-Hantzel test is carried out. This is a test for a common genetic effect where each group is allowed to have is own baseline effect. If a covariate is of type 2 then a test is carried out in each group specified by the covariate and the test statistics are combined to produce an overall test statistic. This tests the hypothesis that there is no genetic effect in any of the groups specified by the covariate. If any of the covariates are of type 3 (this specifies a continuous covariate) a test for a genetic effect over and above that explained the continuous covariate is carried out (using maximum likelihood ratio test for a logistic regression model).

The -cov option can be used together with the -qt option to carryout Quantitative Trait tests of association conditional upon covariates.

In addition there is an option -cov_all which can be used to carry out a test conditional upon all the continuous (type 3) covariates in the sample file.

See the
FILE FORMAT WEBPAGE for more details of how to specify the type of covariate in the sample file. More details of these tests can be found in [2].

Bayesian Tests (Bayes Factors) (top)

Bayes Factors for the five standard genetic models (additive, dominant, recessive, general and heterozygote) can also be carried out using the -bayesian option. These tests are described in detail in [2]. For example, the command

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -bayesian 1 2 3 4 5

produces an output file with columns bayesian_add, bayesian_dom, bayesian_rec, bayesian_gen and  bayesian_het. These contain -log10 Bayes Factors for the additive, dominant, recessive, general and heterozygote models versus a null model of no association. The priors on the model parameters can be changed but are set at their default settings. Please contact us if you wish to learn how to vary these priors.

Bayes Factors for Quantitative Traits have also been implemented but at present have not been documented. Please contact us if you would like to use these options.

NOTE : at present there is no facility to calculate Bayes Factors conditional upon the covariates in the sample files. If you would find this useful we are more likely to implement it if we get lots of requests for it!

Taking Genotype Uncertainty Into Account (top)

Changing the calling threshold

The default setting of the program is to use a threshold to call genotypes as AA, AB, BB or NULL. Both the Frequentist and Bayesian tests of association will be carried on the these thresholded genotypes by default. The genotypes are called by assigning the genotype with the maximum probability if it is greater than the calling threshold otherwise a NULL genotype is called. The default threshold of 0.9. The threshold  can be altered using the -call_thresh option. For example, to produce a set of basic tests based on a threshold of 0.95 use the command

./snptest -cases ./example/cases.gen ./example/cases.sample  -controls ./example/controls.gen ./example/controls.sample  -o ./example/ex.out -call_thresh 0.95 -frequentist 1 2 3 4

Frequentist Tests

The simplest way of taking the uncertainty of the genotype data can be taken into account for a Frequentist association tests is to use the -exp option. This option uses tests based on the expected genotype counts (in addition to the standard thresholded genotype counts). The sums of the genotype probabilities from the case and control cohorts are used to create a 2x3 epxected genotype table to which the standard test statistics are applied. For example, the command

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -frequentist 1 -exp

will produce an output file containing the p-value for the additive test applied to the expected genotype counts (frequentist_add_exp column). The output file will also contain SNP summary information based on the expected genotype counts. The relavant columns will have the suffix _exp. The -exp option works together with the -cov and -qt options. This option does not properly take the genotype uncertainty into account as it treats the expected genotypes as if they were known and does not allow for uncertainty in the overall genotype counts.

The -proper option can be used to take account of the uncertainty of the genotypes into account completely. This option implements a statistical test based on a missing data likelihood (see [2]  for precise details). At present this option does not work together with the -cov  or -qt options and only the basic association tests are implemented. For example, the command
 
./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -frequentist 1 -proper

produces an output file with a column labeled frequentist_add_proper which contains the p-value for the additive score test.

Bayesian Tests

The genotype uncertainty can be taken into account in the calculation of the Bayes Factors by sampling genotype counts based on the genotype probabilities and averaging the resulting Bayes Factors. The -nsamp option specifies the number of samples of genotypes that should be used. The default is 0 and means that no sampling is carried out and no Bayes Factor produced. The number of samples need to produce stable results depends on the amount of uncertainty that exists in the genotypes. We recommend this be set to at least 100 and that the user investigates the stability of the Bayes Factors by varying the number of samples. Sampling genotype counts in this way is computationally intensive and will only produce a different result from the thresholded calls if there are a reasonable number of individuals with a considerable amount of genotype uncertainty. The -certainty_thresh option can be used to specify at which SNPs the sampling is carried out. The average genotype certainty is calculated at each SNP by averaging the maximum genotype probability for each individual. A sampling based Bayes Factor is then carried out at only SNPs that have an average genotype certainty above the specified level (default is 0.9). SNPs that do not exceed the threshold have the thresholded genotype Bayes Factor reported. More details can be found in [2]. For example, the command

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -bayesian 1 -nsamp 100 -certainty_thresh 0.95

produces a file with a column bayesian_add_samp which contains the sample-averaged -log10 Bayes Factor for the additive model.

Other Options (top)

The program works by reading in, analyzing and writing output for chunks of the data at a time. The default chunk size is 100 SNPs. To analyze the data in chunks of size 10 use

./snptest -cases ./example/cases.gen ./example/cases.sample -controls ./example/controls.gen ./example/controls.sample -o ./example/ex.out -chunk 10

This option is included to control the maximum amount of RAM used by the program at any one time.

References (top)

[1] J. Marchini, C. Spencer. Y.Y. Teo and P. Donnelly (2007) A Bayesian Hierarchical Mixture Model for Genotype Calling in a multi-cohort study. (in preparation)
[2] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[3] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls.
Nature 447;661-78. PMID: 17554300 DOI: 10.1038/nature05911

Contact Information (top)

If you have any questions regarding the use of this program please send an email to Dr Jonathan Marchini (marchini <at> stats <dot> ox <dot> ac <dot> uk)