SNPTEST v2

SNPTEST is a program for the analysis of single SNP association in genome-wide studies. The tests implemented include

Binary (case-control) phenotypes, single and multiple quantitative phenotypes
Bayesian and Frequentist tests
Ability to condition upon an arbitrary set of covariates
Various different methods for the dealing with imputed SNPs.

The program is designed to work seamlessly with the output of both the genotype calling program CHIAMO [1], the genotype imputation program IMPUTE [2] and the program GTOOL. This program was used in the analysis of the 7 genome-wide association studies carried out by the Wellcome Trust Case-Control Consortium (WTCCC) [3].

SNPTEST has many different features which are illustrated below through a number of different examples that use the datasets provided with the software in the directory /example. These files contain data at 200 SNPs on 1000 individuals that are split into a control cohort and a case cohort. These datasets can be used to try out the tests using both binary (case-control) and quantitative phenotypes.

Home	Screen Output	Other Options
New features	Excluding SNPs/Individuals	References
Download	Testing for Hardy-Weinberg Equilibrium	Contact Information
Input File Formats	Frequentist Association Tests	Version History
Data Summaries	Bayesian Tests of Association (Bayes Factors)

New features in v2 and changes from v1

The input files are now specified using the -data option and allows multiple cohorts to specified at once. The -cases and -controls options have been removed and the new sample file format (see below) allows binary phenotypes so that case-control tests are carried out by specifying the relevant phenotype in the sample file.
There is a new option -method that is used to specify the method used to fit the chosen model. The new options give better results at SNPs that are rare and/or have high genotype uncertainty.
The -exp_counts has been removed. The genotype frequencies reported in the output file will be thresholded genotype counts if -method threshold is selected and will be expected counts otherwise. Only those individuals included in the tests will contribute to the counts.
The info and average_maximum_posterior_call measures are now only calculated using only those samples used in the test at each SNP.
The Bayesian tests now account for genotype uncertainy and can allow covariates in the tests.
Bayesian Binary Trait tests now have an option to use a t-distribution prior on the genetic effect parameters. This allows more flexibility in specifying the prior beliefs about the genetic effect sizes. See option -t_prior and -t_df in the section on Bayesian Tests.
There are now Bayesian tests for quantitative traits.
There is now an option -mean_bf that calculate the weighted mean of the Bayes factors across the range of models specified. This 'model averaging' feature allows a range of models to be tested at the same time. See the section on Bayesian Tests.
There is now a Bayesian test for multiple quantitative phenotypes.
Automatic detection of .gz files. If any of the .gen or .sample files are gzipped then this will be detected and the data read in from these files. The -gen_gz option has been removed.
The statistical details of the implemented tests (Frequentist and Bayesian) and the information measures are described in a pdf.

Download (top)

SNPTEST is available free to use for academic use only. Please see the LICENCE and also included with the package.

Pre-compiled versions of the program and example files can be downloaded from the links below. We've supplied both static and dynamic versions of the Linux executables. If you intend to run SNPTEST on a machine running an old kernel then you probably want to use the dynamic version. If you have any problems getting the program to work on your machine please contact us.

Platform	File
Linux (x86_64) Static Executable	snptest_v2.1.1_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3)	snptest_v2.1.1_SuSE9.3_x86_64_static.tgz
Linux (x86_64) Dynamic Executable	snptest_v2.1.1_x86_64_dynamic.tgz
Linux (i386) Static Executable	snptest_v2.1.1_i386_static.tgz
Linux (i386) Dynamic Executable	snptest_v2.1.1_i386_dynamic.tgz
Mac OS X 10.4-10.6 Intel	snptest_v2.1.1_MacOSX_10.5_Intel.tgz
Mac OS X (PowerPC)	snptest_v2.1.1_MacOSX_PowerPC.tgz
Solaris 5.8 (Sun SPARC)	snptest_v2.1.1_Solaris5.8_SPARC.tar.gz
Solaris 5.10 (AMD Opterons)	snptest_v2.1.1_Solaris5.10_Opteron.tgz
SLES 10 (Intel Itanium2)	snptest_v2.1.1_SLES10_Itanium2.tgz
Windows MS-DOS (Intel)	snptest_v2.1.1_Windows_Intel.tgz

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use the command like

tar zxvf snptest_vX.X.X_i386.tgz

This will create an executable called snptest and a directory /example that contains the example files. To see a list of options available in SNPTEST type

./snptest -help

Input File Formats (top)

SNPTEST allows the analysis of multiple cohorts of individuals. The data for each cohort is stored in two files. The first file (the genotype file) stores the genotype data for the cohort. The second file (the sample file) stores the ID's and associated covariate and phenotype information of the individuals of each cohort. For the example datasets included with the software the sample and genotype files for each of these cohorts have the suffices .sample and .gen respectively. The file format is described on a NEW FILE FORMAT WEBPAGE. The format of the sample files used by SNPTEST v2 is slightly different from the format used by SNPTEST v1. The difference is how the covariates and phenotypes are denoted in the second line of the sample file.

NOTE 1 : when using multiple cohorts SNPTEST assumes that

each cohort has data at the same set of SNPs and that these SNPs are stored in the same order in each of the cohort genotype files.
the sample files for each cohort have exactly the same set of covariates and phenotypes and these occur in the same order in the files.

NOTE 2 : if any of the .gen or .sample files are gzipped then this will be detected automatically by the program.

Data Summaries (top)

The simplest use of SNPTEST is to calculate data summaries for each SNP (genotype counts, allele frequencies, SNP missing data proportions and odds ratios). For example,

NOTE : within each command box below, most lines end with the '\' character. This is not actually part of the command -- it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split each example command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window (so, for example, you should be able to directly paste these commands into the terminal and hit 'enter' to make them run), but it would be equivalent to put all of the arguments on a single line, separated by spaces.
For example, the command

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out

produces a file ./example/ex.out which contains the data summaries for all 200 SNPs across the two cohorts. Note how the cohorts where specified by placing the relevant genotype and sample files after the -data and option in the command. For each cohort the name of the genotype file should be followed by its associated sample file. The is a limit of 18 cohorts that can be specified.

The -o option specified the output file i.e. ./example/ex.out. This file contains a line for each SNP and there is a header line which specifies the contents of each column. The following table give a description of each of the entries in this file.

id	SNP ID (taken from input files)
rsid	RS ID of the SNP (taken from input files)
pos	Base pair position of the SNP
allele_A allele_B	The two alleles at the SNP. allele_A is coded 0 and allele_B is coded 1.
average_maximum_posterior_call	The average maximum posterior probability across all individuals in the sample that are used for the test at each SNP.This is a measure of how much uncertainty there is at each SNP. Samples excluded will be (a) those excluded using the -exclude_samples option, (b) samples with a missing phenotype or covariate relevant to the test, (c) samples without genotypes if the -method threshold option is used, (d) samples where the sum of the genotype probabilities is less than 0.1.
info	A measure of the observed statistical information for the estimate of allele frequency of the SNP using all individuals in the sample that are used for the test at each SNP. This measure has a maximum value of 1 that indicates that perfect information. Samples excluded will be (a) those excluded using the -exclude_samples option, (b) samples with a missing phenotype or covariate relevant to the test, (c) samples without genotypes if the -method threshold option is used, (d) samples where the sum of the genotype probabilities is less than the value set by the option -total_prob_limit (default 0.1).
cohort_1_AA cohort_1_AB cohort_1_BB cohort_1_NULL	Counts of AA, AB, BB and NULL genotypes in the 1st cohort. See Note below which details exactly how genotype counts are calculated in SNPETST v2.
cohort_2_AA cohort_2_AB cohort_2_BB cohort_2_NULL	Counts of AA, AB, BB and NULL genotypes for the 2nd cohort (see details above). Subsequent cohorts will be included in a similar way. See Note below which details exactly how genotype counts are calculated in SNPETST v2.
all_AA all_AB all_BB all_NULL	Counts of AA, AB, BB and NULL thresholded genotypes across all cohorts. See Note below which details exactly how genotype counts are calculated in SNPETST v2.
all_maf	Minor allele frequencies (MAF) in the combined controls, combined cases and combined across all cohorts.
missing_data_proportion	The proportion of missing data across all cohorts.

NOTE ON HOW GENOTYPE COUNTS, MINOR ALLELE FREQUENCIES AND MISSING DATA PROPORTIONS ARE CALCULATED IN SNPTEST v2
If no association tests are specified or -method threshold is specfied then thresholded genotype counts are reported. Otherwise, expected conts are given. The expected count for a genotype is the sum of the probabilities across all individuals in the sample. If individuals are explicitely excluded then they will not be included in the genotype counts in any way. When testing for association, if an individual has at least one missing phenotype or missing covariate that is needed for the test then their genotype will be called as NULL in the genotype counts. Samples where the sum of the genotype probabilities is less than the value set by the option -total_prob_limit (default 0.1) will also be counted as NULL at each SNP.This way of calculating genotype counts has changed from v1 to v2.

If a test for a binary phenotype is being carried out then the following additional fields are included

controls_AA controls_AB controls_BB controls_NULL	Counts of AA, AB, BB and NULL genotypes across all case cohorts. See Note above which details exactly how genotype counts are calculated in SNPETST v2.
cases_AA cases_AB cases_BB cases_NULL	Counts of AA, AB, BB and NULL genotypes across all case cohorts. See Note above which details exactly how genotype counts are calculated in SNPETST v2.
cases_maf controls_maf	Minor allele frequencies (MAF) in the controls and cases across all cohorts.
het_OR het_OR_lower het_OR_upper	Estimated odds ratios and lower and upper 95% confidence limits for the heterozygote genotype AB versus the (baseline) AA genotype.
hom_OR hom_OR_lower hom_OR_upper	Estimated odds ratios and lower and upper 95% confidence limits for the homozygote genotype BB versus the (baseline) AA genotype.
all_OR, all_OR_lower all_OR_upper	Estimated allelic odds ratios and lower and upper 95% confidence limits for the B allele versus the (baseline) A allele.

NOTE : Odds ratios and their confidence limits are set to -1 if they cannot be calculated.

Screen Output (top)

You should notice that SNPTEST produces some screen output when run. Information about which data files were specified, the tests selected, the numbers of SNPs, the total number of cases and the total number of controls, information about the covariates and phenotypes in the sample files and information about individuals and SNPs selected for exclusion is all written to the screen. Also, information about the progress of the program is written to the screen. Warning and/or error messages may also be shown. Incorrect use of the options or input files with the wrong format may cause the program to terminate. The screen output can be used to identify any problems that lead to the termination. The flag -printids can be used to print the SNP IDs of each SNP as it is processed which can be useful to identify where problems occur.

Excluding SNPs and/or Individuals (top)

Excluding SNPs

The -exclude_snps option can be used to specify a file containing a list of SNPs that should be excluded from the analysis. The IDs in the file can be the SNP IDs (first column of the genotype file) or RS IDs (second column of the genotype file). For example, the file ./example/snps.list contains a list of the SNP IDs for the first 10 SNPs in the example data files. To exclude these SNPs from the analysis we can use

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample
-o ./example/ex.out \
-exclude_snps ./example/snps.list

You should notice that the screen output reports that it has read in 10 SNP IDs and that the output file does not contain output for these SNPs.

Alternatively, the program can be run for on a single SNP using the command line option -snpid. For example, to run SNPTEST on the SNP with a SNP ID of 61 we can use

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-snpid 61

Excluding Individuals

The -exclude_samples option can be used to specify a file containing a list of individuals that should be excluded from the analysis. The IDs in the file should be the ID that appears in the first column of the sample files. For example, the file ./example/samples.list contains a list of the IDs for the first 10 individuals in the example data files. To exclude these individuals from the analysis we can use

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-exclude_samples ./example/samples.list

You should notice that the screen output reports that it has read in 10 sample IDs and that these individuals were excluded.

The -miss_thresh option can be used to exclude individuals whose proportion of missing data does exceeds some level. The missing data proportion of each individual is specified in the 3rd column of the sample file. For example, to specify a maximum missing data proportion of 1% use

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-miss_thresh 0.01

You should notice that the screen output reports that it has read in 10 SNPs IDs that the number of individuals included after the missing data threshold and exclusion list has been applied is less than the original number of individuals in the raw files.

Excluding indviduals with missing covariate or phenotype values

When carrying out a statistical test that conditions on covariates or uses a quantitative phenotype any indvidual with at least one missing value of a covariate or phenotype will be excluded from the test. The default code for missing covariates or phenotypes in the sample files is -9 (see FILE FORMAT). The option -missing_code can be used to specify a non-default numeric value for the missing data code in the sample files. For example, use -missing_code -999 to specify that the value -999 has been used in the sample files.

Testing For Hardy-Weinberg Equilibrium (HWE) (top)

Tests for HWE can be included in the output for each SNP by adding the -hwe flag. For example,

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-hwe

will produce an output file ./example/ex.out with columns that contain the p-values for an exact test of HWE in each cohort. If a test for a binary phenotype is carried out then HWE for all the case individuals and all the control individuals are also reported.

Frequentist Association Tests (top)

There are 3 options that control Frequentist testing for association (-pheno, -frequentist and -method),

-pheno
<name>

This specifies which phenotype you wish to test. The <name> should match one of the phenotypes in the sample file. If the phenotype in the sample file is binary (B) then a case-control test is carried out. If the phenotypes in the sample file is continuous (P) then a quantitative trait test (i.e. F-test for a linear model) is carried out. See FILE FORMAT WEBPAGE for more details about how to specify a phenotype in the sample file.If no phenotype is specified then the first phenotype in the sample file is used.

-frequentist
<t1>...<tn>

This option controls the model you wish to test at each SNP versus a model of no association. The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. When using this option the output file will have a column for each test that contains the p-value for the test as well as estimates of the model parameters (beta's) and their standard errors. SNPTEST codes allele_A as 0 and allele_B as 1 and this defines the meaning of the beta's and there se's. For example, when using the additive model the beta estimates the increase in log-odds that can be attributed to each copy of allele_B. When a model cannot be fitted to the data the p-value is set to -1.

Dealing with genotype uncertainty (the -method option)

The -method option which controls the way genotype uncertainty is taken into account when carrying out association tests. The options are listed in the table below.

-method threshold	Use thresholded genotypes. The calling threshold is controlled by the flag -call_thresh. The default calling threshold is 0.9. This is the same as the default option in previous versions.
-method expected	Use expected genotype counts.
-method score	Use a missing data likelihood score test. This is equivalent to the -proper option in previous versions, except that if the score test experiences problems at a SNP (usually due to a rare SNP and/or high uncertainty) then -method em is used for this SNP.
-method ml	Use multiple Newton-Raphson iterations to estimate the parameters in the missing data likelihood for the model.
-method em	Use an EM algorithm to estimate the parameters in the missing data likelihood for the model.

There are two other options that control how the imputed genotypes are treated.

-renorm

There is no restriction that the genotype probabilities for a given genotype should add up to 1. One reason this may occur is if probabilistic genotype calls from an algorithm like CHIAMO are used. In this case the probabilities might sum to less than one and any left over probability is the probability of a NULL call. The -renorm option renormalizes the genotype probabilities to sum to 1. The default is not to renormalize the probablities unless the -method expected option is chosen in which case it is automatically turned on. If the probabilities do not sum to 1 for a given genotype, then when using the score, ml or em options the tests will naturally accomodate the situation i.e. the genotype will be given less overall weight in the analysis than other genotypes whose probabilities do sum to 1.

-total_prob_limit <x>

There is an internal lower limit set on the sum of genotype probabilities. The default is 0.1. If this threshold is not met then that genotype is not included in the test. This protects against SNPs with a high proportion of NULL genotypes.

The statistical details of the Frequentist tests implemented are given in this pdf.

Information measure

If score, ml or em are chosen as the method when using a frequentist test then a relative information measure will be calculated at each SNP. This will be reported in a column ending in _info.The statistical details of these information measures are given in this pdf.

Output column naming convention

The naming convention for the columns of the output file that contain the results of the statistical tests is

<phenotype_name(s)>_<test_type>_<genetic_model>_<covariate_name(s)>_<summary_measure>

<phenotype_name(s)>	The name (or names if -mpheno is used) of the phenotypes used in the test.
<test_type>	frequentist or bayesian
<genetic_model>	add, dom, rec, gen or het
<covariate_name(s)>	The name (or names) of the covariates being conditioned upon in the test
<summary_measure>	One of pvalue, info, beta_X, se_X or log10_bf depending on the column

Example 1 - Case-Control Test

The following example carries out a case-control test for the binary phenotype named bin1.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1

The p-values for the test is given in the column bin1_frequentist_add_pvalue. Parameter estimates and their standard errors are given in the columns labeled bin1_frequentist_add_beta_1 and bin1_frequentist_add_se_1.

Example 2 Quantitative Trait Test

The following example carries out a case-control test for the quantitative phenotype named pheno1

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-method score \
-frequentist 1 \
-pheno pheno1

The p-values for the test is given in the column pheno1_frequentist_add_pvalue. Parameter estimates and their standard errors are given in the columns labeled pheno1_frequentist_add_beta_1 and pheno1_frequentist_add_se_1.

Covariates

There are several options that control how covariates can be conditioned upon in order to carryout a test of association.

-cov_names <name_1> ... <name_n>	Condition upon the covariates in the sample files with names name_1,...., name_n.
-cov_all	Condition upon all the covariates in the sample files.
-cov_all_discrete	Condition upon all the discrete covariates (D) in the sample files.
-cov_all_continuous	Condition upon all the continuous covariates (C) in the sample files.

Conditioning upon one (or more) covariate means that the test of association being carried out is testing for a genetic effect over and above that explained by the covariate(s). Discrete covariates are added into the model as factors i.e. a different baseline term for each level of the factor is fitted.

Example 1 - Mantel-Hantzel Test

If a single Discrete (D) covariate is conditioned upon then this is equivalent to a Mantel-Hantzel test. This is a test for a common genetic effect where each group is allowed to have it's own baseline effect. Here is an example of conditioning upon the binary covariate called cov1 in the sample files.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin2 \
-cov_names cov1

This produces an output file ./example/ex.out which contains a column with header bin2_frequentist_add_cov1_pvalue that contains the p-values for the test based on the covariate.

Example 2 - Conditioning on covariates that code for population structure

For association studies it has become popular to use eigenvectors from a PCA analysis to code for unobserved population structure. This is carried out in SNPTEST by setting the eigenvectors as Continunus (C) covariates in the sample file and then conditioning upon these covariates. Here is an example of conditioning upon the two continuous covariates called cov3 and cov4 in the sample files.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-frequentist 1 \
-method score \
-pheno bin1 \
-cov_names cov3 cov4

Bayesian Tests (Bayes Factors)

The Bayesian tests are specified by the -bayesian option, in a similar way to the use of the -frequentist option. The statistical details of the Bayesian tests implemented are given in this pdf.

-bayesian
<t1>...<tn>

This option controls the model you wish to test at each SNP versus a model of no association. The five different models are coded as 1=Additive, 2=Dominant, 3=Recessive, 4=General and 5=Heterozygote. When using this option the output file will have a column for each test that contains the log10 Bayes Factor for the test as well as posterior mean estimates of the model parameters (beta's) and their standard errors. SNPTEST codes allele_A as 0 and allele_B as 1 and this defines the meaning of the beta's and there se's. For example, when using the additive model the beta estimates the increase in log-odds that can be attributed to each copy of allele_B. A Bayes factor will always be calculated at a SNP.

The -method option is also used to control the way the Bayesian models are fit, but not all options are valid.

If the phenotype is binary then the only options that work are threshold, expected, score and ml. The score option uses a single newton-raphson iteration to estimate the mode of the posterior while the ml option uses multiple iterations.
If the phenotype is quantitative then the only options that work are threshold and expected.

Priors for Binary Trait models

The table below gives a description of the linear predictor of the logistic regression used, the form of the priors used on the model parameters, the default priors used in SNPTEST and the command line option that can be used to change the priors.

Model	Linear Predictor	Priors	Default	Coding	Command line option
Additive	log(p_i/(1-p_i)) = µ + ßG_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.2	G_i is the additive coding of the SNP i.e. AA -> 0, AB ->1, BB -> 2.	-prior_add a₀ a₁ b₀ b₁
Dominant	log(p_i/(1-p_i)) = µ + ßD_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.5	D_i is the dominant coding of the SNP i.e. AA -> 0, AB -> 1, BB -> 1.	-prior_dom a₀ a₁ b₀ b₁
Recessive	log(p_i/(1-p_i)) = µ + ßR_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.5	R_i is the recessive coding of the SNP i.e. AA -> 0, AB -> 0, BB -> 1.	-prior_rec a₀ a₁ b₀ b₁
General	log(p_i/(1-p_i)) = µ + ßG_i + qH_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²) q~N(c₀, c₁²)	a₀=0, a₁=1 b₀=0, b₁=0.2 c₀=0, c₁=0.5	G_i is the additive coding of the SNP i.e. AA -> 0, AB ->1, BB -> 2. H_i is the heterozygote coding of the SNP i.e. AA -> 0, AB ->1, BB -> 0.	-prior_gen a₀ a₁ b₀ b₁c₀ c₁
Heterozygote	log(p_i/(1-p_i)) = µ + ßH_i	µ~N(a₀, a₁²) ß~N(b₀, b₁²)	a₀=0, a₁=1 b₀=0, b₁=0.5	H_i is the heterozygote coding of the SNP i.e. AA -> 0, AB ->1, BB -> 0.	-prior_het a₀ a₁ b₀ b₁

t-distribution priors

In SNPTEST v2 there is a new option to specify the use of t-distribution priors on the genetic effects. The fatter tails of the t-distribution allow more flexibility in specifying beliefs about the size of the genetic effects. This option is controlled by the following two options.

-t_prior

Specfies the use of t-distribution priors on the genetic effects. Effectively, this option modifies the priors described in the table above i.e. the mean and variance of the t-distributions are specified by the options given in the table above, but the normal distributon is replaced by the t-distribution. NOTE : a t-distribution is only used for the genetic effects i.e. the parameters ß and q in the models above. For example, -bayesian add -t_prior would specify the linear predictor log(p_i/(1-p_i)) = µ + ßG_i and the priors would be µ~N(a₀, a₁²) and ß~t(b₀, b₁², df = 3).

-t_df <x>

The degrees of freedom parameter of the t-distribution. The default value is 3. When this parameter is set very large the prior converges to the normal distribution prior.

Example - Bayesian Case-Control Test

The following example calculates a Bayesian additive model Bayes Factor for the binary phenotype bin1 named using the default priors.

./snptest \
-data ./example/cohort1.gen ./example/cohort1.sample ./example/cohort2.gen ./example/cohort2.sample \
-o ./example/ex.out \
-bayesian 1 \
-method score \
-pheno bin1

Bayesian Quantitative Trait models and priors

The Bayesian tests for quantitative traits are carried out using the conjugate prior formulation of the linear model using either thresholded genotypes (-method threshold) or the expected genotypes (-method expected). The model is most easily explained through an example. For an additive model the formulation is

y_i = ßG_i + e_i, e_i ~ N(0, σ²), where

y_i = the residual phenotype for the ith individual. The residual phenotype is calculated by subtracting off an baseline term and estimates of any specified covariates.
G_i = the thresholded or expected genotype of the ith indvidual.
σ² = the error variance of the model.

This model is compared to the model

y_i = e_i, e_i ~ N(0, σ²).

We use a Normal Inverse Gamma (NIG) prior on the effects ß and σ². This prior has the form

σ²~ IG(a,b) and ß ~ N(m_ß, V_ßσ²)

One way to set this prior is as follows :

It can be shown that the expected non-centrality parameter for the F-test when fitting the above linear model is approximately Np(1 − p)2ß²/σ²
where ß and σ² are the true values of the alternative model, p is the allele frequency of the SNP and 2N is the total sample size.
This can be usefully compared to the non-centrality parameter for the case-control test which is approximately Np(1 − p)ß²
assuming N cases and N controls, and here ß is the log-odds ratio parameter of a logistic regression model. So,
if we are happy to put a N(0, 0.2²) prior on ß for a binary trait we might reasonably put the same prior on √2ß/σ in the model above i.e ß ∼ N(0, 0.02σ²).

In the context of the NIG prior used in SNPTEST v2 this would mean setting m_ß=0 and V_ß = 0.02.

The results should be reasonable robust to changes in the parameters of the Inverse Gamma prior (a and b), especially since most genetic effects will be very small in GWAS and SNPTEST is focussed on this situation. We recommend setting a and b so that the mean of the prior distribution is equal to the total phenotype variance. If a dominant, recessive, heterozygote or general models are selected then a similare argument can be used.

The prior parameters of these models are controlled by the set of options in the following table

-prior_qt_mean_b	m_ß
-prior_qt_mean_q	m_q
-prior_qt_V_b	V_ß
-prior_qt_V_q	V_q
-prior_qt_a	a
-prior_qt_b	b

NOTE : there are no default values for these parameters. You MUST specify them manually in order to use the Bayesian Quantitative Trait models.

Model averaging option

The option -mean_bf is used to average over a set of Bayesian models. This can be used for both binary and quantitative phenotype tests.

-mean_bf <w1>...<wn>

Specify that a log10 Bayes factor for a weighted average over the models specified by -bayesian with weights given by <w1>....<wn>. For example, -bayesian 1 4 -mean_bf 9 1 would calculate a Bayes factor for a weighted average of the additive and general models where the additive model is given weight 9 and the general model is given weight 1. The log10 Bayes factor will be written in a column with the label mean_bf.

Other Options (top)

-chunk <x>	The program works by reading in, analyzing and writing output for chunks of the data at a time. This option is included to control the maximum amount of RAM used by the program at any one time. The default chunk size is 100 SNPs.
-nowarn	Turns off printing off warnings to the screen

FAQ (top)

Q : I get the error "igamc underflow error" printed to the screen. What does this mean?

This error occurs at SNPs where a very small p-value from a chi-squared test needs to be calculated. The CPROB library used by SNPTEST is used to carry this out and it reports an underflow error when this occurs. In this case it returns a p-value of 0. This usually occurs when the signal of association is very huge and can sometime indicate problems with the data. To identify which SNPs this occurs at you can use the -printids flag.

Version History (top)

2.1.1	01-04-2010	Minor update
2.1.0	19-03-2010	This is major change to SNPTEST from previous versions. Please read the following carefully The file format used by this version has been modified NEW FILE FORMAT. I have changed type 1,2,3 covariates to types D=discrete, C=continuous in the sample file. Binary phenotypes now need to be specified in the sample files by using a column of 1's and 0's (1=case and 0=control). The column should be labelled B. Quantitative phenotypes should be labelled P. Look at the sample files example/*.sample for examples. The -cases and -controls flags have been replaced by the -data option i.e. all cohorts should be specified by this option. You can specify multiple gen and sample files but you no longer divide them up into cases and controls. There is no longer a -qt flag. To specify the phenotype you use -pheno <name>. The name_of_phenotype should match the column you want to use from the sample file. It runs logistic regression or linear regression dependent on the type of phenotype you select. There are some changes to the output and the header line of the output file. Take a look. They are pretty straight forward. Basically some of the names of the columns have changed and you get a few extra columns of output if you use a binary phenotype. The -cov_names flag has been added so that you can specify covariates by their name i.e. -cov_names Gender will condition on the covariates named Gender . Multiple covariates can now be specified i.e -cov_names 1 3 will condition on covariates 1 and 3 and it does not matter if they are of different types There are now 3 flags that allow you to specify groups of covariates (i) -cov_all_continuous - condition on all continuous covariates, (ii) -cov_all_discrete - condition on all discrete covariates, (iii) -cov_all - condition on all covariates If no association tests are specified or -method threshold is specfied then thresholded genotype counts are reported. Otherwise, expected conts are given. The expected count for a genotype is the sum of the probabilities across all individuals in the sample. If individuals are explicitely excluded then they will not be included in the genotype counts in any way. When testing for association, if an individual has at least one missing phenotype or missing covariate that is needed for the test then their genotype will be called as NULL in the genotype counts. Samples where the sum of the genotype probabilities is less than 0.1 will also be counted as NULL at each SNP. The -exp_counts flaghas been removed. There is a new option -method that is used to specify the method used to fit the chosen model. The new options give better results at SNPs that are rare and/or have high genotype uncertainty. The Bayesian tests now account for genotype uncertainy and can allow covariates in the tests. Bayesian Binary Trait tests now have an option to use a t-distribution prior on the genetic effect parameters. This allows more flexibility in specifying the prior beliefs about the genetic effect sizes. See option -t_prior and -t_df in the section on Bayesian Tests. There are now Bayesian tests for quantitative traits. There is now an option -mean_bf that calculate the weighted mean of the Bayes factors across the range of models specified. This 'model averaging' feature allows a range of models to be tested at the same time. See the section on Bayesian Tests. There is now a Bayesian test for multiple quantitative phenotypes.
1.1.5	28-05-2008	This release can be found here

References (top)

[1] J. Marchini, C. Spencer. Y.Y. Teo and P. Donnelly (2007) A Bayesian Hierarchical Mixture Model for Genotype Calling in a multi-cohort study. (in preparation)
[2] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[3] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447;661-78. PMID: 17554300 DOI: 10.1038/nature05911