CHIAMO

CHIAMO is a program for calling genotypes from the Affymetrix 500K Mapping chip. The program allows for multiple cohorts which have potentially different intensity characteristics that can lead to elevated false-positive rates in genome-wide studies. The underlying model has a hierarchical structure that allows for correlation between the parameters of each cohort.  For more details see [1]. The output files produced by CHIAMO feed directly into both the programs SNPTEST [2] and IMPUTE [2]. CHIAMO was used to call genotypes for the 7 genome-wide association studies carried out by the Wellcome Trust Case-Control Consortium (WTCCC) [3].

Home Version History
Options
Download Recommended Usage References
Contributors
Example Datasets Contact Information



Contributors (top)

The following people have contributed to the development of the methodology and software for CHIAMO.

Chris Spencer, Jonathan Marchini, Peter Donnelly, YY Teo

Download (top)

Pre-compiled versions of the program and example files can be downloaded from

Platform
File
Linux (x86_64) Static Executable
chiamo_v0.2.1_x86_64_static.tgz
Linux (x86_64) Static Executable (SuSE 9.3)
chiamo_v0.2.1_SuSE9.3_x86_64_static.tgz
Linux (x86_64) Dynamic Executable
chiamo_v0.2.1_x86_64_dynamic.tgz
Linux (i386) Static Executable
chiamo_v0.2.1_i386_static.tgz
Linux (i386) Dynamic Executable
chiamo_v0.2.1_i386_dynamic.tgz
Mac OS X 10.4.11 (Intel)
chiamo_v0.2.1_MacOSX_10.4_Intel.tgz
Mac OS X 10.5.1 (Intel) chiamo_v0.2.1_MacOSX_10.5_Intel.tgz
Mac OS X (PowerPC)
chiamo_v0.2.1_MacOSX_PowerPC.tgz
Solaris 5.8 (Sun SPARC)
chiamo_v0.2.1_Solaris5.8_SPARC.tgz
Solaris 5.10 (AMD Opteron)
chiamo_v0.2.1_Solaris5.10_Opteron.tgz

Please fill out the registration form to receive emails about updates to this software.

To unpack the files use a command like


tar zxvf chiamo_vX.X.X_i386.tgz

This will create an executable called chiamo and a directory /example that contains the example files.

Version History (top)


Version
Date
Comments
0.1.0
07-06-2007
First version
0.1.1
20-06-2007
Ability to handle gzip'd and bzip2'd intensity files added
0.2.0
07-09-2007
Significant reductions in the run time of the algorithm through the addition of -single option and improvements made to the -approx option.
0.2.1
22-10-2007
Addition of a LICENCE

Recommended usage (top)

The CHIAMO algorithm was designed primarily for use with the Affymetrix 500K Mapping chip and the default prior distributions used in the model have been chosen with this type of data in mind. The algorithm could be applied to data from other technologies but we would strongly advise that you contact us to discuss what modifications would need to be made to the algorithm.

The algorithm has two stages. First, the algorithm is run from several semi-random start points and the best solution is found. The program is then run for a second stage where this solution is refined.

The WTCCC data [3] was called with the options -max1 -max2 -nmax 200 -n 0 -b 0 and using a allele frequency file derived from CEU HapMap. These frequency files are available from a link below. This set of options imply that we attempt to maximize the posterior from 12 random starts and we then choose the best maximum to determine the parameters of the model to make the genotype calls.  W
e strongly advise you to use a larger number of iterations than we used for the WTCCC data if possible. In addition, substantial speed-ups at with little or no reduction in performance can be achieved by using the -single and -approx options.

Grid Approximation  When the number of individuals to be called is large, updates of the genotype calls can be slow as the operation requires iteration over all individuals. However, individuals with the same genotype often have very similar intensities. To exploit this property of the data the -approx flag can be used to apply a discrete approximation to the intensity data, which essentially lumps individuals on top of the nearest cross-hair in the grid. The increase in speed is due to the fact that the update need now only be applied once to each occupied cross-hair. The discrete approximation is always removed before the final iteration of the algorithm ensuring the final calls are not required to conform to the grid. When applied to the WTCCC data the increase in speed is approximately an order of magnitude with little loss of accuracy.

Single Cohort Approximation
When using Chiamo to call multiple cohorts  a simple but effective option in Chiamo can be used to increase speed. By using the flag
-single the algorithm treats separate cohorts as one negating the need to update hyper-parameters which couple cohorts. Clearly, treating separate cohorts as one is only advisable if there is only small shifts in the cluster positions across cohorts.

A substantial gain in speed can be achieved by using -single and -approx in  conjunction, as the algorithm can perform updates of the model with more computational efficiency. Chiamo is run in two stages during which these options may be applied . We advocate the use of the first stage to resolve the cluster centres and covariances, and the second to accurately estimate the posterior probability of each genotype call. Therefore, when appropriate we suggest applying both approximations to first stage, and then dropping the single cohort approximation for the second stage allowing each cohort to adapt to possible shifts in cluster location. e.g. using the flags -single 1 -approx 1 20.

Example Datasets (top)

The sub-directory example/ contains some example datasets to test the program. The files consist of data at just 10 SNPs in 9 cohorts. To run the program on this data use

./chiamo -i ./example/cohort1.txt ./example/cohort2.txt ./example/cohort3.txt ./example/cohort4.txt ./example/cohort5.txt ./example/cohort6.txt ./example/cohort7.txt ./example/cohort8.txt ./example/cohort9.txt  -f ./example/freq.txt -max1 -max2 -nmax 200 -n 0 -b 0 -o ./example/output

To  run the program with the options that substantially increase the speed of the algorithm use

./chiamo -i ./example/cohort1.txt ./example/cohort2.txt ./example/cohort3.txt ./example/cohort4.txt ./example/cohort5.txt ./example/cohort6.txt ./example/cohort7.txt ./example/cohort8.txt ./example/cohort9.txt  -f ./example/freq.txt -max1 -max2 -nmax 200 -n 0 -b 0 -o ./example/output -single 1 -approx 1 20

Options (top)

Flags
Required/Optional
Default
Description
-i <file_1> ...... <file_n>
Required

Specifies n input files that contain the normalized intensity data for the n cohorts.
The normalized intensity files are created from the raw CEL file data using a program
written by Hin-Tak Leung (hin-tak.leung@cimr.cam.ac.uk) and is available from the website http://www.wtccc.org.uk/info/software.shtml
-gz
Optional

Specifies that the input intensity files (specified by the -i flag) have been gzipped and have the .gz file extension.
The intensity files can be very big so gzip-ing the files can save a lot of disk space. There is very little difference in
run time if the files have been compressed.
-bz2
Optional

Specifies that the input intensity files (specified by the -i flag) have been bzip2-ed and have the .bz2 file extension.
The intensity files can be very big so bzip2-ing the files can save a lot of disk space. There is very little difference in
run time if the files have been compressed.
-f <freq_file>
Optional

File containing allele frequency information for each Affy SNP i.e. derived from the HapMap data.
This information is used as a prior on allele frequency. There should be a line for each SNP in the
same order as SNPs in the input file. Each line should have the following 5 entries : RS_ID, position, Affy_A_Allele, Affy_B_Allele, frequency of Affy_A_Allele.
The allele frequency files that we have used for the Affy 500K chip are avialble from this link
Affy500K_Allele_Frequency_Files.tgz 
-snps <n> <m>
Optional
Run the program from the nth SNP to the mth SNP in the input files. Otherwise the program will
run on each SNP sequentially. We recommend that each chromsome be processed in relatively small chunks.
-max1
Optional
Attempt to maximize the posterior in stage 1. The default is use MCMC to obtain a sample from the posterior.
-max2
Optional
Attempt to maximize the posterior in stage 2. The default is use MCMC to obtain a sample from the posterior.
-nmax
Optional 40
Number of stage 1 iterations.
-b <int> Optional 200
Number of burn-in iterations in stage 2
-n <int> Optional 1000
Number of sampling iterations in stage 2
-o <o_file>
Optional ./out
The program will produce an output file for each cohort with names o_file_1_mcmc, ...., o_file_n_mcmc.
The output files will contain a line for each SNP with sets of 3 posterior probabilities of individual in the
same order as individuals appear in the input files. It will also produce a file o_file_params which contains
the final parameter estimates produced by the program and a file o_file_information that contains information
measures for the calls at each SNP. The format of the output files is designed to be the input fil format for the programs SNPTEST and IMPUTE (see the FILE FORMAT WEBPAGE for more details). 
-chrX <file>
Optional
For chromosome X data the program takes a file containing an indication of the sex of each individual in the
same order as the individuals appear in the input files. 1 = male, 2 = female.
-no_null
Optional
Do not run with a 4th NULL cluster.
-single <int>
Optional
When multiple cohorts are to be called treat them as a single cohort for stage 1 (use -single 1) or for both stage 1 and stage 2 (use -single 2).
-approx <stages> <grid>
Optional
The first argument to this flag determines whether to apply the approximation to just stage 1 (set to 1),or stage 1 and 2 (set to two). The second argument specifies the density of the grid; the smaller the number the cruder the approximation and the faster the program runs. Note that when this option is used the algorithm is run for a single update without the approximation to provide parameter estimates based on the uncondensed data that the approximation uses.

References (top)

[1] J. Marchini, C. Spencer. Y.Y. Teo and P. Donnelly (2007) A Bayesian Hierarchical Mixture Model for Genotype Calling in a multi-cohort study. (in preparation)
[2] J. Marchini, B. Howie, S. Myers, G. McVean and P. Donnelly (2007) A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39 : 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[3] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447;661-78. PMID: 17554300 DOI: 10.1038/nature05911











StatCounter - Free Web Tracker and Counter