GENECLUSTER is a novel program for
detecting association in genome-wide case-control studies based on a
set of known haplotypes (like the HapMap Phase II haplotypes). The
program is designed to work seamlessly with the output of the genotype
calling program CHIAMO
and the simulation program HAPGEN,
and the input of the association analysis program SNPTEST.
|
./GC2 -h ./example/panel.txt -l ./example/legend.txt -g ./example/controls1.gen -m ./example/genetic_map.txt -int 38640000 39110000 -tree_file ./example/tree_file.txt -strand ./example/strand.txt -o ./example/controls1.gc2.txt.gz |
./GC2 -h ./example/panel.txt -l ./example/legend.txt -g ./example/controls2.gen -m ./example/genetic_map.txt -int 38640000 39110000 -tree_file ./example/tree_file.txt -strand ./example/strand.txt -o ./example/controls2.gc2.txt.gz |
./GC2 -h ./example/panel.txt -l ./example/legend.txt -g ./example/cases.gen -m ./example/genetic_map.txt -int 38640000 39110000 -tree_file ./example/tree_file.txt -strand ./example/strand.txt -o ./example/cases.gc2.txt.gz |
./GC3 -int 38640000 39110000 -tree_file example/tree_file.txt -o example/ex.gs -mutation_models 1.0 1.0 0.0 -controls ./example/controls1.gc2.txt.gz ./example/controls2.gc2.txt.gz -cases ./example/cases.gc2.txt.gz |
Flags |
Required/Optional |
Default | Description |
-g
<file> |
Required | File containing a set of
genotypes for the set if individuals. The file format is described in
detail on the FILE
FORMAT WEBPAGE. The file format is the same as the output
format from our genotype calling program CHIAMO.
NOTE 1 : The SNPs MUST appear in base-pair position order (lowest to highest) i.e. the 3rd column of this file must be sorted. NOTE 2 : Base-pair positions of SNPs must use the same genome build as that used in the haplotype file. NOTE 3 : GC2 supports gzipped files but they must have the extension .gz. |
|
-h <file> |
Required |
|
File containing a set of known
haplotypes for the region of interest. The alleles of the haplotypes
should be coded as 0 and 1. The format of this input file is one line
per SNP and one column per haplotype. |
-o
<file> |
Required | The GC2 output file, which
contain details of the pseudo-genealogies for the study sample based on
the reference panel haplotypes. The file should then be passed onto GC3
as input. GC2 output files can be quite big and we recommend that you
name the output files with a ".gz" extension so that they will
automatically be gzipped by GC2. |
|
-int <lower> <upper> |
Required |
|
Lower are Upper boudaries (in base pair position) of the region in which tests of association are to be carried out. Only positions for which a tree is available will be analysed. |
-l <file> |
Required | |
Legend file for haplotypes file
which give rs ID, position and the alleles that are coded as 0 and 1 in
the haplotypes file. The alleles should be taken from A, C, G and T.
Note that this file needs a header line (see the example file
legend.txt for details) |
-m <file> |
Required | |
Fine-scale recombination map covering the region at which analysis is required. There is one line for each position on the map. The first column contains the base pair position, the 2nd column contains the recombination rate in cM/Mb to the next point on the map and the 3rd column contains the recombination map position in cM.Note that this file needs a header line (see the example file map.txt for details) |
-tree_file <file> |
Required | |
A file containing a set of trees for the reference panel haplotypes at locations where an association test is to be performed. Trees for the HapMap haplotypes are available for download at the bottom of this page. All gzipped files must have a ".gz" extension. |
-Ne <int> |
Optional | 11418 | Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations. |
-buffer <int> |
Optional |
500 |
To avoid edge effects, the program includes genotypes either side of the interval specified by the the -int flag. This option specifies the length of the buffer region (in kb) at each end of the interval. |
-call_thresh <double> |
Optional |
0.9 |
Threshold for calling genotypes
in genotype input file. The genotype with the maximum probability will
be used if that probability is above the threshold. Otherwise the
genotype will treated as missing. |
-exclude_snps <file> | Optional | Exclude a set of genotyped SNPs
(i.e. SNPs that occur in the file specified by the -g option) with ID
equal to those listed in the file. The IDs can be either the rs ID or
the alternate ID given in the first column of the genotype file. These
SNPs will not be used for analysis. |
|
-exclude_samples <file> | Optional | Exclude a list of individuals
from the analysis. The IDs in the file should be the ID that appears in
the first or second column of the sample file. A sample file number be
provided using the -sample flag. |
|
-sample |
Optional |
Sample file (see FILE FORMAT WEBPAGE for more details) containing the unique identifiers for each individual in the genotype file. Is only used when the -exclude_samples flag is used. | |
-strand
<file> |
Optional | File listing the strand
orientation of the SNPs in the genotype file relative to the
orientation of the alleles in the haplotypes file. This is file is
required if the orientation of alleles at SNPs in the haplotype and
genotype files does not match up. The file should contain a line for
each SNP in the genotype file with two entries (i) the base-position of
the SNPs, and (b) the strand (+ or -) of the alleles in the genotype
file. SNPs do not have to be in the same order as in the
genotypes file and the file can include SNPs that are not in the
genotypes file i.e. if the genotypes file has had some SNPs filtered
out. Take a look at the example files for an illustration of the
required format. NOTE : It is critical that the alleles used to code genotypes in the haplotype file and the genotype file match up. If not, then the quality of imputation may decrease substantially. Great care should be taken in constructing a strand file for your data. |
Flags | Required/Optional | Default | Description |
-controls <file1> <file2> ... |
Required |
|
The set of GC2 output files produced for each control sample. You must run GC2 on all case and control genotype files (with the same tree, haplotype reference, legend and recombination rates files) before running GC3. |
-cases <file1> <file2>... | Required | |
The set of GC2 output files produced for each case sample. You must run GC2 on all case and control genotype files (with the same tree, haplotype reference, legend and recombination rates files) before running GC3. |
-int <lower> <upper> |
Required |
|
Lower and upper boudaries (in base pair position) of the region in which tests of association are to be carried out. Only positions for which a tree is available will be analysed. |
-modelA <prior1> <prior2> <prior3> |
Optional |
0.5 0.5 0 | The prior weights on the
1-mutation, 2-mutation and 3-mutation models. Only models with prior
weight greater 0 are implemented, so by default the 3-mutation model,
which is computationally intensive, will not be implemented. The prior
weights determine the Bayes factor for association under all of the
implemented mutation models in the output file. See below for more details. NOTE 1 :It is not important to specify the specific values of the prior, only that the models that you intend to run have prior weight greater than 0. NOTE 2 :It is not important that the prior weights add to 1, GC3 will normalise them to a prior probability distribution. |
-ntree <n> |
Optional |
1 | The maximum number of trees to use for analysis (if a tree file contains more than n trees). |
-o <file> |
Required | |
The file where the results should be written to. There will also be a file with .aux concatanated to the filename, which contains a copy of the command line summary output. |
-tree_file <file> |
Required | The same tree file that was provided to GC2. | |
-rprior <a> <b> |
Optional | See right column |
The parameters of the beta(a,b) penetrance prior. If not specified then will be set to a = p*50 and b = (1-p)*50, where p is the proportion of controls your total sample. |
location ntree ncontrols ncases alpha beta mut1 mut2 mut3 mean_mut |
38640000 1 30 20 2 3 0.6673980344 0.4651869063 0 0.5662924704 |
38645000 1 30 20 2 3 0.733595907 0.5241572532 0 0.6288765801 |
... |
source("plot.signal.r") |
make.plot( | gc.file
= "./example/ex.gs", control.gc2.files = c("./example/controls1.gc2.txt.gz", "./example/controls2.gc2.txt.gz"), case.gc2.files = "./example/cases.gc2.txt.gz", legend.file = "./example/legend.txt", log.file = "./example/plot.signal.log", max.location = 39110000, min.location = 38640000, panel.file = "./example/panel.txt", plot.title = "Example signal plot", rates.file = "./example/genetic_map.txt", rs.ids = c("rs2708479", "rs1388587"), tree.file = "./example/tree_file.txt") |
Flags |
Required/Optional |
Default | Description |
gc.file |
Required | The GC3 results output file.
Requires the same input value that was provided to the -o
flag for GC3. |
|
control.gens.files |
Required | A vector of control GC2 output
files. Requires the same input values that were provided to the -controls
flag for GC3. |
|
case.gens.files |
Required | A vector of case GC2 output
files. Requires the same input values that were provided to the -cases
flag for GC3. |
|
panel.file |
Required | The reference haplotype file
that was provided to the -h
flag for GC3. |
|
legend.file |
Required | The legend file, for the
reference haplotypes, that was provided to the -l
flag for GC2. |
|
rates.file |
Required | The recombination rates file
that was provided to the -r
flag for GC2. |
|
tree.file |
Required | The tree file, for the reference
haplotypes, that was provided to the -tree_file
flag for GC3. |
|
min.location |
Required | The physical location of the
left-hand boundary of the region of interest to be plotted. |
|
max.location |
Required | The physical location of the
right-hand boundary of the region to be plotted. |
|
focal.posn |
Optional | The focal position of the signal
plot. The right hand side of the signal plot will display the tree and
details of the best fitting mutations at the focal position. If none is
specified then the focal position will be the position with the largest
2-mutation BF. |
|
log.file |
Optional | A file with the screen output
summary |
|
plot.title |
Optional | Title of the plot. |
|
rs.ids |
Optional | A vector of rsids defines a set
of SNPs in the reference panel haplotypes. These SNPs define a set of
haplotype backgrounds in the haplotype reference panel, and panel
haplotype will be coloured according to its background in the left
plot. |
Platform |
File |
Linux
(x86_64) Static Executable |
genecluster_r84_x86_64.tgz |
Mac
OS X Intel |
genecluster_r84_macosx_intel.tgz |
tar zxvf GC_vX.X.X_i386.tgz |
Polymorphic files - these files contain SNPs polymorphic in each
panel respectively i.e. the CEU haplotypes only contain data at SNPs that are polymorphic in the CEU panel. The files contain the haplotypes and associated legend files. |
[CEU] [YRI] |
Recombination
rate files (nb. these are the same as the rel#22 rates) |
[CEU] [YRI] [COMBINED] |
Strand files | Affy500k
[These
were constructed using these Affymetrix annotation files - Nsp
Sty]
Affy6.0 [These files were created using this Affymetrix annotation file - LINK] |
Tree files - these files contain the data for marginal trees, constructed under a coalescent model, at locations 1kb apart genome-wide. There is one file for each autosomal chromosome and you must extract the file for the appropriate chromosome before analysis. The program, TREESIM, that constructed these trees can be downloaded from Niall Cardin's webpage. | [CEU] [YRI] |