HAPGEN version 2

Updated 16/10/2011 by Zhan Su

HAPGEN2 is a an updated version of the program HAPGEN, which simulates case control datasets at SNP markers. The new version can now simulate multiple disease SNPs on a single chromosome, on the assumption that each disease SNP acts independently and are in Hardy-Weinberg equilibrium. We also supply a R package that can simulate interaction between the disease SNPs. We hope to add further facilities to simulate quantitive traits and admixture soon.

The underlying simulation approach is identical to HAPGEN, so can handle markers in linkage disequilibrium (LD) and simulate datasets over large regions such as whole chromosomes. It simulates haplotypes by conditioning on a reference set of population haplotypes and an estimate of the fine-scale recombination rate across the region, so that the simulated data has the same LD patterns as the reference data (here is an example). The availability of HapMap3 data allows HAPGEN2 to simulate datasets from a number of populations. See below for details on downloading and using these data.

The disease model is specified through a set of disease causing SNPs together with their relative risks. The program is designed to work with publicly available files that contain the haplotypes estimated as part of the HapMap or 1000 Genomes project and the estimated fine-scale recombination map derived from that data. HAPGEN2 is computationally tractable. On a modern desktop it can simulate several thousand case and control data on a whole chromosome at Hapmap marker density within minutes.

HAPGEN2 output data in the FILE FORMAT used by IMPUTE, IMPUTE2, SNPTEST and GTOOL.

Home Contributors
What's New Download
Version History Running HAPGEN2
Simulating Interaction HapMap2, HapMap3 and 1000 Genomes Haplotypes
References Contact Information

Contributors (top)

The following people have contributed to the development of the methodology and software for HAPGEN2

Zhan Su, Jonathan Marchini, Peter Donnelly

What's new (top)

The new version of HAPGEN2 can: NOTE :

Download (top)

HAPGEN2 is available free to use for academic use only. Please see the LICENCE here for the package.
Pre-compiled versions of the program and example files can be downloaded from the links below. At the moment only binaries for a limited number of platforms are available. If you would like to run HAPGEN2 on a different platform then please contact me.

Mac OS X Intel hapgen_v2.1.2
Linux(x86_64) Static Executable hapgen_v2.1.2
Mac OS X Intel hapgen_v2.2.0
Linux(x86_64) Static Executable hapgen_v2.2.0

In addition to the basic HAPGEN2 binary, the following R packages are available for simulation under more complex disease models.

SimulatePhenotypes Simulates phenotypes for a set of genotype data simulated by HAPGEN2. Currently only simulates discrete phenotypes with interaction between multiple disease SNPs. See below for further details on how to install and use this package. SimulatePhenotypes_1.0.tar.gz

In order to install an R package you need to uncompress it before installing it, for example:

tar -xzvf SimulatePhenotypes_1.0.tar.gz
R CMD INSTALL SimulatePhenotypes

To load the package, type "library(SimulatePhenotypes)" in R before running any of the functions.

Please fill out the registration form to receive emails about updates to this software.

Version History (top)

2.0.1 11-08-2010 First version made available.

Changes from HAPGEN_v.1.3.0:
  • Ability to simulate multiple disease SNPs
  • Gzip support
  • More verbose command line output and supporting summary file
  • Input file formats fully compatible with IMPUTE
2.0.2 11-09-2010 Changes from HAPGEN_v.2.0.1:
  • Output data files are separated into controls and cases data files.
  • Added new flags -no_gens_output and -no_haps_output
2.0.3 24-09-2010 Changes from HAPGEN_v.2.0.2:
  • Fixed minor, but annoying, bug that outputs a hapgen2.summary file every time the program is run.
2.1.0 23-02-2011 Changes from HAPGEN_v.2.0.x:
  • control haplotypes are simulated from Li and Stephens model (no longer simulates conditional on the allele frequency at the disease SNPs)
  • can simulate novel disease haplotypes (i.e. combination of alleles at the disease SNPs on the same chromosome) that are not observed in the reference panel due to recombination
  • summary of pvalues at each disease SNP under a single-SNP model and all the SNPs together under a multiple-SNP model.
2.1.1 10-03-2011 Changes from HAPGEN_v.2.1.0:
  • Fixed bug in command line summary that shows -no_haps_output and -no_gens_output are not recognised when they are used.
2.1.2 24-03-2011 Changes from HAPGEN_v.2.1.1:
  • .legend file is written with .haps files
2.2.0 01-04-2011 Changes from HAPGEN_v.2.1.2:
  • Faster performance and reduced RAM usage.
  • New -output_snp_summary option.

Running HAPGEN2 (top)

Quick example

Here are some example output from HAPGEN2.

HAPGEN2 is a command line program. To illustrate its use we have made an example dataset. To unpack the files use a command like

tar -zxvf hapgen2.example.gz

This will create an folder called example, which contains a set of example input files required by HAPGEN2.

If example is placed in the same directory as the HAPGEN2 binary then you can run HAPGEN2 by

./hapgen2 -m ./example/ex.map -l ./example/ex.leg -h ./example/ex.haps -o ./example/ex.out -dl 1085679 1 1.5 2.25 2190692 0 2 4 -n 100 100 -t ./example/ex.tags

This will simulate data for 100 case and 100 control individuals at the SNPs specified in the file example/ex.leg with similar patterns of LD as the haplotypes in example/ex.haps. Two disease SNPs are simulated, at positions 1085679 and 2190692, and with heterozgyote risks 1.5 and 2, homozygote risks 2.25 and 4, and risk alleles set to 1 and 0 at each SNP respectively. The results of the simulation are written to ./example/ex.out.haps, ./example/ex.out.sample, ./example/ex.out.gen, ./example/ex.out.tags and ./example/ex.out.summary that contain the results of the simulation. See below for a description of the options, input file formats and output file formats.

We recommend using the HapMap or 1000 Genome data as input for HAPGEN2. Please see below for instructions on downloading and using them.

NOTE : HAPGEN2 sets the random seed of its random number generator using the time of day to the nearest second. You should be aware of this when running multiple simulations using HAPGEN2 as runs that are started very close in time will produce identical results.


-h <file>
File of known haplotypes, with one row per SNP and one column per haplotype. Every haplotype file needs a corresponding legend file (see below), and all alleles must be coded as 0 or 1 -- no other values are allowed. See the following section for links to the relevant HapMap and 1000 Genomes files.
-l <file>
Required A legend file for the SNP markers. This file should have 4 columns with one line for each SNP. The columns should contain an ID for each SNP i.e. rs id of the marker, the base pair position of each SNP, base represented by 0 and base represented by 1. The first line of the legend file are column labels (these are not used by the program but the file is required to contain a header line). See the example file ex.leg. See the following section for links to the relevant HapMap and 1000 Genomes files.
 -m <file>
A file containing the fine-scale recombination rate across the region. This file should have 3 columns with one line for each SNP. The columns should contain physical location, rate in cM/Mb to the right of the marker and the cumulative rate in cM to the left of the marker. A header line containing the column labels is required. See the example file ex.map. See the following section for links to the relevant HapMap and 1000 Genomes files.
-dl <int> <a> <rr1> <rr2> ... Required Sets location, risk allele and relative risks for each disease risk. For each disease SNP, four numbers are required in the following order:
  1. physical location of SNP, which must be in the legend file supplied to the -l flag
  2. risk allele (0 or 1), the corresponding base can be found in the legend file
  3. heterozygote disease risk
  4. homozygote disease risk
For example, -dl 1085679 1 1.5 2.25 2190692 0 2 4 specifies two disease SNPs, at positions 1085679 and 2190692, and with heterozgyote risks 1.5 and 2, homozygote risks 2.25 and 4, and risk alleles set to 1 and 0 at each SNP respectively. There is no limit on the number of disease SNPs. We simulate under a disease model where the disease SNPs are independent, and the haplotypes defined by the disease SNPs are in HWE.
This flag is optional for version 2.0.2 and above, when if not supplied then all haplotypes will be simulated under the null.
-n <int> <int> Recommended
Sets the number of control and the number of case individuals to simulate. For example -n 100 200 simulates 100 control and 200 case individuals. The default is to generate 1 control and 1 case individual.
-int <int> <int>
Specify the lower and upper boundaries of the region in which you wish to carry out simulation. The default is set to 0 and 500000000.
-o <file> Required Output file prefix. For example -o ex.out[.gz] creates the following files for the case data:
  • ex.out.cases.haps[.gz] - A file containing the simulated haplotype data in the same format as the file haplotype file supplied to the -h flag.
  • ex.out.legend (from version 2.1.2 onwards) - A legend file with information about the SNPs in the .haps files.
  • ex.out.cases.gen[.gz] - A file containing the simualted genotype data in the file format compatible with SNPTEST, SNPTEST2, IMPUTE, IMPUTE2 and GTOOL.
  • ex.out.cases.sample - A sample file in the file format compatible with SNPTEST2 for the simulated genotype data.
  • ex.out.cases.tags.gen[.gz] - The genotype data limited to the subset of SNPs specified by the file supplied to the -t flag (if applicable).
A similar set of files will be produced for the control data, with the same file names except that cases are replaced by controls.
A summary file, ex.out.[.gz]summary, will also be produced, which summarises the simulation parameters, input files and output files.

  • If the output file prefix has a .gz extension then the *.haps.gz, *.gen.gz and *.tags.gen.gz files will be gzipped.
  • It is possible to supress some of the output files using the flags -no_gens_output and -no_haps_output (see below).
Optional Output the pvalues and effect size estimates (under an log additive model test) for each disease SNP and under a joint model for all of the disease SNPs in the simulated genotype data. Note, that for version 2.1.x, this option always used by default (with no option to switch it off) but it turns out that this step is very time consuming and has therefore been made optional from version 2.2.0 onwards.
Optional No haplotype data files, *.haps[.gz], will be outputted for the case and control data.
Optional No genotype data files, *.gen[.gz], will be outputted for the case and control data. However, if you have provided an input to the -t flag then the *.tags.gen[.gz] will be outputted.
-t <file>
Optional SNP subset file. This option allows the user to output data at only a subset of the SNP markers in the simulated dataset i.e. at a set of tag SNPs. The file should contain the physical location of markers that will be in the output on one line per SNP. The physical locations must match those in the legend file. If this option is selected then a .tags.gen output file will be produced that contains the positions of the SNPs in the output file.
-Ne <int> Optional Sets effective population size that scales the fine-scale recombination map for the given population. For example, -Ne 11000 sets the effective population size to 11000. For autosomal chromosomes, we highly recommend the values 11418 for CEPH, 17469 for Yoruban and 14269 for Chinese Japanese populations.
-theta <real>
Optional Sets mutation rate in the model. For example, -theta 10 sets the scaled mutation rate to 10. Mutation rate is set to that the expected number of mutations at a given SNP is equal to 1 by default.

Simulating interaction (top)

The basic HAPGEN2 executable can only simulate multiple independent disease SNPs. However, the function simulateDiscretePhenotypes in the R package SimulatePhenotypes can simulate phenotype data for a set of genotype data under a multiple-SNP interaction disease model. Therefore, one can first run HAPGEN2 under the null (by setting the effects sizes to 1.0 for all SNPs passed to the -dl flag, or if running version 2.0.2 and above then just omit the -dl flag), load the simulated genotype data into R and pass it into the function to simulate the phenotype data. Since the simulation process is stochastic, the number of individuals simulated with case and control phenotypes can not be controlled. See the help documentation in R for more details on running simulateDiscretePhenotypes.

In addition, SimulatePhenotypes has the following functions: that allows easily simulation of the two-SNP interaction disease model specified in Marchini et al. [4]. See the help documentation of those functions for more details.

Using HAPGEN2 with the HapMap2, HapMap3 and 1000 Genomes Project Data (top)

A main use of HAPGEN2 will be to simulate genotypes based on the haplotypes from HapMap2, HapMap3 and the 1000 Genomes Project data. In particular, the HapMap3 data allows HAPGEN2 to simulate data for a number of populations and the 1000 Genomes data allows the simulation of high density SNP data. To facilitate this use of such data we have designed HAPGEN2 to use the same input format(haplotype and legend files) as required by IMPUTE and therefore be able to use the haplotype data that is available from the IMPUTE webpage.

References (top)

[1] Zhan Su, Jonathan Marchini and Peter Donnelly (2011) HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. [Advance Access]
[2] Chris C. A. Spencer, Zhan Su, Peter Donnelly, Jonathan Marchini (2009) Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLoS Genet 5(5). [Link]
[3] J. Marchini, B. Howie, S. Myers, G.McVean and P. Donnelly (2007)A new multipoint method for genome-wide association studies via imputation of genotypes. Nature Genetics 39: 906-913 [Free Access PDF][Supplementary Material][News and Views Article]
[4] The Wellcome Trust Case Control Consortium (2007) Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 661-78. PMID: 17554300 DOI:10.1038/nature05911
[5] J. Marchini, P. Donnelly and L. R Cardon (2005)Genome-wide strategies for detecting multiple loci influencing complex diseases.. Nature Genetics 37: 413-417 [Free Access PDF]

Contact Information (top)

If you have a question please send a mail to our maillist


You will need to subscribe to the maillist to do this.

IMPORTANT : If you are having a problem with one of the programs please include details of the following when you email.
(a) the version number of the program and the type of computer you are running the program on e.g. SNPTEST v2.1.0 Mac OSX 10.6
(b) include the precise command line(s) you have used
(c) include any log file and/or screen output from the program
(d) sometimes it may be necessary for us to obtain a copy of the data you have so please be prepared to supply this. Otherwise, we may not be able to diagnose the problem.