SHAPEIT is a fast and accurate
method for estimation
of haplotypes (aka phasing) from genotype or
sequencing data. SHAPEIT has primarily been developed by Dr Olivier Delaneau through a collaborative project between the research groups of Prof Jean-Francois Zagury at CNAM and Prof Jonathan Marchini at Oxford. Funding for this project has been received from several sources : CNAM, Peptinov, MRC, Leverhulme, The Wellcome Trust. SHAPEIT has several notable features: - Linear complexity with the number of SNPs and conditioning haplotypes. - Whole chromosome GWAS scale datasets can be phased in a single run. - Phasing individuals with any level of relatedness - Phasing is multi-threaded to tailor computational times to your resources. - Handles X chromosomes phasing. - Phasing using a reference panel (eg.1,000 Genomes) to aid phasing - Ideal for pre-phasing imputation together with IMPUTE2 SHAPEIT is free for academic use only. For commercial use please see here |
Citations
The methods implemented in the SHAPEIT software are
described in the following papers. Please cite one or more of
these papers if you use our software in your study.
- O. Delaneau, J. Marchini, JF. Zagury (2012) A linear complexity phasing method for thousands of genomes. Nat Methods. 9(2):179-81. doi: 10.1038/nmeth.1785 [LINK]
- O. Delaneau, JF. Zagury, J. Marchini (2013) Improved whole chromosome phasing for disease and population genetic studies. Nat Methods. 10(1):5-6. doi: 10.1038/nmeth.2307 [LINK]
- O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696 [LINK]
- J. O'Connell, D. Gurdasani, O. Delaneau, et al. (2014) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics [LINK]
- O. Delaneau, J.
Marchini, The 1000 Genomes Project Consortium (2014)
Integrating sequence and array data to create an improved
1000 Genomes Project haplotype reference panel. Nature
Communications 5 3934 [LINK]
What's new?
The latest software release is v2 (r900)
Download and Licence
SHAPEIT is freely available for
academic use. To see rules for non-academic use see below. A
LICENCE file is also included with each software download.Pre-compiled SHAPEIT binaries and example files can be downloaded from the links below.
The latest software release is v2 (r900).
We support only the most recent version.
Platform | GLIBC |
Type |
File |
---|---|---|---|
Linux (x86_64) | v2.12 |
Static |
shapeit.v2.r904.glibcv2.12.linux.tar.gz |
Linux (x86_64) |
v2.17 |
Static |
shapeit.v2.r904.glibcv2.17.linux.tar.gz |
To unpack the files on a Linux computer, use a command
like this:
tar -zxvf shapeit.v2.r900.glibcv2.17.linux.tar.gz
This will create a directory of the same name as the
downloaded file, minus the '.tgz' suffix. Inside this directory
you will find an executable called
shapeit, a LICENCE file, and an example/ directory that
contains example data files.
Commercial License Agreement
A specific license must be obtained
for any commercial or for-profit organization or for any
web-diffusion purpose. For more information you should contact
both
Prof. Zagury (zagury@cnam.fr) and Prof. Marchini (marchini@stats.ox.ac.uk).
Academic License Agreement
The bioinformatics department of Conservatoire National des Arts et Metiers (CNAM) has developed a new algorithm for a faster computation of hidden Markov models, based on graph representations. This algorithm has been notably applied for the reconstruction of haplotypes from population genotypic data leading to the SHAPEIT software. This algorithm and its applications, including SHAPEIT, are patent pending. The Conservatoire National des Arts et Metiers (CNAM), Prof. Jean-Francois ZAGURY and his group of the bioinformatics department (the developers), give permission for you and your laboratory (Institution) to use SHAPEIT. CNAM and the developers allow researchers at your Institution to copy and modify SHAPEIT for internal, non-profit research purposes, on the following conditions:
The SHAPEIT software remains at your Institution and is not
published, distributed, or otherwise transferred or made
available to other than Institution employees and students
involved in research under your supervision. If you wish to
obtain SHAPEIT for any commercial purposes or for diffusion
through the internet, you will need to execute a separate
licensing agreement with CNAM and pay a fee. This includes, but
is not limited to, using SHAPEIT to provide services to outside
parties for a fee. In that case please contact :
Pr. Zagury, CNAM.
Tel : 33 1 58 80 88 20
Mail : zagury at cnam.fr
The software is distributed under the following conditions of
use
You retain in SHAPEIT and any modifications to SHAPEIT, the
copyright, trademark, or other notices pertaining to SHAPEIT as
provided by CNAM.
You provide the developers with feedback on the use of SHAPEIT
in your research, and that the Developers and CNAM are permitted
to use any information you provide in making changes to the
SHAPEIT software. All bug reports and technical questions shall
be sent to the mail list
here
https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=OXSTATGEN
You acknowledge that the developers, CNAM and its licensees may
develop modifications to SHAPEIT that may be substantially
similar to your modifications of SHAPEIT, and that the
developers, CNAM and its licensees shall not be constrained in
any way by you in CNAM's or its licensees' use or management of
such modifications. You acknowledge the right of the developers
and CNAM to prepare and publish modifications to SHAPEIT that
may be substantially similar or functionally equivalent to your
modifications and improvements, and if you obtain patent
protection for any modification or improvement to SHAPEIT you
agree not to allege or enjoin infringement of your patent by the
Developers, CNAM or by any of CNAM's licensees obtaining
modifications or improvements to SHAPEIT from the CNAM or the
Developers. If utilisation of the SHAPEIT software results in
outcomes which will be published, please specify the version of
SHAPEIT you used and cite one of the following publications.
- O. Delaneau, J. Marchini, JF. Zagury (2012) A linear complexity phasing method for thousands of genomes. Nat Methods. 9(2):179-81. doi: 10.1038/nmeth.1785
- O. Delaneau, JF. Zagury, J. Marchini (2013) Improved whole chromosome phasing for disease and population genetic studies. Nat Methods. 10(1):5-6. doi: 10.1038/nmeth.2307
- O. Delaneau, B. Howie, A. Cox, J-F. Zagury, J. Marchini (2013) Haplotype estimation using sequence reads. American Journal of Human Genetics 93 (4) 787-696
- J. O'Connell, D.
Gurdasani, O. Delaneau, et al. (2014) A general approach for
haplotype phasing across the full spectrum of relatedness. PLoS Genetics
- O. Delaneau, J.
Marchini, The 1000 Genomes Project Consortium (2014)
Integrating sequence and array data to create an improved
1000 Genomes Project haplotype reference panel. Nature
Communications.
Any risk associated with using the SHAPEIT software at your
institution is with you and your Institution. SHAPEIT is
experimental in nature and is made available as a research
courtesy "AS IS," without obligation by CNAM to provide
accompanying services or support.
CNAM AND THE AUTHORS EXPRESSLY DISCLAIM ANY AND ALL WARRANTIES REGARDING THE SOFTWARE, WHETHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES PERTAINING TO MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Contact
To ask a question about SHAPEIT please subscribe to the OXSTATGEN mailing list.
If you experience any problems with SHAPEIT, before emailing the mail list please:
- Make sure you are using the latest version since your problem may have already been fixed
- Check carefully the screen output and the log file. The problem may be reported here since they both contain many details about what is going on.
If the problem persists, ask your question on the OXSTATGEN mailing list and we will be happy to answer!
Getting started
SHAPEIT is primarily a tool for inferring haplotypes from SNP genotypes. It takes as input a set of genotypes and a genetic map, and produces as output, either a single set of estimated haplotypes, or a haplotype graph that encapsulates the uncertainty about the underlying haplotypes. The example dataset (gwas.ped / gwas.map) provided with the SHAPEIT software package contains 1760 SNPs typed in X trios, Y duos and Z unrelateds. To phase this data using default settings, use this command line:
shapeit --input-bed gwas.bed gwas.bim
gwas.fam \
--input-map
genetic_map.txt \
--output-max
gwas.phased.haps gwas.phased.sample
The meaning of the arguments are:
- --input-bed gwas.bed gwas.bim gwas.fam specifies the filenames and the format of the genotypes that need phasing.
- --input-map genetic_map.txt specifies the filename of the genetic map needed to improve phasing quality.
- --output-max gwas.phased.haps gwas.phased.sample specifies the files where to write the haplotypes estimated by SHAPEIT.
Speeding up phasing using multi-threading
We strongly advice to use the multi-threading capabilities of SHAPEIT. For example, to phase the example dataset by taking advantage of a 8-core computer, use this command line:
shapeit --input-bed gwas.bed gwas.bim
gwas.fam \
--input-map
genetic_map.txt \
--output-max
gwas.phased.haps gwas.phased.sample \
--thread
8
More details can be found here.
Short option names
You can specify most options using short option names and their
arguments using implicit file extensions to reduce command line
length. See the option list for more
details. As a short example, the previous command line can now
be rewritten as:
shapeit -B
gwas \
-M
genetic_map.txt \
-O
gwas.phased \
-T
8
This command line will have extactly the same behaviour as the previous one. Note that in the rest of the documentation, we will use this compact form when possible.
Algorithm parameters
Multi-threading (--thread)
You can change the number of threads that SHAPEIT uses. This option was designed to speed up SHAPEIT when running on multi-core computers. For example, if you have a 4-core computer, then you can run the following command:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--thread
4
If there are 1,000 individuals in the dataset, SHAPEIT using 4 threads will phase 4 individuals simultaneously conditional upon the 1000 - 4 = 996 others individuals. It allows to divide the running time of SHAPEIT by almost 4 in this case, so 4 hours becomes 1! This option is recommended only if you have a large number of individuals in your dataset. Note that using a number of threads bigger than the number of available cores will not further decrease the running times.
Iteration number (--burn, --prune and --main)
You can change the default number of MCMC iterations with the following options:
- --burn X : where X is the number of the burn-in iterations. The default value is 7 which is sufficient for most application. Those iterations are used to find a good starting haplotype estimate.
- --prune Y : where Y is the number of iterations of the pruning stage. The default value is 8. Transition probabilities are stored across pruning iterations and used to prune the genotype graphs in order to get more parsimonious graphs.
- --main Z : where Z is the number of main iterations. The default value is 20. The final haplotype estimate is found by averaging across the main iterations.
Hereafter, how to run SHAPEIT with 10 burn-in iterations, 10 pruning iterations and 50 main iterations:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--burn
10 \
--prune
10 \
--main
50
Seed specification (--seed)
Since SHAPEIT is a stochastic algorithm, the only way to reproduce twice exactly the same results is to use the same seed value for the random number generator. You can specify the seed of the random number generator with the --seed option as follows:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--seed
123456789
By default, SHAPEIT initialises the seed by calling the stdlib
function time(NULL). Note that the seed value can be found in
the log file. Note also that you cannot reproduce two
multi-threaded runs even if you specify the seed.
Model parameters
Conditioning states K (--states)
You can increase accuracy of SHAPEIT by increasing the number of conditioning states on which haplotype estimation is based. By default, SHAPEIT uses 100 states per SNP across the dataset which gives good accuracy while maintaining reasonable running times. The complexity of the algorithm is linear with the number of conditioning states, so feel free to increase this number if your computational resources allow it. For instance, if you set this number to 200, it will take times longer than with 100 states. You can set the number of conditioning states with the --states option as follows:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--states
200
Window size W (--window)
Another important parameter in the SHAPEIT2 model is the window size. For GWAS dataset, we recommend a value of 2Mb, which should be fine for most application. By default, SHAPEIT2 uses windows of ~2Mb on average. However, for sequence data, our experiments suggest that a value of 0.5Mb may give better results. The window size can be changed to 0.5Mb by using the --window option as follows:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--window
0.5
Using the graphical model of SHAPEIT version 1
(--model-version1)
We have retained within the software the model from SHAPEIT v1 which uses a representation of the conditioning haplotypes within a graph. To use this model you can use the --model-version1 flag. In this case, the number of conditioning states is controlled by the --states options as described above. Therefore, to run SHAPEIT1 model on the example dataset, just use this command line:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--model-version1
Genetic map
Recombination rates between SNPs are provided via a genetic map that specifies the genetic position of the SNPs in cM. To specify to SHAPEIT to use such a genetic map for phasing, you have to use the option --input-map as follows:
shapeit -B gwas \
--input-map
chr20.gmap.gz \
-O gwas.phased
We recommend that most of the SNPs of your dataset should have
a genetic position specified into this file. For the SNPs that
don't have a genetic position, SHAPEIT internally determines its
genetic position using linear interpolation. If the intersection
between your SNP map (BIM, MAP or GEN file) and your genetic map
(GMAP file) is poor, you should verify that the positions in
both files are from the same build of the Human genome (b36 or
b37 for example).
For genetic map for human populations see here.
If no genetic map is provided, SHAPEIT assumes a constant recombination rates between SNPs. This is unrealistic in human populations and thus may harm accuracy. The default value is 0.0004. You can change this value to 0.001 with the option --rho as follows:
shapeit -B gwas \
-O gwas.phased \
--rho
0.001
Effective population size
To specify the effective population size, use the option --effective-size. The default value for the effective population size is 15,000 which should be fine for most Human populations. A command line example that specifies an effective population size of 11,418:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--effective-size
11418
For information, the effective population sizes estimated for
the HapMap phase II populations are:
- European (CEU) : 11,418
- African (YRI) : 17,469
- Asian (CHB+JPT) : 14,269
Input file options
Filter SNPs and individuals with a poor calling rate
It is recommended to remove SNPs and individuals with a poor calling rate since they slow down computations, increase memory usage and are likely to be error-prone. To remove all SNPs and individuals with a poor call rate, use Plink if your dataset is encoded in a PED/MAP or BED/BIM/FAM format and QCTOOL if it is encoded in the GEN/SAMPLE format.
Note that SHAPEIT throws an error if a completely missing SNP or individual (call rate = 0%) is included in the dataset.
Split the dataset by chromosome
You must split the dataset by chromosomes prior to phasing since SHAPEIT phases only one chromosome at a time. To do so, you can use the following Plink command for example:
for chr in $(seq 1 22); do
plink --file myGwasData \
--chr $chr \
--recode \
--out
myGwasData.chr$chr ;
done
You can also proceed as follows with a GEN/SAMPLE file:
for chr in $(seq 1 22); do
cat myGwasData.gen | awk -v c=$chr '{
if ($1 == c) { print $0 } }' > chr$chr\.unphased.gen ;
done
Input file formats
You can specify the unphased genotypes to SHAPEIT with 4 different file formats:
Plink PED/MAP format:
The file format is described here. Use as follows the --input-ped or -P options to specify unphased genotypes in this format:
shapeit --input-ped
gwas.ped gwas.map \
-M genetic_map.txt
\
-O gwas.phased
In this file format, SHAPEIT considers "0" as missing data. To change the missing data character to "N" for example, use --missing-code options as follows:
shapeit --input-ped gwas.ped gwas.map \
-M genetic_map.txt \
-O gwas.phased \
--missing-code
N
Plink BED/BIM/FAM format:
The file format is described here. Use as follows the --input-bed or -B options to specify unphased genotypes in this format:
shapeit --input-bed
gwas.bed gwas.bim gwas.fam \
-M genetic_map.txt \
-O gwas.phased
Oxford GEN/SAMPLE format:
The file format is described here. Use as follows the --input-gen or -G options to specify unphased genotypes in this format:
shapeit --input-gen
gwas.gen gwas.sample \
-M genetic_map.txt
\
-O gwas.phased
In this file format, you may have some probabilistic genotypes as shown in this small example:
7 SNP1 123 A G 1 0 0 0 1 0 0.1 0.9 0 0 1 0
7 SNP2 456 T C 0 1 0 0 0 1 1 0 0 0 1 0
7 SNP3 789 A T 0.2 0.3 0.5
0 1 0 0 1 0 0 0 1
SHAPEIT does not take into account uncertainty in the genotypes. Instead, it fixes the genotypes when their maximum probability exceeds a given threshold and leaves as missing the others. By default, SHAPEIT uses a threshold of 0.9 which means that the example dataset described above is considered by SHAPEIT as:
7 SNP1 123 A G 1 0 0 0 1 0 0 1 0 0
1 0
7 SNP2 456 T C 0 1 0 0 0 1 1 0 0 0 1 0
7 SNP3 789 A T 0 0 0 0 1 0 0 1 0 0 0 1
Note that IND1 at SNP3 is missing since none of the genotype probabilities is above the threshold of 0.9. You can modify the threshold value to 0.8 for example by adding the option --input-thr to the SHAPEIT command line as follows:
shapeit --input-gen gwas \
-M genetic_map.txt \
-O gwas.phased \
--input-thr
0.8
VCF Variant Call Format:
The file format is described here.
Use as follows the --input-vcf or -V options to
specify unphased genotypes in this format:
shapeit --input-vcf
gwas.vcf \
-M genetic_map.txt
\
-O gwas.phased
Implicit filenames
If all input files have their default file extensions, that is:
- *.ped and *.map for PED/MAP format,
- *.bed, *.bim and *.fam for BED/BIM/FAM format,
- *.gen and *.sample for GEN/SAMPLE format.
- *.vcf for VCF format.
Then, they can be specified to SHAPEIT using the following shorter forms:
shapeit --input-ped
gwas -M genetic_map.txt -O gwas.phased
shapeit --input-bed gwas -M
genetic_map.txt -O gwas.phased
shapeit --input-gen gwas -M
genetic_map.txt -O gwas.phased
shapeit --input-vcf gwas -M
genetic_map.txt -O gwas.phased
Read GZIP or BZIP2 input files
SHAPEIT can read all input files compressed using GZIP or BZIP2 as soon as the correct file extension is specified. SHAPEIT recognises a GZIP file if the corresponding filename is of the form filename.gz and a BZIP2 file if the filename is of the form filename.bz2. Given the correct file extension, SHAPEIT will internally uncompress the file in order to read it. For example, if your GEN and GMAP files are gzipped, you can read them using the command:
shapeit -G gwas.gen.gz
gwas.sample \
-M genetic_map.txt \
-O gwas.phased
Data checks
SHAPEIT automaltically checks the input genotype data prior to phasing. It throws warnings when it detects:
- Individuals or sites with a rate of missing data higher than 5%.
- Monomorphic or singleton SNPs since they are not informative for phasing
And errors when an individual or a site has a missing data rate of 100%. In addition, SHAPEIT generates 2 additional log files to facilitate the detection of any problems in the input data. More specifically, if the log file is myLogFile.log, then the 2 following additional files will be generated:
- myLogFile.ind.mm contains the individual missing rates (N lines).
- myLogFile.snp.mm contains the SNP missing rates and allele frequencies (L lines).
Here is an example of the myLogFile.ind.mm
content:
id missing
HG00098 54
HG00105 0
HG00107 3
HG00115 2
HG00132 1
HG00145 7
HG00153 5
HG00157 0
HG00181 1
...
The first column gives the individual ID and the second the number of sites with missing data the individual contains.
Here is an example of the myLogFile.snp.mm:
id position A B missing count_A_main
count_B_main count_A_founder count_B_founder
SNP20-8948779 9000779 T C 0 132 1930 104 1646
rs6140828 9000854 C T 0 568 1494 464 1286
SNP20-8950466 9002466 C T 0 921 1141 817 933
SNP20-8951666 9003666 G A 2 127 1931 97 1649
SNP20-8956851 9008851 A G 0 5 2057 4 1746
rs8118170 9009163 G A 0 62 2000 49 1701
rs6108236 9009406 C A 0 170 1892 137 1613
SNP20-8957905 9009905 G T 0 155 1907 125 1625
SNP20-8959047 9011047 G A 2 683 1375 562 1184
...
The columns are:
- id: The SNP id (ex: rs id)
- position: The SNP physical position in bp
- A: The first allele (ref allele in VCF)
- B: The second alleles (alt allele in VCF)
- missing: The missing genotype count. That is the
number of samples with missing data at this site.
- count_A_main: The allele A count. The number of haplotypes carrying the A allele.
- count_B_main: The allele B count. The number of haplotypes carrying the B allele.
- count_A_founder: The allele A count in founder
haplotypes only.
- count_B_founder: The allele B count in founder haplotypes only.
In this file, you should have:
N = missing + (count_A_main + count_B_main)/2
The two files can be easily post-processed in R using the following commands:
ind_data <-
read.table("myLogFile.ind.mm", hea=T)
snp_data <- read.table("myLogFile.snp.mm", hea=T)
Importantly, you can generate these two files without performing any phasing. To do so, just use the following SHAPEIT command:
shapeit -check -B gwas --output-log gwas.stats
It will generate the files gwas.stats.ind.mm and gwas.stats.snp.mm.
Subsets of unphased genotypes
Read the genotypes in a window
To only phase the variant sites located in a particular window, use the two following options:
- --input-from : Lower bound of the window in bp. The default value is 0.
- --input-to : Upper bound of the window in bp. The default value is 1e9.
Here is an example to phase sites only within the range [9.1, 9.6] Mb:
shapeit -B gwas \
-M genetic_map.txt \
--input-from
9.1e6 \
--input-to 9.6e6 \
-O gwas.phased
Excluding/Including SNPs
To exclude some particular variant sites from the phasing process, use the option --exclude-snp together with a file listing the site position you want to exclude. Example:
shapeit -B gwas \
-M genetic_map.txt \
--exclude-snp gwas.subset.site \
-O gwas.phased
Where each line of the file gwas.subset.site corresponds to a physical position that has to be excluded. Example:
9000779
9000854
9002466
9003666
9008851
9009163
9009406
...
Similarly, you can use the option --include-snp
to only phase a particular set of variants:
shapeit -B gwas \
-M genetic_map.txt \
--include-snp gwas.subset.site \
-O gwas.phased
Excluding/Including individuals
To exclude individuals from the phasing process, use the option --exclude-ind together with a file listing the IDs of the individuals you want to exclude. Example:
shapeit -B gwas \
-M genetic_map.txt \
--exclude-ind gwas.subset.inds \
-O gwas.phased
Where each line of the file gwas.subset.inds corresponds to an individual that needs to be excluded. Example:
HG00098
HG00105
HG00107
HG00115
HG00132
HG00145
HG00153
HG00157
...
Similarly, you can use the option --include-ind
to only phase a particular set of samples:
shapeit -B gwas \
-M genetic_map.txt \
--include-ind gwas.subset.inds \
-O gwas.phased
Output file options
Log file
Each time SHAPEIT is run, the screen output is copied in a log
file. By default, the name of the log file is defined as being:
shapeit_[date]_[time]_UUID.log
For example, a log file shapeit_23112012_14h11m23s_51b01c25-336f-4eef-98bc-c1c5ac97a4cb.log tells that SHAPEIT was run the 23th of November at 14:11. The UUID (Universal Unique IDentifier) avoids any collision between log filenames. You can specify your own log file name by using the --output-log option:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--output-log
gwas.phased.log
Output best haplotypes in HAPS/SAMPLE format
To output the most likely pairs of haplotypes for each sample in Impute2 format, use the option --output-max:
shapeit -B gwas \
-M genetic_map.txt \
--output-max gwas.phased.haps gwas.phased.sample
Or in short form:
shapeit -B gwas \
-M genetic_map.txt \
-O
gwas.phased
Haplotypes can also be compressed using GZIP or BZIP2 just by adding correct file extension on the filenames:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased.haps.bz2 gwas.phased.sample
Output haplotype graphs to capture phasing uncertainty
To output the haplotype graphs that store phasing uncertainty in a compact way, use the option --output-graph:
shapeit -B gwas \
-M genetic_map.txt \
--output-graph
gwas.graph
Then, you can extract the most likely pairs of haplotypes from the graphs by using:
shapeit -convert
\
--input-graph
gwas.graph \
-O
gwas.phased
This generates the two files gwas.phased.haps and gwas.phased.sample. Alternatively, you can just sample a pair of haplotypes for each indivudual using the --output-sample option:
shapeit -convert \
--input-graph chr20.graph
\
--output-sample
gwas.phased
Subsets of phased haplotypes
SHAPEIT can extract subsets (sites/samples) of haplotypes specified in HAPS/SAMPLE format. This option builds on the following base command line:
shapeit -convert \
--input-haps gwas.phased \
--output-haps
gwas.phased.subset
Then, to exclude/include sites/samples from gwas.phased.haps and gwas.phased.sample, use it in combination of the following options:
- --output-from INT: Output haplotypes only for SNPs with positions >= INT (Default = 0).
- --output-to INT: Output haplotypes only for SNPs with positions <= than INT (Default = 10e9).
- --exclude-ind FILE.inds: Output haplotypes only for individuals not in FILE.
- --include-ind FILE.inds: Output haplotypes only for individuals in FILE.
- --exclude-snp FILE.snps: Output haplotypes only for SNPs not in FILE.
- --include-snp FILE.snps: Output haplotypes only for SNPs in FILE.
For example, this command extracts the haplotypes of the individuals listed in gwas.subset.inds only at sites included in gwas.subset.site:
shapeit -convert \
--input-haps gwas.phased \
--output-haps
gwas.phased.subset \
--include-ind gwas.subset.inds \
--include-ind gwas.subset.site
Where in the files gwas.subset.inds and gwas.subset.site have the same format than those described above in the section "Subsets of unphased genotypes".
Convert the SHAPEIT haplotypes to other formats
SHAPEIT can convert haplotypes in HAPS/SAMPLE format to two
different formats.
To convert haplotypes to the HAP/LEGEND/SAMPLE usually used for
reference panels of haplotypes in Impute2:
shapeit -convert \
--input-haps gwas.phased \
--output-ref gwas.phased.hap gwas.phased.leg gwas.phased.sam
Alternatively, you can convert the haplotypes to VCF format
using:
shapeit -convert \
--input-haps gwas.phased \
--output-vcf gwas.phased.vcf
Prephasing for imputation
A major use of phasing is haplotype estimation of GWAS samples in order to speed up imputation from large reference panel of haplotypes such as 1000 Genomes. The current recommendation is that GWAS samples are first 'pre-phased' using the most accurate method available. The subsequent imputation step (which involves imputing alleles from one set of haplotypes into another set) is fast. As new haplotype reference sets become available imputation can be re-run much more efficiently. The approach we recommend is:
1. Phase the GWAS samples with SHAPEIT
2. Impute non-typed SNPs into SHAPEIT haplotypes with IMPUTE2.
Step1: Alignment of the SNPs
SNP positions in build 37
The most recent 1,000 genomes haplotypes are defined at sites that use build37 coordinate system of the Human Genome. Thus., you have to make sure that your GWAS sites are also in the same build. If it is not the case, you can use the UCSC liftOver tool to perform the conversion to build37 coordinates.
Strand alignmentThis is a crucial step of prephasing/imputation to make sure that the GWAS dataset is well aligned with the reference panel of haplotypes. You can check SNP alignment in two steps with Plink (step1 and step2) or with GTOOL.
You can also check strand alignment using SHAPEIT as described in detail in the section "Using Reference panel".
Step2: Phasing the GWAS samples
Once you GWAS dataset is correctly aligned to the reference panel, we strongly recommend to phase each chromosome in a single run instead of making chunks. It makes the procedure much easier and increase downstream imputation quality. To do so, use the following SHAPEIT command line:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased
Notes for cluster users: Suppose that you
want to prephase your GWAS on a cluster where each node has X
CPU cores. In this case, the approach we recommend is to first
reserve a complete cluster node for each SHAPEIT job, and then
to run each SHAPEIT job with X threads to fully load the
CPU-cores of a node.
Notes for server users: Suppose that you want to prephase your GWAS on a server with Y CPU cores (usually, Y=8,12,16,32,48 or 64). In this case, the approach we recommend is to phase the chromosomes in decreasing size order (the biggest one to the smallest one, that is from chr1 to chr22 for Humans). Then, to use several parallel SHAPEIT jobs (J), each using several threads (T) such that the server is fully loaded (JxT=#CPU-cores). Specifically, we recommend:
To set up this approach, xargs is a really useful Linux command. To use it, you can proceed in 2 steps:
Step1: Generate all the command lines (1 per job):
for i in $(seq 1 22); do
echo "-B gwas.chr$i -M gmap.chr$i\.txt -O
gwas.chr$i\.phased -T T"
>> myCommands.txt
done
Step2: Run all the jobs using xargs:
cat myCommands.txt | xargs -PJ -n8 shapeit &
Then, you have J jobs running in parallel on your server, each one using T threads, and your server is loaded at 100%.
Step3: Imputation of the GWAS samples
Once SHAPEIT has produced haplotype estimates, use IMPUTE2 to impute untyped genotypes with the latest release of the 1000 Genomes haplotypes. On the previous example, this is done for the chunk [9.1, 9.6] Mb with the command:
impute2 -use_prephased_g \
-known_haps_g
gwas.phased.haps \
-h reference.haplotypes.gz
\
-l reference.legend.gz \
-m genetic_map.txt \
-int 9.1e6 9.6e6 \
-Ne 20000 \
-o
gwas.from9.1e6_to9.6e6.imputed
We strongly advice to use the latest IMPUTE2 version available
here.
Several comments on the previous command line:
- The flag -use_prephased_g is required to set IMPUTE2 in the prephasing mode
- Prephased GWAS haplotypes are specified using -known_haps_g
- The option -int 9.1e6 9.6e6 is used to
specify the region to be imputed. If you are imputing across
the genome then you need to split the genome up into small
chunks, and impute each one separately using a separate
job. Please refer to the IMPUTE2 website here
for more information on how to do this.
Note that if you deal with X chromsome imputation, you should
carry out SHAPEIT pre-phasing and IMPUTE2 imputation using the
option --chrX for both.
Using family information
We provide two sets of options for estimating haplotypes when there is relatedness between samples
- Very general functionality to estimate haplotypes in any sized pedigrees
- More focussed functionality that allows phasing of mother-father-child trios and parent-child duos.
Phasing in general pedigrees
In the following paper we investigated the use of SHAPEIT for
phasing related samples.
J. O'Connell, D.
Gurdasani, O. Delaneau, et al. (2014) A general approach for
haplotype phasing across the full spectrum of relatedness.
PLoS Genetics [LINK]
The conclusions of
this paper are that SHAPEIT can be an accurate method
of phasing across a whole spectrum of relatedness, from
explicitely related families or pedigrees, through closely
related samples such as those found in isolated populations,
to unrelated samples. Our new approach has two steps.
Firstly, all the
samples are phased ignoring the pedigree information.
This phasing is very accurate despite not using the pedigree
information. The reason is that the sharing of long-range
haplotypes between related samples actually helps the phasing
within the MCMC algorithm that SHAPEIT implements.
Secondly, we have developed a new method (called duoHMM)
that takes the estimated haplotype from the first step, and
combines it with the known pedigree information to further
improve the phasing. This method will ensure that the haplotypes
are consistent with the pedigree structure, leading to increased
accuracy.
To carry out both these steps internally within SHAPEIT you should simply use the --duohmm. This option will report any Mendel errors in the pedigree. Such genotypes will be set to missing, and then imputed during the phasing. You should be aware of this as it means some genotypes may change between input and output datasets. We have also found it to be (slightly) advantageous to use a larger than default window when large amounts of IBD sharing are present hence we use the -W 5 here
shapeit -B gwas-nomendel \
-M genetic_map.txt \
--duohmm
\
-W 5 \
--output-max gwas-duohmm \
--output-graph
gwas-duohmm.graph
A side product of the duoHMM approach is that it can be
used to detect recombination events and non-mendelian genotyping
errors. If this is the type of analysis that you want to
carryout then we provide a stand alone implementation of the duoHMM
method here.
Phasing of mother-father-child trios and parent-child duos
You can specify parent-child relationships between individuals as in Plink, that is using using the PED/MAP and BED/BIM/FAM file formats. Note that conversely to Plink, SHAPEIT does not use the family ID to build up the pedigrees, which explains why it is important that each sample has an unique ID. You can also use the GEN/SAMPLE format to specify parent-child as long as you add two additional columns in the SAMPLE file that specify the relationships between individuals. These two additional columns need to be called ID_father and ID_mother in the header line of the SAMPLE file in order for SHAPEIT to recognize them. Example:
ID_1 ID_2 missing ID_father ID_mother
0 0 0 0 0
IND1 IND1 0 0 0
IND2 IND2 0 0 0
TRIOF TRIOF 0 0 0
TRIOM TRIOM 0 0 0
TRIOC TRIOC 0 TRIOF TRIOM
DUOM DUOM 0 0 0
DUOC DUOC 0 0 DUOM
Then, SHAPEIT builds internally the mother-father-child trios using the IDs of the father and mother of each individual. Additionally to the basic data checks done on site frequencies and calling rates, SHAPEIT also checks Mendel errors when dealing with family data. To run the checks, do:
shapeit -check
\
-B gwas \
--output-log gwas.checks
It throws warnings when high rates of Mendel errors are
detected at the family or site level and generates 2 log files
to easily spot any problems:
- gwas.checks.ind.me contains the Mendel errors at the individual level (N lines).
- gwas.checks.snp.me contains the Mendel errors
at the SNP level (L lines).
Hereafter an example of gwas.checks.snp.me:
id position mendel total
SNP20-8948779 9000779 0 156
rs6140828 9000854 0 156
SNP20-8950466 9002466 0 156
SNP20-8951666 9003666 0 156
SNP20-8956851 9008851 0 156
rs8118170 9009163 0 156
rs6108236 9009406 0 156
...
The columns are variant site ID and position, number of mendel
errors, total number of duos-trios. If you divide the 3rd column
by the last, you get the Mendel error rate. An example of gwas.checks.ind.me:
id father_id father_mendel mother_id
mother_mendel father_mother_mende famID
HG00098 0
0
0
0
0 0
HG00105 0
0
0
0
0 1
HG00107 0
0
0
0
0 2
...
HG01502 HG01500 1 HG01501
1
2 157
HG01503 0
0
0
0
0 158
HG01504 0
0
0
0
0 158
HG01505 HG01503 0 HG01504
1
1 158
HG01506 0
0
0
0
0 159
HG01507 0
0
0
0
0 159
...
Where the columns are:
- id: The individual id
- father_id: The father id for this individual
- father_mendel: The number of Mendel error when considering only the duo father/child
- mother_id: The mother id for this individual
- mother_mendel: The number of Mendel error when considering only the duo mother/child
- father_mother_mendel: The number of Mendel error when considering the full trio father/mother/child
- famID: An unique id internally assigned by SHAPEIT to
each distinct family (useful to determine how many unrelated
families there are).
These files are quite useful to filter out badly called variant sites and identify spurious trios or duos. To do so, you can use R to open the 2 log files as follows:
ind_mendel <- read.table("gwas.checks.ind.me",
hea=T)
snp_mendel <- read.table("gwas.checks.snp.me:",
hea=T)
Then, you can filter out SNPs according to the Mendel error rate mendel/total. And you can easily identify spurious duos and trios by only examining the father_mother_mendel column; a large value suggesting that there is an error in the father/mother assignment to the child. You can also examine father_mendel and mother_mendel columns to determine if one of them or both were incorrectly set as parent.
As soon as family information is correctly specified, SHAPEIT automatically uses it when running with the standard command line that follows:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased
If you want to disable this feature, use the --noped flag:
shapeit -B gwas \
-M genetic_map.txt \
-O gwas.phased \
--noped
This makes SHAPEIT consider all samples in the data as unrelated.
Phasing with a reference panel
SHAPEIT can use publicly available reference panel of
haplotypes, such as the one provided by the 1,000 Genomes
project, to help phasing. This approach is particularly useful
when phasing typically less than 100 individuals.
Step1: Download the 1,000 Genomes phase 1 reference panel of haplotypes
You can go here to download the 1,000 Genomes phase1 haplotype set in the correct format. File format is described in details here.
Step2: Alignment of the SNPs between the GWAS dataset and the reference panel
Let assume you need to phase a GWAS dataset defined on a set of SNP A using a reference panel of haplotypes defined on a set of SNP B. SHAPEIT can only phase SNPs that are in both sets A and B with the same REF/ALT alleles. Otherwise:
- Case1: It throws an error when SNPs found in both A and B have incompatible allele types (strand issues)
- Case2: It automatically removes SNPs that are in B and not in A
- Case3: It throws an error when it finds SNPs that are
in A but not in B
Assuming you use 1,000 Genomes reference panel; these
constraints should not be a major problem since most of the chip
SNPs are expected to be found. When specifying a reference
panel, SHAPEIT proceeds with several checks in order to make
sure that alignment between SNPs in A and B is
OK. When it encounters any problems; two additional log files
are generated describing those. If the main log file is myLogFile.log,
then the two following files are generated:
- myLogFile.snp.strand that describes in detail all Case1 and Case3 problems found.
- myLogFile.snp.strand.exclude that gives a list of the physical positions of all Case1 and Case3 problems found for easy exclusion using --exclude-snp option.
A data check can be run without carrying out phasing using:
shapeit -check
\
-B gwas \
-M genetic_map.txt \
--input-ref
reference.haplotypes.gz reference.legend.gz reference.sample
\
--output-log
gwas.alignments
A content example of the file gwas.alignments.strand when there is an issue:
type pos main_id main_A main_B
main_flippable ref_id ref_A ref_B ref_flippable
Missing Missing 9065164 SNP20-9013164 G T 1 NA NA NA 1
Missing Missing 9068825 SNP20-9016825 T C 1 NA NA NA 1
Missing Missing 9106084 SNP20-9054084 G T 1 NA NA NA 1
Missing Missing 9226397 SNP20-9174397 C T
1 NA NA NA 1
Missing Missing 9288143 SNP20-9236143 C T 1 NA NA NA 1
Missing Missing 9326235 SNP20-9274235 C T 1 NA NA NA 1
Strand Strand 9342728 SNP20-9290728 G T 1
rs76626604 T C 0
Missing Missing 9505264 SNP20-9453264 A C 1 NA NA NA 1
Missing Missing 9533375 SNP20-9481375 C T 1 NA NA NA 1
This file contains one line for each alignment problem found
between A and B. The columns are the following:
- type: the type of the alignment problem. It can be missing for Case3 problems or strand for Case1 problems
- pos: the physical position of the SNP that has an alignment problem
- main_id: SNP id in the GWAS dataset
- main_A: first allele in the GWAS dataset
- main_B: second allele in the GWAS dataset
- ref_id: SNP id in the reference panel
- ref_A: first allele in the reference panel
- ref_B: second allele in the reference panel
A case1 and case3 are highlighted in red and blue respectively in the above example. The case3 example tells you that SNP20-9174397 at position 9226397 is not in the reference panel. The case1example tells that SNP at position 9342728 has different allele types between the gwas and reference panels. You can input this file into R for analysis using:
ref_data <-
read.table("chr20.alignments.snp.strand", hea=T)
In addition to gwas.alignments.strand, another file gwas.alignments.strand.exclude is provided to easily remove the problematic sites from the analysis. Hereafter the file corresponding to the above example:
9065164
9068825
9106084
9226397
9288143
9326235
9342728
9505264
9533375
This file is basically the second column of the previous file
without including any header. It can be used in all subsequent
SHAPEIT command lines using the --exclude-snp option:
shapeit -check \
-B gwas \
-M genetic_map.txt \
--input-ref
reference.haplotypes.gz reference.legend.gz reference.sample \
--exclude-snp gwas.alignments.strand.exclude
Step3A: Phasing the GWAS dataset using the reference panel of haplotypes
Once solved the site alignment issues between the reference and the GWAS panels, you can then proceed with the phasing in itself. To do so, use this command:
shapeit -B gwas \
-M genetic_map.txt \
--input-ref
reference.haplotypes.gz reference.legend.gz reference.sample \
-O
gwas.phased.with.ref
All standard phasing options described in the algorithm
parameters and model parameters sections are
available if you want to to tweak the phasing step.
If you want to use only a subset of the haplotypes of the reference panel in the phasing process, you can do it by using the two options --exclude-grp and --include-grp. As described here, the SAMPLE file of the reference panel has two additional columns population and group. You can only use a subset of the reference haplotypes by filtering on these columns. For example, if you want to only use the European reference haplotypes (EUR), you have to first generate an inclusion list containing the groups or populations you want to include:
echo "EUR" > group.list
And then, you can run SHAPEIT like this:
shapeit -B gwas \
-M genetic_map.txt \
--input-ref
reference.haplotypes.gz reference.legend.gz reference.sample \
--exclude-snp gwas.alignments.strand.exclude
\
--include-grp group.list \
-O gwas.phased.with.ref
Step3B: Using the reference panel of haplotypes and no MCMC iterations
Alternatively to the previous approaches that relies on MCMC iterations, you can phase each GWAS sample in turn without using any MCMC iteration. To do so, SHAPEIT uses the reference haplotypes as conditioning set and ignores the GWAS haplotypes. We recommend to use this approach only if you have to phase a small number of GWAS sample (typically less than 10). One of the advantage of this approach is that it greatly speeds up the computations. To disable the MCMC algorithm, use the --no-mcmc flag as follows:
shapeit -B gwas \
-M genetic_map.txt \
--input-ref
reference.haplotypes.gz reference.legend.gz reference.sample \
--exclude-snp gwas.alignments.strand.exclude
\
--include-grp group.list \
-O gwas.phased.with.ref \
--no-mcmc
Phasing X chromosomes
Input file formats for X chromosome data
To phase X chromosomes, you need to specify the sex of each
sample:
- A male is coded as "1"
- A female is coded as "2".
In the PED/FAM file format, this is encoded as follows (in purple):
IND247 IND247 0 0 1
1
IND248 IND248 0 0 1 1
IND249 IND249 0 0 2 1
IND250 IND250 0 0 2 1
In the SAMPLE file format, an additional column is needed to specify sex. Itneeds to be called sex or gender in the header as shown below:
ID_1 ID_2 missing sex
0 0 0 0
IND247 IND247 0 1
IND248 IND248 0 1
IND249 IND249 0 2
IND250 IND250 0 2
Note that SHAPEIT considers all individuals as unrelated when phasing X chromosomes. In theory, it is possible to extend the SHAPEIT model to use both information (ploidy and family), but it has not been implemented yet.
Step1: Input data checks
As a first step, SHAPEIT checks for haploid heterozygous
genotypes in males. A haploid heterozygous is a male genotype
that is heterozygous, which is likely an error given the haploid
nature of the male X chromosome. Usually, it is due to incorrect
sex assignment. SHAPEIT will throw warnings when it detects SNPs
or males with high haploid heterozygous proportion (> 1%). Note also that SHAPEIT haploid heterozygous
as missing data. In addition, it generates two
additional log files that help the user to spot any problems. If
the main log file is gwas.phased.log, they are:
- gwas.phased.snp.hh reports the number of haploid heterozygous per variant site.
- gwas.phased.ind.hh reports the number of
haploid heterozygous per sample.
You can run these checks prior to phasing uing:
shapeit -check \
-B gwas.chrX \
--output-log gwas.chrX.log
\
--chrX
An example of the file gwas.phased.snp.hh:
id pos
haploid_hets n_males
rs16985845 10003130
0 20
rs5978365 10006452
1 20
rs5979244 10020945
0 20
rs7880983 10022070
0 20
...
The first columns gives the site ID, the second the physical
position, the third the number of haploid heterozygotes
found and the fourth the total number of males with genotypes at
that site. An example of the file gwas.phased.ind.hh:
id male
haploid_hets n_snp
INDIV_01 0 0
194
INDIV_02 1 2
194
INDIV_03 0 0
194
INDIV_04 1 0
194
...
The columns are individual ID, sex (1 for male), number of
haploid heterozygotes and total number of non-missing sites for
this sample.
Step2.1: Phasing the data without reference panel
To set up SHAPEIT in X chromsome phasing mode, just add the --chrX flag to the command line:
shapeit -B gwas.chrX \
-M genetic_map.chrX.txt \
-O gwas.chrX.phased \
--chrX
It only phases female samples and imputes missing data in male samples.
Step2.2: Phasing the data with a reference panel
When using a reference panels for X chromsomes, you need some adjustments in the SAMPLE file. Specifically, you need to make sure there is a column sex (1=male / 2=female) in the reference sample file:
sample population group sex
HG00096 GBR EUR 1
HG00097 GBR EUR 2
HG00099 GBR EUR 2
HG00100 GBR EUR 2
HG00101 GBR EUR 1
HG00102 GBR EUR 2
...
Then, you can phase the data:
shapeit -B gwas.chrX \
-M genetic_map.chrX.txt \
-O gwas.chrX.phased \
--chrX \
--input-ref
reference.chrX.haplotypes.gz reference.chrX.legend.gz
reference.chrX.sample
Note that SHAPEIT systematically ignores the second haplotype
for males in the reference panel. Note also that it duplicates
the male haplotypes in the output for consistency of the
HAPS/SAMPLE file format.
Read aware phasing
We recently published the following paper which presented an
extension of the SHAPEIT model that takes advantage of
the phase information contained in sequencing reads:
O. Delaneau, B.
Howie, A. Cox, J-F. Zagury, J. Marchini (2013)
Haplotype estimation using sequence reads. American Journal
of Human Genetics 93 (4) 787-696
We showed that this extended model leads to improvements in the
quality of the resulting haplotypes, especially at rare
variants. To incorporate the phase information contained in
sequencing into the phasing process, we need to proceed in two
steps.
Step1: Extract phase informative reads
A phase informative read (PIR) is a sequencing read that
span at least 2 heterozyguous sites. In the following,
we require first that the sequence data is stored in BAM files
and second that the genotype data to be phased is in VCF
format. We developed a small tool, extractPIRs, to
extract the PIRs from BAM files that relies on the samtools
API for efficiency. You can download it from here:
Linux (x86_64) Static Executable | extractPIRs.v1.r68.x86_64.tgz |
Source code |
extractPIRs.tgz |
To illustrate how this process works, lets apply it on the two
following example files provided with the SHAPEIT package:
- sequence.multiple.vcf.gz is the VCF containing the
genotype data to be phased from one chromosome (1,022 variants
x 322 samples).
- NA12891.bam and NA12892.bam contain the sequence data for two particular samples in the VCF file.
The goal here is to improve the phasing quality of the two sample NA12891 and NA12892 using the available PIRs contained in their sequence data. To extract the PIRs contained in the BAM files, you need first to build a text file that lists all the available BAM files together with some informations that extractPIRs needs. Specifically, each row in this text file corresponds to a particular BAM file to be considered and contains three columns that are either TAB or SPACE delimited:
- The first column gives the sample ID corresponding to the BAM file. This sample ID has to match one of the sample ID in the VCF in order for extractPIRs to properly map sequencing reads to samples.
- The second gives the absolute path of the BAM file. This basically tells to extractPIRs where the BAM file is physically located on the computer.
- The third gives the "Reference sequence" (field @SQ
in BAM header) to be considered by extractPIRs. In practice,
it tells to extractPIRs to only consider sequencing
reads that were mapped to a given chromosome. This is required
since phasing works one chromosome at a time, and thus only
requires the PIRs for one chromsome at a time.
For example, we have included the file in the example\ directory bamlist:
NA12891 NA12891.bam 20
NA12892 NA12892.bam 20
Once this is done, you can then run extractPIRs as
follows:
extractPIRs --bam bamlist \
--vcf sequence.multiple.vcf.gz
\
--out
myPIRsList
NOTE : the VCF file
should contain SNPs from only one chromosome.
It will produce a text file myPIRsList that contains
all the PIRs found in the two BAM files i.e. that span at least
two heterozyguous sites in the VCF sequence.multiple.vcf.gz.
This file has a specific format designed to feed SHAPEIT
in a second processing step. It looks like this:
MAP 1022
10000107 T C
10000117 C T
...
10248574
G A
10249534
A G
NA12891 88
910
G
32 912
C 26
841
A
30 842
A 31
...
325
A
30 326
C 27
341
A
28 342
G 30
NA12892 357
450
T
27 452
A
28 453
T 31
207
C
28 208
G 28
...
394
T
15 395
A
23 396
C 27
12
G
24
13
T 27
The first section (in blue) gives
the map of sites onto which PIRs were mapped. It says that there
are 1,022 sites in the map and each line that follows gives the
physical position and the REF/ALT alleles at that site. The next
sections give the PIRs found for each sample. NA12891 (in purple) and NA12892 (in red) have 88 and 357 PIRs,
respectively. Each PIR is described as a line in this file. And
each heterozyguous site covered is described by a triplet (index
of the het in the map, allele observed, base quality). For
instance, the last PIR shown in this file covers two hets: the
12th and the 13th sites in the map with alleles G and T
respectively, that were sequenced with base qualities 24 and 27.
Here, four important points need to be highlighted:
- Only PIRs at Single Nucleotide Polymorphisms (SNP) are considedered by extractPIRs and SHAPEIT. PIRs at short bi-allelic indels are just ignored.
- When a sequencing read contains a nucleotide at a heterozyguous site that does not correspond to any of the REF and ALT alleles specified in the VCF file, this nucleotide is just ignored. In other words, only nucleotides in reads that match the REF/ALT alleles in the VCF are considered.
- You can tell to the program to only consider sequencing read with a mapping quality higher than a given value using the option --read-quality. By default, the minimal mapping quality of a read in order to be considered by extractPIR is 10. You can increase it to 20 by using --read-quality 20 for instance.
- You can tell to the program to only consider bases with a base quality higher than a given value specified using the option --base-quality. By default, the minimal quality of a base in order to be considered by extractPIRs is 13. You can increase it to 20 by using --base-quality 20 for instance.
Here is an example of a custom run of extractPIRs:
extractPIRs --bam bamlist \
--vcf sequence.multiple.vcf.gz
\
--out
myPIRsList \
--base-quality 20 \
--read-quality 20
Step2: phase the VCF using the extracted PIRs
Now that you have extracted the PIRs from your BAM files, let use them to phase the VCF as shown below:
shapeit -assemble
\
--input-vcf sequence.multiple.vcf.gz
\
--input-pir myPIRsList \
-O myHaplotypeData
This phases the genotype data using the standard MCMC
iterative procedure together with the read aware phasing module.
Note the use of the -assemble flag to tell to
SHAPEIT to use the read aware model and the --input-pir
option to specify the set of PIRs to be used (and
previously extracted by extractPIRs). This simple
example uses all paramaters by default, but you can also tune
the MCMC run and the HMM using options described here and here.
Compared to a standard run of SHAPEIT, not much changes on the
screen apart from this section that describes how SHAPEIT used
the PIRs provided:
Assembling reads on graphs [382/382]
* 445 / 445 PIRs used
* 2 / 382 samples covered by PIRs
* 63 / 442 heterozygotes covered by PIRs
* 107 / 146 graph segments covered by PIRs
In brief, it tells you that:
- 445 PIRs out of the 445 provided are used in the phasing process.
- 2 samples out of 382 have PIRs.
- 63 heterozygous genotypes out of 442 possible ones are
covered by PIRs. The number 442 corresponds to the total
number of hets found in the 2 covered samples.
- 107 out of 146 graph segments are covered bt PIRs. The graphs are the compact representations of the space of possible haplotypes used by SHAPEIT.
In the particular case of just a few samples to be phased,
which usually happens in high coverage sequencing studies, you
cannot really use the MCMC procedure and need to use an external
data that captures the LD patterns used for inference. To do so,
you can use the 1000 Genomes haplotypes. Running SHAPEIT
in this context is really similar to what is described here.
We provided a small example of this with the software package:
sequence.single.vcf.gz. This file contains genotype data
on three individuals, labelled NA12878, NA12891 and NA12892. So,
in the above example, we extracted the PIRs for two of these
individuals (NA12891 and NA12892) and stored them in the file myPIRsList.
In a first step, you need to check that this VCF and the
reference panel align well:
shapeit -check
\
--input-vcf
sequence.single.vcf.gz \
-R reference.haplotypes.gz
reference.legend.gz reference.sample \
--output-log
myAlignmentChecks.log
SNPs that do not align well are written to a file, in this
example it is called
myAlignmentChecks.snp.strand.exclude
This file is used in the following command to remove those SNPs
from the analysis.
Then, you can run the actual phasing stage:
shapeit -assemble
\
--input-vcf
sequence.single.vcf.gz \
--input-pir myPIRsList
\
-R
reference.haplotypes.gz reference.legend.gz reference.sample
\
-O myHaplotypeData \
--exclude-snp
myAlignmentChecks.snp.strand.exclude
--no-mcmc
Using phase uncertainty
In this section, we propose two ways of capturing phase
uncertainty. Both of them start from the graph structures that
SHAPEIT can produce as output. So the very first step is to
produce this graph file gwas.phased.graph as
shown here:
shapeit -B gwas \
-M genetic_map.txt \
--output-graph gwas.phased.graph
Method1
A simple approach to propagate uncertainty in downstream analysis is to sample haplotypes R times from the graphs and then to average your downstream analysis over the R generated haplotype sets. To sample 100 haplotype sets for example, you can use the following command:
for r in $(seq 1 100); do
shapeit -convert \
--input-graph gwas.phased.graph \
--output-sample
gwas.phased.S$r \
--seed $r;
done
Don't forget to change the seed to make sure that two samples are generated using different random number sequences. You can also output the haplotypes only for a subset of SNPs/individuals you are interested in with the following options: --output-from, --output-to, --exclude-ind, --include-ind, --exclude-snp and --include-snp. For example, to generate haplotypes only for the individuals in gwas.subset.inds at sites in gwas.subset.inds, use this command instead of the previous one:
shapeit -convert \
--input-graph
gwas.phased.graph \
-O gwas.phased.subset.S$r
\
--exclude-ind gwas.subset.inds \
--exclude-snp
gwas.subset.site
Both files gwas.subset.inds and gwas.subset.site are formatted as described in this section. Note that the graph file format used in SHAPEIT version 2 follows new specifications and requires now just one single binary file to completely store the graphs. In SHAPEIT version 1, the graph file format is different and requires three files. SHAPEIT version 2 can still read this deprecated file format using this command line:
shapeit -convert \
--input-graph
myData.graph.bin myData.graph.bim myData.graph.fam
\
-O myData.phased
Old version of the graph file format is implicitly recognised by SHAPEIT when you specify 3 arguments to --input-graph.
Method2
A bit more sophisticated method relies on internal resampling.
The idea is to extract the most likely pairs of haplotypes with
their respective probability for each sample and defined on
subsets of variants provied by the user. In practice, it
is much faster than the previous method and prevents the need to
write scripts that parse the haplotype sets. In this method, the
first step is to produce a file that describes the different
combinations of variant sites for which the most likely
haplotypes have to be extracted. Herefater an example (gwas.sets):
/home/olivier/gwas.set1 9000779 9009406
/home/olivier/gwas.set2 9009163 9009905
9011322
/home/olivier/gwas.set3 9009163 9011047
This file contains a variable number of columns (min = 3). The
first column gives the filename where to write the results of
the procedure and every other value corresponds to the physical
position of a variant site we wish to consider. For example, the
second line in red tells to
SHAPEIT to extract the most likely pair of haplotypes at
the three variants with positions 9009163, 9009905 and 9011322.
Then to do the job, use the following command:
shapeit -convert \
--input-graph
gwas.phased.graph \
--input-sets gwas.sets
This samples a 1,000 times haplotypes in the graphs, subset
them at the provided variant site combinations, report both:
- For each sample, all the most likely haplotype pair are
reported with their respective probability.The reported
probabilities correspond to frequencies of occurence in the
1,000 samples. These are written in files with *.pairs.gz
extension.
- The frequency of each distinct haplotype in the dataset. It is derived from the haplotype pair probabilities described above. These are written in files with *.freqs.gz extension.
Note that you can change the number of sampling stages with the --replicate option as shown below:
shapeit -convert \
--input-graph
gwas.phased.graph \
--input-sets gwas.sets \
--replicate 500
In this example, 6 files are generated: set1.freqs.gz, set1.pairs.gz, set2.freqs.gz, set2.pairs.gz, set3.freqs.gz and set3.pairs.gz all in the folder /home/olivier. The content of set1.freqs.gz is as follows:
0 11 0.885337
1 01 0.0349776
2 10 0.0546184
3 00 0.0250673
There are 3 columns in this file:
- The haplotype index,
- The haplotype content. A combination of 0s and 1s defining the allelic content of the haplotype.
- The frequency in the dataset. They sum to 1.
The content of the mate file, set1.pairs.gz is as follows:
0 4 HG01881 HG01879 HG01880 0 0.96000 0 1 2
1
0 4 HG01881 HG01879 HG01880 1 0.04000 1 0 3 0
1 4 HG01934 HG01932 HG01933 0 1.00000 1 0
0 0
...
155 3 HG01261 0 HG01260 0 1.00000 0 0 0
156 2 HG02471 0 0 0 1.00000 0 0
157 2 HG02470 0 0 0 1.00000 0 0
158 2 NA18485 0 0 0 1.00000 0 0
159 2 NA18497 0 0 0 1.00000 0 0
160 2 NA18500 0 0 0 1.00000 0 0
161 2 NA18503 0 0 0 1.00000 0 2
162 2 NA18506 0 0 0 0.47000 1 2
162 2 NA18506 0 0 1 0.53000 0 3
...
In purple, it basically tells that NA18506 has two possible haplotype pairs with indexes (1, 2) and (0, 3) in set1.freqs.gz, and with probabilities 0.47 and 0.53. The first column gives the family index, the second the type of family (2=unrelated, 3=duo and 4=trios), the fourth and fifth the parents when relevant, and the sixth the haplotype pair index. In red and blue, two examples of what you get in case of trio and duo, respectively. Here, founder haplotype triplets or quadruplets are reported instead of pairs.
Genotype calling from low coverage sequencing (1000G pipeline)
We recently published the following paper which presents an
extension of the SHAPEIT model that allows to call
genotype from low coverage sequencing data when a chip-derived
haplotype scaffold is available for the same set of samples:
O. Delaneau, J.
Marchini & The 1000 Genomes Consortium (2014)
Integrating sequence and array data to create an improved
1000 Genomes Project haplotype reference panel. Nature
Communications.
We showed that this method is able to produce an improved
panel of haplotypes for the 1000 Genomes project (both for
Phase1 and Phase3) that lead to increased downstream
imputation quality.
Below we illustrate the use of this new functionnality on a
small example (100kb) extracted from the 1000G project phase 1
data. The parameter values we use in this example are the same
as used in 1000 Genomes project, and the values we recommend
for general use. Note that the current pipeline only works for
bi-allelic variants. The pipeline involves first running
Beagle to get a quick initial set of genotype calls. SHAPEIT
then takes these initial calls, together with the phased
haplotype scaffold, and produces a highly accurate set of
calls genotypes and haplotypes on the sequencing data.
Example files
The data files used throughout this example can be downloaded
from here. This includes
the following files
- The Genotype Likelihoods (GLs). A GL is the probability of observing the sequencing reads at a particular variant site given the unknown underlying genotype. To calculate them from sequence data contained in BAM files, you can use software such as samtools, snptools, freebayes or GATK. After having used these software, you should get a VCF file conatining a GL field for each site similarly to the example file GLs.vcf.gz.
- The genetic map. This map is similar the one described here and in the example file map.txt.gz.
- The haplotype scaffold derived from SNP array data. This
data set needs to contain some overlap with the set of
sequence data in term of both samples and variant sites. The
larger the overlap, the better. This data set also needs to be
phased as accurately as possible, so it is important to take
advantage of any useful properties of the data, such as large
sample sizes or family relationships. The scaffold
corresponding to the example data set can be found in scaffold.haps.gz
and scaffold.haps.sample.
Step
1: Define chromosome chunks for Beagle4
First, download the source code of makeBGLCHUNKS from here. Then, compile by
typing make in the terminal. Note that compilation needs a
recent version of boost
libraries. This piece of software builds a file from a
given VCF that contains the chunk coordinates required to run
Beagle 4 in each chunk. Feed makeBGLCHUNKS with GLs.vcf.gz
and two parameters:
- The chunk size (--window) in number of variant sites,
- The overlap size (--overlap) in number of variant sites.
A chunk therefore contains L = W + 2 x O variant sites. The
window and overlap sizes determine the amount of RAM needed by a
single Beagle4 job. Make sure that a chunk spans several
hundreds kilobases (i.e. site density x L). Here is an example
command line to run makeBGLCHUNKS on GLs.vcf.gz:
makeBGLCHUNKS --vcf GLs.vcf.gz --window 700
--overlap 200 --output chunk.coordinates
This produces a file chunk.coordinates with the
following content:
22
20000085 20064615
22 20037290 20099942
This says that the first chunk of chr22 starts at position
20000085 and ends at position 20064615.
Step 2: Run Beagle4
First download Beagle4 from here.
Then run it on all the chunks from step2 as follows:
while read line; do
chr=$(echo $line | awk '{ print $1 }')
from=$(echo $line | awk '{ print $2 }')
to=$(echo $line | awk '{ print $3 }')
ivcf=GLs.vcf.gz
ovcf=output.beagle4.$chr\.$from\.$to
java -Xmx4g -jar b4.r1196.jar gl=$ivcf
out=$ovcf chrom=$chr\:$from\-$to
done < chunk.coordinates
Note that it seems there is an incompatibility between
zlib libraries used in Beagle4 and in BOOST on some platforms.
This involves either the last line of the file being skipped or
a segfault. To fix this issue, you can just recompress the
Beagle4 output files using gzip as follows:
zcat
output.beagle4.20.20000085.20064615.vcf.gz | gzip -c >
tmp.vcf.gz
mv tmp.vcf.gz output.beagle4.20.20000085.20064615.vcf.gz
zcat output.beagle4.20.20037290.20099942.vcf.gz | gzip -c >
tmp.vcf.gz
mv tmp.vcf.gz output.beagle4.20.20037290.20099942.vcf.gz
Step 3: Convert Beagle4 output in SHAPEIT input
You need another C++
program: prepareGenFromBeagle4. Download it from here. And compile
it using make. This program merges all the VCFs
produced by Beagle4 in step 3 and produces proper whole
chromosome input files for SHAPEIT. More specifically, you
get:
- A gen/sample pair of files containing a mixture of GLs and fixed genotypes. All genotypes with beagle4 posteriors above a speficied threshold are fixed. And all genotypes that are not still have GLs as input data onto which SHAPEIT will work.
- A hap/sample pair of files containing the Beagle4 haplotypes in Impute2 format. These haplotypes are used as a starting point for the MCMC used in SHAPEIT.
To convert the 2 VCFs you get out of the Beagle4 runs, proceed
as follows:
prepareGenFromBeagle4 --likelihoods
GLs.vcf.gz\
--posteriors
output.beagle4.22.*.vcf.gz\
--threshold 0.995\
--output
input.shapeit.22
Note All Beagle4 output files for a given chromosome are specified using output.beagle4.20.*.vcf.gz. Moreover, we used 0.995 as the threshold meaning that all genotypes with a posterior above 0.995 are directly fixed and will only need phasing in the SHAPEIT step. Conversely, all genotypes with a posterior below 0.995 are not fixed and will need both calling and phasing. In output, you'll get the 4 files described above with input.shapeit.20 as prefix.
Step 5: Run SHAPEIT
Here is what you need to run SHAPEIT:
- The genetic maps for the chromosome you're phasing
- The chunk boundaries for each chromosome. We advice to use 1Mb chunks for SHAPEIT with 200kb overlap with the flanking chunks. In total, your chunks should be 1.4Mb long.
- The haplotype scaffold derived from SNP arrays on the same
set of samples that has been sequenced (see the Nature
communication paper for more details). The scaffold must be
whole chromosome phased in a previous run of SHAPEIT for
example (See sections of the documentation about phasing SNP
array data) taking advantage of any family data when possible.
- The genotype/GL file preparated in step 4.
- The haplotypes for MCMC initialization prepared in step 4.
Then, run it as follows:
shapeit -call\
--input-gen
input.shapeit.20.gen.gz input.shapeit.20.gen.sample\
--input-init
input.shapeit.20.hap.gz input.shapeit.20.hap.sample\
--input-map map.txt.gz\
--input-scaffold
scaffold.haps.gz scaffold.haps.sample\
--input-thr 1.0\
--thread 8\
--window 0.1\
--states 400\
--states-random 200\
--burn 0\
--run 12\
--prune 4\
--main 20\
--output-max
output.shapeit.20.haps.gz output.shapeit.20.haps.sample\
--output-log
output.shapeit.20.log\
--input-from 20000000\
--input-to 20100000
Some brief descriptions of the arguments:
- -call tells to SHAPEIT to use the genotype calling functionnality
- --input-gen specifies the genotype/GL input data that you get out of the Beagle4 step.
- --input-init specifies the haplotypes that you get out of the Beagle4 step.
- --input-map specifies the genetic map used in the estimation.
- --input-scaffold gives the SNP-array derived haplotype scaffold used by SHAPEIT. It has to be in Impute2 format.
- --thread tells the number of threads to be used in the estimation.
- --window 0.1 says that window of 0.1Mb have to be used for the SHAPEIT2 model. It's much shorter than what we use for SNP array data but gives better accuracy in practice. For sequence data, do not go above 0.5Mb.
- --states 400 says that 400 conditionning haplotypes are used in the estimations. These are chosen to be closely related to the individual being estimated.
- --states-random 200 says that 200 additional conditionnal haplotypes are used in the estimations. Those are randonly chosen which has been shown to improve MCMC mixing.
- --burn 0 says that no burn-in iterations is used since we already start from a good first estimation provided by beagle4.
- --run 12 says that 12 pruning stages are used to make
the haplotype graphs more parsimonious.
- --prune 4 says that only 4 iterations are used in each pruning stage.
- --main 20 says that final estimates are obtained by averaging over 20 iterations.
- --output-max is to specify the output haplotypes/genotypes estimated by SHAPEIT.
- --input-from and --input-to specify the
region to be phased.
Step 6: Ligate SHAPEIT chunks
To ligate together all haplotype chunks produced by SHAPEIT,
we provide a C++
program: ligateHAPLOTYPES. Download it from here, and compile it
using make. You need to run it as follows:
ligateHAPLOTYPES --vcf GLs.vcf.gz\
--scaffold scaffolded_samples.txt\
--chunks s2.chunk1.hap.gz
s2.chunk1.hap.gz s2.chunk1.hap.gz\
--output s2.ligated.hap.gz
This command will basically ligate the haplotype chunks in s2.chunk1.hap.gz, s2.chunk1.hap.gz and s2.chunk1.hap.gz into a unique haplotype file s2.ligated.hap.gz. Phase between successive segments is automatically determined by aligning the heterozyguous genotypes in the overlaps for all samples that are not listed in the file scaffolded_samples.txt. Haplotypes for all samples specified in scaffolded_samples.txt are ligated using simple concatenation.
Options
SHAPEIT can be run in 4 different modes:
- -phase (default) This mode can be used to phase genotype data with or without a reference panel.
- -convert This modes can be used to convert and manipulate the output of SHAPEIT.
- -check This mode can be used to check the validity of the input data.
- -assemble This mode can be used to phase genotype data using phase informative reads.
These different modes are triggeredas follows:
shapeit or shapeit
-phase
//phasing mode
shapeit -convert
//convertion mode
shapeit -check
//check data mode
shapeit -assemble
//phasing mode with PIRs
Basic | |
---|---|
--help | Print help about the -phase mode on screen. |
--version |
Print the version of SHAPEIT. |
--source |
Print the version of each SHAPEIT
module |
--licence |
Print on screen the licence |
In the following tables of options we have indicated which options work with each of the 4 modes.
PH = -phase, CH = -check, CO = -convert, AS = -assemble
Input |
P H |
C H |
C O |
A S |
|
---|---|---|---|---|---|
--input-bed file.bed file.bim file.fam --input-bed file -B file.bed file.bim file.fam -B file |
x |
x |
Unphased genotypes in Plink Binary BED/BIM/FAM
format: * file.bed: binary genotype file * file.bim: SNP map file * file.fam: individual information file Brief description here. Detailed description here. |
||
--input-ped file.ped file.map --input-ped file -P file.ped file.map -P file |
x |
x |
Unphased genotypes in Plink Text PED/MAP format: * file.ped: text genotype file * file.map: SNP map file Brief description here. Detailed description here. |
||
--missing-code [string] | x |
x |
Specifies the character used to encode missing data in
PED/MAP format. The default value is 0. For
example, missing data set as "N" as in GTOOL can be
specified with --missing-code N More details here. |
||
--input-gen file.gen file.sample --input-gen file -G file.gen file.sample -G file |
x |
x |
Unphased genotypes in GEN/SAMPLE format: * file.gen: text genotype file * file.sample: individual information file Brief description here. Detailed description here. |
||
--input-thr [float] | x |
x |
Threshold for hard calling genotypes when
GEN/SAMPLE files are given as input. For each individual
at each SNP, the program will assume that the most
likely genotype is true if its probability exceeds the
threshold. Otherwise, the genotype will be considered as
missing. The default value is 0.9. More details
here. Note that this option has
changed between SHAPEIT1 and SHAPEIT2 and is no more
called --input-gen-threshold. |
||
--input-vcf file.vcf -V file.vcf |
x |
x |
x |
Unphased genotypes in VCF format v4.1. Detailled description of the format is here. |
|
--input-map file.gmap -M file.gmap |
x |
x |
x |
x |
Input genetic map file in Text format with a header
line and 3 fields for each SNP: 1. A physical position in bp 2. A rate in cM/Mb (could be any value) 3. A genetic position in cM Details on the file format can be found here. And details on the usage here. |
--input-ref ref.haps ref.legend ref.sample -R ref.haps ref.legend ref.sample |
x |
x |
x |
x |
Reference panel of haplotypes in the Impute2 file
format. 1. reference.haps: the reference haplotypes 2. reference.legend: the snp map 3. reference.sample: the individual infos Details on the file format can be found here. Reference panels can be downloaded from here. |
--chrX | x |
x |
Specifies to SHAPEIT that the genotypes to be phased come from the non pseudo autosomal region of the X chromosome. SHAPEIT looks therefore at the sex of each individual to determine the ploidy model to use. More details here. | ||
--noped |
x |
x |
Discard all pedigree/family information
in the input files. SHAPEIT treats all samples as
unrelated when this flag is specified. |
||
--input-pir file.pirs |
x |
Sets of Phase Informative reads produced
by extractPIRs. More details here. |
|||
--input-graph file.graph |
x |
Read haplotype graph file generated by a SHAPEIT run with the option --output-graph. More details here and here. | |||
--input-haps file.haps file.sample --input-haps file |
x |
Read haplotypes estimated by a SHAPEIT
run with the option --output-max file.haps
files.sample. More details here. |
|||
--input-sets file.sets |
x |
List of combinations of variant sites for
which pairs of haplotypes have to be produced. More
details here. |
|||
--replicate INT |
x |
Number of haplotype pairs sampled per individual. Default is 1,000. More details here. |
Filtering | P H |
C H |
C O |
A S |
|
---|---|---|---|---|---|
--input-from [int] | x |
x |
x |
x |
Only consider genotypes with physical position above the specified value in the phasing process. Details here. |
--input-to [int] | x |
x |
x |
x |
Only consider genotypes with physical position below the specified value in the phasing process. Details here. |
--output-from [int] | x |
x |
x |
x |
Write haplotypes only for SNPs with position above the specified value. Details here. |
--output-to [int] | x |
x |
x |
x |
Write haplotypes only for SNPs with position below the specified value. Details here. |
--exclude-ind file.exc | x |
x |
x |
x |
Read genotypes for individuals NOT included in the list of IDs file.exc. The provided file contains a line for each individual to exclude (by ID). Details here. |
--include-ind file.inc | x |
x |
x |
x |
Read genotypes only for individuals included in the list of IDs file.inc. The provided file contains a line for each individual to include (by ID). Details here. |
--exclude-snp file.exc | x |
x |
x |
x |
Read genotypes for SNPs NOT included in the list of positions file.exc. The provided file contains a line for each SNP to exclude (by physical position). Details here. |
--include-snp file.inc | x |
x |
x |
x |
Read genotypes for SNPs included in the list of positions file.inc. The provided file contains a line for each SNP to include (by physical position). Details here. |
--exclude-grp file.exc | x |
x |
x |
Read haplotypes in the reference panel specified with --input-ref only for individuals with both group or population IDs NOT included in the list of IDs file.exc. The provided file contains a line for each group or population to exclude (by ID). Details here. | |
--include-grp file.inc | x |
x |
x |
Read haplotypes in the reference panel specified with --input-ref only for individuals with either group or population IDs included in the list of IDs file.inc. The provided file contains a line for each group or population to exclude (by ID). Details here. |
Model |
P H |
C H |
C O |
A S |
|
---|---|---|---|---|---|
--states [int] -S [int] |
x |
x |
To specify the number of conditioning haplotypes to be used during the phasing process. The running time of the algorithm increases only linearly with this number. The default value is 100. More details here. | ||
--window [float] -W [float] |
x |
x |
Mean size of the windows in Mb in which conditioning haplotypes are defined. This parameter has impact on the accuracy. The default value is 2 which is ok for most GWAS application. We advice a 0.5 value when dealing with genotypes derived from sequencing. More details here. | ||
--model-version1 | x |
x |
Use the graphical model to represent the conditioning haplotypes (SHAPEIT v1). More details here. | ||
--effective-size [int] | x |
x |
The effective size of the population you
want to phase. Use the following values: * For European: 11418 * For African: 17469 * For Asian: 14269 * For Mixed sample: between 11418 and 17469, depending on the proportion of each population in the dataset. More details here. |
||
--rho [float] | x |
x |
Constant recombination rate value used when no genetic map has been specified using --input-map. More details here. |
MCMC |
P H |
C H |
C O |
A S |
|
---|---|---|---|---|---|
--burn [int] | x |
x |
The number of burn-in iterations used by the algorithm to reach a good starting point. The default is to use 7 burn-in iterations. More details here. | ||
--prune [int] | x |
x |
The number of pruning iterations used by the algorithm to find a parsimonious graph for each individual. The default is to use 8 pruning iterations. More details here. | ||
--main [int] | x |
x |
The number of iterations used by the algorithm to compute transition probabilities in the haplotype graphs. The default is to use 20 pruning iterations. More details here. | ||
--thread [int] -T [int] |
x |
x |
x |
The number of individuals whose phase is updated in parallel using different threads. If you have 4 cores in your computer, you can use --thread 4 to divide the running times by almost 4. The dafult value is 1. More details here. | |
--seed [int] | x |
x |
x |
x |
The seed of the random number generator. Default is the value given by stdlib function time(NULL). More details here. |
--no-mcmc | x |
x |
Disables the MCMC iterations. This needs a reference panel to be specified. More details here. |
Output |
P H |
C H |
C O |
A S |
|
---|---|---|---|---|---|
--output-max file.haps file.sample --output-max file -O file.haps file.sample -O file |
x |
x |
Provides the most likely pair of haplotypes for each individual in Impute2 format. All extra informations (phenotypes/sex/coveriates) present in the input file are added to the SAMPLE file. More details on the file format specifications can be found here. And more details about this options here. | ||
--output-graph file.hgraph | x |
Provides the haplotype graph for each individual in the sample once the phasing process done. Can be read with SHAPEIT when using the -convert mode. More details about this options here and here. | |||
--output-log file.log -L file.log |
x |
x |
x |
x |
To specify the log file where the screen output is copied. It also gives the prefix of all the files generated by SHAPEIT when checking input data. By default, the file is called shapeit_[date]_[time]_[UUID].log. More details here. |
--output-haps file.haps file.sample |
x |
Haplotypes in Impute2 format produced in
the convert mode of SHAPEIT. More details here. |
|||
--output-vcf file.vcf |
x |
Haplotypes in VCF format produced in the convert mode of SHAPEIT. More details here. | |||
--output-ref file.haps file.leg
file.sample |
x |
Haplotypes in Impute2 reference panel format produced in the convert mode of SHAPEIT. More details here. |
File formats
SHAPEIT reads and outputs various different files with
different formats. Click the links below for details of each
format
Unphased
genotypes (PED/MAP)
Unphased genotypes (BED/BIM/FAM)
Unphased genotypes (GEN/SAMPLE)
Genetic map (GMAP)
Phased haplotypes (HAPS/SAMPLE)
Reference panel (HAP/LEG/SAMPLE)
Plink PED / MAP file format for Unrelateds & Nuclear families
This is a default text file format used by Plink. This format is useful if you want to edit the files in a text editor. However, it requires a substantial HDD usage for large datasets and thus longer running times to parse it. A detailed description of the ped/map file format can be found here. To describe it here briefly, consider the following individuals:
- 2 Unrelateds (UNR1 & UNR2).
- 1 Trio (father=TRIOF, mother=TRIOM & child=TRIOC).
- 1 Duo (parent=DUOP & child=DUOC).
All the individuals were typed on 3 SNPs (SNP1, SNP2 & SNP3) giving the following genetic data:
SNP1 SNP2 SNP3
UNR1 A/A T/C ?/?
UNR2 A/G T/C A/T
TRIOF A/G T/C A/T
TRIOM A/G T/C A/T
TRIOC A/A C/T A/T
DUOP G/A T/C A/A
DUOC A/A T/C A/A
PED file
The PED file describes the individuals and the genetic data. The PED file corresponding to the example dataset is:
FAM1
IND1 0 0 1
0 A A T T 0 0
FAM2
IND2 0 0 1
0 A G T C T A
FAM3 TRIOF
0 0 1
0 A G T C A T
FAM4 TRIOM
0 0 2
0 A G T C A T
FAM5 TRIOC TRIOF TRIOM 1 0 A A C T A T
FAM6
DUOP 0 0 2
0 G A T C A A
FAM7
DUOC DUOP 0 2
0 A A T C A A
This file can be SPACE or TAB delimited. Each line corresponds to a single individual. The first 6 columns are:
- Family ID [string]
- Individual ID [string]
- Father ID [string]
- Mother ID [string]
- Sex [integer]
- Phenotype [float]
Columns 7 & 8 code for the observed alleles at SNP1, columns 9 & 10 code for the observed alleles at SNP2, and so on. Missing data are coded as "0 0" as for example for SNP3 of IND1. This file should have N lines and 2L+6 columns, where N and L are the numbers of individuals and SNPs contained in the dataset respectively.
MAP file
The MAP file describes the SNPs. The MAP file corresponding to the example dataset is:
7 SNP1 0 123
7 SNP2 0 456
7 SNP3 0 789
This file can be SPACE or TAB delimited. Each line corresponds to a SNP. The 4 columns are:
- Chromosome number [integer]
- SNP ID [string]
- SNP genetic position (cM) [float]
- SNP physical position (bp) [integer]
This file should have L lines and 4 columns, where L is the
number of SNPs contained in the dataset.
PED/MAP to BED/BIM/FAM conversion
To convert myPlinkTextData.ped and myPlinkTextData.map in Plink binary format, use Plink as follows:
plink --file myPlinkTextData --make-bed --out myPlinkBinaryData
PED/MAP to GEN/SAMPLE conversion
To convert myPlinkTextData.ped and myPlinkTextData.map in GEN/SAMPLE format, use the software package GTOOL that you can find here as follows:
gtool -P --ped myPlinkTextData.ped --map myPlinkTextData.map --og myGtoolTextData.gen --os myGtoolTextData.sample
Note that GTOOL has more options to tweak the conversion
between formats.
Plink BED / BIM / FAM file format for Unrelateds & Nuclear families
This is a default binary file format used by Plink. This format is very convenient for large datasets since it requires low HDD memory usage. You can find a precise description of the Plink binary format here. If you want to use this format, we advice to first format your data in PED/MAP format and then use Plink to convert them in bed/bim/fam To describe it here briefly, consider the following individuals:
- 2 Unrelateds (UNR1 & UNR2).
- 1 Trio (father=TRIOF, mother=TRIOM & child=TRIOC).
- 1 Duo (parent=DUOP & child=DUOC).
All the individuals were typed on 3 SNPs (SNP1, SNP2 & SNP3) giving the following genetic data:
SNP1 SNP2 SNP3
UNR1 A/A T/C ?/?
UNR2 A/G T/C A/T
TRIOF A/G T/C A/T
TRIOM A/G T/C A/T
TRIOC A/A C/T A/T
DUOP G/A T/C A/A
DUOC A/A T/C A/A
BIM file
The BIM file describes the SNPs. The BIM files corresponding to the example dataset is:
7 SNP1 0 123 A G
7 SNP2 0 456 T C
7 SNP3 0 789 A T
This file is SPACE or TAB delimited. Each line corresponds to a single SNP. The first six columns are:
- Chromosome number [integer]
- SNP ID [string]
- SNP genetic position (cM) [float]
- SNP physical position (bp) [integer]
- First allele [string]
- Second allele [string]
This file should have L lines where L is the number of SNPs in the dataset.
FAM file
The FAM file describes the individuals. The FAM files corresponding to the example dataset is:
FAM1
IND1 0 0 1
0
FAM2
IND2 0 0 1
0
FAM3 TRIOF
0 0 1
0
FAM4 TRIOM
0 0 2
0
FAM5 TRIOC TRIOF TRIOM 1 0
FAM6
DUOP 0 0 2
0
FAM7
DUOC DUOP 0 2
0
This file is basically the 6 first columns of the corresponding PED file. This file is SPACE or TAB delimited. Each line corresponds to a single individual. The first six columns are:
- Family ID [string]
- Individual ID [string]
- Father ID [string]
- Mother ID [string]
- Sex [integer]
- Phenotype [float]
This file should have N lines and 6 columns where N is the
number of individuals.
BED file
A detailed description of the BED format can be found here.
BED/BIM/FAM to PED/MAP
To convert back myPlinkBinaryData.bed, myPlinkBinaryData.bim and myPlinkBinaryData.fam in Plink text format, use Plink as follows:
plink --bfile myPlinkBinaryData --recode
--out myPlinkTextData
Oxford GEN / SAMPLE file format for Unrelateds
This is a default text file format used in the analysis pipeline of the WTCCC genetic studies. This format is now widely used by a variety of tools. A detailed description of the gen/sample file format can be found here. To describe it here briefly, consider 4 unrelateds (UNR1, UNR2, UNR3 & UNR4) that were typed on 3 SNPs (SNP1, SNP2 & SNP3):
SNP1 SNP2 SNP3
UNR1 A/A T/C ?/?
UNR2 A/G C/C A/T
UNR3 A/G T/T A/T
UNR4 A/G T/C T/T
GEN file
The GEN file contains the genetic data and describes the SNPs. The GEN file corresponding to the example dataset is:
7 SNP1 123 A G 1 0 0 0 1 0 0 1 0 0 1 0
7 SNP2 456 T C 0 1 0 0 0 1 1 0 0 0 1 0
7 SNP3 789 A T 0 0 0 0 1 0 0 1 0 0 0 1
This file is SPACE delimited. Each line corresponds to a single SNP. The first 5 columns are:
- Chromosome number [integer]
- SNP ID [string]
- SNP physical position (bp) [integer]
- First allele [string]
- Second allele [string]
Then the successive column triplets (6, 7, 8), (9, 10, 11),
(12, 13, 14) and (15, 16, 17) correspond to the genotypes of
the individuals at the first, second, third and fourth SNP. A
genotype is coded as a triplet (AA, AB, BB). For example "1 0
0" means that the genotype is A/A, and "0 1 0" that the
genotype is A/B. Missing data is coded as "0 0 0" as for IND1
x SNP3. For a single SNP, the genotypes are given in the same
order than in the SAMPLE file (see below). This file
should have L lines and 3N+5 columns where L and N are the
numbers of SNPs and individuals respectively.
SAMPLE file
The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is:
ID_1 ID_2 missing
0 0 0
UNR1 UNR1 0
UNR2 UNR2 0
UNR3 UNR3 0
UNR4 UNR4 0
It is SPACE delimited. The first two lines are header lines that describe the content of the file. Then, each line corresponds to a single individual. The first three columns are:
- 1. First individual ID
- 2. Second individual ID
- 3. Missing data proportion
Additional columns usually describe other information for the individuals as some phenotypes for example or covariates. This file should have N+2 lines and at least 3 columns where N is the number of individuals.
GEN/SAMPLE to PED/MAP
To convert myGtoolTextData.gen and myGtoolTextData.sample in Plink PED/MAP format, use GTOOL as follows:
gtool -G --g myGtoolTextData.gen --s myGtoolTextData.sample --ped myPlinkTextData.ped --map myPlinkTextData.map
Note that genotypes with a probability below the --threshold value are set as missing. GTOOL codes missing genotypes using "N" characters. You can use Plink to recode them using "0" as follows:
plink --file myPlinkTextData --missing-genotype N --output-missing-genotype 0 --recode --out myPlinkTextData2
The genetic map
You can find genetic maps for Human here:
The genetic map must have a header line and the three following SPACE delimited columns:
- The physical position (bp) [integer]
- The recombination rate (cM/Mb) [float]
- The genetic position (cM) [float]
The resulting genetic map file should look like that:
pposition rrate gposition
72765 0.12455 0.00000
94172 0.12458 0.00266
94426 0.12461 0.00269
95949 0.12461 0.00288
98087 0.12460 0.00315
98481 0.12468 0.00320
111955 0.12475 0.00488
113224 0.12476 0.00504
113934 0.12481 0.00513
... ... ...
135443104 6.25739 182.52442
135447971 6.29614 182.55506
135499163 10.89782 183.11294
HAPS / SAMPLE file format for haplotypes
This is a default haplotype text file format used by the analysis pipeline of the WTCCC genetic studies. This format is used for example by IMPUTE2. To describe it briefly, consider 4 Unrelateds (IND1, IND2, IND3 & IND4) that were phased on 3 SNPs (SNP1, SNP2 & SNP3):
SNP1 SNP2 SNP3
IND1 hap1 A T A
IND1 hap2 A C T
IND2 hap1 G C T
IND2 hap2 A T A
IND3 hap1 A T T
IND3 hap2 A C T
IND4 hap1 G T T
IND4 hap2 G C T
SAMPLE file
The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is:
ID_1 ID_2 missing
0 0 0
UNR1 UNR1 0
UNR2 UNR2 0
UNR3 UNR3 0
UNR4 UNR4 0
It is SPACE delimited. The first two lines are header lines that describe the content of the file. Then, each line corresponds to a single individual. The first three columns are:
- First individual ID
- Second individual ID
- Missing data proportion
Additional columns usually describe other information for the individuals as some phenotypes for example or covariates. This file should have N+2 lines and at least 3 columns where N is the number of individuals.
HAPS file
The HAPS file describes the SNPs and contains the haplotypes. The HAPS file corresponding to the example dataset is:
7 SNP1 123 A G 0 0 1 0 0 0 1 1
7 SNP2 456 T C 0 1 1 0 0 1 0 1
7 SNP3 789 A T 0 1 1 0 1 1 1 1
This file is SPACE delimited. Each line corresponds to a single SNP. The first five columns are:
- Chromosome number [integer]
- SNP ID [string]
- SNP Position [integer]
- First allele [string]
- Second allele [string]
Then the successive column pair (6, 7), (8, 9), (10, 11) and (12, 13) corresponds to the alleles carried at the 4 SNPs by each haplotype of a single individual. For example a pair "1 0" means that the first haplotype carries the B allele while the second carries the A allele. The haplotypes are given in the same order than in the SAMPLE file. This file should have L lines and 2N+5 columns, where L and N are the numbers of SNPs and individuals respectively.
HAP / LEGEND / SAMPLE file format for haplotype reference panels
This is a default haplotype text file format used by IMPUTE2 to encode the haplotypes of the 1,000 Genomes Project. Precise details about the file format can be found here. To describe it briefly, consider 4 unrelated individuals (Two white American CEU1 & CEU2, an English GBR1 & a Nigerian YRI1) for which haplotypes are available at 3 SNPs (SNP1, SNP2 & SNP3):
SNP1 SNP2 SNP3
CEU1 hap1 A T A
CEU1 hap2 A C T
CEU2 hap1 G C T
CEU2 hap2 A T A
GBR1 hap1 A T T
GBR1 hap2 A C T
YRI1 hap1 G T T
YRI1 hap2 G C T
SAMPLE file
The SAMPLE file describes the individuals. The minimal SAMPLE file corresponding to the example dataset is:
sample population group sex
CEU1 CEU EUR 1
CEU2 CEU EUR 2
GBR1 GBR EUR 2
YRI1 YRI AFR 2
It is SPACE delimited. The first line is a header line that describe the content of the file. Then, each line corresponds to a single individual. The first four columns are:
- Individual ID
- Group1 ID
- Group2 ID
- Sex (1 for male, 2 for female)
SHAPEIT requires at least 4 columns in the SAMPLE file.
Additional columns can be added but SHAPEIT will ignore them.
This file should have N+1 lines and at least 4 columns where N
is the number of individuals in the reference panel.Each
individual must have unique IDs containing only alphanumeric
characters.
LEGEND file
The LEGEND file describes the SNPs. The minimal LEGEND file corresponding to the example dataset is:
id position a0 a1
SNP1 123 A G
SNP2 456 T C
SNP3 789 A T
This file is SPACE delimited. The first line is a header line that describe the content of the file. Each line corresponds to a single SNP. The first four columns are:
- SNP ID [string]
- SNP Position [integer]
- First allele [string]
- Second allele [string]
SHAPEIT requires at least 4 columns in the SAMPLE file. Additional columns can be added but SHAPEIT will ignore them. This file should have L+1 lines and at least 4 columns where L is the number of SNPs in the reference panel.
HAP file
The HAP file contains the haplotypes. The HAP file corresponding to the example dataset is:
0 0 1 0 0 0 1 1
0 1 1 0 0 1 0 1
0 1 1 0 1 1 1 1
This file is SPACE delimited. Each line corresponds to a single SNP. Each successive column pair (0, 1), (2, 3), (4, 5) and (6, 7) corresponds to the alleles carried at the 4 SNPs by each haplotype of a single individual. For example a pair "1 0" means that the first haplotype carries the B allele while the second carries the A allele as specified in the LEGEND file. The haplotypes are given in the same order than in the SAMPLE file. This file should have L lines and 2N columns, where L and N are the numbers of SNPs and individuals respectively.
Version history
v2.r900 (17 January 2018)
We have adjusted the algorithm to
(a) ensure that when missing genotypes in duos/trios are
imputed as part of the phasing they are Mendel consistent
(b) improve the accuracy of imputation of missing
genotypes.
v2.837 (17 August 2015)
Fixes some bugs when running -assemble mode
v2.790 (16 June 2014)
New functionality for calling genotypes and phasing from
sequencing data (1000GP pipeline)
We have published the following paper on using SHAPEIT2 for
calling genotypes and phasing from sequencing data. This method
is being used within the 1000 Genomes Project.
O. Delaneau, J. Marchini, The 1000 Genomes Project Consortium (2014) Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nature Communications 5 3934 [LINK]
The new
functionality, and the pipeline used for the 1000 Genomes
project, is available here.
Bug fixes
We have fixed the following bugs
- The -assemble mode will now work when some individuals in the PIRs file are not in the genotype file.
- The -convert mode will now work when converting data to the hap/legend/sample format by giving the --output-ref option 3 arguments.
- Monomorphic sites in PLINK BED format files don't now have the alternate allele coded as 0, which implies some incompatibility with Impute2.
v2.778 (24 Mar 2014)
The website for SHAPEIT has been re-designed (hope you
like it) and is now hosted at
https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html
The new version of SHAPEIT (v2 778) contains two new
major features
Functionality for using phase information in sequencing
reads to help phasing.
This method is primarily focused on phasing high-coverage
sequencing datasets, from which genotypes have already been
called. Our approach extracts phase-informative reads
from a set of BAM files, and uses this information to aid the
phasing, within the SHAPEIT approach. This approach
helps especially with phasing low-frequency variants. In fact,
the method can phase singleton sites, that are covered by
phase-informative reads, with high accuracy. The details of the
method have been published here.
O. Delaneau, B.
Howie, A. Cox, J-F. Zagury, J.
Marchini (2013)
Haplotype estimation using sequence reads. American Journal of
Human Genetics 93 (4) 787-696 [LINK]
See here
for more detail instruction on these features.
Functionality for
phasing in general pedigrees.
In the following paper we investigated the use of SHAPEIT for
phasing related samples.
J. O'Connell, D.
Gurdasani, O. Delaneau, et al. (2014) A general approach for
haplotype phasing across the full spectrum of relatedness.
PLoS Genetics [LINK]
The conclusions of
this paper are that SHAPEIT can be an accurate method
of phasing across a whole spectrum of relatedness, from
explicitely related families or pedigrees, through closely
related samples such as those found in isolated populations,
to unrelated samples. Our new approach has two steps.
Firstly, all the
samples are phased ignoring the pedigree information.
This phasing is very accurate despite not using the pedigree
information. The reason is that the sharing of long-range
haplotypes between related samples actually helps the phasing
within the MCMC algorithm that SHAPEIT implements.
Secondly, we have developed a new method (called duoHMM)
that takes the estimated haplotypes from the first step, and
combines them with the known pedigree information, to further
improve the phasing. This method will ensure that the haplotypes
are consistent with the pedigree structure, leading to increased
accuracy. There is a single, simple option (--duohmm)
used to active this method. See here
for more detailed instructions.
A side product of the duoHMM approach is that it can be used to detect recombination events and non-mendelian genotyping errors. For these features we provide standalone software.
v2.644 (4 Dec 2012)
The main improvement in SHAPEIT version 2 resides in
the new model used for the conditioning haplotypes. It is a
generalisation of the "surrogate family" phasing idea that can
now be applied at the whole chromosome scale. It results in
improved accuracy and speed compared to version 1. This new
model is used by default and can be controlled via two
parameters, the average window size (W) and the number of
haplotypes to be considered per window (K). Default values for
these parameters can be changed using the options --window
and --states respectively. The graph representation of
the conditioning haplotypes (the underlying model of version
1) can still be used if you add --version1 flag to the
SHAPEIT command line.
* The SHAPEIT2 model is now used by default with --states
100 and --window 2.
* The SHAPEIT1 model can still be used if the --version1
flag is added to the command line.
* Option --chrX to phase X chromosomes has been
improved: better compatibility with Impute2, more checks for
haploid heterozygous, possibility of using a reference panel.
* Option --input-ref to phase using a reference panel
has been improved: SNP alignment between the study genotypes
and the reference panels is much easier.
* Subsets of the reference panel can be ignored using --exclude-grp
and --include-grp options. It makes things easier if
you want to consider only a subset of the reference
haplotypes.
* MCMC iterations can be discarded when using a reference
panel with --no-mcmc flag. Each individual is phased
in turn using only the reference panel of haplotypes.
* Default number of iterations has been halved from 70
[10B+10P+50M] to 35 [7B+8P+20M].
* Haplotype graphs now requires a single file to be stored,
instead of 3 as in version 1.
* Verbose on the screen has been clarified.
* --version issue has been fixed
* To reduce command line length; (a) many available options
have now a short name (ex: --input-bed can be written as -B),
and (b) file set with common prefix can be specified only with
the prefix (ex: --input-bed file.bed file.bim file.fam can be
written as --input-bed file)
* A new mode has been added: shapeit -check [options].
It contains all functions related to data verification. It
reads all input files and spot any problems related to high
call rates, Mendel error rates or haploid heterozygous.
Summary statistics are given in several log files
automatically generated.
* A new mode has been added: shapeit -convert [options].
It contains useful functions to manipulate SHAPEIT output
files; (a) haplotype graph files generated using --output-graph
and (b) haplotypes generated using --output-max.
v1.ESHG (23 June 2012)
* Major: New option --chrX to phase X
chromosomes.
* Major: New option --input-ref to phase using a
reference panel of haplotypes (1KGP).
* Major: More complete log files with many useful
statistics that can be direclty input in R about missingness
per snp/individual, allele frequencies, mendel errors,
heterozygous haploids, etc ...
* Minor: All input files can be given as Gzipped or Bzipped as
soon as the correct file extension is given (.gz and .bz2).
They will be internally decompressed.
* Minor: All output files can be internally compressed using
Gzip or Bzip2 as soon as the correct file extension is given
(.gz and .bz2).
* Minor: Any information in the input sample file are
propagated into the output sample file as the phenotypes for
example. Idem for PED and FAM files.
* Minor: Verbose on the screen was reduced.
* Minor: Unphased genotype files can be given in a more
parsimonious way. For example --input-bed file.bed file.bim
file.fam can be now written as --input-bed file.
v1.r532:
* Minor: bugfix in trio/duo phasing.
v1.r416:
* Minor: When running on ped/map files at monomorphic SNPs the output files listed the two alleles as the observed allele and 0.This has now been changed so that the observed allele is listed twice.This provides better compatibility with IMPUTE2.
v1.r415:
* Major: Unrelateds are ordered in the same
way in input and output files.
* Minor: Covariates and phenotypes in input sample file are
given in output sample file (still don't work when the option
--output-graph is used).
* Minor: When output window coordinates are not consistent
with input window coordinates, they are automatically fixed
v1.r378:
* Major: SNPs are not uniquely identified
by their ID anymore, but rather by their physical position.
* Minor: Compiled on older Linux version for better
compatibility.
* Minor: Improved initial data checking (calling rate / Mendel
error rate).
* Minor: New data sub-setting options ( --include-ind
--exclude-ind --include-snp --exclude-snp --output-from
--output-to )
* Minor: New testing option to check that data reading is OK (
--test-reading )