duoHMM is a software package for post-processing haplotypes estimated by SHAPEIT. It incorporates pedigree information into the haplotype estimates in a post-hoc manner. This has three functions:
- correcting phasing errors in the haploptypes to make them
consistent with a pedigree structure
- detecting recombination events
- detecting subtle genotyping errors
The remainder of this page demonstrates how to apply this
pipeline to some example data.
If you use duoHMM in your research please cite:
J. O'Connell et al. (2013) A general approach for haplotype
phasing across the full spectrum of relatedness. PLoS
Genetics. 10(4) doi:10.1371/journal.pgen.1004234
NOTE : If you just wish
to use duoHMM for phasing complex pedigrees, this
functionality is integrated into SHAPEIT2 and can be
enabled via the --duohmm flag. See more details here.
Download
Pre-compiled duoHMM binaries and
example files can be downloaded from the links below.
Platform | File |
---|---|
Linux (x86_64) Static Executable | duohmm_v0.1.7.tar.gz |
Contact
To ask a question about duoHMM please subscribe to the OXSTATGEN mailing list.
If you experience any problems with duoHMM, before emailing the mail list please:
- Make sure you are using the latest version since your problem may have already been fixed
- Check carefully the screen output and the log file. The problem may be reported here since they both contain many details about what is going on.
If the problem persists, ask your question on the OXSTATGEN mailing list and we will be happy to answer!
Options
Here is a list of the command line option for duoHMM
Option [short version] |
Description |
---|---|
--haps [-H] | SHAPEIT2 haplotype file |
--input-gen [-M] | genetic map file in SHAPEIT2/Impute format |
--output-hap [-O] |
output pedigree corrected haplotypes this file |
--output-generr [-G] | output possible genotyping errors to this file |
--output-rec [-R] | output recombination map for pedigree to this file |
Examples
Data quality control
We assume your data is in plink binary format with the pedigree relationships specified in the respective plink .fam file as plink provides useful pedigree functionality. Before phasing, genotypes that are inconsistent with Mendelian inheritance should be removed. High rates of Mendel inconsistencies are likely to indicate an incorrect family relationship. The plink website provides details on how to investigate Mendel consistency within families.
Once you are confident that all your family relationships are correct. Simply run this plink command to remove any Mendel inconsistencies:
./plink --noweb \
--bfile
duohmm-example \
--me 1 1 \
--set-me-missing \
--make-bed \
--out
duohmm-example-nomendel
now the data is ready to be input into SHAPEIT2.
Phasing with SHAPEIT2
We simple run SHAPEIT2 using the --noped flag so that the pedigree relationships are completely ignored. We have also found it to be (slightly) advantageous to use a larger than default window when large amounts of IBD sharing are present hence we use the -W 5 here:
./shapeit --noped \
-B
duohmm-example-nomendel \
-M
genetic_map_chr10_combined_b37.txt \
-W 5 \
--output-max duohmm-example \
--output-graph duohmm-example.graph
Integrating pedigree structure with duoHMM
We can use the duoHMM to ensure haplotypes are consistent with pedigree structure, producing more accurate haplotypes. This process is straightforward, simply run the command:
./duohmm -H duohmm-example \
-M
genetic_map_chr10_combined_b37.txt \
-O
duohmm-example-corrected
This produces duohmm-example-corrected.haps which is a
standard SHAPEIT2 format haplotype file that has been
corrected for pedigree structure. An additional benefit is that
a child's haplotypes are now ordered with the paternal haplotype
first and the maternal haplotype second.
NOTE : this feature has
been built into the SHAPEIT software (via the --duohmm
flag). So if you just want to create a set of pedigree-correct
haplotypes then we would advise you to use that option. See here
for more details.
Detecting subtle genotyping errors with duoHMM
Not all genotyping errors will produce a Mendelian inconsistency. duoHMM can identify genotypes that are unlikely within a duo/trio based on patterns of recombination. Likely genotyping errors can be detected by running the command:
./duohmm -H duohmm-example \
-M
genetic_map_chr10_combined_b37.txt \
-G
duohmm-example.generr
The file duohmm-example.generr contains four columns; the individual ID, paternal ID, maternal ID, marker tag, marker position and the probability of error in this proband at this position:
ID father mother RSID1 POS PROB_ERROR ID001 0 ID024 10 2055591 1 ID001 0 ID024 10 2566085 1 ID001 0 ID024 10 4517381 1 ID001 0 ID024 10 9344681 1 ID001 0 ID024 10 20128601 0.999892
Markers with high probability are likely to be erroneous in either the parent or the child (or both), it is not possible to distinguish which individual contained the error.
Detecting recombination crossovers with DuoHMM
Detecting crossover events is more complicated as we need to
integrate over the uncertainty in phase. To do this, we need to
simulate haplotypes from the SHAPEIT2 diploid graph
which is output by the --output-graph argument
(we included this earlier in our phasing example).
We now will:
- Simulate ten haplotype sets from the SHAPEIT2 graph with different seeds
- Calculate the recombination map each meiosis in the haplotype sets using duoHMM
- Use the python script mapavg.py to average over these maps
for i in {1..10};
do
./shapeit -convert --input-graph
duohmm-example.graph --output-sample sim${i} --seed ${i}
./duohmm -H sim${i} -M
genetic_map_chr10_combined_b37.txt -R sim${i}.rec
done
python mapavg.py *.rec > duohmm-example-recombinations.txt
The resulting output in duohmm-example-recombinations.txt has five columns:
CHILD PARENT START END PROB_RECOMBINATION ID011 ID001 6526497 6543063 0.9997400 ID008 ID022 68442048 68482454 0.9941613 ID008 ID022 87537219 87684681 0.9995396 ID008 ID022 89965473 90008047 0.9995075 ID014 ID013 4522299 4575807 0.9721813 ID014 ID013 82866361 82979066 0.9776638
The first two columns are the parent and child involved in the
meisosis, the START and END columns are the regions where a
crossover may have occurred (flanking heterozygote sites on the
parents) and the final column is the probability of a crossover
event occurring.
Note: this routine has been shown to have good specificity,
that is it has low false positive rate when a high probability
threshold is used. The power to detect recombination events is
very dependendant on the structure of the pedigree. For example,
in a nuclear family with >2 siblings most crossover events
should be detectable. Pedigrees with such structure are informative
towards recombination.
We have also demonstrated some power to detect crossovers in uninformative
meisoses, however this is very dependant on how much more
distant IBD sharing is present in a cohort.