duoHMM

duoHMM is a software package for post-processing haplotypes estimated by SHAPEIT. It incorporates pedigree information into the haplotype estimates in a post-hoc manner. This has three functions:

  1. correcting phasing errors in the haploptypes to make them consistent with a pedigree structure
  2. detecting recombination events
  3. detecting subtle genotyping errors

The remainder of this page demonstrates how to apply this pipeline to some example data.

If you use duoHMM in your research please cite:

J. O'Connell et al. (2013) A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genetics. 10(4) doi:10.1371/journal.pgen.1004234

NOTE : If you just wish to use duoHMM for phasing complex pedigrees, this functionality is integrated into SHAPEIT2 and can be enabled via the --duohmm flag. See more details here.


Download

Pre-compiled duoHMM binaries and example files can be downloaded from the links below.

Platform File
Linux (x86_64) Static Executable duohmm_v0.1.7.tar.gz



Contact

To ask a question about duoHMM please subscribe to the OXSTATGEN mailing list.

If you experience any problems with duoHMM, before emailing the mail list please:

  1. Make sure you are using the latest version since your problem may have already been fixed
  2. Check carefully the screen output and the log file. The problem may be reported here since they both contain many details about what is going on.

If the problem persists, ask your question on the OXSTATGEN mailing list and we will be happy to answer!




Options

Here is a list of the command line option for duoHMM

Option [short version]
Description
--haps [-H] SHAPEIT2 haplotype file
--input-gen [-M]  genetic map file in SHAPEIT2/Impute format
--output-hap [-O]
output pedigree corrected haplotypes this file
--output-generr [-G] output possible genotyping errors to this file
--output-rec [-R] output recombination map for pedigree to this file

Examples

Data quality control

We assume your data is in plink binary format with the pedigree relationships specified in the respective plink .fam file as plink provides useful pedigree functionality. Before phasing, genotypes that are inconsistent with Mendelian inheritance should be removed. High rates of Mendel inconsistencies are likely to indicate an incorrect family relationship. The plink website provides details on how to investigate Mendel consistency within families.

Once you are confident that all your family relationships are correct. Simply run this plink command to remove any Mendel inconsistencies:

./plink --noweb \
        --bfile duohmm-example \
        --me 1 1 \
        --set-me-missing \
        --make-bed \
        --out duohmm-example-nomendel

now the data is ready to be input into SHAPEIT2.

Phasing with SHAPEIT2

We simple run SHAPEIT2 using the --noped flag so that the pedigree relationships are completely ignored. We have also found it to be (slightly) advantageous to use a larger than default window when large amounts of IBD sharing are present hence we use the -W 5 here:

./shapeit --noped \
          -B duohmm-example-nomendel \
          -M genetic_map_chr10_combined_b37.txt \
          -W 5 \
          --output-max duohmm-example \
          --output-graph duohmm-example.graph

Integrating pedigree structure with duoHMM

We can use the duoHMM to ensure haplotypes are consistent with pedigree structure, producing more accurate haplotypes. This process is straightforward, simply run the command:

./duohmm -H duohmm-example \
         -M genetic_map_chr10_combined_b37.txt \
         -O duohmm-example-corrected

This produces duohmm-example-corrected.haps which is a standard SHAPEIT2 format haplotype file that has been corrected for pedigree structure. An additional benefit is that a child's haplotypes are now ordered with the paternal haplotype first and the maternal haplotype second.

NOTE : this feature has been built into the SHAPEIT software (via the --duohmm flag). So if you just want to create a set of pedigree-correct haplotypes then we would advise you to use that option. See here for more details.

Detecting subtle genotyping errors with duoHMM

Not all genotyping errors will produce a Mendelian inconsistency. duoHMM can identify genotypes that are unlikely within a duo/trio based on patterns of recombination. Likely genotyping errors can be detected by running the command:

./duohmm -H duohmm-example \
         -M genetic_map_chr10_combined_b37.txt \
         -G duohmm-example.generr

The file duohmm-example.generr contains four columns; the individual ID, paternal ID, maternal ID, marker tag, marker position and the probability of error in this proband at this position:

ID father mother RSID1 POS PROB_ERROR ID001 0 ID024 10 2055591 1 ID001 0 ID024 10 2566085 1 ID001 0 ID024 10 4517381 1 ID001 0 ID024 10 9344681 1 ID001 0 ID024 10 20128601 0.999892

Markers with high probability are likely to be erroneous in either the parent or the child (or both), it is not possible to distinguish which individual contained the error.

Detecting recombination crossovers with DuoHMM

Detecting crossover events is more complicated as we need to integrate over the uncertainty in phase. To do this, we need to simulate haplotypes from the SHAPEIT2 diploid graph which is output by the --output-graph argument (we included this earlier in our phasing example).

We now will:

  1. Simulate ten haplotype sets from the SHAPEIT2 graph with different seeds
  2. Calculate the recombination map each meiosis in the haplotype sets using duoHMM
  3. Use the python script mapavg.py to average over these maps

for i in {1..10};
do
  ./shapeit -convert --input-graph duohmm-example.graph --output-sample sim${i} --seed ${i}
  ./duohmm -H sim${i} -M genetic_map_chr10_combined_b37.txt -R sim${i}.rec
done

python mapavg.py *.rec > duohmm-example-recombinations.txt

The resulting output in duohmm-example-recombinations.txt has five columns:

CHILD PARENT START END PROB_RECOMBINATION ID011 ID001 6526497 6543063 0.9997400 ID008 ID022 68442048 68482454 0.9941613 ID008 ID022 87537219 87684681 0.9995396 ID008 ID022 89965473 90008047 0.9995075 ID014 ID013 4522299 4575807 0.9721813 ID014 ID013 82866361 82979066 0.9776638

The first two columns are the parent and child involved in the meisosis, the START and END columns are the regions where a crossover may have occurred (flanking heterozygote sites on the parents) and the final column is the probability of a crossover event occurring.

Note: this routine has been shown to have good specificity, that is it has low false positive rate when a high probability threshold is used. The power to detect recombination events is very dependendant on the structure of the pedigree. For example, in a nuclear family with >2 siblings most crossover events should be detectable. Pedigrees with such structure are informative towards recombination.

We have also demonstrated some power to detect crossovers in uninformative meisoses, however this is very dependant on how much more distant IBD sharing is present in a cohort.