IMPUTE2

MVNcall

MVNCALL is a program developed for genotype calling and phasing using low-coverage next-generation sequencing reads information. The software assumes a study design where the samples are sequenced and also genotyped in a microarray. The approach assumes that the microarray genotypes have been phased using a haplotype estimation program. We recommend SHAPEIT for this purpose. The phased haplotypes then serve as a haplotype scaffold in MVNCALL. Then, each polymorphic site is phased one at a time onto the haplotype scaffold.

In the figure below illustrates how MVNcall works. The method implements an MCMC algorithm in which unobserved genotypes are iteratively updated. The figures illustrates how a site s is phased onto the haplotype scaffold. At each iteration, the posterior genotypes of an individual (in purple) are estimated given the (i) haplotype scaffold (in blue), (ii) the genotypes at the site at all other individuals (in green), and (iii) the flanking haplotypes of the individual in a given iteration (in orange).

The approach is described in our paper

A. Menelaou and J. Marchini (2013) Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold. Bioinformatics 29(1):84-91[Link]

MVNCALL has the following properties:

It analyses one site at a time
It is highly parallelizable
It requires low RAM usage.
It can leverage duo/trio information when available
It can analyse multi-allelic variants

What's New?

First software release (30 December 2013)

We have just released the first version MVNcall v1.0

Download MVNcall

MVNcall is freely available for academic use. To see rules for non-academic use, please read the LICENCE file, which is included with each software download.

Pre-compiled MVNcall binaries and example files can be downloaded from the links below. For Linux machines, the dynamic binaries are smaller but may not work on some machines due to gcc library compatibility issues; if the dynamic version doesn't work for you, please try the static version. If you have any problems getting the program to work on your machine or would like to request an executable for a platform not shown here, please send a message to our mail list.

The latest software release is v1.0. We support only the most recent version.

Platform	File
Linux (x86_64) Dynamic Executable	mvncall_v1.0_x86_64_dynamic.tgz

To unpack the files on a Linux computer, use a command like this:

tar -zxvf mvncall_v1.0_i686.tgz

(Other file decompression programs are available for non-Linux computers.) This will create a directory of the same name as the downloaded file, minus the '.tgz' suffix. Inside this directory you will find an executable called mvncall, a LICENCE file, and an Example/ directory that contains example data files. We show how to perform various kinds of analyses with the example files here.

Examples

This section provides some example commands that illustrate typical applications of MVNcall. All of the data files used in these commands are included in the Example/ directory that comes with the software download. You should run the commands from the main download directory (i.e., the one that contains the mvncall executable). Detailed explanations are provided at each link below.

Run type	Description
Imputation with default parameters	Basic scenario in which most people will use MVNcall.
Imputation with model averaging option	As above, but with model averaging which averages over two models.
Imputation with duo/trio related samples	As above, but with related samples.

How to use example commands

All of the data files in the example commands below are included in the example/ directory that comes with the MVNcall software download. You should run the command from the main download directory, which is the one that contains the mvncall executable. For example, if you just downloaded a software package named mvncall_v1.0_MacOSX.intel.tgz and unpacked it according to the directions here, you can reach the appropriate directory by typing "cd mvncall_v1.0_MacsOSX.intel/" on the command line.

Once you have found the right directory, you should be able to run the example command by entering it into a Unix-style terminal window. Depending on the settings of your computer, this may be as simple as highlighting the command text in your web browser, using the browser's Copy command, and then using the Paste command in your terminal window. (You may then need to hit Enter to start the run.)

Note that most lines in the example command end with the '\' character. This is not actually part of the command; it is just a shorthand notation that means "keep reading the next line as part of a single command." We use this notation to split the command over multiple lines so it is easier to read. This is a valid way to enter commands in a Unix-style terminal window, but it would be equivalent to put all of the arguments on a single line, separated by spaces.

You do not have to run MVNcall exactly as in the example. Some of the arguments shown here are optional, and there are many other options that could be added to modify the behavior of the program. For a full list of available options, see here.

Most of the examples below include the string "--int 20400000 20500000", which tells the program to produce results for a 100 kb region (positions 20,400,000-20,500,000) on a single chromosome. MVNcall assumes there is only one chromosome per input file, and that all input files in a single run come from the same chromosome. Applying the program to a much larger region—say, a whole chromosome or the whole genome—requires running many such jobs with different values of the --int parameter, usually in parallel on a computing cluster.

Imputation with DEFAULT PARAMETERS

This is the default scenario using the default parameters (for --iteration, --burn-in, --lambda, --numsnps).

The following command shows how to run this kind of analysis with MVNcall, using the example data that come with the program download:

./mvncall \
--sample-file ./example/example.sample \
--scaffold-file ./example/example.haps \
--glfs ./example/example.vcf \
--int 10906922 10906922 \
--o ./example/example1.vcf

Comments

Here we input the --sample-file which corresponds to the order of the haplotypes in the --scaffold-file. The --glfs file is in vcf format. Care should be taken that the ids in the --sample-file match the ids in the --glfs file.
The --int specifies the interval of sites to be imputed. Only sites that are present in the --glfs sites will be imputed. In the case where a site is present in the --scaffold-file, no imputation is performed, and the site is kept the same as in the --scaffold-file.
The --o provides the phased genotypes at the sites analysed in a vcf format.
The default values for the other parameters are used for the imputation of the Phase 1 of the 1000 Genomes Project. For different study designs those parameters might need to be modified.

Imputation with MODEL AVERAGING OPTION

The number of conditioning haplotypes (--k) is fixed to a single value. In order to run MVNcall with two values of --k, one specified by the --k flag and the other with all available haplotypes, the model-averaging option can be used. The two models are ran simultaneously and the averaged posteriors are used to infer the phased genotypes. This procedure seemed to help the imputation of the Phase 1 1000 Genomes data.

The following commands show how to run this kind of analysis with MVNcall, using the example data that come with the program download:

./mvncall \
--model-avg \
--sample-file ./example/example.sample \
--scaffold-file ./example/example.haps \
--glfs ./example/example.vcf \ --list ./example/example.list \
--k 100 \
-o ./example/example2.vcf

Comments

The --model-avg option runs the model with two values of --k (100 and all haplotypes). For low coverage sequencing data, the model averaging option was preferred since it provided lower genotype discordance rates. The computational time is however increased. We recommend you first test the --model-avg option in a small region and compare the genotype discordance with the model ran on a single value of --k before applying it to larger regions.
Instead of choosing an interval through the --int flag to impute with MVNcall, a list of sites can also be inputted where the physical positions of the sites analysed are inputted, one in each row. Note that no chromosome is required in the list, since MVNcall assumes that a single chromosome is inputted.

Imputation with DUo/TRIO RELATED SAMPLES

This scenario assumes that the samples are related (either duos -parent,offspring-, or trios -both parents,offspring-). We highly recommend to use the known relationships of the samples, if available, since this has shown to significantly help both the imputation and phasing of the samples.

The following command shows how to run this kind of analysis with MVNcall, using the example data that come with the program download:

./mvncall \
--sample-file ./example/example.sample \
--scaffold-file ./example/example.haps \
--glfs ./example/example.vcf \
--int 10000000 11000000 \
--k 200 \
--ped-file ./example/example.ped \
-o ./example/example3.vcf

Comments

The --ped-file should follow this format. Make sure that the ids in the ped file agree with the ped files in --sample-file.
The transmitted haplotypes in the --scaffold-file should be: father transmits the first haplotype to the first haplotype of the offspring, and the mother transmits the first haplotype to the second haplotype of the offspring. This is crucial for running the pedigree related model.

Program Options

Flag	Default	Description
--sample-file <file> REQUIRED	none	File containing the IDs of the samples in the haplotype scaffold. There should be no header in the file. The order of the IDs in the the sample file needs to agree with the order of haplotypes in the scaffold file. An example of a sample file can be found here
--scaffold-file <file> REQUIRED	none	File containing the haplotypes of the samples as specified in the sample file. Each row corresponds to a SNP and each column to a haplotype. The first five columns of the file give information regarding the chromosome, the position, and the alleles. For more details see here.
--glfs <file> REQUIRED	none	The glfs file is a vcf format file that contains the genotype likelihoods of the sites to be analysed on the same set of samples that are in the sample file. The genotype likelihoods can be in GL or PL format. When multiple variants have the same physical position, the output will contain pseudo-positions with +1 from the input position. The option supports .gz format.
--ped-file <file>	none	The ped-file contains the duo/trio relationships between the samples studied. The file must follow this format. When a ped-file is not given, then the individuals are assumed to be unrelated. For more details on the format see here.
--o <file> REQUIRED	none	The output file contains the phased genotypes in a vcf format. For more details see here
--mp <file>	none	The file contains the marginal posterior probabilities of the phased genotypes. For bi-allelic sites, the order is RR-AA-AR-RA where R=reference allele and A=alternative allele. For more than 2 alleles, the order follows the same convention as used in vcf format. At each row the position, the reference and alternative allele are given in the first three columns and the following columns correspond to the marginal probabilities of the individuals at the site. An example of the output file is given here
--int <lower> <upper>	none	Genomic interval to use for inference, as specified by <lower> and <upper> boundaries in base pair position. The option is useful for splitting the chromosome in multiple regions and run them parallel.
--list <file>	none	Instead of using a genomic interval, a list of sites with their genomic positions can be given. Only the positions in the list will be imputed. An example of a list file is given here
--numsnps <int>	50	The number of flanking sites that is going to be used for inference at either site of the site of interest.
--lambda <double>	0.06	The penalty parameter added on the diagonal of the variance-covariance matrix. The higher the value of lambda, the more weight is given to the GLs and less to the LD structure to calculate the posterior probabilities of genotypes. The default value is suggested for low/medium coverage data.
--k <int>	all	The number of conditioning haplotypes to be used for inference. Each haplotype in a given individual, conditions on k haplotypes, which do not need to be the same. Perfect matching is used for choosing the closest haplotypes. We recommend to use different values of k before you decide which value to use. For the 1000 Genomes Phase 1 Project and the Genome of the Netherlands, we use k=100
--model-avg	none	Model average flag will run MVNcall twice: once with the value of k as indicated in --k, and another one with k equal to all haplotypes.
--iteration <int>	100	Total number of MCMC iterations to perform, including burn-in. Increasing the number of iterations may improve accuracy, depending on the number of samples and the depth of coverage.
--burn-in <int>	10	Number of MCMC iterations to discard as burn-in.

File formats

Examples of file formats required as inputs in MVNcall are provided.

Sample file (--sample-file): The order of the samples in the haplotype files are required in all analyses. The sample file contains the IDs of the samples, one in each row. No header file is required.

IND1
IND2
IND3
IND4

Scaffold file (-scaffold-file): The study design assumed by MVNcall includes a phased haplotype scaffold. A haplotype scaffold is built by phasing microarray genotypes on the same set of samples that are sequenced. We recommend SHAPEIT2 for building the haplotype scaffold. The output of SHAPEIT2 includes the HAPS and SAMPLE file formats for haplotypes. The scaffold file has the same format as the SHAPEIT2 HAPS files.

1 SNP1 123 A T 1 0 0 1 0 0
1 SNP2 456 G C 0 0 1 0 1 0
1 SNP3 789 T A 0 1 1 1 0 1

The first five columns include:

Chromosome number <integer>

SNP ID <string>

SNP Position <integer>

First allele <string>

Second allele <string>

The subsequent column pair (6,7), (8,9), (10,11), (12,13) corresponds to the alleles carried at that SNP by each haplotype of the four individuals.

Ped file (--ped-file): This is a space delimited file which includes at least five columns. The five columns correspond to:
- Family ID <string>
- Individual ID <string>
- Father ID <string>
- Mother ID <string>
- Sex <integer>
- Phenotype <float>
FAM1 IND1 0 0 1 0
FAM2 IND2 0 0 2 0
FAM3 IND3 IND1 IND2 2 0
Output file (-o): The output file is in vcf format. The phased genotypes are given using the vcf notation, i.e. 0|0, 0|1, 1|0, 1|1.
Posterior probabilities file (--mp): The output file gives the posterior probabilities of the phased genotypes. For example:

123 A T 1 0 0 0 1 0 0 0 1 0 0 0
456 G C 1 0 0 0 0 1 0 0 1 0 0 0
789 T A 0 0 0 1 0 1 0 0 0 0 1 0

List file (--list):

123
456
789

_____________________________________________________________________________________________________________________________________

References

A. Menelaou and J. Marchini (2013) Genotype calling and phasing using next-generation sequencing reads and a haplotype scaffold. Bioinformatics 29(1):84-91[Link]

Contributors

The following people developed the methodology and software for MVNcall:

Androniki Menelaou, Jonathan Marchini

Mail List

If you have a question about MVNcall, please send a message to our mailing list:

http://www.jiscmail.ac.uk/OXSTATGEN

You will need to subscribe to the mailing list to post a question. The list has low but steady traffic, so you may want to redirect the messages to a dedicated e-mail folder if you don't want them all landing in your inbox.

IMPORTANT: If you are having a problem with the software, please include the following details in your e-mail; otherwise, we may not be able to diagnose the problem.

The version number of MVNcall and the type of computer you are using to run it—e.g., "MVNcall v1.0 on Mac OSX 10.6".
Any log files and/or screen output from the program;
For difficult problems like memory access errors (e.g., "segmentation faults"), we may need you to send data files that show the problem. These files should ideally be small, and we can provide suggestions if you are not allowed to share your actual data.