04 - Variant calling
Variant Calling Software
Once we have our .bam
files for all of our individuals, we can move forward with variant calling. There are many, many variant callers out there, and which one you want to use will depend on the specifics of your data and your planned analyses with those data. Some of the most popular variant callers for short read sequencing data and their features include:
- GATK – a very common analysis pipeline developed by the Broad Institute (i.e., developed for use in humans) that includes the
HaplotypeCaller
module for calling variants within a population - SAMtools/BCFtools
- ANGSD – a software for analyzing next generation sequencing data that has a lot of flexibility and can handle a number of different input types. Most of its methods take genotype uncertainty into account instead of basing the analysis on called genotypes, which is especially useful for low and medium depth data.
- Freebayes – a Bayesian variant caller that uses a different method of variant detection (i.e., haplotype-based variant detection) than other popular software (see their documentation for more details)
If you are interested in reading more about variant calling and the different options, Nielsen et al. (2011) provides a nice review, with a focus on mitigating uncertainty in genotype calling.
Brief aside: genotype likelihoods and genotype probabilites
Often, we are not 100% certain about a base call at a specific site, especially when we have lower read depth. In these situations, we can use genotype probabilities or likelihoods instead of genotype calls, which allow us to quantify our uncertainty (or certainty) in a given genotype and to use that in downstream calculations. For moderate or low sequencing depths, genotype calling based on fixed cutoffs will typically lead to under-calling of heterozygous genotypes. The advantages of the probabilistic methods are that they provide measures of statistical uncertainty when calling genotypes, they lead to higher accuracy of genotype calling, and they provide a natural framework for incorporating information regarding allele frequencies and patterns of LD into our inferences.
Genotype likelihoods are the probability of the observed data (i.e., the reads) given a specific genotype. The likelihoods are presented in threes for diploid data, which corresponds to the likelihood that the true genotype is AA, AB, and BB. All four of the variant callers above can provide genotype likelihoods in the output VCF files. Many downstream methods require called genotypes rather than likelihoods; however, there are more and more methods being developed that will do analyses on likelihoods, so those are beneficial to know about. While likelihoods are necessary for low-depth data, it is also good practice to use them where possible even for higher-depth data, as they allow you to incorporate uncertainty into your downstream analyses instead of ignoring this potential source of error. There is a more in-depth explanation of genotype likelihoods for those interested in the variant calling section of Eric Anderson’s Bioinformatics Handbook.
Calling variants using BCFtools and SAMtools
For today, we’ll be working with the BCFtools/SAMtools pipeline for calling variants and dealing with “hard” genotype calls, as detailed here. To do this, we’ll want to have all of the sorted .bam
files in the same directory, and we’ll also need to know the path to our reference genome. Once we have both of those, we can use the following code to call variants across all of our samples combined:
module load bcftools
bcftools mpileup -Ou -a AD,DP -f reference.fa bamfiles/*.sorted.bam | bcftools call -m -v -Ov -o variants.vcf
To see what each of the flags that we use in this code means (and to see what other options exist), we can check out the BCFtools mpileup
and call
manual pages (https://samtools.github.io/bcftools/bcftools.html#mpileup and https://samtools.github.io/bcftools/bcftools.html#call).
For bcftools mpileup
:
-a
- Annotate the vcf - here we add allelic depth (AD) and genotype depth (DP).-O
- the output type. Here it isu
which means we do not compress the output.-f
- specify the reference genome to call variants against.
For bcftools call:
-v
- output variant sites only - i.e. ignore non-variant parts of the reads-m
- use bcftools multiallelic caller-O
- specify the output type, here it isv
- i.e. vcf; you could also choosez
, which would mean a compressed (zipped) vcf-o
output path
The pipe (|
) in between these two calls allows us to pass the results from the bcftools mpileup
command directly into bcftools call
instead of creating an intermediate file. If we run this code, we will see a warning that only the first 250 reads for each region are examined and we may want to increase this number. To increase the read depth analyzed, we can add a -d 1000
option to increase it to the first 1000 reads.