Variant calling with GATK

Different variant callers disagree a great deal, for single nucleotide polymorphisms (SNPs) and particularly for insertions and deletions (indels). Of the various methods available (samtools, varscan, freebayes, ReadXplorer etc) GATK, by the Broad Institute is the best. The HaplotypeCaller module which performs local de novo assemblies around indels has recently been updated to include non-diploid organisms, so it’s used here but there is little difference for bacterial genomes at least, between HaplotypeCaller and UnifiedGenotyper.

Following conversations at a few conferences and meetings now, most recently the brilliant 2nd ISCB Bioinformatics Symposium at TGAC, I’ve realised that others have had the same issues I had with implementing GATK. This mostly arises in preparation of your data beforehand, so here is a brief run-through of exactly how to call variants from microbial genome short read data using GATK which I hope is useful!:


You’ll need to install samtools, picardtools, GATK, bwa and optionally, vcffilter for this workflow. Picardtools and GATK are simply .jar files so that’s no problem while you probably already have bwa installed, otherwise installation is well documented!

-The workflow

This workflow begins with short read (fastq) files and a fasta reference. First  the reference is prepared, a sequence dictionary is created, short reads are aligned to the reference and read group information provided, resulting sequence alignment map (sam) file sorted and converted to binary alignment map (bam) format, duplicates marked, bam file sorted, indel targets identified, indels realigned and variants called. Simple!

For simplicity an example set of commands are provided here, where the reference is reference.fasta and the short reads are R1.fastq.gz and R2.fastq.gz. You will need to enter the paths and versions of the software being used at each step and your actual file names. Ploidy is set to 1.

See Updated GATK pipeline to HaplotypeCaller + gVCF for implementation of the more recent gVCF workflow. Otherwise HaplotypeCaller alone (below) will work just fine

#Index reference
bwa index reference.fasta

#Sort reference
samtools faidx reference.fasta

#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

#Indel realignment
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

The resulting vcf file will contain your variant calls!

Then you can optionally filter the variants:

#Filter variants
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf

Or first split the raw.vcf file into SNPs and indels:

#Extract SNPs
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf

#Extract Indels
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf

I also have a neat perl wrapper to automate this workflow over many short read files and would be happy to make this available if people are interested. Please do comment with any questions or issues and I’ll do my best to resolve them!


19 thoughts on “Variant calling with GATK

    1. mattmoorebioinf Post author


      Thanks for the feedback and sorry for the slow response. I’ve been away!

      That line specifies what you want the read group headers of your alignment file to be (it doesn’t matter too much), but there (point 9) gives a detailed explanation of what these contain:

      So yours would, of course, bed different to the example I’ve given! The bwa manual also specifies how to include the read group header at this step. I find this finds less problems than using GATK’s ‘AddorReplaceReadGroups’ module later on.

  1. baocheng

    Hi, it is a clear guideline for variant calling. Is it possible to share the perl script for automating this workflow with many read files?

  2. Annie

    Is the ‘PA01.fasta’ reference used in the #Indel realignment step different to the ‘reference.fasta’ used in the steps before and after? If so, how is it different.


    1. mattmoorebioinf Post author

      Thanks will edit now!

      Also, a postdoc in my group has published a pipeline which takes short reads and a reference, performs all these steps (except it has been updated to use gVCF rather than haplotype caller), can produce a multiple alignment, phylogenetic trees and introduces a novel filter for false positive variants:

      I highly recommend it!

      1. dR_kaOs

        Fantastic help, thank you for sharing it. My script is working.
        One stupid question but I really want to know.. If we process several VCF files, we need to index the reference once or for every VCF file?
        I think once once but I want to confirm.

      2. mattmoorebioinf Post author

        Yes, it just needs to be indexed once! So in my perl wrapper for example I index the fasta reference and then loop over all vcf files for analysis

  3. Pingback: Updated GATK pipeline to HaplotypeCaller + gVCF | approachedinthelimit

  4. Ramani

    Hi, is it necessary to do base recalibration step? I saw it on GATK site that they expect us to do this, any experiences how does it affect for bacterial short read data variant calling ?

    1. mattmoorebioinf Post author

      Hi Ramani,

      Sorry for the slow response, I don’t seem to get notifications for comments.

      It is advisable to use the base calibration step, yes. The sequencer can miscalculate the base quality not taking into account some what proportion of mismatches there are with the reference.

      I’ve missed it out because while you can repeat this step to converge upon potential mismatches it’s best used when the organism you’re working with has a database of known SNPs

  5. lidi

    First, Thank you for sharing this nice job.
    I’m really new in this field am trying to write a script for DAN-seq variant calling according GATK Pipeline….
    Your pipeline is really helpful for me to learn..
    Just I have a question for you: I think the Output for the “Sort Bam file” step is a Bai file(A bam index file ,dedup.bai ). May I know for the next steps how you this file?
    Thanks again

    1. mattmoorebioinf Post author

      Hi and thanks, I hope it’s a helpful guide.

      So that step indexes the bam file, when using bamtools for example this will be required (simply has to be in the same directory). Does that answer your question?

      Best regards,


  6. Ati

    Thank you very much for sharing this nice job.
    I think the output for “Sort Bam file” step is a Bai file (Bam index file). May I know how you use this file for the other steps?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s