Different variant callers disagree a great deal, for single nucleotide polymorphisms (SNPs) and particularly for insertions and deletions (indels). Of the various methods available (samtools, varscan, freebayes, ReadXplorer etc) GATK, by the Broad Institute is the best. The HaplotypeCaller module which performs local de novo assemblies around indels has recently been updated to include non-diploid organisms, so it’s used here but there is little difference for bacterial genomes at least, between HaplotypeCaller and UnifiedGenotyper.
Following conversations at a few conferences and meetings now, most recently the brilliant 2nd ISCB Bioinformatics Symposium at TGAC, I’ve realised that others have had the same issues I had with implementing GATK. This mostly arises in preparation of your data beforehand, so here is a brief run-through of exactly how to call variants from microbial genome short read data using GATK which I hope is useful!:
You’ll need to install samtools, picardtools, GATK, bwa and optionally, vcffilter for this workflow. Picardtools and GATK are simply .jar files so that’s no problem while you probably already have bwa installed, otherwise installation is well documented!
This workflow begins with short read (fastq) files and a fasta reference. First the reference is prepared, a sequence dictionary is created, short reads are aligned to the reference and read group information provided, resulting sequence alignment map (sam) file sorted and converted to binary alignment map (bam) format, duplicates marked, bam file sorted, indel targets identified, indels realigned and variants called. Simple!
For simplicity an example set of commands are provided here, where the reference is reference.fasta and the short reads are R1.fastq.gz and R2.fastq.gz. You will need to enter the paths and versions of the software being used at each step and your actual file names. Ploidy is set to 1.
See Updated GATK pipeline to HaplotypeCaller + gVCF for implementation of the more recent gVCF workflow. Otherwise HaplotypeCaller alone (below) will work just fine
bwa index reference.fasta
samtools faidx reference.fasta
#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict
#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam
#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt
#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam
#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam
#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf
The resulting vcf file will contain your variant calls!
Then you can optionally filter the variants:
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf
Or first split the raw.vcf file into SNPs and indels:
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf
I also have a neat perl wrapper to automate this workflow over many short read files and would be happy to make this available if people are interested. Please do comment with any questions or issues and I’ll do my best to resolve them!