Updated GATK workflow to HaplotypeCaller and gVCF

I’ve updated my GATK workflow to GATK’s joint genotyping genomic VCF (gVCF) workflow, implemented in GATK3.4. I’ll provide the entire workflow here but it’s only the HaplotypeCaller step that is changed from:

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

To:

#Call variants (HaplotypeCaller) and prepare for genotyping

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf
and then:

#Genotype GVCFs

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

  • Full workflow

Filtering is advisable once variants are called

#Index reference
bwa index reference.fasta

#Sort reference
samtools faidx reference.fasta

#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:someID” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

#Create realignment targets
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

#Indel realignment
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

#Call variants (HaplotypeCaller) and prepare for genotyping
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf

#Genotype GVCFs
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

7 thoughts on “Updated GATK workflow to HaplotypeCaller and gVCF

  1. Pingback: Variant calling with GATK | approachedinthelimit

    1. mattmoorebioinf Post author

      Hi Arup,

      I can look into this further but I suspect it’s due to the picard tools version (1.8.5). In the most up to date versions the commands are totally different! I may update the post to reflect this

      Reply
  2. Samantha

    Thank you for your detailed pipeline.

    Can you explain why you are using HaplotypeCaller instead of UnifiedGenotyper? It seems like UnifiedGenotyper is better suited for haploid organisms.

    Reply
    1. mattmoorebioinf Post author

      Hi Samantha,

      So initially I had to use UnifiedGenotyper because HaplotypeCaller didn’t deal with anything other than diploid organisms. They updated it (I’m not sure from which version) to then include non-diploid organisms including haploid and pooled samples of multiple-ploidy.

      I’ve previously generated simulated reads (with putative SNPs and indels) and compared the major variant callers, the results were basically the same for UnifiedGenotyper and HaplotypeCaller. HaplotypeCaller though should work better for indels due to its de novo assembly around variants (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php).

      Also it seems that GATK will capture large insertions but not necessarily large deletions, it’s always worth checking genes of interest by assembly or from the sam or bam files that a huge chunk is missing anyway.

      Let me know if this answers your question!

      Reply
  3. Na Su

    Dear author:
    Thank you a lot for sharing this detailed protocol for bacteria sequencing data processing. I’m new to this field and was confused about all kinds of software version problems. It seems to me that when you using GATK HaplotypeCaller to do variants calling, you no longer need to do the indel realignment thing (according to GATK’s offical documents https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php). Can you explain this to me? Why you still do this before var calling? And I also want to know that if your variants filtering methods from your original post can still be used to this pipeline? Look forward to your reply, I learned a lot, thanks!

    Reply

Leave a reply to mattmoorebioinf Cancel reply