Updated GATK workflow to HaplotypeCaller and gVCF

I’ve updated my GATK workflow to GATK’s joint genotyping genomic VCF (gVCF) workflow, implemented in GATK3.4. I’ll provide the entire workflow here but it’s only the HaplotypeCaller step that is changed from:

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

To:

#Call variants (HaplotypeCaller) and prepare for genotyping

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf
and then:

#Genotype GVCFs

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

Full workflow

Filtering is advisable once variants are called

#Index reference
bwa index reference.fasta

#Sort reference
samtools faidx reference.fasta

#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:someID” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

#Create realignment targets
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

#Indel realignment
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

#Call variants (HaplotypeCaller) and prepare for genotyping
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf

#Genotype GVCFs
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

7 thoughts on “Updated GATK workflow to HaplotypeCaller and gVCF”

Pingback: Variant calling with GATK | approachedinthelimit

Arup Ghosh August 16, 2017 at 5:27 am

I guess in the “Mark duplicates” part you should include the “REMOVE_DUPLICATES=true ” to remove the duplicates.

Reply ↓

mattmoorebioinf Post authorAugust 16, 2017 at 11:23 am

Hi Arup,

I can look into this further but I suspect it’s due to the picard tools version (1.8.5). In the most up to date versions the commands are totally different! I may update the post to reflect this

Reply ↓

Samantha September 7, 2017 at 3:29 pm

Thank you for your detailed pipeline.

Can you explain why you are using HaplotypeCaller instead of UnifiedGenotyper? It seems like UnifiedGenotyper is better suited for haploid organisms.

Reply ↓

mattmoorebioinf Post authorSeptember 18, 2017 at 2:44 pm

Hi Samantha,

So initially I had to use UnifiedGenotyper because HaplotypeCaller didn’t deal with anything other than diploid organisms. They updated it (I’m not sure from which version) to then include non-diploid organisms including haploid and pooled samples of multiple-ploidy.

I’ve previously generated simulated reads (with putative SNPs and indels) and compared the major variant callers, the results were basically the same for UnifiedGenotyper and HaplotypeCaller. HaplotypeCaller though should work better for indels due to its de novo assembly around variants (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php).

Also it seems that GATK will capture large insertions but not necessarily large deletions, it’s always worth checking genes of interest by assembly or from the sam or bam files that a huge chunk is missing anyway.

Let me know if this answers your question!

Reply ↓

Na Su November 13, 2017 at 5:37 am

Dear author:
Thank you a lot for sharing this detailed protocol for bacteria sequencing data processing. I’m new to this field and was confused about all kinds of software version problems. It seems to me that when you using GATK HaplotypeCaller to do variants calling, you no longer need to do the indel realignment thing (according to GATK’s offical documents https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php). Can you explain this to me? Why you still do this before var calling? And I also want to know that if your variants filtering methods from your original post can still be used to this pipeline? Look forward to your reply, I learned a lot, thanks!

Reply ↓

mattmoorebioinf Post authorNovember 13, 2017 at 11:49 am

Hi Na Su,

I’ve just checked that and you’re right, because HaplotypeCaller deals with this you don’t need it in the workflow anymore!

Best,

Matt

Reply ↓

Pingback: Variant calling with GATK | approachedinthelimit
Arup Ghosh August 16, 2017 at 5:27 am

I guess in the “Mark duplicates” part you should include the “REMOVE_DUPLICATES=true ” to remove the duplicates.

Reply ↓
1. mattmoorebioinf Post authorAugust 16, 2017 at 11:23 am
  
  Hi Arup,
  
  I can look into this further but I suspect it’s due to the picard tools version (1.8.5). In the most up to date versions the commands are totally different! I may update the post to reflect this
  
  Reply ↓
Samantha September 7, 2017 at 3:29 pm

Thank you for your detailed pipeline.

Can you explain why you are using HaplotypeCaller instead of UnifiedGenotyper? It seems like UnifiedGenotyper is better suited for haploid organisms.

Reply ↓
1. mattmoorebioinf Post authorSeptember 18, 2017 at 2:44 pm
  
  Hi Samantha,
  
  So initially I had to use UnifiedGenotyper because HaplotypeCaller didn’t deal with anything other than diploid organisms. They updated it (I’m not sure from which version) to then include non-diploid organisms including haploid and pooled samples of multiple-ploidy.
  
  I’ve previously generated simulated reads (with putative SNPs and indels) and compared the major variant callers, the results were basically the same for UnifiedGenotyper and HaplotypeCaller. HaplotypeCaller though should work better for indels due to its de novo assembly around variants (https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php).
  
  Also it seems that GATK will capture large insertions but not necessarily large deletions, it’s always worth checking genes of interest by assembly or from the sam or bam files that a huge chunk is missing anyway.
  
  Let me know if this answers your question!
  
  Reply ↓
Na Su November 13, 2017 at 5:37 am

Dear author:
Thank you a lot for sharing this detailed protocol for bacteria sequencing data processing. I’m new to this field and was confused about all kinds of software version problems. It seems to me that when you using GATK HaplotypeCaller to do variants calling, you no longer need to do the indel realignment thing (according to GATK’s offical documents https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php). Can you explain this to me? Why you still do this before var calling? And I also want to know that if your variants filtering methods from your original post can still be used to this pipeline? Look forward to your reply, I learned a lot, thanks!

Reply ↓
1. mattmoorebioinf Post authorNovember 13, 2017 at 11:49 am
  
  Hi Na Su,
  
  I’ve just checked that and you’re right, because HaplotypeCaller deals with this you don’t need it in the workflow anymore!
  
  Best,
  
  Matt
  
  Reply ↓

approachedinthelimit

Small bugs, big data.

Updated GATK workflow to HaplotypeCaller and gVCF

7 thoughts on “Updated GATK workflow to HaplotypeCaller and gVCF”

Leave a reply to mattmoorebioinf Cancel reply

Share this:

Related

7 thoughts on “Updated GATK workflow to HaplotypeCaller and gVCF”

Leave a reply to mattmoorebioinf Cancel reply