Tag Archives: microbial genomics

Updated GATK workflow to HaplotypeCaller and gVCF

I’ve updated my GATK workflow to GATK’s joint genotyping genomic VCF (gVCF) workflow, implemented in GATK3.4. I’ll provide the entire workflow here but it’s only the HaplotypeCaller step that is changed from:

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

To:

#Call variants (HaplotypeCaller) and prepare for genotyping

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf
and then:

#Genotype GVCFs

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

  • Full workflow

Filtering is advisable once variants are called

#Index reference
bwa index reference.fasta

#Sort reference
samtools faidx reference.fasta

#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

#Create realignment targets
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

#Indel realignment
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

#Call variants (HaplotypeCaller) and prepare for genotyping
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf

#Genotype GVCFs
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

Small Bugs, Big Data

In 2000, a short 23 years after Fred Sanger first conceived dideoxy chain termination(Sanger) DNA sequencing and the human genome project had reached draft completion. The public project had cost an estimated $2.7 billion and the late introduction of profit driven Celera Genomics injected some Hollywood drama. Bill Clinton and Tony Blair jointly announced the draft and attempted to convey the sheer human accomplishment the project represented as well as the medical, ethical and philosophical implications. They also announced that no raw data would be available to patent —that the human genome was public— and resulted in Celera’s investors enjoying the second largest one-day fall on the stock market of all time. As messy as the politics had become and as difficult as the project proved, one thing was absolutely clear: biology had become a big science with big money and big data.

Now, sixteen years since the human genome draft, it is medical and clinical microbiology, which is enjoying a revolution in methodology. High throughput sequencing of microorganisms with comparatively tiny genomes is producing more data than ever before for medical microbiologists to better understand their biology and is poised to change how clinical microbiology is done due to the substantially more portable Oxford Nanopore sequencing technologies. Microbes, particularly bacteria, threaten public health and simultaneously, the crops and livestock upon which we depend. Developing countries in particular stand to benefit greatly from fast, efficient, powerful and inexpensive sequence-based microbiological methods in research and the clinic.

The first bacterial genome, that of Heamophilus influenzae, was completely sequenced in 1995. Sequencing of H. influenzae and the projects in the following years were labour-intensive and required massive six figure budgets, with entire laboratories dedicated to completion or ‘closing’ of gaps left in the genome by the computational sequencing fragment assemblers. From around 2005, ‘second generation’ sequencers allowed a massive increase in throughput while the price decreased dramatically, an excessively stated fact but one that remains worthy of celebration as it has been increasingly so since.

Increasingly sophisticated algorithms to deal with raw sequencing data have accompanied the technological advance of sequencing platforms. Many copies of a genome are fragmented and constitute the raw output of DNA sequencing, so-called shotgun sequencing, and must be assembled. Early genome assemblers relied upon overlap-layout-consensus (OLC) algorithms, while high coverage shorter read sequencers demanded the development of graph based de brujin algorithms and accompanying heuristics, particularly for larger genomes. These advancements in sequencing and computation began to produce ‘draft’ genomes, which were now of high enough quality —especially if a closely related, high quality reference was already available— to eliminate the substantial rate-limiting step of ‘finishing’ genomes in the lab. Armed with a draft genome and with a number of online tools for most imaginable questions in microbial genomics (for not too large datasets) researchers can ask more about their chosen microbe of study than ever before.

Advancements are not only being made in tracking pathogens, unravelling the evolution of antibiotic resistance, population structure and adaptation of microbes but sequencing and bioinformatics are also democratising science with open tools, the soaring popularity of pre-prints, open data and a thriving community making the most of social media, which appears to be having the effect of promoting collaboration and reducing outdated, closed-off, paranoid science being done in isolation.

Variant calling with GATK

Different variant callers disagree a great deal, for single nucleotide polymorphisms (SNPs) and particularly for insertions and deletions (indels). Of the various methods available (samtools, varscan, freebayes, ReadXplorer etc) GATK, by the Broad Institute is the best. The HaplotypeCaller module which performs local de novo assemblies around indels has recently been updated to include non-diploid organisms, so it’s used here but there is little difference for bacterial genomes at least, between HaplotypeCaller and UnifiedGenotyper.

Following conversations at a few conferences and meetings now, most recently the brilliant 2nd ISCB Bioinformatics Symposium at TGAC, I’ve realised that others have had the same issues I had with implementing GATK. This mostly arises in preparation of your data beforehand, so here is a brief run-through of exactly how to call variants from microbial genome short read data using GATK which I hope is useful!:

-Dependencies:

You’ll need to install samtools, picardtools, GATK, bwa and optionally, vcffilter for this workflow. Picardtools and GATK are simply .jar files so that’s no problem while you probably already have bwa installed, otherwise installation is well documented!

-The workflow

This workflow begins with short read (fastq) files and a fasta reference. First  the reference is prepared, a sequence dictionary is created, short reads are aligned to the reference and read group information provided, resulting sequence alignment map (sam) file sorted and converted to binary alignment map (bam) format, duplicates marked, bam file sorted, indel targets identified, indels realigned and variants called. Simple!

For simplicity an example set of commands are provided here, where the reference is reference.fasta and the short reads are R1.fastq.gz and R2.fastq.gz. You will need to enter the paths and versions of the software being used at each step and your actual file names. Ploidy is set to 1.

See Updated GATK pipeline to HaplotypeCaller + gVCF for implementation of the more recent gVCF workflow. Otherwise HaplotypeCaller alone (below) will work just fine


#Index reference
bwa index reference.fasta

#Sort reference
samtools faidx reference.fasta

#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

#Create realignment targets
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

#Indel realignment
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

#Call variants (HaplotypeCaller)
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

The resulting vcf file will contain your variant calls!

Then you can optionally filter the variants:

#Filter variants
~/bin/vcflib/bin/vcffilter -f ‘DP > 9’ -f ‘QUAL > 10’ raw.vcf > filtered.vcf

Or first split the raw.vcf file into SNPs and indels:

#Extract SNPs
java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType SNP -o snps.vcf

#Extract Indels
java -jar ~/bin/GATK/GenomeAnalysisTK.jar -T SelectVariants -R reference.fasta -V raw.vcf -selectType INDEL -o indels.vcf

I also have a neat perl wrapper to automate this workflow over many short read files and would be happy to make this available if people are interested. Please do comment with any questions or issues and I’ll do my best to resolve them!