Tag Archives: Sequencing

Updated GATK workflow to HaplotypeCaller and gVCF

I’ve updated my GATK workflow to GATK’s joint genotyping genomic VCF (gVCF) workflow, implemented in GATK3.4. I’ll provide the entire workflow here but it’s only the HaplotypeCaller step that is changed from:

java -jar ~/bin/GATK3.3/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o raw.vcf

To:

#Call variants (HaplotypeCaller) and prepare for genotyping

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf
and then:

#Genotype GVCFs

java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

  • Full workflow

Filtering is advisable once variants are called

#Index reference
bwa index reference.fasta

#Sort reference
samtools faidx reference.fasta

#Create sequence dictionary
java -jar ~/bin/picard-tools-1.8.5/CreateSequenceDictionary.jar REFERENCE=reference.fasta OUTPUT=reference.dict

#Align reads and assign read group
bwa mem -R “@RG\tID:FLOWCELL1.LANE1\tPL:ILLUMINA\tLB:test\tSM:PA01” reference.fasta R1.fastq.gz R2.fastq.gz > aln.sam

#Sort sam file
java -jar ~/bin/picard-tools-1.8.5/SortSam.jar I=aln.sam O=sorted.bam SORT_ORDER=coordinate

#Mark duplicates
java -jar ~/bin/picard-tools-version/MarkDuplicates.jar I=sorted.bam O=dedup.bam METRICS_FILE=metrics.txt

#Sort bam file
java -jar ~/bin/picard-tools-version/BuildBamIndex.jar INPUT=dedup.bam

#Create realignment targets
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference.fasta -I dedup.bam -o targetintervals.list

#Indel realignment
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T IndelRealigner -R reference.fasta -I dedup.bam -targetIntervals targetintervals.list -o realigned.bam

#Call variants (HaplotypeCaller) and prepare for genotyping
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T HaplotypeCaller -R reference.fasta -I realigned.bam -ploidy 1 –emitRefConfidence GVCF -o raw_gVCF.vcf

#Genotype GVCFs
java -jar ~/bin/GATK3.4/GenomeAnalysisTK.jar -T GenotypeGVCFs -R reference.fasta –variant raw_gVCF.vcf -o raw.vcf

Small Bugs, Big Data

In 2000, a short 23 years after Fred Sanger first conceived dideoxy chain termination(Sanger) DNA sequencing and the human genome project had reached draft completion. The public project had cost an estimated $2.7 billion and the late introduction of profit driven Celera Genomics injected some Hollywood drama. Bill Clinton and Tony Blair jointly announced the draft and attempted to convey the sheer human accomplishment the project represented as well as the medical, ethical and philosophical implications. They also announced that no raw data would be available to patent —that the human genome was public— and resulted in Celera’s investors enjoying the second largest one-day fall on the stock market of all time. As messy as the politics had become and as difficult as the project proved, one thing was absolutely clear: biology had become a big science with big money and big data.

Now, sixteen years since the human genome draft, it is medical and clinical microbiology, which is enjoying a revolution in methodology. High throughput sequencing of microorganisms with comparatively tiny genomes is producing more data than ever before for medical microbiologists to better understand their biology and is poised to change how clinical microbiology is done due to the substantially more portable Oxford Nanopore sequencing technologies. Microbes, particularly bacteria, threaten public health and simultaneously, the crops and livestock upon which we depend. Developing countries in particular stand to benefit greatly from fast, efficient, powerful and inexpensive sequence-based microbiological methods in research and the clinic.

The first bacterial genome, that of Heamophilus influenzae, was completely sequenced in 1995. Sequencing of H. influenzae and the projects in the following years were labour-intensive and required massive six figure budgets, with entire laboratories dedicated to completion or ‘closing’ of gaps left in the genome by the computational sequencing fragment assemblers. From around 2005, ‘second generation’ sequencers allowed a massive increase in throughput while the price decreased dramatically, an excessively stated fact but one that remains worthy of celebration as it has been increasingly so since.

Increasingly sophisticated algorithms to deal with raw sequencing data have accompanied the technological advance of sequencing platforms. Many copies of a genome are fragmented and constitute the raw output of DNA sequencing, so-called shotgun sequencing, and must be assembled. Early genome assemblers relied upon overlap-layout-consensus (OLC) algorithms, while high coverage shorter read sequencers demanded the development of graph based de brujin algorithms and accompanying heuristics, particularly for larger genomes. These advancements in sequencing and computation began to produce ‘draft’ genomes, which were now of high enough quality —especially if a closely related, high quality reference was already available— to eliminate the substantial rate-limiting step of ‘finishing’ genomes in the lab. Armed with a draft genome and with a number of online tools for most imaginable questions in microbial genomics (for not too large datasets) researchers can ask more about their chosen microbe of study than ever before.

Advancements are not only being made in tracking pathogens, unravelling the evolution of antibiotic resistance, population structure and adaptation of microbes but sequencing and bioinformatics are also democratising science with open tools, the soaring popularity of pre-prints, open data and a thriving community making the most of social media, which appears to be having the effect of promoting collaboration and reducing outdated, closed-off, paranoid science being done in isolation.