Small Bugs, Big Data

In 2000, a short 23 years after Fred Sanger first conceived dideoxy chain termination(Sanger) DNA sequencing and the human genome project had reached draft completion. The public project had cost an estimated $2.7 billion and the late introduction of profit driven Celera Genomics injected some Hollywood drama. Bill Clinton and Tony Blair jointly announced the draft and attempted to convey the sheer human accomplishment the project represented as well as the medical, ethical and philosophical implications. They also announced that no raw data would be available to patent —that the human genome was public— and resulted in Celera’s investors enjoying the second largest one-day fall on the stock market of all time. As messy as the politics had become and as difficult as the project proved, one thing was absolutely clear: biology had become a big science with big money and big data.

Now, sixteen years since the human genome draft, it is medical and clinical microbiology, which is enjoying a revolution in methodology. High throughput sequencing of microorganisms with comparatively tiny genomes is producing more data than ever before for medical microbiologists to better understand their biology and is poised to change how clinical microbiology is done due to the substantially more portable Oxford Nanopore sequencing technologies. Microbes, particularly bacteria, threaten public health and simultaneously, the crops and livestock upon which we depend. Developing countries in particular stand to benefit greatly from fast, efficient, powerful and inexpensive sequence-based microbiological methods in research and the clinic.

The first bacterial genome, that of Heamophilus influenzae, was completely sequenced in 1995. Sequencing of H. influenzae and the projects in the following years were labour-intensive and required massive six figure budgets, with entire laboratories dedicated to completion or ‘closing’ of gaps left in the genome by the computational sequencing fragment assemblers. From around 2005, ‘second generation’ sequencers allowed a massive increase in throughput while the price decreased dramatically, an excessively stated fact but one that remains worthy of celebration as it has been increasingly so since.

Increasingly sophisticated algorithms to deal with raw sequencing data have accompanied the technological advance of sequencing platforms. Many copies of a genome are fragmented and constitute the raw output of DNA sequencing, so-called shotgun sequencing, and must be assembled. Early genome assemblers relied upon overlap-layout-consensus (OLC) algorithms, while high coverage shorter read sequencers demanded the development of graph based de brujin algorithms and accompanying heuristics, particularly for larger genomes. These advancements in sequencing and computation began to produce ‘draft’ genomes, which were now of high enough quality —especially if a closely related, high quality reference was already available— to eliminate the substantial rate-limiting step of ‘finishing’ genomes in the lab. Armed with a draft genome and with a number of online tools for most imaginable questions in microbial genomics (for not too large datasets) researchers can ask more about their chosen microbe of study than ever before.

Advancements are not only being made in tracking pathogens, unravelling the evolution of antibiotic resistance, population structure and adaptation of microbes but sequencing and bioinformatics are also having a democratising effect with open tools, the soaring popularity of pre-prints, open data and a thriving community making the most of social media to do science in a legitimately new way.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s