Microbes are key drivers of bio-geochemical cycles on Earth. Despite their importance, our understanding of this `unseen majority' has been largely restricted by traditional culture-dependent approaches. Recent improvements in molecular techniques have seen a departure from this culture-dependent approach in favour of sequencing microbial genomes directly from environmental samples. These community-based sequencing approaches are commonly referred to as metagenomics, community genomics or ecogenomics. Community genomics is becoming increasingly popular, largely driven by the availability of high-throughput, low-cost `next-generation' sequencers. The application of next-generation sequencing to heterogeneous microbial communities presents unique challenges for analysis, necessitating the development of bioinformatics tools appropriate for the task. This thesis explores and describes methods for analysing microbial community sequence data from both targeted sequencing of amplicons and random shotgun (metagenomics) using next-generation sequencing platforms.
Determining the appropriate strategy for both generating and analysing a metagenome depends on the microbial community structure and overarching purpose of the study. Currently, pyrosequencing of 16S rRNA gene amplicons is the cheapest and most direct means of measuring the diversity and abundance of species in an environmental sample. However, this procedure is highly sensitive to the sequencing errors introduced by the Roche 454 pyrosequencing platform, necessitating `denoising' or error-correction of these amplicons. AmpliconNoise and DeNoiser are two such error-correction tools, however, they share similar limitations, in that they are computationally intensive and assume command-line proficiency. Motivated to address these issues, I describe here a novel error-correction algorithm called Acacia. Acacia gains a speedup of 1-2 orders of magnitude over AmpliconNoise and DeNoiser through the use of alignment heuristics and maximum-likelihood approaches to correcting pyrosequencing error. To ensure user-friendliness, Acacia has a graphical user interface, and is implemented in Java, a cross-platform programming language. Unlike DeNoiser and AmpliconNoise which both use global metrics to compare reads, Acacia uses a conservative approach to correction which maintains small, but significant differences between genuinely distinct amplicons. Consequently, benchmarking revealed Acacia had the highest error-correction specificity overall, introducing fewer errors than AmpliconNoise and DeNoiser.
It is anticipated that with increased read lengths, the Ion Torrent PGM will soon supersede the Roche 454 as the platform of choice for amplicon sequencing. Since the release of the PGM in 2011, several improvements to both PGM sequencing chemistry and signal-processing software have been implemented, and limited research has been conducted to comprehensively evaluate the accuracy of this platform. In order to assess whether the PGM is an appropriate technology for sequencing amplicon libraries, a sound understanding of the error-rate and biases of this instrument is required. Factors associated with PGM sequencing error and biases were identified using 15 Ion Torrent re-sequencing datasets across a number of variables, including species, kit, chip and machine. Analysis revealed that homopolymer length, cycle number and cycle position have a substantial influence on the mean and variance of the flow-value distribution. These factors lead to shorter homopolymers being called more accurately than longer ones, earlier flow cycles less error-prone than later, and specific within-cycle positions having half the accuracy of others. Although the PGM read throughput is higher than Roche 454, the PGM has a much higher error rate, which may limit its suitability for amplicon sequencing.
The utility of community profiling is demonstrated in my final research chapter, where knowledge of the species diversity guided our strategy for generating and analysing the metagenome. In this study, the primary aim was to recover the genome sequence of a rare and novel methanogen found within a bovine rumen community. As the species occurred at low abundance, enrichment procedures were employed to simplify the community and increase the abundance of the target organism. To ensure both high-coverage and resolvability of repetitive sequences within and across species, Illumina paired-end libraries were generated from the enrichment. Analysis of the metagenome necessitated custom binning, assembly and annotation approaches to generate both complete and partial genomes from a mixed microbial cohort.
The bioinformatics community have quickly responded to the computational and statistical challenges raised by community genome sequencing using next-gen technologies, with both application and platform-specific tools becoming readily available. This thesis describes novel methods, approaches and insights pertaining to microbial community sequencing, advances that form part of the analysis of metagenomes or amplicon datasets in isolation. By leveraging these rapidly establishing workflows, future bioinformatics development can focus on system-based approaches to community sequencing, integrating information gained from metagenomes, amplicons and transcriptomes to provide holistic interpretations of community structure and function.