An analysis of algorithms to estimate the characteristics of the underlying population in Massively Parallel Pyrosequencing data.
MetadataShow full item record
Massively Parallel Pyrosequencing (MPP) is a next generation DNA sequencing technique that is becoming ubiquitous because it is considerably faster, cheaper and produces a higher throughput than long established sequencing techniques like Sanger sequencing. The MPP methodology is also much less labor intensive than Sanger sequencing. Indeed, MPP has become a preferred technology in experiments that seek to determine the distinctive genetic variation present in homologous genomic regions. However there arises a problem in the interpretation of the reads derived from an MPP experiment. Specifically MPP reads are characteristically error prone. This means that it becomes difficult to separate the authentic genomic variation underlying a set of MPP reads from variation that is a consequence of sequencing error. The difficulty of inferring authentic variation is further compounded by the fact that MPP reads are also characteristically short. As a consequence of this, the correct alignment of an MPP read with respect to the genomic region from which it was derived may not be intuitive. To this end, several computational algorithms that seek to correctly align and remove the non-authentic genetic variation from MPP reads have been proposed in literature. We refer to the removal of non-authentic variation from a set of MPP reads as error correction. Computational algorithms that process MPP data are classified as sequence-space algorithms and flow-space algorithms. Sequence-space algorithms work with MPP sequencing reads as raw data, whereas flow-space algorithms work with MPP flowgrams as raw data. A flowgram is an intermediate product of MPP, which is subsequently converted into a sequencing read. In theory, flow-space computations should produce more accurate results than sequence-space computations. In this thesis, we make a qualitative comparison of the distinct solutions delivered by selected MPP read alignment algorithms. Further we make a qualitative comparison of the distinct solutions delivered by selected MPP error correction algorithms. Our comparisons between different algorithms with the same niche are facilitated by the design of a platform for MPP simulation, PyroSim. PyroSim is designed to encapsulate the error rate that is characteristic of MPP. We implement a selection of sequence-space and flow-space alignment algorithms in a software package, MPPAlign. We derive a quality ranking for the distinct algorithms implemented in MPPAlign through a series of qualitative comparisons. Further, we implement a selection of sequence-space and flow-space error correction algorithms in a software package, MPPErrorCorrect. Similarly, we derive a quality ranking for the distinct algorithms implemented in MPPErrorCorrect through a series of qualitative comparisons. Contrary to the view expressed in literature which postulates that flowspace computations are more accurate than sequence-space computations, we find that in general the sequence-space algorithms that we implement outperform the flow-space algorithms. We surmise that flow-space is a more sensitive domain for conducting computations and can only yield consistently good results under stringent quality control measures. In sequence-space, however, we find that base calling, the process that converts flowgrams (flow-space raw data) into sequencing reads (sequence-space raw data), leads to more reliable computations.