## An analysis of algorithms to estimate the characteristics of the underlying population in Massively Parallel Pyrosequencing data.

##### Abstract

Massively Parallel Pyrosequencing (MPP) is a next generation DNA sequencing
technique that is becoming ubiquitous because it is considerably
faster, cheaper and produces a higher throughput than long established sequencing
techniques like Sanger sequencing. The MPP methodology is also
much less labor intensive than Sanger sequencing.
Indeed, MPP has become a preferred technology in experiments that seek to
determine the distinctive genetic variation present in homologous genomic
regions.
However there arises a problem in the interpretation of the reads derived
from an MPP experiment. Specifically MPP reads are characteristically
error prone. This means that it becomes difficult to separate the authentic
genomic variation underlying a set of MPP reads from variation that is a
consequence of sequencing error.
The difficulty of inferring authentic variation is further compounded by the
fact that MPP reads are also characteristically short. As a consequence of
this, the correct alignment of an MPP read with respect to the genomic
region from which it was derived may not be intuitive.
To this end, several computational algorithms that seek to correctly align
and remove the non-authentic genetic variation from MPP reads have been
proposed in literature. We refer to the removal of non-authentic variation
from a set of MPP reads as error correction. Computational algorithms that
process MPP data are classified as sequence-space algorithms and flow-space
algorithms. Sequence-space algorithms work with MPP sequencing reads as
raw data, whereas flow-space algorithms work with MPP flowgrams as raw
data. A flowgram is an intermediate product of MPP, which is subsequently
converted into a sequencing read.
In theory, flow-space computations should produce more accurate results
than sequence-space computations.
In this thesis, we make a qualitative comparison of the distinct solutions
delivered by selected MPP read alignment algorithms. Further we make a
qualitative comparison of the distinct solutions delivered by selected MPP
error correction algorithms.
Our comparisons between different algorithms with the same niche are facilitated
by the design of a platform for MPP simulation, PyroSim. PyroSim
is designed to encapsulate the error rate that is characteristic of MPP.
We implement a selection of sequence-space and flow-space alignment algorithms
in a software package, MPPAlign. We derive a quality ranking
for the distinct algorithms implemented in MPPAlign through a series of
qualitative comparisons.
Further, we implement a selection of sequence-space and flow-space error
correction algorithms in a software package, MPPErrorCorrect. Similarly,
we derive a quality ranking for the distinct algorithms implemented in MPPErrorCorrect
through a series of qualitative comparisons.
Contrary to the view expressed in literature which postulates that flowspace
computations are more accurate than sequence-space computations,
we find that in general the sequence-space algorithms that we implement
outperform the flow-space algorithms.
We surmise that flow-space is a more sensitive domain for conducting computations
and can only yield consistently good results under stringent quality
control measures. In sequence-space, however, we find that base calling,
the process that converts flowgrams (flow-space raw data) into sequencing
reads (sequence-space raw data), leads to more reliable computations.