Algorithms and Hardness Results for DNA Physical Mapping, Protein Identification, and Related Problems | |

Authors: | Mark Cieliebak |

Reference: | Ph.D. Thesis No. 15258, 2003. Department of Computer Science, ETH Zurich |

Download: | Postscript (.ps) or PDF (.pdf) or Compressed (.zip) |

Abstract: |
In this thesis, we focus on two applications of digestion experiments,
namely
physical mapping of DNA and protein identification,
and study the computational complexity of combinatorial problems that
arise in this context. Digestion experiments play an important role
in molecular biology. In such experiments, enzymes are used to cleave
DNA molecules or proteins at specific sequence patterns, the restriction
sites. The resulting fragments are used in many different ways
to study the structure of DNA and proteins, respectively.In the DoubleDigest problem, we are given the lengths
of DNA fragments arising from digestion experiments with two enzymes,
and we want to find a physical map of the DNA, i.e., the positions of
the restriction sites of the enzymes along the DNA sequence. DoubleDigest
is known to be NP-hard. We show that the problem is even strongly NP-hard,
even if the two enzymes always cut at disjoint restriction sites. Moreover,
we show that for partial cleavage errors the problem to find solutions
with a minimum number of errors is hard to approximate. In the PartialDigest
problem, we are given DNA fragment lengths arising from digestion experiments
with only one enzyme, and we again ask for a physical map of the DNA.
Neither a proof of NP-hardness nor a polynomial--time algorithm is
known for PartialDigest. We study variations of PartialDigest that
model missing fragments, additional fragments, and erroneous fragment
lengths, and show that these variations are NP-hard, hard to approximate,
and strongly NP-hard, respectively.The EqualSumSubsets problem, where
we are given a set of positive integers and we ask for two subsets
such that their elements add up to the same total, is known to be NP-hard.
EqualSumSubsets can be used to prove NP-hardness for PartialDigest
variants. Motivated by this, we study variations of EqualSumSubsets,
where we, for instance, allow any positive rational factor between
the sums of the two subsets. We give (pseudo-)polynomial algorithms
or (strong) NP--hardness proofs, respectively, for several natural
variations of EqualSumSubsets.In the second part of this thesis, we address the problem of protein identification. The mass fingerprint of a protein
contains the masses of fragments that emerge when digesting the protein.
Mass fingerprints are used, for instance, to search for proteins in
large protein databases, without sequencing them. The MassFinding problem
arises in this context. Here, we are given a mass M and
a protein sequence, and we ask whether there is a fragment of the protein
that has mass M. MassFinding can be solved easily in time
linear in n, the length of the protein sequence. We present
an algorithm that solves the problem even in sublinear time $O(\frac
n{\log n})$. This algorithm uses a data structure that is generated
in a preprocessing step, and that requires only linear storage space.A different approach to identifying a protein is to establish its amino acid sequence ( de novo sequencing).
Here, a fragment of the protein (peptide) is dissociated,
and the masses of the resulting pieces are measured using tandem mass
spectrometry. This yields an MS/MS spectrum of the peptide.
For the case of error--free data, algorithms exist that construct the
amino acid sequence of a peptide from its MS/MS spectrum. We have implemented
a software tool (Audens) that allows for de novo peptide sequencing
even in the case of erronoeous data, and evaluated its performance
on real-life spectra.One problem that arises in the context of Audens is the Decomposition problem,
where we ask whether a given mass can be represented as a sum of amino
acid masses. This problem is known to be NP-hard. We show that Decomposition
can be solved in polynomial time if the number of different amino
acid masses is constant, or if the masses of all but a constant number
of amino acids are polynomially bounded. On the other hand, we show
that if we ask for the minimum or maximum number
of amino acids whose masses add up to the given mass, then no polynomial-time
algorithm can guarantee any constant approximation ratio (unless P
= NP). |

Remarks: |

Note: electronic versions may not always correspond exactly to printed versions.