Untitled Document

In this thesis, we focus on two applications of digestion experiments, namely physical mapping of DNA and protein identification, and study the computational complexity of combinatorial problems that arise in this context. Digestion experiments play an important role in molecular biology. In such experiments, enzymes are used to cleave DNA molecules or proteins at specific sequence patterns, the restriction sites. The resulting fragments are used in many different ways to study the structure of DNA and proteins, respectively.
In the DoubleDigest problem, we are given the lengths of DNA fragments arising from digestion experiments with two enzymes, and we want to find a physical map of the DNA, i.e., the positions of the restriction sites of the enzymes along the DNA sequence. DoubleDigest is known to be NP-hard. We show that the problem is even strongly NP-hard, even if the two enzymes always cut at disjoint restriction sites. Moreover, we show that for partial cleavage errors the problem to find solutions with a minimum number of errors is hard to approximate. In the PartialDigest problem, we are given DNA fragment lengths arising from digestion experiments with only one enzyme, and we again ask for a physical map of the DNA. Neither a proof of NP-hardness nor a polynomial--time algorithm is known for PartialDigest. We study variations of PartialDigest that model missing fragments, additional fragments, and erroneous fragment lengths, and show that these variations are NP-hard, hard to approximate, and strongly NP-hard, respectively.
The EqualSumSubsets problem, where we are given a set of positive integers and we ask for two subsets such that their elements add up to the same total, is known to be NP-hard. EqualSumSubsets can be used to prove NP-hardness for PartialDigest variants. Motivated by this, we study variations of EqualSumSubsets, where we, for instance, allow any positive rational factor between the sums of the two subsets. We give (pseudo-)polynomial algorithms or (strong) NP--hardness proofs, respectively, for several natural variations of EqualSumSubsets.
In the second part of this thesis, we address the problem of protein identification. The mass fingerprint of a protein contains the masses of fragments that emerge when digesting the protein. Mass fingerprints are used, for instance, to search for proteins in large protein databases, without sequencing them. The MassFinding problem arises in this context. Here, we are given a mass M and a protein sequence, and we ask whether there is a fragment of the protein that has mass M. MassFinding can be solved easily in time linear in n, the length of the protein sequence. We present an algorithm that solves the problem even in sublinear time $O(\frac n{\log n})$. This algorithm uses a data structure that is generated in a preprocessing step, and that requires only linear storage space.
A different approach to identifying a protein is to establish its amino acid sequence (de novo sequencing). Here, a fragment of the protein (peptide) is dissociated, and the masses of the resulting pieces are measured using tandem mass spectrometry. This yields an MS/MS spectrum of the peptide. For the case of error--free data, algorithms exist that construct the amino acid sequence of a peptide from its MS/MS spectrum. We have implemented a software tool (Audens) that allows for de novo peptide sequencing even in the case of erronoeous data, and evaluated its performance on real-life spectra.
One problem that arises in the context of Audens is the Decomposition problem, where we ask whether a given mass can be represented as a sum of amino acid masses. This problem is known to be NP-hard. We show that Decomposition can be solved in polynomial time if the number of different amino acid masses is constant, or if the masses of all but a constant number of amino acids are polynomially bounded. On the other hand, we show that if we ask for the minimum or maximum number of amino acids whose masses add up to the given mass, then no polynomial-time algorithm can guarantee any constant approximation ratio (unless P = NP).

	Algorithms and Hardness Results for DNA Physical Mapping, Protein Identification, and Related Problems

Authors:	Mark Cieliebak
Reference:	Ph.D. Thesis No. 15258, 2003. Department of Computer Science, ETH Zurich
Download:	Postscript (.ps) or PDF (.pdf) or Compressed (.zip)
Abstract:	In this thesis, we focus on two applications of digestion experiments, namely physical mapping of DNA and protein identification, and study the computational complexity of combinatorial problems that arise in this context. Digestion experiments play an important role in molecular biology. In such experiments, enzymes are used to cleave DNA molecules or proteins at specific sequence patterns, the restriction sites. The resulting fragments are used in many different ways to study the structure of DNA and proteins, respectively. In the DoubleDigest problem, we are given the lengths of DNA fragments arising from digestion experiments with two enzymes, and we want to find a physical map of the DNA, i.e., the positions of the restriction sites of the enzymes along the DNA sequence. DoubleDigest is known to be NP-hard. We show that the problem is even strongly NP-hard, even if the two enzymes always cut at disjoint restriction sites. Moreover, we show that for partial cleavage errors the problem to find solutions with a minimum number of errors is hard to approximate. In the PartialDigest problem, we are given DNA fragment lengths arising from digestion experiments with only one enzyme, and we again ask for a physical map of the DNA. Neither a proof of NP-hardness nor a polynomial--time algorithm is known for PartialDigest. We study variations of PartialDigest that model missing fragments, additional fragments, and erroneous fragment lengths, and show that these variations are NP-hard, hard to approximate, and strongly NP-hard, respectively. The EqualSumSubsets problem, where we are given a set of positive integers and we ask for two subsets such that their elements add up to the same total, is known to be NP-hard. EqualSumSubsets can be used to prove NP-hardness for PartialDigest variants. Motivated by this, we study variations of EqualSumSubsets, where we, for instance, allow any positive rational factor between the sums of the two subsets. We give (pseudo-)polynomial algorithms or (strong) NP--hardness proofs, respectively, for several natural variations of EqualSumSubsets. In the second part of this thesis, we address the problem of protein identification. The mass fingerprint of a protein contains the masses of fragments that emerge when digesting the protein. Mass fingerprints are used, for instance, to search for proteins in large protein databases, without sequencing them. The MassFinding problem arises in this context. Here, we are given a mass M and a protein sequence, and we ask whether there is a fragment of the protein that has mass M. MassFinding can be solved easily in time linear in n, the length of the protein sequence. We present an algorithm that solves the problem even in sublinear time $O(\frac n{\log n})$. This algorithm uses a data structure that is generated in a preprocessing step, and that requires only linear storage space. A different approach to identifying a protein is to establish its amino acid sequence (de novo sequencing). Here, a fragment of the protein (peptide) is dissociated, and the masses of the resulting pieces are measured using tandem mass spectrometry. This yields an MS/MS spectrum of the peptide. For the case of error--free data, algorithms exist that construct the amino acid sequence of a peptide from its MS/MS spectrum. We have implemented a software tool (Audens) that allows for de novo peptide sequencing even in the case of erronoeous data, and evaluated its performance on real-life spectra. One problem that arises in the context of Audens is the Decomposition problem, where we ask whether a given mass can be represented as a sum of amino acid masses. This problem is known to be NP-hard. We show that Decomposition can be solved in polynomial time if the number of different amino acid masses is constant, or if the masses of all but a constant number of amino acids are polynomially bounded. On the other hand, we show that if we ask for the minimum or maximum number of amino acids whose masses add up to the given mass, then no polynomial-time algorithm can guarantee any constant approximation ratio (unless P = NP).
Remarks: