Noisy Data Make the Partial Digest Problem NP-hard
Authors: Mark Cieliebak, Stephan Eidenbenz, and Paolo Penna
Reference: In "Proceedings of the 3rd Workshop on Algorithms in Bioinformatics (WABI 2003)", Budapest, Hungary, September 2003. Springer, LNBI 2812, pp. 111-123, 2003
Download: Postscript (.ps) or PDF (.pdf)
The problem to find the coordinates of n points on a line such that the pairwise distances of the points form a given multi-set of $n \choose 2$ distances is known as Partial Digest problem, which occurs for instance in DNA physical mapping and de novo sequencing of proteins. Although Partial Digest was -- as a combinatorial problem -- already proposed in the 1930's, its computational complexity is still unknown. In an effort to model real-life data, we introduce two optimization variations of Partial Digest that model two different error types that occur in real-life data. First, we study the computational complexity of a minimization version of Partial Digest in which only a subset of all pairwise distances is given and the rest are lacking due to experimental errors. We show that this variation is NP-hard to solve exactly. This result answers an open question posed by Pevzner (2000). We then study a maximization version of Partial Digest where a superset of all pairwise distances is given, with some additional distances due to inaccurate measurements. We show that this maximization version is NP-hard to approximate to within a factor of $|D|^{\frac{1}{2} -\varepsilon}$ for any $\varepsilon > 0$, where $|D|$ is the number of input distances. This inapproximability result is tight up to low-order terms as we give a trivial approximation algorithm that achieves a matching approximation ratio.

Note: electronic versions may not always correspond exactly to printed versions.