Noisy Data Make the Partial Digest Problem NP-hard
Authors: Mark Cieliebak, Stephan Eidenbenz, and Paolo Penna
Reference: Technical Report 381, ETH Zurich, Department of Computer Science, October 2002
Download: Postscript (.ps) or PDF (.pdf)
The Partial Digest problem -- well-known for its applications in computational biology and for the intriguingly open status of its computational complexity -- asks for the coordinates of n points on a line such that the pairwise distances of the points form a given multi-set of $n \choose 2$ distances. In an effort to model real-life data, we study the computational complexity of a minimization version of Partial Digest, in which only a subset of all pairwise distances is given and the rest are lacking due to experimental errors. We show that this variation is NP-hard to solve exactly, thus making the existence of polynomial-time algorithms for this problem extremely unlikely. Our result answers an open question posed by Pevzner (2000). We then study a maximization version of Partial Digest where a superset of all pairwise distances is given, with some additional distances due to inaccurate measurements. We show that this maximization version is NP-hard to approximate to within a factor of $|D|^{\frac{1}{2} -\epsilon}$ for any $\epsilon > 0$, where $|D|$ is the number of input distances, which implies that polynomial-time algorithms cannot even guarantee to find a solution for the problem that comes close to the optimum. Our inapproximability result is tight up to low-order terms as we give a trivial approximation algorithm that achieves a matching approximation ratio. Our optimization variations model two different error types that occur in real-life data.

Note: electronic versions may not always correspond exactly to printed versions.