Algorithmic Complexity of Protein Identification: Combinatorics of Weighted Strings | |
Authors: | Mark Cieliebak, Thomas Erlebach, Zsuzsanna Lipták, Jens Stoye, and Emo Welzl |
Reference: | Technical Report 361, ETH Zurich, Department of Computer Science, August 2001 |
Download: | Postscript (.ps) or PDF (.pdf) |
Abstract: |
We investigate a problem from computational biology:
Given a constant--size alphabet A with a weight function $\mu:
A \to R^+$, find an efficient data structure
and query algorithm solving the following problem: For a weight $M
\in R^+$ and a string $\sigma$ over A, decide
whether $\sigma$ contains a substring with weight M (One--String
Mass Finding Problem). If the answer is yes, then we may in addition require
a witness, i.e. indices $i \leq j$ such that the substring
beginning at position i and ending at position j has
weight M. We allow preprocessing of the string, and measure efficiency
in two parameters: storage space required for the preprocessed data, and
running time of the query algorithm for given M. We are interested
in data structures and algorithms requiring subquadratic storage space
and sublinear query time, where we measure the input size as the length
of the input string. We present two efficient algorithms: LOOKUP solves
the problem with O(n)$ space and $O(\frac{n}{\log n}\cdot
\log\log n)$ time; INTERVAL solves the problem for binary alphabets
with O(n) space in O(log n) time. We sketch a third
algorithm, CLUSTER, which can be adjusted for a space-time-tradeoff but
for which we do not yet have a resource analysis. We introduce a function
on weighted strings which is closely related to the analysis of algorithms
for the One-String Mass Finding Problem: The number of different submasses
of a weighted string. We present several properties of this function,
including upper and lower bounds. Finally, we introduce two more general
variants of the problem and sketch how algorithms may be extended for
these variants.
|
Remarks: |