Algorithmic Complexity of Protein Identification: Searching in Weighted Strings | |
Authors: | Mark Cieliebak, Thomas Erlebach, Zsuzsanna Lipták, Jens Stoye, and Emo Welzl |
Reference: | In "Proceedings of the 2nd IFIP International Conference on Theoretical Computer Science (TCS 2002): Foundations of Information Technology in the Era of Network and Mobile Computing", Montreal, Canada, September 2002. Ricardo Baeza-Yates, Ugo Montanari, and Nicola Santoro (editors). Kluwer Adacemic Publishers, pp. 143-156, 2002 |
Download: | Postscript (.ps) or PDF (.pdf) |
Abstract: |
We investigate a problem which arises in computational
biology: Given a constant-size alphabet A with a weight function
$\mu: A --> N$, find an efficient data structure
and query algorithm solving the following problem: For a string $\sigma$
over A and a weight $M \in N$, decide
whether $\sigma$ contains a substring with weight M (Mass Finding
Problem). If the answer is yes, then we may in addition require a witness,
i.e., indices $i \leq j$ such that the substring beginning
at position i and ending at position j has weight M.
We allow preprocessing of the string, and measure efficiency in two parameters:
storage space required for the preprocessed data, and running time of
the query algorithm for given M. We are interested in data structures
and algorithms requiring subquadratic storage space and sublinear query
time, where we measure the input size as the length of the input string.
Among others, we present two non-trivial efficient algorithms: LOOKUP
solves the problem with O(n) space and $O(\frac{n}{\log
n}\cdot \log\log n)$ time; INTERVAL solves the problem
for binary alphabets with O(n) storage space in O(log n)
query time. Finally, we introduce other variants of the problem and sketch
how our algorithms may be extended for these variants.
|
Remarks: |