|
ABSTRACT
We are given a collection D of text documents d1,…,dk, with ∑i = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time O(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated.We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects --- points and lines --- that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
P. Agarwal and M. van Kreveld. Polygon and connected component intersection searching Algorithmica, 1996, 15, 626-660.
|
| |
2
|
|
| |
3
|
Amir M. Ben-Amram , Omer Berkman , Costas S. Iliopoulos , Kunsoo Park, The subtree max gap problem with application to parallel string covering, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.501-510, January 23-25, 1994, Arlington, Virginia, United States
|
| |
4
|
G. Benson and M. Waterman A Method for Fast Database Search for All k-nucleotide Repeats. Nucleic Acids Research, Vol 22, No. 22, 1994.
|
| |
5
|
|
| |
6
|
|
| |
7
|
P. Ferragina. Personal Communication, 1999.
|
 |
8
|
Paolo Ferragina , Nick Koudas , Divesh Srivastava , S. Muthukrishnan, Two-dimensional substring indexing, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.282-288, May 2001, Santa Barbara, California, United States
[doi> 10.1145/375551.375610]
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
R. Janardan and M. Lopez. Generalized intersection searching problems. Intl J Comput. Geom. Applications, 3(1993), 39-69.
|
| |
17
|
R. Kolpakov and G. Kucherov. Finding repeats with fixed gap. Technical Report RR-3901, INRIA, March 2000.
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
R. Hariharan. Personal communication, 1999.
|
| |
22
|
SQL full text indexing. http://larr.unm.edu/owen/SQLBOL70/html/ca --- co15.htm.
|
| |
23
|
Jaak Vilo. Discovering Frequent Patterns from Strings Technical Report, Department of Computer Science P.O.Box 26 FIN-00014. Check at http://citeseer.nj.nec.com/103438.html.
|
| |
24
|
P. Weiner. Linear Pattern Matching Algorithms. Proc. 14th IEEE Symp on Switching and Automata Theory, 1973, 1-11.
|
CITED BY 11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Amihood Amir , Eran Chencinski , Costas Iliopoulos , Tsvi Kopelowitz , Hui Zhang, Property matching and weighted matching, Theoretical Computer Science, v.395 n.2-3, p.298-310, May, 2008
|
|
|
C. Epifanio , A. Gabriele , F. Mignosi , A. Restivo , M. Sciortino, Languages with mismatches, Theoretical Computer Science, v.385 n.1-3, p.152-166, October, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|