ACM Home Page
Please provide us with feedback. Feedback
Efficient algorithms for document retrieval problems
Full text PdfPdf (1.17 MB)
Source Symposium on Discrete Algorithms archive
Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms table of contents
San Francisco, California
Pages: 657 - 666  
Year of Publication: 2002
ISBN:0-89871-513-X
Author
S. Muthukrishnan  AT&T Labs --- Research, Florham Park, NJ
Sponsor
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
Publisher
Society for Industrial and Applied Mathematics  Philadelphia, PA, USA
Bibliometrics
Downloads (6 Weeks): 11,   Downloads (12 Months): 63,   Citation Count: 11
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  

ABSTRACT

We are given a collection D of text documents d1,…,dk, with ∑i = n, which may be preprocessed. In the document listing problem, we are given an online query comprising of a pattern string p of length m and our goal is to return the set of all documents that contain one or more copies of p. In the closely related occurrence listing problem, we output the set of all positions within the documents where pattern p occurs. In 1973, Weiner [24] presented an algorithm with O(n) time and space preprocessing following which the occurrence listing problem can be solved in time O(m + output) where output is the number of positions where p occurs; this algorithm is clearly optimal. In contrast, no optimal algorithm is known for the closely related document listing problem, which is perhaps more natural and certainly well-motivated.We provide the first known optimal algorithm for the document listing problem. More generally, we initiate the study of pattern matching problems that require retrieving documents matched by the patterns; this contrasts with pattern matching problems that have been studied more frequently, namely, those that involve retrieving all occurrences of patterns. We consider document retrieval problems that are motivated by online query processing in databases, Information Retrieval systems and Computational Biology. We present very efficient (optimal) algorithms for our document retrieval problems. Our approach for solving such problems involve performing "local" encodings whereby they are reduced to range query problems on geometric objects --- points and lines --- that have color. We present improved algorithms for these colored range query problems that arise in our reductions using the structural properties of strings. This approach is quite general and yields simple, efficient, implementable algorithms for all the document retrieval problems in this paper.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
P. Agarwal and M. van Kreveld. Polygon and connected component intersection searching Algorithmica, 1996, 15, 626-660.
 
2
 
3
 
4
G. Benson and M. Waterman A Method for Fast Database Search for All k-nucleotide Repeats. Nucleic Acids Research, Vol 22, No. 22, 1994.
 
5
 
6
 
7
P. Ferragina. Personal Communication, 1999.
8
9
 
10
 
11
 
12
13
 
14
 
15
 
16
R. Janardan and M. Lopez. Generalized intersection searching problems. Intl J Comput. Geom. Applications, 3(1993), 39-69.
 
17
R. Kolpakov and G. Kucherov. Finding repeats with fixed gap. Technical Report RR-3901, INRIA, March 2000.
 
18
 
19
 
20
 
21
R. Hariharan. Personal communication, 1999.
 
22
SQL full text indexing. http://larr.unm.edu/owen/SQLBOL70/html/ca --- co15.htm.
 
23
Jaak Vilo. Discovering Frequent Patterns from Strings Technical Report, Department of Computer Science P.O.Box 26 FIN-00014. Check at http://citeseer.nj.nec.com/103438.html.
 
24
P. Weiner. Linear Pattern Matching Algorithms. Proc. 14th IEEE Symp on Switching and Automata Theory, 1973, 1-11.

CITED BY  11