ACM Home Page
Please provide us with feedback. Feedback
To search or to crawl?: towards a query optimizer for text-centric tasks
Full text PdfPdf (625 KB)
Source International Conference on Management of Data archive
Proceedings of the 2006 ACM SIGMOD international conference on Management of data table of contents
Chicago, IL, USA
SESSION: Querying and web table of contents
Pages: 265 - 276  
Year of Publication: 2006
ISBN:1-59593-434-0
Authors
Panagiotis G. Ipeirotis  New York University
Eugene Agichtein  Microsoft Research
Pranay Jain  Columbia University
Luis Gravano  Columbia University
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 141,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1142473.1142504
What is a DOI?

ABSTRACT

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
 
3
E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
 
4
M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question answering using the World-Wide Web. In Symp. on Mining Answers from Texts and KBases, 2002.
 
5
M. K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), Aug. 2001.
 
6
C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/¿mlearn/MLRepository.html
 
7
8
9
10
11
 
12
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2002.
13
 
14
15
16
17
 
18
F. Chung and L. Lu. Connected components in random graphs with given degree sequences. Annals of Combinatorics, 6:125--145, 2002.
 
19
W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, 1996.
 
20
W. W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.
 
21
22
 
23
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
24
25
 
26
L. Gravano, P. G. Ipeirotis, and M. Sahami. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull., 25(1), 2002.
27
 
28
 
29
 
30
 
31
P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.
 
32
33
 
34
M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--334, 2004.
 
35
M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Review E, 64(2):1--17, 2001.
36
 
37
 
38
S. M. Ross. Introduction to Probability Models. Academic Press, 8th ed., 2002.
39
 
40
V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, Sept. 1998.
 
41
 
42
R. Yangarber and R. Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In MUC-7, 1998.
 
43
G. K. Zipf. Human Behavior and the Principle of Least Effort. 1949.

CITED BY  11

Collaborative Colleagues:
Panagiotis G. Ipeirotis: colleagues
Eugene Agichtein: colleagues
Pranay Jain: colleagues
Luis Gravano: colleagues