|
ABSTRACT
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or 'crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this paper, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated cost-model parameters. Overall, our approach helps predict the most appropriate execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
|
| |
3
|
E. Agichtein, P. G. Ipeirotis, and L. Gravano. Modeling query-based access to text databases. In WebDB, 2003.
|
| |
4
|
M. Banko, E. Brill, S. Dumais, and J. Lin. AskMSR: Question answering using the World-Wide Web. In Symp. on Mining Answers from Texts and KBases, 2002.
|
| |
5
|
M. K. Bergman. The Deep Web: Surfacing hidden value. Journal of Electronic Publishing, 7(1), Aug. 2001.
|
| |
6
|
C. L. Blake and C. J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/¿mlearn/MLRepository.html
|
| |
7
|
|
 |
8
|
|
 |
9
|
|
 |
10
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
11
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
| |
12
|
S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2002.
|
 |
13
|
|
| |
14
|
|
 |
15
|
|
 |
16
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
 |
17
|
|
| |
18
|
F. Chung and L. Lu. Connected components in random graphs with given degree sequences. Annals of Combinatorics, 6:125--145, 2002.
|
| |
19
|
W. W. Cohen. Learning trees and rules with set-valued features. In AAAI, 1996.
|
| |
20
|
W. W. Cohen and Y. Singer. Learning to query the web. In AAAI Workshop on Internet-Based Information Systems, 1996.
|
| |
21
|
|
 |
22
|
Robert B. Doorenbos , Oren Etzioni , Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web, Proceedings of the first international conference on Autonomous agents, p.39-48, February 05-08, 1997, Marina del Rey, California, United States
[doi> 10.1145/267658.267666]
|
| |
23
|
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973.
|
 |
24
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
 |
25
|
|
| |
26
|
L. Gravano, P. G. Ipeirotis, and M. Sahami. Query- vs. crawling-based classification of searchable web databases. IEEE Data Eng. Bull., 25(1), 2002.
|
 |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
| |
31
|
P. G. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. In VLDB, 2002.
|
| |
32
|
|
 |
33
|
|
| |
34
|
M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--334, 2004.
|
| |
35
|
M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Review E, 64(2):1--17, 2001.
|
 |
36
|
|
| |
37
|
|
| |
38
|
S. M. Ross. Introduction to Probability Models. Academic Press, 8th ed., 2002.
|
 |
39
|
|
| |
40
|
V. N. Vapnik. Statistical Learning Theory. Wiley-Interscience, Sept. 1998.
|
| |
41
|
|
| |
42
|
R. Yangarber and R. Grishman. NYU: Description of the Proteus/PET system as used for MUC-7. In MUC-7, 1998.
|
| |
43
|
G. K. Zipf. Human Behavior and the Principle of Least Effort. 1949.
|
CITED BY 11
|
|
|
|
|
|
|
|
|
|
|
Pedro DeRose , Warren Shen , Fei Chen , AnHai Doan , Raghu Ramakrishnan, Building structured web community portals: a top-down, compositional, and incremental approach, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Fei Chen , Byron J. Gao , AnHai Doan , Jun Yang , Raghu Ramakrishnan, Optimizing complex extraction programs over evolving text data, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
Xiaoyong Chai , Ba-Quy Vuong , AnHai Doan , Jeffrey F. Naughton, Efficiently incorporating user feedback into information extraction and integration programs, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|