|
ABSTRACT
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Agichtein, E. 2005. Scaling information extraction to large document collections. IEEE Data Eng. Bull. 28, 4 (Dec.), 3--10.
|
 |
2
|
|
| |
3
|
Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE'03).
|
| |
4
|
Agichtein, E., Ipeirotis, P. G., and Gravano, L. 2003. Modeling query-based access to text databases. In Proceedings of the Sixth International Workshop on the Web and Databases (WebDB'03). 87--92.
|
 |
5
|
|
| |
6
|
Baayen, R. H. 2006. Word Frequency Distributions. Springer, Berlin, Germany.
|
| |
7
|
Banko, M., Brill, E., Dumais, S., and Lin, J. 2002. AskMSR: Question answering using the World Wide Web. In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases.
|
| |
8
|
Bergman, M. K. 2001. The deep Web: Surfacing hidden value. J. Electron. Publish. 7, 1 (Aug.).
|
| |
9
|
Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.
|
| |
10
|
|
 |
11
|
|
| |
12
|
Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. 2007. Structured querying of Web text data: A technical challenge. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR'07). 225--234.
|
 |
13
|
|
 |
14
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
 |
15
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
 |
19
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
 |
20
|
|
| |
21
|
Chung, F. and Lu, L. 2002. Connected components in random graphs with given degree sequences. Ann. Combin. 6, 125--145.
|
| |
22
|
Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96). 709--716.
|
| |
23
|
Cohen, W. W. and Singer, Y. 1996. Learning to query the Web. In Proceedings of the AAAI Workshop on Internet-Based Information Systems. 16--25.
|
| |
24
|
Cunningham, H. 2006. Information extraction, automatic. In Encyclopedia of Language and Linguistics, 2nd ed. Elsevier Science, Amsterdam, The Netherlands.
|
| |
25
|
|
 |
26
|
Robert B. Doorenbos , Oren Etzioni , Daniel S. Weld, A scalable comparison-shopping agent for the World-Wide Web, Proceedings of the first international conference on Autonomous agents, p.39-48, February 05-08, 1997, Marina del Rey, California, United States
[doi> 10.1145/267658.267666]
|
| |
27
|
|
 |
28
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
 |
29
|
|
| |
30
|
Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable Web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.
|
| |
31
|
|
| |
32
|
|
 |
33
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142504]
|
| |
34
|
|
 |
35
|
Zachary G. Ives , Daniela Florescu , Marc Friedman , Alon Levy , Daniel S. Weld, An adaptive query execution system for data integration, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.299-310, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
36
|
Jain, A., Doan, A., and Gravano, L. 2007. SQL queries over unstructured text databases (poster paper). In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE'07).
|
| |
37
|
Jones, R. 2005. Learning to extract entities from labeled and unlabeled text. Ph.D. dissertation. Carnegie Mellon University, School of Computer Science, Pittsburgh, PA.
|
 |
38
|
|
| |
39
|
|
 |
40
|
Volker Markl , Vijayshankar Raman , David Simmen , Guy Lohman , Hamid Pirahesh , Miso Cilimdzic, Robust query processing through progressive optimization, Proceedings of the 2004 ACM SIGMOD international conference on Management of data, June 13-18, 2004, Paris, France
[doi> 10.1145/1007568.1007642]
|
 |
41
|
|
 |
42
|
|
| |
43
|
Mitzenmacher, M. 2004. Dynamic models for file sizes and double pareto distributions. Internet Math. 1, 3, 305--334.
|
| |
44
|
Newman, M. E. J., Strogatz, S. H., and Watts, D. J. 2001. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 64, 2 (Aug.), 026118 (1--17).
|
 |
45
|
|
| |
46
|
|
| |
47
|
|
 |
48
|
|
| |
49
|
|
| |
50
|
|
| |
51
|
Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).
|
| |
52
|
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.
|
|