ACM Home Page
Please provide us with feedback. Feedback
Towards a query optimizer for text-centric tasks
Full text PdfPdf (1.10 MB)
Source
ACM Transactions on Database Systems (TODS) archive
Volume 32 ,  Issue 4  (November 2007) table of contents
Article No. 21  
Year of Publication: 2007
ISSN:0362-5915
Authors
Panagiotis G. Ipeirotis  New York University, New York, NY
Eugene Agichtein  Emory University, Atlanta, GA
Pranay Jain  Columbia University, New York, NY
Luis Gravano  Columbia University, New York, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 22,   Downloads (12 Months): 163,   Citation Count: 5
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1292609.1292611
What is a DOI?

ABSTRACT

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Agichtein, E. 2005. Scaling information extraction to large document collections. IEEE Data Eng. Bull. 28, 4 (Dec.), 3--10.
2
 
3
Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE'03).
 
4
Agichtein, E., Ipeirotis, P. G., and Gravano, L. 2003. Modeling query-based access to text databases. In Proceedings of the Sixth International Workshop on the Web and Databases (WebDB'03). 87--92.
5
 
6
Baayen, R. H. 2006. Word Frequency Distributions. Springer, Berlin, Germany.
 
7
Banko, M., Brill, E., Dumais, S., and Lin, J. 2002. AskMSR: Question answering using the World Wide Web. In Proceedings of 2002 AAAI Spring Symposium on Mining Answers from Texts and Knowledge Bases.
 
8
Bergman, M. K. 2001. The deep Web: Surfacing hidden value. J. Electron. Publish. 7, 1 (Aug.).
 
9
Blake, C. L. and Merz, C. J. 1998. UCI repository of machine learning databases. http://www.ics.uci.edu/~mlearn/MLRepository.html.
 
10
11
 
12
Cafarella, M. J., Re, C., Suciu, D., and Etzioni, O. 2007. Structured querying of Web text data: A technical challenge. In Proceedings of the Third Biennial Conference on Innovative Data Systems Research (CIDR'07). 225--234.
13
14
15
16
 
17
18
19
20
 
21
Chung, F. and Lu, L. 2002. Connected components in random graphs with given degree sequences. Ann. Combin. 6, 125--145.
 
22
Cohen, W. W. 1996. Learning trees and rules with set-valued features. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), Eighth Conference on Innovative Applications of Artificial Intelligence (IAAI-96). 709--716.
 
23
Cohen, W. W. and Singer, Y. 1996. Learning to query the Web. In Proceedings of the AAAI Workshop on Internet-Based Information Systems. 16--25.
 
24
Cunningham, H. 2006. Information extraction, automatic. In Encyclopedia of Language and Linguistics, 2nd ed. Elsevier Science, Amsterdam, The Netherlands.
 
25
26
 
27
28
29
 
30
Gravano, L., Ipeirotis, P. G., and Sahami, M. 2002. Query- vs. crawling-based classification of searchable Web databases. IEEE Data Eng. Bull. 25, 1 (Mar.), 43--50.
 
31
 
32
33
 
34
35
 
36
Jain, A., Doan, A., and Gravano, L. 2007. SQL queries over unstructured text databases (poster paper). In Proceedings of the 22nd IEEE International Conference on Data Engineering (ICDE'07).
 
37
Jones, R. 2005. Learning to extract entities from labeled and unlabeled text. Ph.D. dissertation. Carnegie Mellon University, School of Computer Science, Pittsburgh, PA.
38
 
39
40
41
42
 
43
Mitzenmacher, M. 2004. Dynamic models for file sizes and double pareto distributions. Internet Math. 1, 3, 305--334.
 
44
Newman, M. E. J., Strogatz, S. H., and Watts, D. J. 2001. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E (Statistical, Nonlinear, and Soft Matter Physics) 64, 2 (Aug.), 026118 (1--17).
45
 
46
 
47
48
 
49
 
50
 
51
Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the Seventh Message Understanding Conference (MUC-7).
 
52
Zipf, G. K. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Reading, MA.


Collaborative Colleagues:
Panagiotis G. Ipeirotis: colleagues
Eugene Agichtein: colleagues
Pranay Jain: colleagues
Luis Gravano: colleagues