ACM Home Page
Please provide us with feedback. Feedback
Building query optimizers for information extraction: the SQoUT project
Full text PdfPdf (769 KB)
Source
ACM SIGMOD Record archive
Volume 37 ,  Issue 4  (December 2008) table of contents
COLUMN: Special section on managing information extraction table of contents
Pages 28-34  
Year of Publication: 2009
ISSN:0163-5808
Authors
Alpa Jain  Columbia University
Panagiotis Ipeirotis  New York University
Luis Gravano  Columbia University
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 12,   Downloads (12 Months): 92,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1519103.1519108
What is a DOI?

ABSTRACT

Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies. We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the userspecified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
 
3
 
4
 
5
W. Cohen and A. McCallum. Information extraction from the World Wide Web (tutorial). In KDD, 2003.
 
6
 
7
D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, 2005.
8
 
9
10
11
 
12
A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008.
13
 
14
A. Jain, P. G. Ipeirotis, A. Doan, and L. Gravano. Join optimization of information extraction output: Quality matters! In ICDE, 2009. To appear.
 
15
A. Jain and D. Srivastava. Exploring a few good tuples from text databases. In ICDE, 2009. To appear.
 
16
 
17
M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In WWW, 2007.
 
18
 
19
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.
 
20
21


Collaborative Colleagues:
Alpa Jain: colleagues
Panagiotis Ipeirotis: colleagues
Luis Gravano: colleagues