|
ABSTRACT
Text documents often embed data that is structured in nature. This structured data is increasingly exposed using information extraction systems, which generate structured relations from documents, introducing an opportunity to process expressive, structured queries over text databases. This paper discusses our SQoUT1 project, which focuses on processing structured queries over relations extracted from text databases. We show how, in our extraction-based scenario, query processing can be decomposed into a sequence of basic steps: retrieving relevant text documents, extracting relations from the documents, and joining extracted relations for queries involving multiple relations. Each of these steps presents different alternatives and together they form a rich space of possible query execution strategies. We identify execution efficiency and output quality as the two critical properties of a query execution, and argue that an optimization approach needs to consider both properties. To this end, we take into account the userspecified requirements for execution efficiency and output quality, and choose an execution strategy for each query based on a principled, cost-based comparison of the alternative execution strategies.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
E. Agichtein and L. Gravano. Querying text databases for efficient information extraction. In ICDE, 2003.
|
| |
3
|
|
| |
4
|
|
| |
5
|
W. Cohen and A. McCallum. Information extraction from the World Wide Web (tutorial). In KDD, 2003.
|
| |
6
|
|
| |
7
|
D. Downey, O. Etzioni, and S. Soderland. A probabilistic model of redundancy in information extraction. In IJCAI, 2005.
|
 |
8
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
| |
9
|
|
 |
10
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142504]
|
 |
11
|
|
| |
12
|
A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008.
|
 |
13
|
|
| |
14
|
A. Jain, P. G. Ipeirotis, A. Doan, and L. Gravano. Join optimization of information extraction output: Quality matters! In ICDE, 2009. To appear.
|
| |
15
|
A. Jain and D. Srivastava. Exploring a few good tuples from text databases. In ICDE, 2009. To appear.
|
| |
16
|
|
| |
17
|
M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In WWW, 2007.
|
| |
18
|
|
| |
19
|
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.
|
| |
20
|
|
 |
21
|
Victor S. Sheng , Foster Provost , Panagiotis G. Ipeirotis, Get another label? improving data quality and data mining using multiple, noisy labelers, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
[doi> 10.1145/1401890.1401965]
|
|