ACM Home Page
Please provide us with feedback. Feedback
A quality-aware optimizer for information extraction
Full text PdfPdf (1.16 MB)
Source
ACM Transactions on Database Systems (TODS) archive
Volume 34 ,  Issue 1  (April 2009) table of contents
Article No. 5  
Year of Publication: 2009
ISSN:0362-5915
Authors
Alpa Jain  Columbia University, New York
Panagiotis G. Ipeirotis  New York University, New York
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 39,   Downloads (12 Months): 210,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1508857.1508862
What is a DOI?

ABSTRACT

A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Adamic, L. A. and Huberman, B. A. 2002. Zipf's law and the Internet. Glottometrics 3, 143--150.
 
2
3
 
4
Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE'03).
 
5
Agichtein, E., Ipeirotis, P. G., and Gravano, L. 2003. Modeling query-based access to text databases. In Proceedings of the 6th International Workshop on the Web and Databases, (WebDB'03), 87--92.
6
 
7
8
 
9
Cafarella, M. J., Re, C., Suciu, D., Etzioni, O., and Banko, M. 2007. Structured querying of Web text: A technical challenge. In 3rd Biennial Conference on Innovative Data Systems Research (CIDR'07).
10
 
11
Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning (ICML'95), 115--123.
 
12
Cohen, W. W. 2004. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net.
 
13
Downey, D., Etzioni, O., and Soderland, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05), 1034--1041.
 
14
Egan, J. P. 1975. Signal Detection Theory and ROC Analysis. Academic Press.
 
15
Erdreich, L. S. and Lee, E. T. 1981. Use of relative operating characteristic analysis in epidemiology. Amer. J. Epidemiol. 114, 5, 649--662.
16
 
17
Fawcett, T. 2003. ROC graphs: Notes and practical considerations for data mining researchers. Tech. rep., HPL-2003-4, HP Labs.
 
18
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. 2003. Bayesian Data Analysis, 2nd ed. Chapman and Hall/CRC.
 
19
Goldstein, M., Morris, S., and Yen, G. G. 2004. Problems with fitting to the power-law distribution. The Eur. Phys. J. B - Condensed Matter Complex Syst. 41, 2, 255--258.
 
20
 
21
 
22
 
23
Hiyakumoto, L., Lita, L. V., and Nyberg, E. 2005. Multi-Strategy information extraction for question answering. In Proceedings of the International Florida Artificial Intelligence Research Society Conference (FLAIRS'05), 678--683.
24
25
26
 
27
Jain, A., Doan, A., and Gravano, L. 2007a. Optimizing SQL queries over text databases. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE'08). 636--645.
 
28
Jain, A., Doan, A., and Gravano, L. 2007b. SQL queries over unstructured text databases. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07).
29
30
 
31
 
32
Newman, M. E. J. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary Phys. 46, 5, 323--351.
 
33
 
34
 
35
36
 
37
Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7).


Collaborative Colleagues:
Alpa Jain: colleagues
Panagiotis G. Ipeirotis: colleagues