|
ABSTRACT
A large amount of structured information is buried in unstructured text. Information extraction systems can extract structured relations from the documents and enable sophisticated, SQL-like queries over unstructured text. Information extraction systems are not perfect and their output has imperfect precision and recall (i.e., contains spurious tuples and misses good tuples). Typically, an extraction system has a set of parameters that can be used as “knobs” to tune the system to be either precision- or recall-oriented. Furthermore, the choice of documents processed by the extraction system also affects the quality of the extracted relation. So far, estimating the output quality of an information extraction task has been an ad hoc procedure, based mainly on heuristics. In this article, we show how to use Receiver Operating Characteristic (ROC) curves to estimate the extraction quality in a statistically robust way and show how to use ROC analysis to select the extraction parameters in a principled manner. Furthermore, we present analytic models that reveal how different document retrieval strategies affect the quality of the extracted relation. Finally, we present our maximum likelihood approach for estimating, on the fly, the parameters required by our analytic models to predict the runtime and the output quality of each execution plan. Our experimental evaluation demonstrates that our optimization approach predicts accurately the output quality and selects the fastest execution plan that satisfies the output quality restrictions.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Adamic, L. A. and Huberman, B. A. 2002. Zipf's law and the Internet. Glottometrics 3, 143--150.
|
| |
2
|
|
 |
3
|
|
| |
4
|
Agichtein, E. and Gravano, L. 2003. Querying text databases for efficient information extraction. In Proceedings of the 19th IEEE International Conference on Data Engineering (ICDE'03).
|
| |
5
|
Agichtein, E., Ipeirotis, P. G., and Gravano, L. 2003. Modeling query-based access to text databases. In Proceedings of the 6th International Workshop on the Web and Databases, (WebDB'03), 87--92.
|
 |
6
|
|
| |
7
|
|
 |
8
|
|
| |
9
|
Cafarella, M. J., Re, C., Suciu, D., Etzioni, O., and Banko, M. 2007. Structured querying of Web text: A technical challenge. In 3rd Biennial Conference on Innovative Data Systems Research (CIDR'07).
|
 |
10
|
Surajit Chaudhuri , Rajeev Motwani , Vivek Narasayya, Random sampling for histogram construction: how much is enough?, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.436-447, June 01-04, 1998, Seattle, Washington, United States
|
| |
11
|
Cohen, W. W. 1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning (ICML'95), 115--123.
|
| |
12
|
Cohen, W. W. 2004. Minorthird: Methods for identifying names and ontological relations in text using heuristics for inducing regularities from data. http://minorthird.sourceforge.net.
|
| |
13
|
Downey, D., Etzioni, O., and Soderland, S. 2005. A probabilistic model of redundancy in information extraction. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI'05), 1034--1041.
|
| |
14
|
Egan, J. P. 1975. Signal Detection Theory and ROC Analysis. Academic Press.
|
| |
15
|
Erdreich, L. S. and Lee, E. T. 1981. Use of relative operating characteristic analysis in epidemiology. Amer. J. Epidemiol. 114, 5, 649--662.
|
 |
16
|
Oren Etzioni , Michael Cafarella , Doug Downey , Stanley Kok , Ana-Maria Popescu , Tal Shaked , Stephen Soderland , Daniel S. Weld , Alexander Yates, Web-scale information extraction in knowitall: (preliminary results), Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988687]
|
| |
17
|
Fawcett, T. 2003. ROC graphs: Notes and practical considerations for data mining researchers. Tech. rep., HPL-2003-4, HP Labs.
|
| |
18
|
Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. 2003. Bayesian Data Analysis, 2nd ed. Chapman and Hall/CRC.
|
| |
19
|
Goldstein, M., Morris, S., and Yen, G. G. 2004. Problems with fitting to the power-law distribution. The Eur. Phys. J. B - Condensed Matter Complex Syst. 41, 2, 255--258.
|
| |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
Hiyakumoto, L., Lita, L. V., and Nyberg, E. 2005. Multi-Strategy information extraction for question answering. In Proceedings of the International Florida Artificial Intelligence Research Society Conference (FLAIRS'05), 678--683.
|
 |
24
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
[doi> 10.1145/1142473.1142504]
|
 |
25
|
|
 |
26
|
Zachary G. Ives , Daniela Florescu , Marc Friedman , Alon Levy , Daniel S. Weld, An adaptive query execution system for data integration, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.299-310, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
27
|
Jain, A., Doan, A., and Gravano, L. 2007a. Optimizing SQL queries over text databases. In Proceedings of the 24th IEEE International Conference on Data Engineering (ICDE'08). 636--645.
|
| |
28
|
Jain, A., Doan, A., and Gravano, L. 2007b. SQL queries over unstructured text databases. In Proceedings of the 23rd IEEE International Conference on Data Engineering (ICDE'07).
|
 |
29
|
|
 |
30
|
|
| |
31
|
|
| |
32
|
Newman, M. E. J. 2005. Power laws, Pareto distributions and Zipf's law. Contemporary Phys. 46, 5, 323--351.
|
| |
33
|
Marius Paşca , Dekang Lin , Jeffrey Bigham , Andrei Lifchits , Alpa Jain, Names and similarities on the web: fact extraction in the fast lane, Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, p.809-816, July 17-18, 2006, Sydney, Australia
[doi> 10.3115/1220175.1220277]
|
| |
34
|
|
| |
35
|
|
 |
36
|
Victor S. Sheng , Foster Provost , Panagiotis G. Ipeirotis, Get another label? improving data quality and data mining using multiple, noisy labelers, Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, August 24-27, 2008, Las Vegas, Nevada, USA
[doi> 10.1145/1401890.1401965]
|
| |
37
|
Yangarber, R. and Grishman, R. 1998. NYU: Description of the Proteus/PET system as used for MUC-7. In Proceedings of the 7th Message Understanding Conference (MUC-7).
|
|