|
ABSTRACT
Rule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted objects. Such extraction is inherently uncertain, due to the varying precision associated with the rules used in a specific extraction task. Quantifying this uncertainty is crucial for querying the extracted objects in probabilistic databases, and for improving the recall of extraction tasks that use compositional rules. In this paper, we provide a probabilistic framework for handling the uncertainty in rule-based information extraction. Specifically, for each extraction task, we build a parametric exponential model of uncertainty that captures the interaction between the different rules, as well as the compositional nature of the rules; the exponential form of our model follows from maximum-entropy considerations. We also give model-decomposition techniques that make the learning algorithms scalable to large numbers of rules and constraints. Experiments over multiple real-world extraction tasks confirm that our approach yields accurate probability estimates with only a small performance overhead. Moreover, our framework supports incremental pay-as-you-go improvements in the accuracy of probability estimates as new rules, data, or constraints are added.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Automatic Content Extraction 2005 Evaluation (ACE05) Dataset, http://www.nist.gov/speech/tests/ace/2005/.
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
M. Avriel. Nonlinear Programming. Prentice-Hall, 1976.
|
| |
6
|
|
| |
7
|
B. Boguraev. Annotation-based finite state processing in a large-scale NLP architecture. In RANLP, pages 61--80, 2003.
|
| |
8
|
|
| |
9
|
H. Cunningham, D.Maynard, and V. Tablan. JAPE: a Java annotation patterns engine. Technical report, Dept. of Computer Science, University of Sheffield, 2000.
|
| |
10
|
|
 |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
Ravi Jampani , Fei Xu , Mingxi Wu , Luis Leopoldo Perez , Christopher Jermaine , Peter J. Haas, MCDB: a monte carlo approach to managing uncertain data, Proceedings of the 2008 ACM SIGMOD international conference on Management of data, June 09-12, 2008, Vancouver, Canada
[doi> 10.1145/1376616.1376686]
|
| |
15
|
T. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engrg. Bull., 29(1):40--48, 2006.
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
V. Markl , P. J. Haas , M. Kutsch , N. Megiddo , U. Srivastava , T. M. Tran, Consistent selectivity estimation via maximum entropy, The VLDB Journal — The International Journal on Very Large Data Bases, v.16 n.1, p.55-76, January 2007
[doi> 10.1007/s00778-006-0030-1]
|
| |
22
|
B. Marthi, B. Milch, and S. Russell. First-order probabilistic models for information extraction. In IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, 2003.
|
| |
23
|
J. McCarthy and W. Lehnert. Using decision trees for coreference resolution. In IJCAI, pages 1050--1055, 1995.
|
| |
24
|
Einat Minkov , Richard C. Wang , William W. Cohen, Extracting personal names from email: applying named entity recognition to informal text, Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, p.443-450, October 06-08, 2005, Vancouver, British Columbia, Canada
[doi> 10.3115/1220575.1220631]
|
| |
25
|
|
| |
26
|
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI Workshop on Machine Learning for Information Filtering, 1999.
|
| |
27
|
|
| |
28
|
L. Peshkin and A. Pfeffer. Bayesian information extraction network. In IJCAI, pages 421--426, 2003.
|
| |
29
|
|
| |
30
|
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, pages 913--918, 2007.
|
| |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
|
| |
35
|
C. Siefkes. Incremental information extraction using tree--based context representations. In CICLing, pages 510--521, 2005.
|
| |
36
|
C. Siefkes and P. Siniakov. An overview and classification of adaptive approaches to information extraction. LNCS J. Data Semantics IV, pages 172--212, 2005.
|
| |
37
|
M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden Markov models for information extraction. In IJCAI, pages 427--433, 2003.
|
 |
38
|
|
| |
39
|
System Text for Information Extraction http://www.alphaworks.ibm.com/tech/systemt.
|
| |
40
|
|
| |
41
|
|
| |
42
|
Q. Xu, Y. Liang, and Y. Du. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J. Chemometrics, 18(2):112--120, 2004.
|
|