ACM Home Page
Please provide us with feedback. Feedback
Uncertainty management in rule-based information extraction systems
Full text PdfPdf (788 KB)
Source
International Conference on Management of Data archive
Proceedings of the 35th SIGMOD international conference on Management of data table of contents
Providence, Rhode Island, USA
SESSION: Research session 3: information extraction table of contents
Pages 101-114  
Year of Publication: 2009
ISBN:978-1-60558-551-2
Authors
Eirinaios Michelakis  University of California at Berkeley, Berkeley, CA, USA
Rajasekar Krishnamurthy  IBM Almaden Research Center, San Jose, CA, USA
Peter J. Haas  IBM Almaden Research Center, San Jose, CA, USA
Shivakumar Vaithyanathan  IBM Almaden Research Center, San Jose, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 71,   Downloads (12 Months): 327,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1559845.1559858
What is a DOI?

ABSTRACT

Rule-based information extraction is a process by which structured objects are extracted from text based on user-defined rules. The compositional nature of rule-based information extraction also allows rules to be expressed over previously extracted objects. Such extraction is inherently uncertain, due to the varying precision associated with the rules used in a specific extraction task. Quantifying this uncertainty is crucial for querying the extracted objects in probabilistic databases, and for improving the recall of extraction tasks that use compositional rules. In this paper, we provide a probabilistic framework for handling the uncertainty in rule-based information extraction. Specifically, for each extraction task, we build a parametric exponential model of uncertainty that captures the interaction between the different rules, as well as the compositional nature of the rules; the exponential form of our model follows from maximum-entropy considerations. We also give model-decomposition techniques that make the learning algorithms scalable to large numbers of rules and constraints. Experiments over multiple real-world extraction tasks confirm that our approach yields accurate probability estimates with only a small performance overhead. Moreover, our framework supports incremental pay-as-you-go improvements in the accuracy of probability estimates as new rules, data, or constraints are added.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Automatic Content Extraction 2005 Evaluation (ACE05) Dataset, http://www.nist.gov/speech/tests/ace/2005/.
 
2
 
3
 
4
 
5
M. Avriel. Nonlinear Programming. Prentice-Hall, 1976.
 
6
 
7
B. Boguraev. Annotation-based finite state processing in a large-scale NLP architecture. In RANLP, pages 61--80, 2003.
 
8
 
9
H. Cunningham, D.Maynard, and V. Tablan. JAPE: a Java annotation patterns engine. Technical report, Dept. of Computer Science, University of Sheffield, 2000.
 
10
11
12
 
13
14
 
15
T. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Engrg. Bull., 29(1):40--48, 2006.
 
16
17
 
18
 
19
 
20
 
21
 
22
B. Marthi, B. Milch, and S. Russell. First-order probabilistic models for information extraction. In IJCAI 2003 Workshop on Learning Statistical Models from Relational Data, 2003.
 
23
J. McCarthy and W. Lehnert. Using decision trees for coreference resolution. In IJCAI, pages 1050--1055, 1995.
 
24
 
25
 
26
K. Nigam, J. Lafferty, and A. McCallum. Using maximum entropy for text classification. In IJCAI Workshop on Machine Learning for Information Filtering, 1999.
 
27
 
28
L. Peshkin and A. Pfeffer. Bayesian information extraction network. In IJCAI, pages 421--426, 2003.
 
29
 
30
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI, pages 913--918, 2007.
 
31
 
32
 
33
 
34
 
35
C. Siefkes. Incremental information extraction using tree--based context representations. In CICLing, pages 510--521, 2005.
 
36
C. Siefkes and P. Siniakov. An overview and classification of adaptive approaches to information extraction. LNCS J. Data Semantics IV, pages 172--212, 2005.
 
37
M. Skounakis, M. Craven, and S. Ray. Hierarchical hidden Markov models for information extraction. In IJCAI, pages 427--433, 2003.
38
 
39
System Text for Information Extraction http://www.alphaworks.ibm.com/tech/systemt.
 
40
 
41
 
42
Q. Xu, Y. Liang, and Y. Du. Monte Carlo cross-validation for selecting a model and estimating the prediction error in multivariate calibration. J. Chemometrics, 18(2):112--120, 2004.

Collaborative Colleagues:
Eirinaios Michelakis: colleagues
Rajasekar Krishnamurthy: colleagues
Peter J. Haas: colleagues
Shivakumar Vaithyanathan: colleagues