ACM Home Page
Please provide us with feedback. Feedback
TEG: a hybrid approach to information extraction
Full text PdfPdf (202 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the thirteenth ACM international conference on Information and knowledge management table of contents
Washington, D.C., USA
SESSION: KM-3 (knowledge management): knowledge extraction table of contents
Pages: 589 - 596  
Year of Publication: 2004
ISBN:1-58113-874-1
Authors
Benjamin Rosenfeld  Bar-Ilan University, Ramat Gan, ISRAEL
Ronen Feldman  Bar-Ilan University, Ramat Gan, ISRAEL
Moshe Fresko  Bar-Ilan University, Ramat Gan, ISRAEL
Jonathan Schler  Bar-Ilan University, Ramat Gan, ISRAEL
Yonatan Aumann  Bar-Ilan University, Ramat Gan, ISRAEL
Sponsors
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 7,   Downloads (12 Months): 55,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1031171.1031280
What is a DOI?

ABSTRACT

This paper describes a hybrid statistical and knowledge-based information extraction model, able to extract entities and relations at the sentence level. The model attempts to retain and improve the high accuracy levels of knowledge-based systems while drastically reducing the amount of manual labor by relying on statistics drawn from a training corpus. The implementation of the model, called TEG (Trainable Extraction Grammar), can be adapted to any IE domain by writing a suitable set of rules in a SCFG (Stochastic Context Free Grammar) based extraction language, and training them using an annotated corpus. The system does not contain any purely linguistic components, such as PoS tagger or parser. We demonstrate the performance of the system on several named entity extraction and relation extraction tasks. The experiments show that our hybrid approach outperforms both purely statistical and purely knowledge-based systems, while requiring orders of magnitude less manual rule writing and smaller amount of training data. The improvement in accuracy is slight for named entity extraction task and more pronounced for relation extraction.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
ACE. http://www.itl.nist.gov/iad/894.01/tests/ace/. in ACE - Automatic Content Extraction. 2002.
3
 
4
 
5
 
6
 
7
 
8
Kushmerick, N. Finite-state approaches to Web information extraction. in 3rd Summer Convention on Information Extraction. 2002. Rome.
 
9
Freitag, D., Using grammatical inference to improve precision in information extraction, in Workshop on Grammatical Inference, Automata Induction, and Language Acquisition (ICML'97), Nashville, TN. 1997.
 
10
Aitken, J.S. Learning Information Extraction Rules: An Inductive Logic Programming approach. in 15th European Conference on Artificial Intelligence. 2002: IOS Press.
 
11
 
12
 
13
Leek, T.R., Information extraction using hidden Markov models. M.Sc.Thesis, UC San Diego, 1997.
 
14
 
15
Freitag, D. and A.K. McCallum, Information extraction with HMMs and shrinkage, in Proceedings of the AAAI-99 Workshop on Machine Learning for Informatino Extraction. 1999.
 
16
De Sitter, A. and W. Daelemans. Information Extraction via Double Classification. in International Workshop on Adaptive Text Extraction and Mining. 2003. Dubrovnik.
 
17
Sun, A., et al. Using Support Vector Machine for Terrorism Information Extraction. in 1st NSF/NIJ Symposium on Intelligence and Security Informatics. 2003.
 
18
Kushmerick, N., E. Johnston, and S. McGuinness. Information extraction by text classification. in IJCAI-01 Workshop on Adaptive Text Extraction and Mining. 2001. Seattle, WA.
 
19
 
20
Miller, S., et al., Algorithms that learn to extract information-BBN: Description of the SIFT system as used for MUC, in Proceedings of the Seventh Message Understanding Conference (MUC-7). 1998.
 
21
Collins, M. and S. Miller. Semantic Tagging using a Probabilistic Context Free Grammar. in Proceedings of the Sixth Workshop on Very Large Corpora. 1998.
 
22
 
23
 
24
 
25
 
26
 
27
Klein, D. and C. Manning, An O(n3) Agenda-Based Chart Parser for Arbitrary Probabilistic Context-Free Grammars., in Technical Report dbpubs/2001. 2001, Stanford University.


Collaborative Colleagues:
Benjamin Rosenfeld: colleagues
Ronen Feldman: colleagues
Moshe Fresko: colleagues
Jonathan Schler: colleagues
Yonatan Aumann: colleagues