ACM Home Page
Please provide us with feedback. Feedback
SystemT: a system for declarative information extraction
Full text PdfPdf (563 KB)
Source
ACM SIGMOD Record archive
Volume 37 ,  Issue 4  (December 2008) table of contents
COLUMN: Special section on managing information extraction table of contents
Pages 7-13  
Year of Publication: 2009
ISSN:0163-5808
Authors
Rajasekar Krishnamurthy  IBM Almaden Research Center
Yunyao Li  IBM Almaden Research Center
Sriram Raghavan  IBM Almaden Research Center
Frederick Reiss  IBM Almaden Research Center
Shivakumar Vaithyanathan  IBM Almaden Research Center
Huaiyu Zhu  IBM Almaden Research Center
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 20,   Downloads (12 Months): 128,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1519103.1519105
What is a DOI?

ABSTRACT

As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
E. Agichtein and S. Sarawagi. Scalable information extraction and integration. KDD, 2006.
 
2
D. E. Appelt and B. Onyshkevych. The common pattern specification language. In TIPSTER workshop, 1998.
 
3
W. Cohen and A. McCallum. Information extraction from the World Wide Web. KDD, 2003.
 
4
H. Cunningham, D. Maynard, and V. Tablan. JAPE: a java annotation patterns engine. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, 2000.
5
 
6
 
7
Hadoop. http://hadoop.apache.org/.
 
8
 
9
F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, 2004.
 
10
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.
11
 
12
 
13
System Text for Information Extraction. http://www.alphaworks.ibm.com/tech/systemt.


Collaborative Colleagues:
Rajasekar Krishnamurthy: colleagues
Yunyao Li: colleagues
Sriram Raghavan: colleagues
Frederick Reiss: colleagues
Shivakumar Vaithyanathan: colleagues
Huaiyu Zhu: colleagues