|
ABSTRACT
As applications within and outside the enterprise encounter increasing volumes of unstructured data, there has been renewed interest in the area of information extraction (IE) -- the discipline concerned with extracting structured information from unstructured text. Classical IE techniques developed by the NLP community were based on cascading grammars and regular expressions. However, due to the inherent limitations of grammarbased extraction, these techniques are unable to: (i) scale to large data sets, and (ii) support the expressivity requirements of complex information tasks. At the IBM Almaden Research Center, we are developing SystemT, an IE system that addresses these limitations by adopting an algebraic approach. By leveraging well-understood database concepts such as declarative queries and costbased optimization, SystemT enables scalable execution of complex information extraction tasks. In this paper, we motivate the SystemT approach to information extraction. We describe our extraction algebra and demonstrate the effectiveness of our optimization techniques in providing orders of magnitude reduction in the running time of complex extraction tasks.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Agichtein and S. Sarawagi. Scalable information extraction and integration. KDD, 2006.
|
| |
2
|
D. E. Appelt and B. Onyshkevych. The common pattern specification language. In TIPSTER workshop, 1998.
|
| |
3
|
W. Cohen and A. McCallum. Information extraction from the World Wide Web. KDD, 2003.
|
| |
4
|
H. Cunningham, D. Maynard, and V. Tablan. JAPE: a java annotation patterns engine. Research Memorandum CS-00-10, Department of Computer Science, University of Sheffield, 2000.
|
 |
5
|
|
| |
6
|
|
| |
7
|
Hadoop. http://hadoop.apache.org/.
|
| |
8
|
|
| |
9
|
F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, 2004.
|
| |
10
|
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, 2008.
|
 |
11
|
P. Griffiths Selinger , M. M. Astrahan , D. D. Chamberlin , R. A. Lorie , T. G. Price, Access path selection in a relational database management system, Proceedings of the 1979 ACM SIGMOD international conference on Management of data, May 30-June 01, 1979, Boston, Massachusetts
[doi> 10.1145/582095.582099]
|
| |
12
|
|
| |
13
|
System Text for Information Extraction. http://www.alphaworks.ibm.com/tech/systemt.
|
CITED BY 3
|
|
Eirinaios Michelakis , Rajasekar Krishnamurthy , Peter J. Haas , Shivakumar Vaithyanathan, Uncertainty management in rule-based information extraction systems, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
David E. Simmen , Frederick Reiss , Yunyao Li , Suresh Thalamati, Enabling enterprise mashups over unstructured text feeds with InfoSphere MashupHub and SystemT, Proceedings of the 35th SIGMOD international conference on Management of data, June 29-July 02, 2009, Providence, Rhode Island, USA
|
|
|
AnHai Doan , Jeffrey F. Naughton , Raghu Ramakrishnan , Akanksha Baid , Xiaoyong Chai , Fei Chen , Ting Chen , Eric Chu , Pedro DeRose , Byron Gao , Chaitanya Gokhale , Jiansheng Huang , Warren Shen , Ba-Quy Vuong, Information extraction challenges in managing unstructured data, ACM SIGMOD Record, v.37 n.4, December 2008
|
|