| High-performance information extraction with AliBaba |
| Full text |
Pdf
(674 KB)
|
| Source
|
Extending Database Technology; Vol. 360
archive
Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology
table of contents
Saint Petersburg, Russia
DEMONSTRATION SESSION: Demonstrations: Demo group 2
table of contents
Pages 1140-1143
Year of Publication: 2009
ISBN:978-1-60558-422-5
|
|
Authors
|
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 17, Downloads (12 Months): 75, Citation Count: 0
|
|
|
ABSTRACT
A wealth of information is available only in web pages, patents, publications etc. Extracting information from such sources is challenging, both due to the typically complex language processing steps required and to the potentially large number of texts that need to be analyzed. Furthermore, integrating extracted data with other sources of knowledge often is mandatory for subsequent analysis. In this demo, we present the AliBaba system for scalable information extraction from biomedical documents. Unlike many other systems, AliBaba performs both entity extraction and relationship extraction and graphically visualizes the resulting network of inter-connected objects. It leverages the PubMed search engine for selection of relevant documents. The technical novelty of AliBaba is twofold: (a) its ability to automatically learn language patterns for relationship extraction without an annotated corpus, and (b) its high performance pattern matching algorithm. We show that a simple yet effective pattern filtering technique improves the runtime of the system drastically without harming its extraction effectiveness. Although AliBaba has been implemented for biomedical texts, its underlying principles should also be applicable in any other domain.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Altschul, S. F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389--402.
|
| |
2
|
Chen, F., et al. Efficient Information Extraction over Evolving Text Data. in 24th International Conference on Data Engineering. 2008. Cancun, Mexico.
|
| |
3
|
|
 |
4
|
|
| |
5
|
Luis Gravano , Panagiotis G. Ipeirotis , H. V. Jagadish , Nick Koudas , S. Muthukrishnan , Divesh Srivastava, Approximate String Joins in a Database (Almost) for Free, Proceedings of the 27th International Conference on Very Large Data Bases, p.491-500, September 11-14, 2001
|
| |
6
|
D. Gruhl , L. Chavet , D. Gibson , J. Meyer , P. Pattanayak , A. Tomkins , J. Zien, How to build a WebFountain: An architecture for very large-scale text analytics, IBM Systems Journal, v.43 n.1, p.64-77, January 2004
|
| |
7
|
Hakenberg, J., et al., Gene mention normalization and interaction extraction with context models and sentence motifs. Genome Biol, 2008. 9 Suppl 2: p. S14.
|
| |
8
|
|
| |
9
|
Jenssen, T. K., et al., A literature network of human genes for high-throughput analysis of gene expression. Nat Genet, 2001. 28(1): p. 21--8.
|
| |
10
|
Myers, E. and R. Durbin, A Table-Driven, Full-Sensitivity Similarity Search Algorithm. Journal of Computational Biology, 2003. 10(2): p. 103--117.
|
| |
11
|
|
| |
12
|
Pyysalo, S., et al., Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics, 2008. 9 Suppl 3: p. S6.
|
| |
13
|
Ramakrishnan, C., K. J. Kochut, and A. P. Sheth. A Framework for Schema-Driven Relationship Discovery from Unstructured Text. in Int. Semantic Web Conference. 2006.
|
| |
14
|
|
|