|
ABSTRACT
Table is a very common presentation scheme, but few papers touch on table extraction in text data mining. This paper focuses on mining tables from large-scale HTML texts. Table filtering, recognition, interpretation, and presentation are discussed. Heuristic rules and cell similarities are employed to identify tables. The F-measure of table recognition is 86.50%. We also propose an algorithm to capture attribute-value relationships among table cells. Finally, more structured data is extracted and presented.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Appelt, D. and Israel, D. (1997) "Tutorial Notes on Building Information Extraction Systems," Tutorial on Fifth Conference on Applied Natural Language Processing, 1997.
|
| |
2
|
Chen, H. H.; Ding Y. W.; and Tsai, S. C. (1998) "Named Entity Extraction for Information Retrieval," Computer Processing of Oriental Languages, Special Issue on Information Retrieval on Oriental Languages, Vol. 12, No. 1, 1998, pp. 75--85.
|
| |
3
|
Douglas, S.; Hurst, M. and Quinn, D. (1995) "Using Natural Language Processing for Identifying and Interpreting Tables in Plain Text," Proceedings of Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 535--545.
|
| |
4
|
Douglas, S. and Hurst, M. (1996) "Layout and Language: Lists and Tables in Technical Documents," Proceedings of ACL SIGPARSE Workshop on Punctuation in Computational Linguistics, 1996, pp. 19--24.
|
| |
5
|
Gaizauskas, R. and Wilks, Y. (1998) "Information Extraction: Beyond Document Retrieval," Computational Linguistics and Chinese Language Processing, Vol. 3, No. 2, 1998, pp. 17--59.
|
| |
6
|
Green, E. and Krishnamoorthy, M. (1995) "Recognition of Tables Using Grammars," Proceedings of the Fourth Annual Symposium on Document Analysis and Information Retrieval, 1995, pp. 261--278.
|
| |
7
|
|
| |
8
|
Hurst, M. (1999a) "Layout and Language: Beyond Simple Text for Information Interaction - Modeling the Table," Proceedings of the 2nd International Conference on Multimodal Interfaces, Hong Kong, January 1999.
|
| |
9
|
Hurst, M. (1999b) "Layout and Language: A Corpus of Documents Containing Tables," Proceedings of AAAI Fall Symposium: Using Layout for the Generation, Understanding and Retrieval of Documents, 1999.
|
| |
10
|
Mikheev, A. and Finch, S. (1995) "A Workbench for Acquisition of Ontological Knowledge from Natural Text," Proceedings of the 7th Conference of the European Chapter for Computational Linguistics, 1995, pp. 194--201.
|
| |
11
|
MUC (1998) Proceedings of 7th Message Understanding Conference, http://www.muc.saic.com/proceedings/proceedings_index.html.
|
| |
12
|
|
CITED BY 17
|
|
|
|
|
|
|
|
|
Ying Liu , Kun Bai , Prasenjit Mitra , C. Lee Giles, TableSeer: automatic table metadata extraction and searching in digital libraries, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
Aleksander Pivk , Philipp Cimiano , York Sure , Matjaz Gams , Vladislav Rajkovič , Rudi Studer, Transforming arbitrary tables into logical form with TARTAR, Data & Knowledge Engineering, v.60 n.3, p.567-595, March, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog , Bernhard Krüpl , Bernhard Pollak, Towards domain-independent information extraction from web tables, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE Design Automation Conference on
Gwo-Dong Chen
, Daniel D. Gajski
|