| A machine learning based approach for table detection on the web |
| Full text |
Pdf
(336 KB)
|
| Source
|
International World Wide Web Conference
archive
Proceedings of the 11th international conference on World Wide Web
table of contents
Honolulu, Hawaii, USA
SESSION: Extraction and Visualization
table of contents
Pages: 242 - 250
Year of Publication: 2002
ISBN:1-58113-449-5
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 5, Downloads (12 Months): 77, Citation Count: 19
|
|
|
ABSTRACT
Table is a commonly used presentation scheme, especially for describing relational information. However, table understanding remains an open problem. In this paper, we consider the problem of table detection in web documents. Its potential applications include web mining, knowledge management, and web content summarization and delivery to narrow-bandwidth devices. We describe a machine learning based approach to classify each given table entity as either genuine or non-genuine. Various features reflecting the layout as well as content characteristics of tables are studied.In order to facilitate the training and evaluation of our table classifier, we designed a novel web document table ground truthing protocol and used it to build a large table ground truth database. The database consists of 1,393 HTML files collected from hundreds of different web sites and contains 11,477 leaf TABLE elements, out of which 1,740 are genuine tables. Experiments were conducted using the cross validation method and an F-measure of 95.89% was achieved.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
M. Hurst. Layout and language: Challenges for table understanding on the web. In Proc. 1st International Workshop on Web Document Analysis, pages 27--30, Seattle, WA, USA, September 2001.
|
| |
7
|
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130--137, 1980.
|
| |
12
|
|
 |
13
|
|
| |
14
|
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web. In Proc. 1st International Workshop on Web Document Analysis, pages 31--34, Seattle, WA, USA, September 2001.
|
CITED BY 19
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jie Tang , Hang Li , Yunbo Cao , Zhaohui Tang, Email data cleaning, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ying Liu , Kun Bai , Prasenjit Mitra , C. Lee Giles, Automatic searching of tables in digital libraries, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
Aleksander Pivk , Philipp Cimiano , York Sure , Matjaz Gams , Vladislav Rajkovič , Rudi Studer, Transforming arbitrary tables into logical form with TARTAR, Data & Knowledge Engineering, v.60 n.3, p.567-595, March, 2007
|
|
|
Ying Liu , Kun Bai , Prasenjit Mitra , C. Lee Giles, TableSeer: automatic table metadata extraction and searching in digital libraries, Proceedings of the 2007 conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
|
|
|
Wolfgang Gatterbauer , Paul Bohunsky , Marcus Herzog , Bernhard Krüpl , Bernhard Pollak, Towards domain-independent information extraction from web tables, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|