| Identifying table boundaries in digital documents via sparse line detection |
| Full text |
Pdf
(635 KB)
|
Source
|
Conference on Information and Knowledge Management
archive
Proceeding of the 17th ACM conference on Information and knowledge management
table of contents
Napa Valley, California, USA
SESSION: KM: information extraction
table of contents
Pages 1311-1320
Year of Publication: 2008
ISBN:978-1-59593-991-3
|
|
Authors
|
|
Ying Liu
|
The Pennsylvania State University, University Park, PA, USA
|
|
Prasenjit Mitra
|
The Pennsylvania State University, University Park, PA, USA
|
|
C. Lee Giles
|
The Pennsylvania State University, University Park, PA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 15, Downloads (12 Months): 124, Citation Count: 0
|
|
|
ABSTRACT
Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristics-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
H. Chao and J. Fan. Layout and content extraction for pdf documents. pages 213--224, 2004.
|
| |
3
|
|
| |
4
|
|
| |
5
|
M. Hurst. Layout and language: Challenges for table understanding on the web, 2001.
|
| |
6
|
N. G. J. Shin. Table recognition and evaluation. In In Proc. of the Class of 2005 Senior Conf., Computer Science Department, Swarthmore College, pages 8--13, 2005.
|
| |
7
|
T. Joachims. Svm light. http://svmlight.joachims.org/.
|
| |
8
|
|
| |
9
|
T. G. Kieninger. Table structure recognition based on robust block segmentation. In In Proc. Document Recognition V, SPIE, volume 3305, pages 22--32, January 1998.
|
 |
10
|
|
| |
11
|
|
 |
12
|
Ying Liu , Kun Bai , Prasenjit Mitra , C. Lee Giles, TableSeer: automatic table metadata extraction and searching in digital libraries, Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, June 18-23, 2007, Vancouver, BC, Canada
[doi> 10.1145/1255175.1255193]
|
| |
13
|
Y. Liu, P. Mitra, and C. L. Giles. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In Technical report, 2008.
|
| |
14
|
A. McCallum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on UAI, 2003.
|
| |
15
|
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, 2003.
|
| |
16
|
H. Ng, C. Lim, and J. Koo. Learning to recognize tables in free text, 1999.
|
| |
17
|
|
| |
18
|
G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices, 2001.
|
 |
19
|
|
| |
20
|
S. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. In SMC(21), No. 3, May 1991, pp. 660--674.
|
| |
21
|
F. Sha and F. Pereira. Shallow parsing with conditional random fields, 2003.
|
| |
22
|
|
 |
23
|
|
| |
24
|
|
| |
25
|
|
| |
26
|
B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A >method to extract table information from pdf files. IICAI05, (Pune, India), 2005.
|
| |
27
|
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web, 2001.
|
| |
28
|
Richard Zanibbi , Dorothea Blostein , R. Cordy, A survey of table recognition: Models, observations, transformations, and inferences, International Journal on Document Analysis and Recognition, v.7 n.1, p.1-16, March 2004
[doi> 10.1007/s10032-004-0120-9]
|
| |
29
|
|
|