ACM Home Page
Please provide us with feedback. Feedback
Identifying table boundaries in digital documents via sparse line detection
Full text PdfPdf (635 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 17th ACM conference on Information and knowledge management table of contents
Napa Valley, California, USA
SESSION: KM: information extraction table of contents
Pages 1311-1320  
Year of Publication: 2008
ISBN:978-1-59593-991-3
Authors
Ying Liu  The Pennsylvania State University, University Park, PA, USA
Prasenjit Mitra  The Pennsylvania State University, University Park, PA, USA
C. Lee Giles  The Pennsylvania State University, University Park, PA, USA
Sponsors
ACM: Association for Computing Machinery
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 124,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458082.1458255
What is a DOI?

ABSTRACT

Most prior work on information extraction has focused on extracting information from text in digital documents. However, often, the most important information being reported in an article is presented in tabular form in a digital document. If the data reported in tables can be extracted and stored in a database, the data can be queried and joined with other data using database management systems. In order to prepare the data source for table search, accurately detecting the table boundary plays a crucial role for the later table structure decomposition. Table boundary detection and content extraction is a challenging problem because tabular formats are not standardized across all documents. In this paper, we propose a simple but effective preprocessing method to improve the table boundary detection performance by considering the sparse-line property of table rows. Our method easily simplifies the table boundary detection problem into the sparse line analysis problem with much less noise. We design eight line label types and apply two machine learning techniques, Conditional Random Field (CRF) and Support Vector Machines (SVM), on the table boundary detection field. The experimental results not only compare the performances between the machine learning methods and the heuristics-based method, but also demonstrate the effectiveness of the sparse line analysis in the table boundary detection.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
H. Chao and J. Fan. Layout and content extraction for pdf documents. pages 213--224, 2004.
 
3
 
4
 
5
M. Hurst. Layout and language: Challenges for table understanding on the web, 2001.
 
6
N. G. J. Shin. Table recognition and evaluation. In In Proc. of the Class of 2005 Senior Conf., Computer Science Department, Swarthmore College, pages 8--13, 2005.
 
7
T. Joachims. Svm light. http://svmlight.joachims.org/.
 
8
 
9
T. G. Kieninger. Table structure recognition based on robust block segmentation. In In Proc. Document Recognition V, SPIE, volume 3305, pages 22--32, January 1998.
10
 
11
12
 
13
Y. Liu, P. Mitra, and C. L. Giles. Improving the table boundary detection in pdfs by fixing the sequence error of the sparse lines. In Technical report, 2008.
 
14
A. McCallum. Efficiently inducing features of conditional random fields. In Nineteenth Conference on UAI, 2003.
 
15
A. McCallum and W. Li. Early results for named entity recognition with conditional random fields, 2003.
 
16
H. Ng, C. Lim, and J. Koo. Learning to recognize tables in free text, 1999.
 
17
 
18
G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices, 2001.
19
 
20
S. Safavian and D. Landgrebe. A survey of decision tree classifier methodology. In SMC(21), No. 3, May 1991, pp. 660--674.
 
21
F. Sha and F. Pereira. Shallow parsing with conditional random fields, 2003.
 
22
23
 
24
 
25
 
26
B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A >method to extract table information from pdf files. IICAI05, (Pune, India), 2005.
 
27
M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate tables of the world wide web, 2001.
 
28
 
29

Collaborative Colleagues:
Ying Liu: colleagues
Prasenjit Mitra: colleagues
C. Lee Giles: colleagues