| Automatic categorization of figures in scientific documents |
| Full text |
Pdf
(699 KB)
|
| Source
|
International Conference on Digital Libraries
archive
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Chapel Hill, NC, USA
SESSION: Document analysis
table of contents
Pages: 129 - 138
Year of Publication: 2006
ISBN:1-59593-354-9
|
|
Authors
|
|
Xiaonan Lu
|
The Pennsylvania State University, University Park, Pennsylvania
|
|
Prasenjit Mitra
|
The Pennsylvania State University, University Park, Pennsylvania
|
|
James Z. Wang
|
The Pennsylvania State University, University Park, Pennsylvania
|
|
C. Lee Giles
|
The Pennsylvania State University, University Park, Pennsylvania
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 3, Downloads (12 Months): 70, Citation Count: 4
|
|
|
ABSTRACT
Figures are very important non-textual information contained in scientific documents. Current digital libraries do not provide users tools to retrieve documents based on the information available within the figures. We propose an architecture for retrieving documents by integrating figures and other information. The initial step in enabling integrated document search is to categorize figures into a set of pre-defined types. We propose several categories of figures based on their functionalities in scholarly articles. We have developed a machine-learning-based approach for automatic categorization of figures. Both global features, such as texture, and part features, such as lines, are utilized in the architecture for discriminating among figure categories. The proposed approach has been evaluated on a testbed document set collected from the CiteSeer scientific literature digital library. Experimental evaluation has demonstrated that our algorithms can produce acceptable results for realworld use. Our tools will be integrated into a scientific document digital library.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Henry S. Baird , Daniel Lopresti , Brian D. Davison , William M. Pottenger, Robust document image understanding technologies, Proceedings of the 1st ACM workshop on Hardcopy document processing, p.9-14, November 12-12, 2004, Washington, DC, USA
[doi> 10.1145/1031442.1031444]
|
| |
2
|
I. Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94:115--147, 1987.
|
| |
3
|
|
| |
4
|
|
| |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
E. Giladi, M. G. Walker, J. Z. Wang, and W. Volkmuth. SST: An algorithm for finding near-exact sequence matches in time proportional to the logarithm of the database size. Bioinformatics, 18(6):873--879, 2002.
|
 |
11
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
| |
12
|
Hui Han , C. Lee Giles , Eren Manavoglu , Hongyuan Zha , Zhenyue Zhang , Edward A. Fox, Automatic document metadata extraction using support vector machines, Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, May 27-31, 2003, Houston, Texas
|
 |
13
|
|
| |
14
|
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer-Verlag, New York, NY, 2001.
|
| |
15
|
|
 |
16
|
Yunhua Hu , Hang Li , Yunbo Cao , Dmitriy Meyerzon , Qinghua Zheng, Automatic extraction of titles from general documents using machine learning, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
[doi> 10.1145/1065385.1065418]
|
| |
17
|
T. Joachims. Making Large-Scale Support Vector Machine Learning Practical. MIT Press, Cambridge, MA, 1998.
|
| |
18
|
D. Joshi, J. Li, and J. Z. Wang. A computationally efficient approach to the estimation of two- and three-dimensional hidden markov models. IEEE Transactions on Image Processing, 2006, to appear.
|
| |
19
|
J. Li and R. M. Gray. Context-based multiscale classification of document images using wavelet coefficient distributions. IEEE Transactions on Image Processing, 9(9):1604--1616, 2000.
|
| |
20
|
|
| |
21
|
S. Mao and A. Rosenfeld. Document structure analysis algorithms: a literature survey. In Proceedings of SPIE, pages 197--207, 2003.
|
 |
22
|
Mor Naaman , Ron B. Yeh , Hector Garcia-Molina , Andreas Paepcke, Leveraging context to resolve identity in photo albums, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
[doi> 10.1145/1065385.1065430]
|
| |
23
|
|
 |
24
|
Byung-Won On , Dongwon Lee , Jaewoo Kang , Prasenjit Mitra, Comparative study of name disambiguation problem using a scalable blocking-based framework, Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, June 07-11, 2005, Denver, CO, USA
[doi> 10.1145/1065385.1065463]
|
| |
25
|
|
| |
26
|
|
 |
27
|
|
| |
28
|
|
| |
29
|
|
| |
30
|
|
CITED BY 4
|
|
Xiaonan Lu , James Z. Wang , Prasenjit Mitra , C. Lee Giles, Deriving knowledge from figures for digital libraries, Proceedings of the 16th international conference on World Wide Web, May 08-12, 2007, Banff, Alberta, Canada
|
|
|
|
|
|
|
|
|
William Browuer , Saurabh Kataria , Sujatha Das , Prasenjit Mitra , C. Lee Giles, Segregating and extracting overlapping data points in two-dimensional plots, Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries, June 16-20, 2008, Pittsburgh PA, PA, USA
|
|