ACM Home Page
Please provide us with feedback. Feedback
Text categorization for multi-page documents: a hybrid naive Bayes HMM approach
Full text PdfPdf (280 KB)
Source International Conference on Digital Libraries archive
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries table of contents
Roanoke, Virginia, United States
Pages: 11 - 20  
Year of Publication: 2001
ISBN:1-58113-345-6
Authors
Paolo Frasconi  Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
Giovanni Soda  Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
Alessandro Vullo  Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 4,   Downloads (12 Months): 40,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/379437.379440
What is a DOI?

ABSTRACT

Text categorization is typically formulated as a concept learning prob lem where each instance is a single isolated document. In this paper we are interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. We describe a method for classifying pages of sequential OCR text documents into one of several assigned categories and suggest that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy. The proposed architecture relies on hidden Markov models whose emissions are bag-of-words according to a multinomial word event model, as in the generative portion of the Naive Bayes classifier. Our results on a collection of scanned journals from the Making of America project confirm the importance of using whole page sequences. Empirical evaluation indicates that the error rate (as obtained by running a plain Naive Bayes classifier on isolated page) can be roughly reduced by half if contextual information is incorporated.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The metadata engine project. http://meta-e.uibk.ac.at, 2001.
 
2
Y. Bengio and P. Frasconi. An input output HMM architecture. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pages 427-434. MIT Press, 1995.
 
3
D. A. Bicknese. Measuring the accuracy of the OCR in the Making of America. Report available at moa.umdl.umich.edu/moaocr.html, 1998.
 
4
W. Cavnar and J. Trenkle. N-Gram based text categorization. In Prof. of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161-175, Las Vegas, NV, 1994.
 
5
 
6
W. W. Cohen. Text categorization and relational learning. In Proceedings of the Twelfth International Conference on Machine Learning, Lake Tahoe, California, 1995.
 
7
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39:1-38, 1977.
 
8
 
9
 
10
 
11
 
12
 
13
 
14
 
15
 
16
 
17
 
18
D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
 
19
 
20
 
21
22
 
23
 
24
 
25
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286, 1989.
 
26
E. Shaw and S. Blumson. Online searching and page presentation at the University of Michigan. D-Lib Magazine, July/August 1997. url: www.dlib.org/dlib/july97/america/07shaw.html.
 
27
 
28
29
 
30


Collaborative Colleagues:
Paolo Frasconi: colleagues
Giovanni Soda: colleagues
Alessandro Vullo: colleagues