| Text categorization for multi-page documents: a hybrid naive Bayes HMM approach |
| Full text |
Pdf
(280 KB)
|
| Source
|
International Conference on Digital Libraries
archive
Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries
table of contents
Roanoke, Virginia, United States
Pages: 11 - 20
Year of Publication: 2001
ISBN:1-58113-345-6
|
|
Authors
|
|
Paolo Frasconi
|
Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
|
|
Giovanni Soda
|
Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
|
|
Alessandro Vullo
|
Department of Systems and Computer Science, University of Florence, 50139 Firenze, Italy
|
|
| Sponsor |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 4, Downloads (12 Months): 40, Citation Count: 3
|
|
|
ABSTRACT
Text categorization is typically formulated as a concept learning prob lem where each instance is a single isolated document. In this paper we are interested in a more general formulation where documents are organized as page sequences, as naturally occurring in digital libraries of scanned books and magazines. We describe a method for classifying pages of sequential OCR text documents into one of several assigned categories and suggest that taking into account contextual information provided by the whole page sequence can significantly improve classification accuracy. The proposed architecture relies on hidden Markov models whose emissions are bag-of-words according to a multinomial word event model, as in the generative portion of the Naive Bayes classifier. Our results on a collection of scanned journals from the Making of America project confirm the importance of using whole page sequences. Empirical evaluation indicates that the error rate (as obtained by running a plain Naive Bayes classifier on isolated page) can be roughly reduced by half if contextual information is incorporated.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The metadata engine project. http://meta-e.uibk.ac.at, 2001.
|
| |
2
|
Y. Bengio and P. Frasconi. An input output HMM architecture. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7, pages 427-434. MIT Press, 1995.
|
| |
3
|
D. A. Bicknese. Measuring the accuracy of the OCR in the Making of America. Report available at moa.umdl.umich.edu/moaocr.html, 1998.
|
| |
4
|
W. Cavnar and J. Trenkle. N-Gram based text categorization. In Prof. of the 3rd Annual Symposium on Document Analysis and Information Retrieval, pages 161-175, Las Vegas, NV, 1994.
|
| |
5
|
|
| |
6
|
W. W. Cohen. Text categorization and relational learning. In Proceedings of the Twelfth International Conference on Machine Learning, Lake Tahoe, California, 1995.
|
| |
7
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum-likelihood from incomplete data via the EM algorithm. Journal of Royal Statistical Society B, 39:1-38, 1977.
|
| |
8
|
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
| |
18
|
D. Lewis and M. Ringuette. Comparison of two learning algorithms for text categorization. In Proc. 3rd Annual Symposium on Document Analysis and Information Retrieval, 1994.
|
| |
19
|
|
| |
20
|
|
| |
21
|
|
 |
22
|
Hwee Tou Ng , Wei Boon Goh , Kok Leong Low, Feature selection, perception learning, and a usability case study for text categorization, Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, p.67-73, July 27-31, 1997, Philadelphia, Pennsylvania, United States
|
| |
23
|
|
| |
24
|
|
| |
25
|
L. R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257-286, 1989.
|
| |
26
|
E. Shaw and S. Blumson. Online searching and page presentation at the University of Michigan. D-Lib Magazine, July/August 1997. url: www.dlib.org/dlib/july97/america/07shaw.html.
|
| |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
INDEX TERMS
Primary Classification:
I.
Computing Methodologies
I.2
ARTIFICIAL INTELLIGENCE
Additional Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.1
Content Analysis and Indexing
Subjects:
Indexing methods
I.
Computing Methodologies
I.7
DOCUMENT AND TEXT PROCESSING
General Terms:
Design,
Documentation,
Experimentation,
Human Factors,
Management,
Measurement,
Performance,
Theory
Keywords:
hidden Markov models,
multi-page documents,
naive Bayes classifier,
text categorization
|