ACM Home Page
Please provide us with feedback. Feedback
A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books
Full text PdfPdf (1.54 MB)
Source International Conference on Digital Libraries archive
Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries table of contents
Chapel Hill, NC, USA
SESSION: Document analysis table of contents
Pages: 109 - 118  
Year of Publication: 2006
ISBN:1-59593-354-9
Authors
Shaolei Feng  University of Massachusetts, Amherst, MA
R. Manmatha  University of Massachusetts, Amherst, MA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 102,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1141753.1141776
What is a DOI?

ABSTRACT

A number of projects are creating searchable digital libraries of printed books. These include the Million Book Project, the Google Book project and similar efforts from Yahoo and Microsoft. Content-based on line book retrieval usually requires first converting printed text into machine readable (e.g. ASCII) text using an optical character recognition (OCR) engine and then doing full text search on the results. Many of these books are old and there are a variety of processing steps that are required to create an end to end system. Changing any step (including the scanning process) can affect OCR performance and hence a good automatic statistical evaluation of OCR performance on book length material is needed. Evaluating OCR performance on the entire book is non-trivial. The only easily obtainable ground truth (the Gutenberg e-texts) must be automatically aligned with the OCR output over the entire length of a book. This may be viewed as equivalent to the problem of aligning two large (easily a million long) sequences. The problem is further complicated by OCR errors as well as the possibility of large chunks of missing material in one of the sequences. We propose a Hidden Markov Model (HMM) based hierarchical alignment algorithm to align OCR output and the ground truth for books. We believe this is the first work to automatically align a whole book without using any book structure information. The alignment process works by breaking up the problem of aligning two long sequences into the problem of aligning many smaller subsequences. This can be rapidly and effectively done. Experimental results show that our hierarchical alignment approach works very well even if OCR output has a high recognition error rate. Finally, we evaluate the performance of a commercial OCR engine over a large dataset of books based on the alignment results.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
H. Alshawi, S. Bangalore, and S. Douglas. Learning phrase-based head transduction models for translation of spoken utterances. In Proceedings of the fifth International Conference on Spoken Language Processing (ICSLP98), Sydney, 1998.
 
2
X. Chen and A. Yuille. Detecting and reading text in natural scenes. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition., pages 366--373, Washington, DC, USA, 2004.
 
3
Y. Deng and W. Byrne. Hmm word and phrase alignment for statistical machine translation. In Proceedings of HLT-EMNLP, 2005.
 
4
Gutenberg Website:. http://www.gutenberg.com.
 
5
T. Ho and H. Baird. Evaluation of ocr accuracy using synthetic data. In Proceedings of 4th UNLV Symp. on Document Analysis and Information Retrieval, Las Vegas, Nevada, USA, April 1995.
 
6
J. Hobby. Matching document images with ground truth. International Journal on Document Analysis and Recognition, 1(1):52--61, 1997.
 
7
 
8
 
9
 
10
 
11
A. Krogh, M. Brown, I. Mian, K. Sjolander, and D. Haussler. Hidden markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501--1531, 1994.
 
12
F. Malfrre, O. Deroo, and T. Dutoit. Phonetic alignment: Speech synthesis based vs. hybrid hmm/ann. In Proceedings of the ICSLP, pages 1571--1574, 1998.
 
13
S. Needleman and C. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443--53, 1970.
 
14
L. Rabiner and B. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, pages 4--15, January 1986.
 
15
J. Rothfeder, T. Rath, and R. Manmatha. Aligning transcripts to automatically segmented handwritten manuscripts. In to appear in Proceedings of the Seventh International Workshop on Document Analysis Systems, DAS'06, Nelson, New Zealand, 2006.
 
16
 
17
T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147(3):195--197, 1981.
 
18
A. Viterbi. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE transactions on Information Theory, 13:260--267, April 1967.
19
 
20
 
21


Collaborative Colleagues:
Shaolei Feng: colleagues
R. Manmatha: colleagues