ACM Home Page
Please provide us with feedback. Feedback
Topic based language models for OCR correction
Full text PdfPdf (696 KB)
Source AND; Vol. 303 archive
Proceedings of the second workshop on Analytics for noisy unstructured text data table of contents
Singapore
Pages 107-112  
Year of Publication: 2008
ISBN:978-1-60558-196-5
Authors
Anurag Bhardwaj  University at Buffalo, Amherst, NY
Faisal Farooq  University at Buffalo, Amherst, NY
Huaigu Cao  University at Buffalo, Amherst, NY
Venu Govindaraju  University at Buffalo, Amherst, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 81,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1390749.1390766
What is a DOI?

ABSTRACT

Despite several decades of research in document analysis, recognition of unconstrained handwritten documents is still considered a challenging task. Previous research in this area has shown that word recognizers produce reasonably clean output when used with a restricted lexicon. But in absence of such a restricted lexicon, the output of an unconstrained handwritten word recognizer is noisy. The objective of this research is to process noisy recognizer output and eliminate spurious recognition choices using a topic based language model. We construct a topic based language model for every document using a training data which is manually categorized. A topic categorization sub-system based on Maximum Entropy model is also trained which is used to generate the topic distribution of a test document. A given test word image is processed by the recognizer and its word recognition likelihood is refined by incorporating topic distribution of the document and topic based language model probability. The proposed method is evaluated on a publicly available IAM dataset and experimental results show significant improvement in the word recognition accuracy from 32% to 40% over a test set consisting of 4033 word images extracted from 70 handwritten document images.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
F. Farooq, D. Jose and V. Govindaraju, Phrase Based Direct Model for Improving Handwriting Recognition Accuracies, To appear in International Conference on Frontiers in Handwriting Recognition, 2008, Montreal, Canada.
 
3
F. Farooq, G. Chandalia and V. Govindaraju. Lexicon Reduction in Handwriting Recognition Using Topic Categorization. Under Review - In Eight International Workshop on Document Analysis Systems. Nara, Japan, 2008.
 
4
 
5
N. D. Guillevic D and Y. K. Word lexicon reduction by character spotting. Proceedings of the Seventh International Workshop on Frontiers in Handwriting Recognition. pages 373--382, 2000.
 
6
S. Impedovo, P. Wang, and H. Bunke. Automatic bankcheck processing. Machine Perception and Artificial Intelligence, 28, 1997.
 
7
 
8
 
9
G. Kim, V. Govindaraju, and S. Srihari. Architecture for handwriting recognition systems. International Journal of Document Analysis and Recognition, 2(1):37--44, 1999.
 
10
A. Koerich, R. Sabourin, and C. Suen. Large vocabulary offline handwriting recognition using a constrained level building algorithm. Pattern Analysis and Applications, 6(2):97--121, 2003.
11
 
12
S. Madhvanath and V. Govindaraju. Holistic lexicon reduction for handwritten word recognition. In Proceedings of the SPIE - Document Recognition III, pages 224--234, San Jose, CA, 1996.
 
13
S. Madhvanath and V. Govindaraju. Syntatic methodology of pruning large lexicons in cursive script recognition. Pattern Recognition, 34(1):37--46, January 2001.
 
14
 
15
U. Marti and H. Bunke. The iam-database: an english sentence database for off-line handwriting recognition. International Journal on Document Analysis and Recognition, 5:39--46, 2002.
 
16
U. Pal, P. Kundu and B. Chaudhuri, OCR error correction of an inflectional Indian language using morphological parsing, Journal of Information Science and Engineering, 16(6):903--922, 2000.
 
17
N. S. R. K. Powalka and R. J. Whitrow. Word shape analysis for a hybrid recognition system. Pattern Recognition, 30(3):421--445, March 1997.
 
18
 
19
K. Taghva and E. Stofsky. 2001. OCRSpell: an interactive spelling correction system for OCR errors in text. International Journal on Document Analysis and Recognition, 3(3):125--137.
 
20


Collaborative Colleagues:
Anurag Bhardwaj: colleagues
Faisal Farooq: colleagues
Huaigu Cao: colleagues
Venu Govindaraju: colleagues