ACM Home Page
Please provide us with feedback. Feedback
Compression of scan-digitized Indian language printed text: a soft pattern matching technique
Full text PdfPdf (272 KB)
Source Document Engineering archive
Proceedings of the 2003 ACM symposium on Document engineering table of contents
Grenoble, France
SESSION: Optimizing document format table of contents
Pages: 185 - 192  
Year of Publication: 2003
ISBN:1-58113-724-9
Authors
U. Garain  Indian Statistical Institute, India
S. Debnath  Regional Engineering College, West Bengal, India
A. Mandal  Defense Research & Development Organization, Pune, India
B. B. Chaudhuri  Indian Statistical Institute, Kolkata, India
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 48,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/958220.958254
What is a DOI?

ABSTRACT

In this paper, a new compression scheme is presented for Indian Language (IL) textual document images. Since OCR technology for IL scripts is not matured enough, transcription of these documents into digital domain needs new techniques that achieve high degree of compression as well as suitable methods to perform various operations like document indexing, retrieval, etc. The proposed method is essentially based on symbolic compression technique, which has been realized with an efficient segmentation-based clustering approach. A soft pattern-matching technique has been implemented using two different feature sets that co-operate each other to build an efficient prototype library. Experiments have been done for documents printed in Devnagari (Hindi) and Bangla scripts, two mostly used script in Indian sub-continent. Test results show that the proposed technique outperforms several standard methods like CCITT Group-4, JBIG, etc. which are frequently used for compression of document images.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Chaudhuri, B. B. and Pal, U. A complete printed Bangla OCR system, Pattern Recognition, 31, 1998, 531--549.
 
2
 
3
Bodson, D. Urban, S. Deutermann, A, and Clarke, C. Measurement of data compression in advanced Group 4 facsimile system, in Proc. of the IEEE, Vol. 73, 1985, 731--739.
 
4
 
5
Archer, R. and Nagy, G. A means for achieving a high degree of compaction on scan-digitized printed text, IEEE Trans. on Computers, Vol. 23, 1974, 1174--1179.
 
6
 
7
 
8
Zhang, Q. and Danskin, J. Entropy-based pattern matching for document image compression, in Proceedings of the Int'l Conf. on Image Processing (ICPR'96), 1996, 221--224.
 
9
CCITT. 1993. Draft recommendation T.82 & ISO DIS 11544: Coded representation of picture and audio information - progressive bi-level image compression.
 
10
 
11
 
12
 
13
 
14
Kanungo, T. Haralick, R.M. and Phillips, I.T. Global and local document degradation models, in Proceedings of the Int'l Conf. on Document Analysis and Recognition (ICDAR'93), 1993, 730--734.
 
15
"Lossy/Lossless Coding of Bilevel Images," ITU-T Recommendation T.88, 2000.


Collaborative Colleagues:
U. Garain: colleagues
S. Debnath: colleagues
A. Mandal: colleagues
B. B. Chaudhuri: colleagues