|
ABSTRACT
In some Thai documents, a single text line of a printed document page may contain words of both Thai and Roman scripts. For the Optical Character Recognition (OCR) of such a document page it is better to identify, at first, Thai and Roman script portions and then to use individual OCR systems of the respective scripts on these identified portions. In this article, an SVM-based method is proposed for identification of word-wise printed Roman and Thai scripts from a single line of a document page. Here, at first, the document is segmented into lines and then lines are segmented into character groups (words). In the proposed scheme, we identify the script of a character group combining different character features obtained from structural shape, profile behavior, component overlapping information, topological properties, and water reservoir concept, etc. Based on the experiment on 10,000 data (words) we obtained 99.62% script identification accuracy from the proposed scheme.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Burges, C. 1998. A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Dicov. 2, 2, 1--47.
|
| |
2
|
Busch, A., Boles, W. W., and Sridharan, S. 2005. Texture for script identification. IEEE Comput. Soc. Tech. Comm. Newsl. Patt. Anal. Mach. Intell. 27, 11, 1720--1732.
|
| |
3
|
Chanda, S., Ramos Terrades, O., and Pal, U. 2007. SVM based scheme for Thai and English script identification. In Proceedings of the 9th International Conference on Document Analysis and Recognition (ICDAR’07). 551--555.
|
| |
4
|
Dhandra, B. V., Nagabhushan P., Hangarge M., Hegadi, R., and Malemath, V. S. 2006. Script identification based on morphological reconstruction in document images. In Proceedings of the International Conference on Pattern Recognition (ICPR’06). 950--953.
|
| |
5
|
Dhanya, D., Ramakrishna, A. G., and Pati, P. B. 2002. Script identification in printed bilingual documents. Sadhana, 27, 1, 73--82.
|
| |
6
|
Ding, J., Lam, L., and Suen, C. Y. 1997. Classification of oriental and European scripts by using Characteristic features. In Proceedings of the 4th International Conference on Document Analysis and Recognition (ICDAR’97). 1023--1027.
|
| |
7
|
Gllavata, J. and Freisleben, B. 2005. Script recognition in images with complex backgrounds. In Proceedings of the International Symposium on Signal Processing and Information Technology (ISSPIT’05). 589--594.
|
| |
8
|
Hochberg, J., Kelly, P., Thomas, T., and Kerns, L. 1997. Automatic script identification from document images using cluster-based templates. IEEE Comput. Soc. Tech. Comm. Newsl. Trans. Patt. Anal. Mach. Intell. 19, 2, 176--181.
|
| |
9
|
Jaeger, S., Ma, H., and Doermann, D. 2005. Identifying script on word-level with informational confidence. In Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR’05). 416--420.
|
| |
10
|
Lu, S., Chen, B. M., and Ko, C. C. 2005. Perspective rectification of document images using fuzzy set and morphological operations. Image Vis. Comput. 23, 5, 541--553.
|
| |
11
|
Lu, S. and Tan, C. L. 2008. Script and language identification in noisy and degraded document images. IEEE Comput. Soc. Tech. Comm. Newsl. Patt. Anal. Mach. Intell. 30, 1, 14--24.
|
| |
12
|
Pal, U. 1997. On the optical character recognition of printed Bangla script. PhD Thesis, Indian Statistical Institute.
|
| |
13
|
Pal, U., Belaïd, A., and Choisy, A. 2003. Touching numeral segmentation using water reservoir concept. Patt. Recog. Lett. 24, 1--3, 261--272.
|
| |
14
|
Pal, U. and Chaudhuri, A. 1996. An improved document skew angle estimation technique. Patt. Recog. Lett. 17, 8, 899--904.
|
| |
15
|
Pal, U., Sinha, S., and Chaudhuri, B. B. 2003. Multi-script line identification from Indian documents. In Proceedings of the 7th International Conference on Document Analysis and Recognition (ICDAR’03). 880--884.
|
| |
16
|
Roy, K., Pal, U., and Chaudhuri, B. B. 2004. A System for joining and recognition of broken Bangla numerals for Indian postal automation. In Proceedings of the 4th Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP’04). 641--646.
|
| |
17
|
Sinha, S., Pal, U., and Chaudhuri, B. B. 2004. Word-wise identification from Indian documents. S. Marinai and A. Dengel, Eds., Lecture Notes on Computer Science, 310--321.
|
| |
18
|
Spitz, A. L. 1997. Determination of the script and language content of document images. IEEE Comput. Soc. Tech. Comm. Newsl. Patt. Anal. Mach. Intell. 19, 3, 235--245.
|
| |
19
|
Tan, T. N. 1998. Rotation invariant texture features and their use in automatic script identification. IEEE Comput. Soc. Tech. Comm. Newsl. Patt. Anal. Mach. Intell. 20, 7, 751--756.
|
| |
20
|
Vapnik, V. 1995. The Nature of Statistical Learning Theory. Springer Verlag.
|
| |
21
|
Zhang, T. and Ding, X. 1999. Cluster-based bilingual script-segmentation and language identification. Char. Recog. Intell. Inform. Proc. 6, 137--148.
|
| |
22
|
Zhou, L., Lu, Y., and Tan, C. L. 2006. Bangla/English script identification based on analysis of connected component profiles. In Proceedings of the 7th International Workshop on Document Analysis and Systems (DAS’06). 243--254.
|
|