ACM Home Page
Please provide us with feedback. Feedback
Crf-based authors' name tagging for scanned documents
Full text PdfPdf (248 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 8th ACM/IEEE-CS joint conference on Digital libraries table of contents
Pittsburgh PA, PA, USA
SESSION: Content from documents table of contents
Pages 272-275  
Year of Publication: 2008
ISBN:978-1-59593-998-2
Authors
Manabu Ohta  Okayama University, Okayama, Japan
Atsuhiro Takasu  National Institute of Informatics, Tokyo, Japan
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 73,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1378889.1378935
What is a DOI?

ABSTRACT

Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. Therefore, those creating metadata for digital libraries would appreciate an automatic method to extract such bibliographic data from printed documents. In this paper, we describe an automatic author name tagger for academic articles scanned with optical character recognition (OCR) mark-up. The method uses conditional random fields (CRF) for labeling the unsegmented character strings in authors' blocks as those of either an author or a delimiter. We applied the tagger to Japanese academic articles. The results of the experiments showed that it correctly labeled more than 99% of the author name strings, which compares favorably with the under 96% correct rate of our previous tagger based on a hidden Markov model (HMM).


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
H. Bunke and P. Wang, editors. Handbook of Character Recognition and Document Image Analysis. World Scientific, 1997.
 
2
T. Kudo, K. Yamamoto, and Y. Matsumoto, "Applying Conditional Random Fields to Japanese Morphological Analysis", In Proc. of EMNLP 2004, 2004.
 
3
 
4
 
5
M. Ohta, S. Yamasaki, T. Yakushi, and A. Takasu, "Authors' Names Extraction from Scanned Documents", In Proc. of Second IEEE International Conference on Digital Information Management, pp.67--72, 2007.
 
6
 
7
8
 
9
M. Takechi, T. Tokunaga, and Y. Matsumoto, "Chunking-based Question Type Identification for Multi-Sentence Queries", In Proc. of SIGIR 2007 Workshop on Focused Retrieval, 2007.
 
10
K. Y. Wong, R. G. Casey, and F. M. Wahl, "Document Analysis System", IBM Journal of Research and Development, Vol.26, No.6, pp.647--656, 1982.
 
11
H. Zhao, C.-N. Huang, and M. Li, "An Improved Chinese Word Segmentation System with Conditional Random Field", In Proc. of Fifth SIGHAN Workshop on Chinese Language Processing, pp.162--165, 2006.
 
12
H. Zhao and C. Kit, "Incorporating Global Information into Supervised Learning for Chinese Word Segmentation", In Proc. of 10th Conference of the Pacific Association for Computational Linguistics, pp.66--74, 2007.

Collaborative Colleagues:
Manabu Ohta: colleagues
Atsuhiro Takasu: colleagues