ACM Home Page
Please provide us with feedback. Feedback
Document digitization lifecycle for complex magazine collection
Full text PdfPdf (541 KB)
Source Document Engineering archive
Proceedings of the 2005 ACM symposium on Document engineering table of contents
Bristol, United Kingdom
SESSION: Techniques for document management and document engineering table of contents
Pages: 197 - 206  
Year of Publication: 2005
ISBN:1-59593-240-2
Authors
Sherif Yacoub  Hewlett-Packard, Spain
John Burns  Hewlett-Packard, Spain
Paolo Faraboschi  Hewlett-Packard, Spain
Daniel Ortega  Hewlett-Packard, Spain
Jose Abad Peiro  Hewlett-Packard, Spain
Vinay Saxena  Hewlett-Packard, USA
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 9,   Downloads (12 Months): 44,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1096601.1096650
What is a DOI?

ABSTRACT

The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Adam, S., M. Rigamonti, E. Clavier, J.-M. Ogier, E. Trupin and K. Tombre. DocMining: A Document Analysis System Builder. In Document Analysis Systems VI - Proceedings of 6th IAPR International Workshop on Document Analysis System, Florence (Italy), pages 472--483, Lecture Notes in Computer Science, vol. 3163, Springer Verlag, september 2004.
 
2
Aiello, M., C. Monzl, L. Todoran. Combining Linguistic and Spatial Information for Document Analysis. Proceedings of the RIAO'2000 Document-Based Multimedia Information Access, Paris, France, pp.266--275, April 2000.
 
3
Aiello, M.,C. Monz, L. Todoran, M. Worring. Document understanding for a broad class of documents. International Journal on Document Analysis and Recognition. 5: 1--16, 2002.
4
 
5
Altamura, O., F. Esposito & D. Malerba (2001). Transforming Paper Documents into XML Format with WISDOM++, International Journal of Document Analysis and Recognition, Springer Verlag, 3(2), 175--198.
 
6
Clavier, E., G. Masini, M. Delalandre, M. Rigamonti, K. Tombre and J. Gardes. DocMining: A Cooperative Platform for Heterogeneous Document Interpretation According to User-Defined Scenarios. Lecture Notes in Computer Science. Volume 3088 / 2004 Title: Graphics Recognition: Recent Advances and Perspectives, 5th International Workshop, GREC 2003, Barcelona, Spain, July 30-31, 2003.
 
7
Clavier, E., P. Heroux, J. Gardes, E. Trupin. Ground-Truth Production and Benchmarking Scenarios Creation With DocMining. Third International Workshop on Document Layout Interpretation and its Applications (DLIA2003). August 2, 2003 Edinburgh, Scotland.
 
8
 
9
Haralick, R. Document image understanding: geometric and logical layout. IEEE conference on computer vision and document understanding. 1994.
 
10
Hitz, O., L. Robadey, and R. Ingold. An architecture for editing document recognition results using XML. In Proceedings of 4th IAPR International Workshop on Document Analysis Systems, Rio de Janeiro (Brazil), pages 385--396, 2000.
 
11
HP Laboratory, Barcelona Research Office. Time Archive + HP. http://welcome.hp.com/country/us/en/msg/corp/htmltimearchive.html
 
12
Kanungo, T., C. H. Lee, J. Czorapinski, I. Bella. TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR. In Proc. of SPIE Conference on Document Recognition and Retrieval, Jan. 2001.
 
13
Klink, S., A. Dengel, T. Kieninger. Document Structure Analysis Based on Layout and Textual Features. Proceedings of the 4th IAPR International Workshop on Document Analysis Systems, DAS2000, pp99--111, Brazil 2000.
 
14
 
15
Masataki, H., Y. Sgisaka. Variable-order N-gram generation byword-class splitting and consecutive word grouping, IEEE, pp. 188--191, 1996.
 
16
McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
 
17
 
18
Scansoft Omnipage. http://www.scansoft.com/omnipage/
 
19
Tsujimoto et al. Understanding Multi-Articled Documents. Proceedings of 10th Int. Conf. on Pattern Recognition, vol. 1, pp. 551--556, Jun. 1990.
 
20
Tsujimoto S., H. Asada. Major Components of a Complete Text Reading System. Proceedings of the IEEE, 80(7):1133--1149, 1992.
 
21
 
22
 
23
 
24
Abbyy Fine Reader. http://www.abbyy.com/finereader7/?param=28603
 
25
Yacoub, S., P. Faraboschi, J. Burns, D. Ortega, J.Abad, J.A. Sanchez. Chronos: A Document Understanding System for Historical Magazine Collections. Submitted to International Journal on Document Analysis and Recognition IJDAR.
26
27


Collaborative Colleagues:
Sherif Yacoub: colleagues
John Burns: colleagues
Paolo Faraboschi: colleagues
Daniel Ortega: colleagues
Jose Abad Peiro: colleagues
Vinay Saxena: colleagues