|
ABSTRACT
The conversion of large collections of documents from paper to digital formats that are suitable for electronic archival is a complex multi-phase process. The creation of good quality images from paper documents is just one phase. To extract relevant information that they contain, with an accuracy that fits the purpose of target applications, an automated document analysis system and a manual verification/review process are needed. The automated system needs to perform a variety of analysis and recognition tasks in order to reach an accuracy level that minimizes the manual correction effort downstream.This paper describes the complete process and the associated technologies, tools, and systems needed for the conversion of a large collection of complex documents and deployment for online web access to its information rich content. We used this process to recapture 80 years of Time magazines. The historical collection is scanned, automatically processed by advanced document analysis components to extract articles, manually verified for accuracy, and converted in a form suitable for web access. We discuss the major phases of the conversion lifecycle and the technology developed and tools used for each phase. We also discuss results in terms of recognition accuracy.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Adam, S., M. Rigamonti, E. Clavier, J.-M. Ogier, E. Trupin and K. Tombre. DocMining: A Document Analysis System Builder. In Document Analysis Systems VI - Proceedings of 6th IAPR International Workshop on Document Analysis System, Florence (Italy), pages 472--483, Lecture Notes in Computer Science, vol. 3163, Springer Verlag, september 2004.
|
| |
2
|
Aiello, M., C. Monzl, L. Todoran. Combining Linguistic and Spatial Information for Document Analysis. Proceedings of the RIAO'2000 Document-Based Multimedia Information Access, Paris, France, pp.266--275, April 2000.
|
| |
3
|
Aiello, M.,C. Monz, L. Todoran, M. Worring. Document understanding for a broad class of documents. International Journal on Document Analysis and Recognition. 5: 1--16, 2002.
|
 |
4
|
|
| |
5
|
Altamura, O., F. Esposito & D. Malerba (2001). Transforming Paper Documents into XML Format with WISDOM++, International Journal of Document Analysis and Recognition, Springer Verlag, 3(2), 175--198.
|
| |
6
|
Clavier, E., G. Masini, M. Delalandre, M. Rigamonti, K. Tombre and J. Gardes. DocMining: A Cooperative Platform for Heterogeneous Document Interpretation According to User-Defined Scenarios. Lecture Notes in Computer Science. Volume 3088 / 2004 Title: Graphics Recognition: Recent Advances and Perspectives, 5th International Workshop, GREC 2003, Barcelona, Spain, July 30-31, 2003.
|
| |
7
|
Clavier, E., P. Heroux, J. Gardes, E. Trupin. Ground-Truth Production and Benchmarking Scenarios Creation With DocMining. Third International Workshop on Document Layout Interpretation and its Applications (DLIA2003). August 2, 2003 Edinburgh, Scotland.
|
| |
8
|
|
| |
9
|
Haralick, R. Document image understanding: geometric and logical layout. IEEE conference on computer vision and document understanding. 1994.
|
| |
10
|
Hitz, O., L. Robadey, and R. Ingold. An architecture for editing document recognition results using XML. In Proceedings of 4th IAPR International Workshop on Document Analysis Systems, Rio de Janeiro (Brazil), pages 385--396, 2000.
|
| |
11
|
HP Laboratory, Barcelona Research Office. Time Archive + HP. http://welcome.hp.com/country/us/en/msg/corp/htmltimearchive.html
|
| |
12
|
Kanungo, T., C. H. Lee, J. Czorapinski, I. Bella. TRUEVIZ: a groundtruth/metadata editing and visualizing toolkit for OCR. In Proc. of SPIE Conference on Document Recognition and Retrieval, Jan. 2001.
|
| |
13
|
Klink, S., A. Dengel, T. Kieninger. Document Structure Analysis Based on Layout and Textual Features. Proceedings of the 4th IAPR International Workshop on Document Analysis Systems, DAS2000, pp99--111, Brazil 2000.
|
| |
14
|
|
| |
15
|
Masataki, H., Y. Sgisaka. Variable-order N-gram generation byword-class splitting and consecutive word grouping, IEEE, pp. 188--191, 1996.
|
| |
16
|
McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/mccallum/bow, 1996.
|
| |
17
|
|
| |
18
|
Scansoft Omnipage. http://www.scansoft.com/omnipage/
|
| |
19
|
Tsujimoto et al. Understanding Multi-Articled Documents. Proceedings of 10th Int. Conf. on Pattern Recognition, vol. 1, pp. 551--556, Jun. 1990.
|
| |
20
|
Tsujimoto S., H. Asada. Major Components of a Complete Text Reading System. Proceedings of the IEEE, 80(7):1133--1149, 1992.
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
Abbyy Fine Reader. http://www.abbyy.com/finereader7/?param=28603
|
| |
25
|
Yacoub, S., P. Faraboschi, J. Burns, D. Ortega, J.Abad, J.A. Sanchez. Chronos: A Document Understanding System for Historical Magazine Collections. Submitted to International Journal on Document Analysis and Recognition IJDAR.
|
 |
26
|
A. Antonacopoulos , D. Karatzas , H. Krawczyk , B. Wiszniewski, The lifecycle of a digital historical document: structure and content, Proceedings of the 2004 ACM symposium on Document engineering, October 28-30, 2004, Milwaukee, Wisconsin, USA
[doi> 10.1145/1030397.1030427]
|
 |
27
|
|
|