|
ABSTRACT
No existing document image understanding technology, whether experimental or commercially available, can guarantee high accuracy across the full range of documents of interest to industrial and government agency users. Ideally, users should be able to search, access, examine, and navigate among document images as effectively as they can among encoded data files, using familiar interfaces and tools as fully as possible. We are investigating novel algorithms and software tools at the frontiers of document image analysis, information retrieval, text mining, and visualization that will assist in the full integration of such documents into collections of textual document images as well as "born digital" documents. Our approaches emphasize <i>versatility first</i>: that is, methods which work reliably across the broadest possible range of documents.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
H. Baird. Model-directed document image analysis. In Proceedings of the DOD-sponsored Symposium on Document Image Understanding Technology (SDIUT 1999), pages 42--49, Annapolis, Maryland, April 1999.
|
| |
2
|
H. Baird, K. Popat, T. Breuel, P. Sarkar, and D. Lopresti. Assuring high-accuracy document understanding: Retargeting, scaling up, and adapting. In Proceedings of the Symposium on Document Image Understanding Technology, pages 17--29, Greenbelt, MD, April 2003.
|
| |
3
|
H. S. Baird. Anatomy of a versatile page reader. Proceedings of the IEEE, 80(7):1059--1065, July 1992.
|
| |
4
|
H. S. Baird, A. L. Coates, and R. Fateman. PessimalPrint: a reverse Turing test. Int'l J. on Document Analysis and Recognition, 5:158--163, 2003.
|
| |
5
|
H. S. Baird and G. Nagy. A self-correcting 100-font classifier. In Proceedings, IS&T/SPIE Symposium on Electronic Imaging: Science & Technology, pages 106--115, San Jose, CA, February 1994.
|
| |
6
|
|
| |
7
|
'Miserable failure' links to Bush: George W Bush has been Google bombed. BBC News, Dec. 2003. http://news.bbc.co.uk/2/hi/americas/3298443.stm.
|
| |
8
|
|
| |
9
|
T. Breuel. Modeling the Sample Distribution for Clustering OCR. In SPIE Conference on Document Recognition and Retrieval VIII, 2001.
|
| |
10
|
T. Breuel and K. Popat. Recent work in the document image decoding group at xerox parc. In Proceedings of the DOD-sponsored Symposium on Document Image Understanding Technology (SDIUT 2001), Columbia, Maryland, April 2001.
|
| |
11
|
|
| |
12
|
|
| |
13
|
M. Chew and H. S. Baird. Baffletext: a human interactive proof. In Proc., 10th IS&T/SPIE Document Recognition & Retrieval Conf., Santa Clara, CA, January 2003.
|
 |
14
|
Yi-Ming Chung , William M. Pottenger , Bruce R. Schatz, Automatic subject indexing using an associative neural network, Proceedings of the third ACM conference on Digital libraries, p.59-68, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276682]
|
 |
15
|
|
| |
16
|
B. D. Davison, A. Gerasoulis, K. Kleisouris, Y. Lu, H. Seo, W. Wang, and B. Wu. DiscoWeb: Applying link analysis to Web search. In Poster proceedings of the Eighth International World Wide Web Conference, pages 148--149, Toronto, Canada, May 1999.
|
| |
17
|
Digital Bridges. Lehigh University Libraries, April 2004. http://bridges.lib.lehigh.edu/.
|
| |
18
|
Distributed Proofreaders. Project gutenberg's distributed proofreaders home page. http://www.pgdp.net/, 2004.
|
| |
19
|
|
| |
20
|
Document Understanding Conference (DUC): Workshop on Text Summarization, 2002. http://tides.nist.gov/.
|
| |
21
|
J. Esakov, D. P. Lopresti, and J. S. Sandberg. Classification and distribution of optical character recognition errors. In Proceedings of Document Recognition I (IS&T/SPIE Electronic Imaging), volume 2181, pages 204--216, San Jose, CA, February 1994.
|
| |
22
|
|
| |
23
|
Y. Ishitani. Model-based information extraction method tolerant of OCR errors for document images. Int. J. Comput. Proc. Oriental Lang., 15(2):165--186, 2002.
|
| |
24
|
|
| |
25
|
|
| |
26
|
H. Jing, D. Lopresti, and C. Shih. Summarizing noisy documents. In Proceedings of the Symposium on Document Image Understanding Technology, pages 111--119, Greenbelt, MD, April 2003.
|
| |
27
|
T. Kanungo, H. Baird, and R. Haralick. Estimation and validation of document degradation models. In Proc. 4th Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 1995.
|
| |
28
|
G. Kopec. An em algorithm for character template estimation. Submitted March 1997; returned for revision, but not revised due to the author's death; available from PARC by request.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
D. Lopresti and A. L. Spitz. Comparing the utility of optical character recognition and character shape coding in duplicate document detection. In Proceedings of the Fourth IAPR International Workshop on Document Analysis Systems, pages 439--450, Rio de Janeiro, Brazil, December 2000.
|
| |
33
|
D. Lopresti and J. Zhou. Retrieval strategies for noisy text. In Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, pages 255--269, Las Vegas, NV, April 1996.
|
| |
34
|
|
| |
35
|
M. McCord. English Slot Grammar. IBM, 1990.
|
| |
36
|
P. Metzger. Private communication, April 2004.
|
| |
37
|
T. P. Minka, D. S. Bloomberg, and K. Popat. Document image decoding using iterated complete path heuristic. In Proceedings of IS&T/SPIE Electronic Imaging 2001: Document Recognition and Retrieval VIII, San Jose, CA, January 2001.
|
| |
38
|
Open Source Technology Group, Inc. Slashdot home page. http://slashdot.org/, 2004.
|
| |
39
|
|
| |
40
|
K. Popat. Decoding of text lines in grayscale document images. In Proceedings of the 2001 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2001), Salt Lake City, Utah, May 2001. IEEE. To appear.
|
| |
41
|
K. Popat, D. Bloomberg, and D. Greene. Adding linguistic constraints to document image decoding. In Proc., 4th International Workshop on Document Analysis Systems, Rio de Janeiro, Brazil, December 2000. International Association of Pattern Recognition.
|
| |
42
|
K. Popat, D. Greene, J. Romberg, and D. S. Bloomberg. Adding linguistic constraints to document image decoding: Comparing the iterated complete path and stack algorithms. In Proceedings of IS&T/SPIE Electronic Imaging 2001: Document Recognition and Retrieval VIII, San Jose, CA, January 2001.
|
| |
43
|
|
| |
44
|
|
| |
45
|
|
| |
46
|
|
| |
47
|
|
| |
48
|
|
| |
49
|
|
 |
50
|
|
| |
51
|
R. Veltkamp and M. Tanase. Content-based image retrieval systems: A survey. Technical Report UU-CS-2000-34, Dept. of Computing Science, Utrecht University, Oct. 2000. http://citeseer.ist.psu.edu/veltkamp00contentbased.html.
|
| |
52
|
|
| |
53
|
|
CITED BY 2
|
|
Xiaonan Lu , Prasenjit Mitra , James Z. Wang , C. Lee Giles, Automatic categorization of figures in scientific documents, Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries, June 11-15, 2006, Chapel Hill, NC, USA
|
|
|
|
|