ACM Home Page
Please provide us with feedback. Feedback
Book search: indexing the valuable parts
Full text PdfPdf (268 KB)
Source
Conference on Information and Knowledge Management archive
Proceeding of the 2008 ACM workshop on Research advances in large digital book repositories table of contents
Napa Valley, California, USA
SESSION: Content representation and discoverability table of contents
Pages 53-56  
Year of Publication: 2008
ISBN:978-1-60558-249-8
Authors
Walid Magdy  Cairo Microsoft Innovation Center, Abou Rawash, Egypt
Kareem Darwish  Cairo Microsoft Innovation Center, Abou Rawash, Egypt
Sponsors
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
SIGIR: ACM Special Interest Group on Information Retrieval
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 62,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1458412.1458429
What is a DOI?

ABSTRACT

With massive book digitization efforts underway, there is a need for developing effective book retrieval strategies. This paper explores the relative contribution of different parts of digitized and OCR'ed books towards effective retrieval. The examined parts include the entire content of books, book headings, book titles, and table of content entries. Results show that indexing the headers and titles of books is nearly as effective as indexing the entire contents of books. These results indicate that certain portions of the books, specifically titles and headers, are more valuable than other parts of books. This is akin to web search where hypertext and page titles are more valuable to index than the rest of the webpage. Also, using a combination of evidence approach provides further improved retrieval effectiveness compared to using any portion of the book in isolation.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
Croft, W. B., S. Harding, K. Taghva, and J. Andborsak. An evaluation of information retrieval accuracy with simulated OCR output. In Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, Nev., 115--126, (1994).
 
3
Darwish, K. and O. Emam. The Effect of Blind Relevance Feedback on a New Arabic OCR Degraded Text Collection. In International Conference on Machine Intelligence: Special Session on Arabic Document Image Analysis, (2005).
4
 
5
 
6
 
7
8
 
9
Hawking, D. Document Retrieval in OCR-Scanned Text. Sixth Parallel Computing Workshop, paper P2-F (1996).
 
10
Kantor, P. and E. Voorhees. Report on the TREC-5 Confusion Track. TREC-5, pp 65, (1996).
 
11
12
 
13
14
 
15
 
16
Smith, S., An Analysis of the Effects of Data Corruption on Text Retrieval Performance. Technical Report DR90-1, Thinking Machines Corp: Cambridge, MA, (1990).
17
 
18
Song, R., J.R. Wen, S. Shi, G. Xin, T.Y. Liu, et al., Microsoft Research Asia at Web Track and Terabyte Track of TREC 2004. In 2004 Text REtrieval Conference (2004).
 
19
Taghva, K., J. Borsack, and A. Condit. An Expert System for Automatically Correcting OCR Output. Proc. IS&T/SPIE 1994 Intl. Symp. on Electronic Imaging Science and Technology, , San Jose, CA, pp 270--278 (1994a).
 
20
Taghva, K., J. Borasack, A. Condit, and J. Gilbreth. Results and Implications of the Noisy Data Projects. Technical Report 94-01, Information Science Research Institute, University of Nevada, Las Vegas, (1994b).
 
21
Taghva, K., J. Borasack, A. Condit, and P. Inaparthy. Querying Short OCR'd Documents. Technical Report 94-10, Information Science Research Institute, University of Nevada, Las Vegas, (1995).
22
 
23
 
24
Thoma, G. and G. Ford. Automated Data Entry System: Performance Issues. Proc. SPIE Conference on Document Recognition and Retrieval IX, San Jose, 2002, pp. 181--190, (2002).
 
25
Tseng, Y. and D. Oard. Document Image Retrieval Techniques for Chinese. In Symposium on Document Image Understanding Technology, Columbia, MD, pp 151--158 (2001).
26
 
27
Wu, H., G. Kazai, and M. Taylor, Book Search Experiments: Investigating IR Methods for the Indexing and Retrieval of Books. ECIR 2008: 234--245 (2008).


Collaborative Colleagues:
Walid Magdy: colleagues
Kareem Darwish: colleagues