ACM Home Page
Please provide us with feedback. Feedback
Pruning long documents for distributed information retrieval
Full text PdfPdf (186 KB)
Source Conference on Information and Knowledge Management archive
Proceedings of the eleventh international conference on Information and knowledge management table of contents
McLean, Virginia, USA
SESSION: Information retrieval 1 table of contents
Pages: 332 - 339  
Year of Publication: 2002
ISBN:1-58113-492-4
Authors
Jie Lu  Carnegie Mellon University, Pittsburgh, PA
Jamie Callan  Carnegie Mellon University, Pittsburgh, PA
Sponsors
SIGMIS: ACM Special Interest Group on Management Information Systems
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 5,   Downloads (12 Months): 36,   Citation Count: 10
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/584792.584847
What is a DOI?

ABSTRACT

Query-based sampling is a method of discovering the contents of a text database by submitting queries to a search engine and observing the documents returned. In prior research sampled documents were used to build resource descriptions for automatic database selection, and to build a centralized sample database for query expansion and result merging. An unstated assumption was that the associated storage costs were acceptable.When sampled documents are long, storage costs can be large. This paper investigates methods of pruning long documents to reduce storage costs. The experimental results demonstrate that building resource descriptions and centralized sample databases from the pruned contents of sampled documents can reduce storage costs by 54-93% while causing only minor losses in the accuracy of distributed information retrieval.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
J. Allan, J. Callan, M. Sanderson, J. Xu and S. Wegmann. INQUERY and TREC-7. In Proc. of the Seventh Text Retrieval Conference (TREC-7). 1999.
2
 
3
J. Callan. Distributed information retrieval. W. B. Croft, editor, Advances in information retrieval, chapter 5, pages 127--150. Kluwer Academic Publishers, 2000.
4
5
 
6
N. Craswell, D. Hawking and P. Thistlewaite. Merging results from isolated search engines. In Proc. of the Tenth Australasian Database Conference. 1999.
 
7
D. Harman, editor. Proc. of the Third Text Retrieval Conference (TREC-3). 1995.
8
9
10
11
 
12

CITED BY  10