|
ABSTRACT
The TREC.GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the.gov data shows that a purely distributed approach would not be feasible for providing search on a.gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
 |
3
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
4
|
|
 |
5
|
Nick Craswell , Peter Bailey , David Hawking, Server selection on the World Wide Web, Proceedings of the fifth ACM conference on Digital libraries, p.37-46, June 02-07, 2000, San Antonio, Texas, United States
[doi> 10.1145/336597.336628]
|
| |
6
|
Nick Craswell , Francis Crimmins , David Hawking , Alistair Moffat, Performance and cost tradeoffs in Web search, Proceedings of the fifteenth Australasian database conference, p.161-169, January 01, 2004, Dunedin, New Zealand
|
 |
7
|
|
| |
8
|
Nick Craswell, David Hawking, Ross Wilkinson, and Mingfang Wu. Overview of the TREC-2003 web track. In Proc. TREC 2003, November 2003.
|
 |
9
|
James C. French , Allison L. Powell , Jamie Callan , Charles L. Viles , Travis Emmitt , Kevin J. Prey , Yun Mou, Comparing the performance of database selection algorithms, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.238-245, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312684]
|
 |
10
|
James C. French , Allison L. Powell , Charles L. Viles , Travis Emmitt , Kevin J. Prey, Evaluating database selection techniques: a testbed and experiment, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.121-129, August 24-28, 1998, Melbourne, Australia
[doi> 10.1145/290941.290976]
|
 |
11
|
|
 |
12
|
|
| |
13
|
Bernado A. Huberman and Lada A. Adamic. Evolutionary dynamics of the world wide web. Technical report, Xerox Palo Alto Research Center, February 1999. http://www.hpl.hp.com/research/idl/papers/webgrowth/.
|
 |
14
|
|
| |
15
|
Ronny Lempel and Shlomo Moran. Optimizing result prefetching in web search engines with segmented indices. In VLDB, pages 370--381, 2002.
|
| |
16
|
Henrik Nottelmann and Norbert Fuhr. Combining CORI and the decision-theoretic approach for advanced resource selection. In Proc. ECIC 2004. Springer, 2004.
|
 |
17
|
|
 |
18
|
|
| |
19
|
Luo Si and Jamie Callan. The effect of database size distribution on resource selection algorithms. In Proc. SIGIR 2003 Workshop on Distributed Information Retrieval, August 2003.
|
 |
20
|
|
 |
21
|
Luo Si , Rong Jin , Jamie Callan , Paul Ogilvie, A language modeling framework for resource selection and results merging, Proceedings of the eleventh international conference on Information and knowledge management, November 04-09, 2002, McLean, Virginia, USA
[doi> 10.1145/584792.584856]
|
 |
22
|
|
 |
23
|
|
 |
24
|
Jaime Teevan , Christine Alvarado , Mark S. Ackerman , David R. Karger, The perfect search engine is not enough: a study of orienteering behavior in directed search, Proceedings of the SIGCHI conference on Human factors in computing systems, p.415-422, April 24-29, 2004, Vienna, Austria
[doi> 10.1145/985692.985745]
|
 |
25
|
|
CITED BY 11
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
Milad Shokouhi , Justin Zobel , Falk Scholer , S. M. M. Tahaghoghi, Capturing collection size for distributed non-cooperative retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|