| Evaluating sampling methods for uncooperative collections |
| Full text |
Pdf
(376 KB)
|
Source
|
Annual ACM Conference on Research and Development in Information Retrieval
archive
Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
table of contents
Amsterdam, The Netherlands
SESSION: Collection representation in distributed IR
table of contents
Pages: 503 - 510
Year of Publication: 2007
ISBN:978-1-59593-597-7
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 3, Downloads (12 Months): 77, Citation Count: 4
|
|
|
ABSTRACT
Many server selection methods suitable for distributed information retrieval applications rely, in the absence of cooperation, on the availability of unbiased samples of documents from the constituent collections. We describe a number of sampling methods which depend only on the normal query-response mechanism of the applicable search facilities. We evaluate these methods on a number of collections typical of a personal metasearch application. Results demonstrate that biases exist for all methods, particularly toward longer documents, and that in some cases these biases can be reduced but not eliminated by choice of parameters.We also introduce a new sampling technique, "multiple queries", which produces samples of similar quality to the best current techniques but with significantly reduced cost.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
| |
3
|
|
 |
4
|
Andrei Broder , Marcus Fontura , Vanja Josifovski , Ravi Kumar , Rajeev Motwani , Shubha Nabar , Rina Panigrahy , Andrew Tomkins , Ying Xu, Estimating corpus size via queries, Proceedings of the 15th ACM international conference on Information and knowledge management, November 06-11, 2006, Arlington, Virginia, USA
[doi> 10.1145/1183614.1183699]
|
 |
5
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
| |
6
|
N. Craswell, D. Hawking, and P. Thistlewaite. Merging results from isolated search engines. In Proc. Australasian Database Conference, 1999.
|
 |
7
|
|
| |
8
|
|
 |
9
|
King-Lup Liu , Adrain Santoso , Clement Yu , Weiyi Meng, Discovering the representative of a search engine, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502696]
|
| |
10
|
Open directory project. http://dmoz.org/.
|
 |
11
|
Allison L. Powell , James C. French , Jamie Callan , Margaret Connell , Charles L. Viles, The impact of database selection on distributed searching, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.232-239, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345584]
|
| |
12
|
P. Rusmevichientong, D. M. Pennock, S. Lawrence, and C. L. Giles. Methods for sampling pages uniformly from the world wide web. In Proc. AAAI Fall Symposium on Using Uncertainty Within Computation, 2001.
|
 |
13
|
Milad Shokouhi , Justin Zobel , Falk Scholer , S. M. M. Tahaghoghi, Capturing collection size for distributed non-cooperative retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
[doi> 10.1145/1148170.1148227]
|
 |
14
|
|
 |
15
|
|
|