|
ABSTRACT
The proliferation of online information resources increases the importance of effective and efficient information retrieval in a multicollection environment. Multicollection searching is cast in three parts: collection selection (also referred to as database selection), query processing and results merging. In this work, we focus our attention on the evaluation of the first step, collection selection.In this article, we present a detailed discussion of the methodology that we used to evaluate and compare collection selection approaches, covering both test environments and evaluation measures. We compare the CORI, CVV and gGLOSS collection selection approaches using six test environments utilizing three document testbeds. We note similar trends in performance among the collection selection approaches, but the CORI approach consistently outperforms the other approaches, suggesting that effective collection selection can be achieved using limited information about each collection.The contributions of this work are both the assembled evaluation methodology as well as the application of that methodology to compare collection selection approaches in a standardized environment.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Araújo, M. D., Navarro, G., and Ziviani, N. 1997. Large text searching allowing errors. In Proceedings of the 4th South American Workshop on String Processing (WSP '97). 2--20.
|
 |
3
|
|
 |
4
|
|
| |
5
|
Buckley, C. 1992. SMART version 11.0. ftp://ftp.cs.cornell.edu/pub/smart/.
|
 |
6
|
Jamie Callan , Margaret Connell , Aiqun Du, Automatic discovery of language models for text databases, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.479-490, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
7
|
Callan, J., Powell, A. L., French, J. C., and Connell, M. 2000. The effects of query-based sampling on automatic database selection algorithms. Tech. Rep. CMU-LTI-00-162, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa.
|
| |
8
|
Callan, J. P., Croft, W. B., and Harding, S. M. 1992. The INQUERY Retrieval System. In Proceedings of the 3rd International Conference on Database and Expert Systems Applications (DEXA'92). 78--83.
|
 |
9
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
 |
10
|
Nick Craswell , Peter Bailey , David Hawking, Server selection on the World Wide Web, Proceedings of the fifth ACM conference on Digital libraries, p.37-46, June 02-07, 2000, San Antonio, Texas, United States
[doi> 10.1145/336597.336628]
|
 |
11
|
R. Dolin , D. Agrawal , A. El Abbadi , L. Dillon, Pharos: a scalable distributed architecture for locating heterogeneous information sources, Proceedings of the sixth international conference on Information and knowledge management, p.348-355, November 10-14, 1997, Las Vegas, Nevada, United States
[doi> 10.1145/266714.266924]
|
 |
12
|
R. Dolin , D. Agrawal , E. El Abbadi, Scalable collection summarization and selection, Proceedings of the fourth ACM conference on Digital libraries, p.49-58, August 11-14, 1999, Berkeley, California, United States
[doi> 10.1145/313238.313257]
|
| |
13
|
Dolin, R., Agrawal, D., Abbadi, E. E., and Pearlman, J. 1998. Using Automated Classification for Summarizing and Selecting Heterogeneous Information Sources. D-Lib Mag. http://www.dlib.org/dlib/january98/dolin/01dolin.html.
|
 |
14
|
|
 |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
James C. French , Allison L. Powell , Jamie Callan , Charles L. Viles , Travis Emmitt , Kevin J. Prey , Yun Mou, Comparing the performance of database selection algorithms, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.238-245, August 15-19, 1999, Berkeley, California, United States
[doi> 10.1145/312624.312684]
|
 |
19
|
James C. French , Allison L. Powell , Charles L. Viles , Travis Emmitt , Kevin J. Prey, Evaluating database selection techniques: a testbed and experiment, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.121-129, August 24-28, 1998, Melbourne, Australia
[doi> 10.1145/290941.290976]
|
 |
20
|
|
| |
21
|
Gauch, S., Wang, G., and Gomez, M. 1996. ProFusion: Intelligent fusion from multiple, distributed search engines. J. Univ. Comput. 2, 9, 637--649.
|
| |
22
|
|
 |
23
|
Luis Gravano , Chen-Chuan K. Chang , Héctor García-Molina , Andreas Paepcke, STARTS: Stanford proposal for Internet meta-searching, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.207-218, May 11-15, 1997, Tucson, Arizona, United States
|
| |
24
|
|
 |
25
|
Luis Gravano , Héctor García-Molina , Anthony Tomasic, The effectiveness of GIOSS for the text database discovery problem, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.126-137, May 24-27, 1994, Minneapolis, Minnesota, United States
|
 |
26
|
|
| |
27
|
Harman, D. K., Ed. 1995. Proceedings of the 4th Text Retrieval Conference (TREC-4). NIST Special Publication 500--236. Department of Commerce, National Institute of Standards and Technology, Gaithersburg, Md.
|
 |
28
|
|
| |
29
|
Ipeirotis, P. G. and Gravano, L. 2002. Distributed Search over the Hidden Web: Hierarchical Database Sampling and Selection. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB 2002). 394--405.
|
 |
30
|
|
 |
31
|
Yong Lin , Jian Xu , Ee-Peng Lim , Wee-Keong Ng, ZBroker: a query routing broker for Z39.50 databases, Proceedings of the eighth international conference on Information and knowledge management, p.202-209, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.319979]
|
| |
32
|
Liu, K.-L., Yu, C., Meng, W., Wu, W., and Rishe, N. 1999. A statistical method for estimating the usefulness of text databases. Tech. rep., Department of EECS, University of Illinois at Chicago, Chicago, Ill.
|
| |
33
|
Lu, Z., Callan, J. P., and Croft, W. B. 1996. Measures in collection ranking evaluation. Tech. Rep. TR-96-39, Computer Science Department, University of Massachusetts.
|
| |
34
|
Weiyi Meng , King-Lup Liu , Clement T. Yu , Xiaodong Wang , Yuhsi Chang , Naphtali Rishe, Determining Text Databases to Search in the Internet, Proceedings of the 24rd International Conference on Very Large Data Bases, p.14-25, August 24-27, 1998
|
| |
35
|
|
| |
36
|
Moffat, A. and Zobel, J. 1995. Information retrieval systems for large document collections. In Proceedings of the 3rd Text Retrieval Conference (TREC-3). 85--94.
|
| |
37
|
|
 |
38
|
Allison L. Powell , James C. French , Jamie Callan , Margaret Connell , Charles L. Viles, The impact of database selection on distributed searching, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.232-239, July 24-28, 2000, Athens, Greece
[doi> 10.1145/345508.345584]
|
| |
39
|
|
 |
40
|
|
| |
41
|
|
 |
42
|
|
| |
43
|
Voorhees, E., Gupta, N. K., and Johnson-Laird, B. 1994. The collection fusion problem. In Proceedings of the 3rd Text REtrieval Conference (TREC-3). 95--104.
|
 |
44
|
Ellen M. Voorhees , Narendra K. Gupta , Ben Johnson-Laird, Learning collection fusion strategies, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.172-179, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215357]
|
| |
45
|
Voorhees, E. M. 1995. Siemens TREC-4 Report: Further Experiments with Database Merging. In Proceedings of the 4th Text REtrieval Conference (TREC-4). 121--130.
|
 |
46
|
|
 |
47
|
|
 |
48
|
Jian Xu , Yinyan Cao , Ee-Peng Lim , Wee-Keong Ng, Database selection techniques for routing bibliographic queries, Proceedings of the third ACM conference on Digital libraries, p.264-274, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276707]
|
 |
49
|
|
| |
50
|
|
 |
51
|
Clement Yu , Weiyi Meng , King-Lup Liu , Wensheng Wu , Naphtali Rishe, Efficient and effective metasearch for a large number of text databases, Proceedings of the eighth international conference on Information and knowledge management, p.217-224, November 02-06, 1999, Kansas City, Missouri, United States
[doi> 10.1145/319950.320005]
|
| |
52
|
|
| |
53
|
Zobel, J. 1997. Collection selection via lexicon inspection. In Proceedings of the 2nd Australian Document Computing Symposium. 74--80.
|
CITED BY 13
|
|
|
|
|
|
|
|
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
Milad Shokouhi , Justin Zobel , Falk Scholer , S. M. M. Tahaghoghi, Capturing collection size for distributed non-cooperative retrieval, Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, August 06-11, 2006, Seattle, Washington, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
INDEX TERMS
Primary Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.3
Information Search and Retrieval
Subjects:
Selection process
Additional Classification:
H.
Information Systems
H.3
INFORMATION STORAGE AND RETRIEVAL
H.3.4
Systems and Software
Subjects:
Performance evaluation (efficiency and effectiveness)
General Terms:
Experimentation,
Measurement,
Performance
Keywords:
Collection selection,
database selection,
distributed information retrieval,
distributed text retrieval,
metasearch engine,
resource discovery,
resource ranking,
resource selection,
server ranking,
server selection,
text retrieval
|