|
ABSTRACT
The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BARBAR , D. AND CLIFTON, C. 1992. Information brokers: Sharing knowledge in a heterogeneous distributed system. Tech. Rep. MITL-TR-31-92. Matsushita Information Technology Laboratory.
|
| |
2
|
BOWMAN, C. M., DANZIG, P. B., HARDY, D. R., MANBER, U., AND SCHWARTZ, M. F. 1994. Harvest: A scalable, customizable discovery and access system. Tech. Rep. CU-CS-732-94. Dept. Computer Science, Univ. of Colorado, Boulder.
|
 |
3
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
| |
4
|
CHAMIS, A.Y. 1988. Selection of online databases using switching vocabularies. J. Am. Soc. Inf. Sci. 39, 3, 217-218.
|
| |
5
|
|
 |
6
|
Peter B. Danzig , Jongsuk Ahn , John Noll , Katia Obraczka, Distributed indexing: a scalable mechanism for distributed information retrieval, Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, p.220-229, October 13-16, 1991, Chicago, Illinois, United States
[doi> 10.1145/122860.122883]
|
| |
7
|
DANZIG, P. B., LI, S. -H., AND OBRACZKA, K. 1992. Distributed indexing of autonomous internet services. Comput. Syst. 5, 4, 433-459.
|
| |
8
|
|
| |
9
|
DUDA, A. AND SHELDON, M.A. 1994. Content routing in a network of WAIS servers. In Proceedings of the 14th IEEE International Conference on Distributed Computing Systems (Poznan, Poland, June). IEEE Computer Society Press, Los Alamitos, CA.
|
| |
10
|
FLATER, D. W. AND YESHA, Y. 1993. An information retrieval system for network resources. In Proceedings of the International Workshop on Next Generation Information Technologies and Systems (June).
|
 |
11
|
James C. French , Allison L. Powell , Charles L. Viles , Travis Emmitt , Kevin J. Prey, Evaluating database selection techniques: a testbed and experiment, Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, p.121-129, August 24-28, 1998, Melbourne, Australia
[doi> 10.1145/290941.290976]
|
| |
12
|
FULLTON, J. AND WARNOCK, A. ET AL. 1993. Release Notes for Free WAIS 0.2.
|
 |
13
|
Luis Gravano , Chen-Chuan K. Chang , Héctor García-Molina , Andreas Paepcke, STARTS: Stanford proposal for Internet meta-searching, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.207-218, May 11-15, 1997, Tucson, Arizona, United States
|
| |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
Luis Gravano , Héctor García-Molina , Anthony Tomasic, The effectiveness of GIOSS for the text database discovery problem, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.126-137, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
19
|
|
| |
20
|
|
| |
21
|
MORRIS, A., DRENTH, H., AND TSENG, G. 1993. The development of an expert system for online company database selection. Expert Syst. 10, 2 (May), 47-60.
|
| |
22
|
|
| |
23
|
NEUMAN, B. C. 1992. The Prospero file system: A global file system based on the virtual system model. Comput. Syst. 5, 4, 407-432.
|
| |
24
|
|
| |
25
|
ORDILLE, J. J. AND MILLER, B. P. 1992. Distributed active catalogs and meta-data caching in descriptive name services. Tech. Rep. 1118. University of Wisconsin at Madison, Madison, WI.
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
SCHWARTZ, M. F. 1990. A scalable, non-hierarchical resource discovery mechanism based on probabilistic protocols. Tech. Rep. CU-CS-474-90. Department of Computer Science, University of Colorado at Boulder, Boulder, CO.
|
| |
30
|
|
| |
31
|
SCHWARTZ, M. F., EMTAGE, A., KAHLE, B., AND NEUMAN, B. C. 1992. A comparison of Internet resource discovery approaches. Comput. Syst. 5, 4, 461-493.
|
| |
32
|
SELBERG, E. AND ETZIONI, O. 1995. Multi-service search and comparison using the MetaCrawler. In Proceedings of the Fourth International Conference on World-Wide Web (Dec.).
|
| |
33
|
Mark A. Sheldon , Andrzej Duda , Ron Weiss , James W. O'Toole, Jr. , David K. Gifford, Content routing for distributed information servers, Proceedings of the 4th international conference on extending database technology on Advances in database technology, p.109-122, May 1994, Cambridge, United Kingdom
|
| |
34
|
SIMPSON, P. AND ALONSO, R. 1989. Querying a network of autonomous databases. Tech. Rep. CS-TR-202-89. Department of Computer Science, Princeton Univ., Princeton, NJ.
|
 |
35
|
|
| |
36
|
VOORHEES, E. M., GUPTA, N. K., AND JOHNSON-LAIRD, B. 1995. The collection fusion problem. In Proceedings of the Third Conference on Text Retrieval (TREC-3, Mar.).
|
| |
37
|
YAN, T. W. AND GARC A-MOLINA, H. 1995. SIFT--a tool for wide-area information dissemination. In Proceedings of the 1995 USENIX Technical Conference (Jan.). USENIX Assoc., Berkeley, CA, 177-186.
|
| |
38
|
ZAHIR, S. AND CHANG, C. L. 1992. Online-Expert: An expert system for online database selection. J. Am. Soc. Inf. Sci. 43, 5, 340-357.
|
CITED BY 81
|
|
|
|
|
|
|
|
James C. French , Allison L. Powell , Jamie Callan , Charles L. Viles , Travis Emmitt , Kevin J. Prey , Yun Mou, Comparing the performance of database selection algorithms, Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p.238-245, August 15-19, 1999, Berkeley, California, United States
|
|
|
Allison L. Powell , James C. French , Jamie Callan , Margaret Connell , Charles L. Viles, The impact of database selection on distributed searching, Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, p.232-239, July 24-28, 2000, Athens, Greece
|
|
|
|
|
|
|
|
|
James C. French , Allison L. Powell , Fredric Gey , Natalia Perelman, Exploiting a controlled vocabulary to improve collection selection and retrieval effectiveness, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
Yong Lin , Jian Xu , Ee-Peng Lim , Wee-Keong Ng, ZBroker: a query routing broker for Z39.50 databases, Proceedings of the eighth international conference on Information and knowledge management, p.202-209, November 02-06, 1999, Kansas City, Missouri, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chunqiang Tang , Zhichen Xu , Sandhya Dwarkadas, Peer-to-peer information retrieval using self-organizing semantic overlay networks, Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, August 25-29, 2003, Karlsruhe, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Matthias Bender , Sebastian Michel , Peter Triantafillou , Gerhard Weikum , Christian Zimmer, Improving collection selection with overlap awareness in P2P search engines, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, August 15-19, 2005, Salvador, Brazil
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Klaus Berberich , Manolis Koubarakis , Christos Tryfonopoulos , Gerhard Weikum , Christian Zimmer, MAPS: approximate publish/subscribe functionality in peer-to-peer networks, Proceedings of the 1st international workshop on Advanced data processing in ubiquitous computing (ADPUC 2006), November 27-December 01, 2006, Melbourne, Australia
|
|
|
Milad Shokouhi , Justin Zobel , Yaniv Bernstein, Distributed text retrieval from overlapping collections, Proceedings of the eighteenth conference on Australasian database, p.141-150, January 30-February 02, 2007, Ballarat, Victoria, Australia
|
|
|
Panagiotis G. Ipeirotis , Eugene Agichtein , Pranay Jain , Luis Gravano, To search or to crawl?: towards a query optimizer for text-centric tasks, Proceedings of the 2006 ACM SIGMOD international conference on Management of data, June 27-29, 2006, Chicago, IL, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Mayank Bawa , Roberto J. Bayardo, Jr. , Rakesh Agrawal, Privacy-preserving indexing of documents on the network, Proceedings of the 29th international conference on Very large data bases, p.922-933, September 09-12, 2003, Berlin, Germany
|
|
|
Jack G. Conrad , Xi S. Guo , Peter Jackson , Monem Meziou, Database selection using actual physical and acquired logical collection resources in a massive domain-specific operational environment, Proceedings of the 28th international conference on Very Large Data Bases, p.71-82, August 20-23, 2002, Hong Kong, China
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Igor Mekterovic , Krešimir Križanovic , Mirta Baranovic, Content-based search using self-organizing peer-to-peer network, Proceedings of the 7th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems, p.50-55, February 20-22, 2008, Cambridge, UK
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Aoying Zhou , Rong Zhang , Weining Qian , Quang Hieu Vu , Tianming Hu, Adaptive indexing for content-based search in P2P systems, Data & Knowledge Engineering, v.67 n.3, p.381-398, December, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Peter Mork , Ken Smith , Barbara Blaustein , Chris Wolf , Keri Sarver, Facilitating discovery on the private web using dataset digests, Proceedings of the 10th International Conference on Information Integration and Web-based Applications & Services, November 24-26, 2008, Linz, Austria
|
|
|
|
|
|
|
|
|
Jaime Arguello , Fernando Diaz , Jamie Callan , Jean-Francois Crespo, Sources of evidence for vertical selection, Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, July 19-23, 2009, Boston, MA, USA
|
|
|
|
|
|
|
REVIEW
"Edward Y. Lee : Reviewer"
The idea of a Glossary of Servers Server (GlOSS) has been discussed
in several other publications. This paper expands the idea into three
versions of the implementation as applied to the discovery of
text-source documents available over the In
more...
|