|
ABSTRACT
The information explosion across the Internet and elswhere offers access to an increasing number of document collections. In order for users to effectively access these collections, information retrieval (IR) systems must provide coordinated, concurrent, and distributed access. In this article, we explore how to achieve scalable performance in a distributed system for collection sizes ranging from 1GB to 128GB. We implement a fully functional distributed IR system based on a multithreaded version of the Inquery simulation model. We measure performance as a function of system parameters such as client command rate, number of document collections, ter ms per query, query term frequency, number of answers returned, and command mixture. Our results show that it is important to model both query and document commands because the heterogeneity of commands significantly impacts performance. Based on our results, we recommend simple changes to the prototype and evaluate the changes using the simulator. Because of the significant resource demands of information retrieval, it is not difficult to generate workloads that overwhelm system resources regardless of the architecture. However under some realistic workloads, we demonstrate system organizations for which response time gracefully degrades as the workload increases and performance scales with the number of processors. This scalable architecture includes a surprisingly small number of brokers through which a large number of clients and servers communicate.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BAILEY, P. AND HAWKING, D. 1996. A parallel architecture for query processing over a terabyte of text. Tech. Rep. TR-CS-96-04. Department of Computer Science, Australian National Univ., Canberra, Australia.
|
| |
2
|
|
| |
3
|
BROWN, E. W. AND CHONG, H.A. 1998. The GURU system in TREC-7. In Proceedings of the 7th Text Retrieval Conference (TREC-7),
|
| |
4
|
J. A. Brumfield , J. L. Miller , H. T. Chou, Performance modeling of distributed object-oriented database systems, Proceedings of the first international symposium on Databases in parallel and distributed systems, p.22-32, December 05-07, 1988, Austin, Texas, United States
|
 |
5
|
|
| |
6
|
BURKOWSKI, F., CORMACK, G., CLARKE, C., AND GOOD, R. 1995. A global search architecture. Tech. Rep. CS-95-12. Computer Science Dept., University of Waterloo, Waterloo, Canada.
|
 |
7
|
|
| |
8
|
|
| |
9
|
CALLAN, J. P., CROFT, W. B., AND HARDING, S. M. 1992. The INQUERY retrieval system. In Proceedings of the 3rd International Conference on Database and Expert System Applications (Valencia, Spain, Sept.),
|
 |
10
|
James P. Callan , Zhihong Lu , W. Bruce Croft, Searching distributed collections with inference networks, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.21-28, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215328]
|
| |
11
|
T. R. Couvreur , R. N. Benzel , S. F. Miller , D. N. Zeitler , D. L. Lee , M. Singhal , N. Shivaratri , W. Y. P. Wong, An analysis of performance and cost factors in searching large text databases using parallel search systems, Journal of the American Society for Information Science, v.45 n.7, p.443-464, Aug. 1994
[doi> 10.1002/(SICI)1097-4571(199408)45:7<443::AID-ASI1>3.0.CO;2-O]
|
 |
12
|
J. K. Cringean , R. England , G. A. Manson , P. Willett, Parallel text searching in serial files using a processor farm, Proceedings of the 13th annual international ACM SIGIR conference on Research and development in information retrieval, p.429-453, September 05-07, 1990, Brussels, Belgium
[doi> 10.1145/96749.98249]
|
| |
13
|
CROFT, W. B., COOK, R., AND WILDER, D. 1995. Providing government information on the Internet: Experiences with THOMAS. In Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries (DL '95, Austin, TX, June), 19-24.
|
| |
14
|
CROWDER, G. AND NICHOLAS, C. 1995. An approach to large scale distributed information systems using statistical properties of text to guide agent search. In Proceedings of the CIKM Workshop on Intelligent Information Agents (Baltimore, MD, Dec.),
|
 |
15
|
|
| |
16
|
David J. DeWitt , Robert H. Gerber , Goetz Graefe , Michael L. Heytens , Krishna B. Kumar , M. Muralikrishna, GAMMA - A High Performance Dataflow Database Machine, Proceedings of the 12th International Conference on Very Large Data Bases, p.228-237, August 25-28, 1986
|
| |
17
|
Fox, E.A. 1983. Characterization of two new experimental collections in computer and information science containing textual and bibliographic concepts. Tech. Rep. 83-561. Cornell University, Ithaca, NY.
|
 |
18
|
|
 |
19
|
|
| |
20
|
HARMAN, D. K., Ed. 1992. Proceedings of the 1st Text Retrieval Conference. (TREC-1, Gaithersburg, MD, Nov.). National Institute of Standards and Technology, Gaithersburg, MD. NIST Special Publication 500-217.
|
| |
21
|
|
| |
22
|
|
| |
23
|
HAWKING, D. AND THISTLEWAITE, P. 1997. Overview of the TREC-6 very large collection track. In Proceedings of the 6th Text Retreival Conference (TREC-6, Nov.), E. Voorhees and D. Harman, Eds.
|
| |
24
|
HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. 1998. Overview of TREC-7 very large collection track. In Proceedings of the 7th Text Retrieval Conference (TREC-7),
|
| |
25
|
|
| |
26
|
JUMP, J. R. 1993. YACSIM reference manual. Version 2.1.1. Rice University, Houston, TX.
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
| |
32
|
|
| |
33
|
MOFFAT, n. AND ZOBEL, g. 1995. Information retrieval systems for large document collections. In Proceedings of the 3rd Text Retrieval Conference (TREC-3), D. Harman, Ed. National Institute of Standards and Technology, Gaithersburg, MD, 500-525.
|
| |
34
|
POGUE, C. A. AND WILLETT, P. 1987. Use of text signatures for document retrieval in a highly parallel environment. Parallel Comput. 4, 3 (June), 259-268.
|
| |
35
|
SCHATZ, B. R. 1990. Interactive retrieval in information spaces distributed across a wide-area network. TR 90-35. Department of Computer Science, University of Arizona, Tucson, AZ.
|
 |
36
|
|
 |
37
|
|
 |
38
|
C. Stanfill , R. Thau , D. Waltz, A parallel indexed algorithm for information retrieval, Proceedings of the 12th annual international ACM SIGIR conference on Research and development in information retrieval, p.88-97, June 25-28, 1989, Cambridge, Massachusetts, United States
|
| |
39
|
STONEBRAKER, M., WOODFILL, J., RANSTROM, J., KALASH, J., ARNOLD, K., AND ANDERSON, E. 1983. Performance analysis of distributed data base systems. In Proceedings of the 3rd Symposium on Reliability in Distributed Software and Database Systems (Clearwater Beach, FL, Oct.),
|
| |
40
|
|
| |
41
|
TOMASIC, A. AND GARCIA-MOLINA, H. 1992. Caching and database scaling in distributed shared-nothing information retrieval systems. Tech. Rep. STAN-CS-92-1456. Stanford University, Stanford, CA.
|
| |
42
|
|
 |
43
|
|
 |
44
|
Ellen M. Voorhees , Narendra K. Gupta , Ben Johnson-Laird, Learning collection fusion strategies, Proceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval, p.172-179, July 09-13, 1995, Seattle, Washington, United States
[doi> 10.1145/215206.215357]
|
| |
45
|
|
| |
46
|
ZIPF, G. K. 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Addison-Wesley, Reading, MA.
|
CITED BY 17
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Danilo Ardagna , Chiara Francalanci , Marco Trubian, Cost minimization in the design of IT infrastructures, Proceedings of the 5th WSEAS International Conference on Software Engineering, Parallel and Distributed Systems, p.1-7, February 15-17, 2006, Madrid, Spain
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|