| SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines |
| Full text |
Pdf
(527 KB)
|
Source
|
International Conference on Digital Libraries
archive
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
table of contents
Vancouver, BC, Canada
SESSION: Systems
table of contents
Pages: 137 - 146
Year of Publication: 2007
ISBN:978-1-59593-644-8
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 8, Downloads (12 Months): 51, Citation Count: 0
|
|
|
ABSTRACT
Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
C. S. Badue, R. Barbosa, P. Golgher, B. Ribeiro-Neto, and N. Ziviani. Distributed processing of conjunctive queries. In HDIR '05, SIGIR 2005, 2005.
|
 |
3
|
|
 |
4
|
L. Bent , M. Rabinovich , G. M. Voelker , Z. Xiao, Characterization of a large web site population with implications for content delivery, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
[doi> 10.1145/988672.988743]
|
| |
5
|
J. Beran. Statistics for Long-Memory Processes. Chapman & Hall, New York, NY, 1994.
|
| |
6
|
B. Berendt, B. Mobasher, M. Spiliopoulou, and J. Wiltshire. Measuring the accuracy of sessionizers for web usage analysis. In Proceedings of the Web Mining Workshop at the 1st SIAM International Conference on Data Mining, Chicago, Illinois, April 2001.
|
| |
7
|
|
| |
8
|
S. Chaudhuri, P. Ganesan, and V. R. Narasayya. Primitives for workload summarization and implications for SQL. In VLDB, pages 730--741, 2003.
|
| |
9
|
|
| |
10
|
P. Dinda and D. O'Hallaron. An extensible toolkit for resource prediction in distributed systems, 1999.
|
 |
11
|
C. Lee Giles , Kurt D. Bollacker , Steve Lawrence, CiteSeer: an automatic citation indexing system, Proceedings of the third ACM conference on Digital libraries, p.89-98, June 23-26, 1998, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/276675.276685]
|
| |
12
|
|
 |
13
|
|
| |
14
|
D. E. Knuth. The Art of Computer Programming. Four volumes. Addison-Wesley, 1969.
|
 |
15
|
|
| |
16
|
Y. Lu, T. Abdelzaher, C. Lu, and G. Tao. An adaptive control framework for QoS guarantees and its application to differentiated caching services, 2002.
|
| |
17
|
|
| |
18
|
R. Sen and M. Hansen. Predicting a web user's next access based on log data. Journal of Computational and Graphical Statistics, 12:143--155(13), 2003.
|
 |
19
|
Rob Simmonds , Carey Williamson , Russell Bradford , Martin Arlitt , Brian Unger, Web server benchmarking using parallel WAN emulation, Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, June 15-19, 2002, Marina Del Rey, California
|
| |
20
|
A. Streit. Self-Tuning Job Scheduling Strategies for the Resource Management of HPC Systems and Computational Grids. PhD thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University Paderborn, 2003.
|
| |
21
|
|
 |
22
|
|
| |
23
|
Y. Wang, M. J. Rutherford, A. Carzaniga, and A. L. Wolf. Weevil: a tool to automate experimentation with distributed systems. Technical Report CU-CS-980-04, Department of Computer Science, University of Colorado, Oct. 2004.
|
| |
24
|
Jianyong Zhang , Anand Sivasubramaniam , Hubertus Franke , Natarajan Gautam , Yanyong Zhang , Shailabh Nagar, Synthesizing Representative I/O Workloads for TPC-H, Proceedings of the 10th International Symposium on High Performance Computer Architecture, p.142, February 14-18, 2004
[doi> 10.1109/HPCA.2004.10019]
|
| |
25
|
|
|