ACM Home Page
Please provide us with feedback. Feedback
SearchGen: a synthetic workload generator for scientific literature digital libraries and search engines
Full text PdfPdf (527 KB)
Source
International Conference on Digital Libraries archive
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries table of contents
Vancouver, BC, Canada
SESSION: Systems table of contents
Pages: 137 - 146  
Year of Publication: 2007
ISBN:978-1-59593-644-8
Authors
Huajing Li  The Pennsylvania State University, State College, PA
Wang-Chien Lee  The Pennsylvania State University, State College, PA
Anand Sivasubramaniam  The Pennsylvania State University, State College, PA
Lee Giles  The Pennsylvania State University, State College, PA
Sponsors
ACM: Association for Computing Machinery
SIGIR: ACM Special Interest Group on Information Retrieval
SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 51,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1255175.1255203
What is a DOI?

ABSTRACT

Due to the popularity of web applications and their heavy usage, it is important to obtain a good understanding of their workloads in order to improve performance of search services. Existing works have typically focused on generic web workloads without putting emphasis on specific domains. In this paper, we analyze the usage logs of CiteSeer, a scientific literature digital library and search engine, to characterize workloads for both robots and users. Essential ingredients that contribute to workloads are proposed. Among them we find the access intervals show high variance, and thus cannot be predicted well with time-series models. On the other hand, client visiting path and semantics can be well captured with probabilistic models and Zipf-law. Based on the findings, we propose SearchGen, a synthetic workload generator to output traces for scientific literature digital libraries and search engines. A comparison between synthetic workloads and actual logged traces suggests that the synthetic workload fits well.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
 
2
C. S. Badue, R. Barbosa, P. Golgher, B. Ribeiro-Neto, and N. Ziviani. Distributed processing of conjunctive queries. In HDIR '05, SIGIR 2005, 2005.
3
4
 
5
J. Beran. Statistics for Long-Memory Processes. Chapman & Hall, New York, NY, 1994.
 
6
B. Berendt, B. Mobasher, M. Spiliopoulou, and J. Wiltshire. Measuring the accuracy of sessionizers for web usage analysis. In Proceedings of the Web Mining Workshop at the 1st SIAM International Conference on Data Mining, Chicago, Illinois, April 2001.
 
7
 
8
S. Chaudhuri, P. Ganesan, and V. R. Narasayya. Primitives for workload summarization and implications for SQL. In VLDB, pages 730--741, 2003.
 
9
 
10
P. Dinda and D. O'Hallaron. An extensible toolkit for resource prediction in distributed systems, 1999.
11
 
12
13
 
14
D. E. Knuth. The Art of Computer Programming. Four volumes. Addison-Wesley, 1969.
15
 
16
Y. Lu, T. Abdelzaher, C. Lu, and G. Tao. An adaptive control framework for QoS guarantees and its application to differentiated caching services, 2002.
 
17
 
18
R. Sen and M. Hansen. Predicting a web user's next access based on log data. Journal of Computational and Graphical Statistics, 12:143--155(13), 2003.
19
 
20
A. Streit. Self-Tuning Job Scheduling Strategies for the Resource Management of HPC Systems and Computational Grids. PhD thesis, Faculty of Computer Science, Electrical Engineering and Mathematics, University Paderborn, 2003.
 
21
22
 
23
Y. Wang, M. J. Rutherford, A. Carzaniga, and A. L. Wolf. Weevil: a tool to automate experimentation with distributed systems. Technical Report CU-CS-980-04, Department of Computer Science, University of Colorado, Oct. 2004.
 
24
 
25

Collaborative Colleagues:
Huajing Li: colleagues
Wang-Chien Lee: colleagues
Anand Sivasubramaniam: colleagues
Lee Giles: colleagues