|
ABSTRACT
Block correlations are common semantic patterns in storage systems. They can be exploited for improving the effectiveness of storage caching, prefetching, data layout, and disk scheduling. Unfortunately, information about block correlations is unavailable at the storage system level. Previous approaches for discovering file correlations in file systems do not scale well enough for discovering block correlations in storage systems.In this article, we propose two algorithms, C-Miner and C-Miner*, that use a data mining technique called frequent sequence mining to discover block correlations in storage systems. Both algorithms run reasonably fast with feasible space requirement, indicating that they are practical for dynamically inferring correlations in a storage system. C-Miner is a direct application of a frequent-sequence mining algorithm with a few modifications; compared with C-Miner, C-Miner* is redesigned for mining block correlations by making concessions for the specific problem of long sequences in storage system traces. Therefore, C-Miner* can discover 7--109% more correlation rules within 2--15 times shorter time than C-Miner. Moreover, we have also evaluated the benefits of block correlation-directed prefetching and data layout through experiments. Our results using real system workloads show that correlation-directed prefetching and data layout can reduce average I/O response time by 12--30% compared to the base case, and 7--25% compared to the commonly used sequential prefetching scheme for most workloads.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Eric Anderson , Michael Hobbs , Kimberly Keeton , Susan Spence , Mustafa Uysal , Alistair C. Veitch, Hippodrome: Running Circles Around Storage Administration, Proceedings of the Conference on File and Storage Technologies, p.175-188, January 28-30, 2002
|
| |
3
|
Anthes, G. H. 2002. Storage virtualization: The next step. Computer World, January 28, 2002, p. 43.
|
| |
4
|
Ari, I., Amer, A., Miller, E., Brandt, S., and Long, D. 2002. Who is more adaptive? ACME: Adaptive caching using multiple experts. In Proceedings of the Workshop on Distributed Data and Structures (WDAS).
|
 |
5
|
|
 |
6
|
Jay Ayres , Jason Flannick , Johannes Gehrke , Tomi Yiu, Sequential PAttern mining using a bitmap representation, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, July 23-26, 2002, Edmonton, Alberta, Canada
[doi> 10.1145/775047.775109]
|
 |
7
|
|
 |
8
|
|
| |
9
|
Cao, P., Felten, E., and Li, K. 1994. Application-controlled file caching policies. In Proceedings of the USENIX Summer 1994 Technical Conference. 171--182.
|
 |
10
|
Pei Cao , Edward W. Felten , Anna R. Karlin , Kai Li, A study of integrated prefetching and caching strategies, Proceedings of the 1995 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems, p.188-197, May 15-19, 1995, Ottawa, Ontario, Canada
|
 |
11
|
|
| |
12
|
|
| |
13
|
Chen, Z., Zhou, Y., and Li, K. 2003. Eviction-based cache placement for storage caches. In Proceedings of the Conference on 2003 USENIX Annual Technical Conference. 269--282.
|
 |
14
|
Jongmoo Choi , Sam H. Noh , Sang Lyul Min , Yookun Cho, Towards application/file-level characterization of block references: a case for fine-grained buffer management, Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, p.286-295, June 18-21, 2000, Santa Clara, California, United States
|
| |
15
|
Chou, H. and DeWitt, D. 1993. An evaluation of buffer management strategies for relational database systems. In Proceedings of the 19th International Conference on Very Large Data Bases (Dublin, Ireland). 127--141.
|
| |
16
|
Clifton, C. and Gengo, G. 2000. Developing custom intrusion detection filters using data mining. In Proceedings of the 2000 Military Communications International Symposium (MILCOM2000, Los Angeles, CA).
|
 |
17
|
Kenneth M. Curewitz , P. Krishnan , Jeffrey Scott Vitter, Practical prefetching via data compression, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.257-266, May 25-28, 1993, Washington, D.C., United States
|
| |
18
|
Eaton, P. R., Geels, D., and Mori, G. 1999. Clump: Improving file system performance through adaptive optimizations. Go to http://www.citeseer.csail.mit.edu/eatox99clump.html.
|
| |
19
|
EMC Corporation. 1999. Symmetrix 3000 and 5000 Enterprise Storage Systems Product Description Guide. EMC Corporation, Hopkinton, MA. Web site: http://www.emc.com.
|
| |
20
|
Ganger, G. 1995. Systemoriented evaluation of I/O subsystem performance. Tech. rep. CSE-TR-243-95. University of Michigan, Ann Arbor, MI.
|
| |
21
|
|
| |
22
|
|
 |
23
|
Garth A. Gibson , David F. Nagle , Khalil Amiri , Jeff Butler , Fay W. Chang , Howard Gobioff , Charles Hardin , Erik Riedel , David Rochberg , Jim Zelenka, A cost-effective, high-bandwidth storage architecture, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.92-103, October 02-07, 1998, San Jose, California, United States
|
| |
24
|
Griffioen, J. and Appleton, R. 1994. Reducing file system latency using a predictive approach. In Proceedings of the 1994 Summer USENIX Conference.
|
| |
25
|
Griffioen, J. and Appleton, R. 1995. Performance measurements of automatic prefetching. In Proceedings of the International Conference on Parallel and Distributed Computing Systems.
|
| |
26
|
Han, J. 2002. How can data mining help bio-data analysis? In Proceedings of the 2002 Workshop on Data Mining in Bioinformatics (BIOKDD'02, Edmonton, Canada). 1--4.
|
| |
27
|
|
 |
28
|
|
| |
29
|
IBM. 2002. Storage Tank, a distributed storage system. IBM White paper. Web site: http://www.almaden.ibm.com/StorageSystems/file_systems/storage_tank/papers.shtml.
|
| |
30
|
Keeton, K. and Wilkes, J. 2002. Automating data dependability. In Proceedings of 10th ACM-SIGOPS European Workshop.
|
| |
31
|
Kim, J., Choi, J., Kim, J., Noh, S., Min, S., Cho, Y., and Kim, C. 2000. A low-overhead high-performance unified buffer management scheme that exploits sequential and looping references. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI, San Diego, CA). 119--134.
|
 |
32
|
Tracy Kimbrel , Andrew Tomkins , R. Hugo Patterson , Brian Bershad , Pei Cao , Edward W. Felten , Garth A. Gibson , Anna R. Karlin , Kai Li, A trace-driven comparison of algorithms for parallel prefetching and caching, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.19-34, October 29-November 01, 1996, Seattle, Washington, United States
|
| |
33
|
Kroeger, T. M. and Long, D. D. E. 1995. Predicting file-system actions from prior events. In Proceedings of the 1996 USENIX Annual Technical Conference. 319--328.
|
| |
34
|
Kuenning, G. 1994. Design of the SEER predictive caching scheme. In Proceedings of the Workshop on Mobile Computing Systems and Applications.
|
 |
35
|
|
 |
36
|
|
| |
37
|
Lee, W. and Stolfo, S. 1998. Data mining approaches for intrusion detection. In Proceedings of the 7th USENIX Security Symposium (San Antonio, TX).
|
| |
38
|
Lei, H. and Duchamp, D. 1997. An analytical approach to file prefetching. In Proceedings of the 1997 USENIX Annual Technical Conference (Anaheim, CA.)
|
 |
39
|
|
 |
40
|
Tara M. Madhyastha , Garth A. Gibson , Christos Faloutsos, Informed prefetching of collective input/output requests, Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p.13-es, November 14-19, 1999, Portland, Oregon, United States
[doi> 10.1145/331532.331545]
|
 |
41
|
|
| |
42
|
|
| |
43
|
|
 |
44
|
Todd C. Mowry , Angela K. Demke , Orran Krieger, Automatic compiler-inserted I/O prefetching for out-of-core applications, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.3-17, October 29-November 01, 1996, Seattle, Washington, United States
|
| |
45
|
|
 |
46
|
R. H. Patterson , G. A. Gibson , E. Ginting , D. Stodolsky , J. Zelenka, Informed prefetching and caching, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.79-95, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
47
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Helen Pinto , Qiming Chen , Umeshwar Dayal , Meichun Hsu, PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proceedings of the 17th International Conference on Data Engineering, p.215-224, April 02-06, 2001
|
| |
48
|
Pitkow, J. E. and Pirolli, P. 1999. Mining longest repeating subsequences to predict World Wide Web surfing. In Proceedings of the USENIX Symposium on Internet Technologies and Systems.
|
| |
49
|
Ruemmler, C. and Wilkes, J. 1993a. A trace-driven analysis of disk working set sizes. Tech. rep. HPL-OSR-93-23. Hewlett-Packard Laboratories, Palo Alto, CA.
|
| |
50
|
Ruemmler, C. and Wilkes, J. 1993b. UNIX disk access patterns. In Proceedings of the Winter 1993 USENIX Conference.
|
| |
51
|
Salmon, B., Thereska, E., Soules, C. A., and Ganger, G. R. 2003. A two-tiered software architecture for automated tuning of disk layouts. In Proceedings of the First Workshop on Algorithms and Architectures for Self-Managing Systems.
|
| |
52
|
|
| |
53
|
|
| |
54
|
Seifert, A. and Scholl, M. H. 2002. A multi-version cache replacement and prefetching policy for hybrid data delivery environments. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB).
|
| |
55
|
Muthian Sivathanu , Vijayan Prabhakaran , Florentina I. Popovici , Timothy E. Denehy , Andrea C. Arpaci-Dusseau , Remzi H. Arpaci-Dusseau, Semantically-Smart Disk Systems, Proceedings of the 2nd USENIX Conference on File and Storage Technologies, March 31-31, 2003, San Francisco, CA
|
 |
56
|
|
| |
57
|
Smith, B. J. 1978b. A pipelined, shared resource MIMD computer. In Proceedings of International Conference on Parallel Processing. 6--8.
|
 |
58
|
|
| |
59
|
Storage Performance Council. 2004. SPC I/O traces. Web site: http://www.storageperformance.org/.
|
 |
60
|
Carl Tait , Hui Lei , Swarup Acharya , Henry Chang, Intelligent file hoarding for mobile computers, Proceedings of the 1st annual international conference on Mobile computing and networking, p.119-125, November 13-15, 1995, Berkeley, California, United States
[doi> 10.1145/215530.215564]
|
 |
61
|
Andrew Tomkins , R. Hugo Patterson , Garth Gibson, Informed multi-process prefetching and caching, Proceedings of the 1997 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, p.100-114, June 15-18, 1997, Seattle, Washington, United States
|
 |
62
|
|
| |
63
|
|
| |
64
|
|
 |
65
|
|
 |
66
|
J. Wilkes , R. Golding , C. Staelin , T. Sullivan, The HP AutoRAID hierarchical storage system, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.96-108, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
67
|
|
| |
68
|
Yan, X., Han, J., and Afshar, R. 2003. CloSpan: Mining closed sequential patterns in large datasets. In Proceedings of the 2003 SIAM International Conference Data Mining (SDM'03, San Francisco, CA).
|
| |
69
|
|
| |
70
|
|
| |
71
|
|
CITED BY 7
|
|
|
|
|
Lei Chang , Tengjiao Wang , Dongqing Yang , Hua Luan , Shiwei Tang, Efficient algorithms for incremental maintenance of closed sequential patterns in large databases, Data & Knowledge Engineering, v.68 n.1, p.68-106, January, 2009
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|