|
ABSTRACT
Minimizing the amount of data that must be stored and managed is a key goal for any storage architecture that purports to be scalable. One way to achieve this goal is to avoid maintaining duplicate copies of the same data. Eliminating redundant data at the source by not writing data which has already been stored not only reduces storage overheads, but can also improve bandwidth utilization. For these reasons, in the face of today's exponentially growing data volumes, redundant data elimination techniques have assumed critical significance in the design of modern storage systems.Intelligent object partitioning techniques identify data that is new when objects are updated, and transfer only these chunks to a storage server. In this article, we propose a new object partitioning technique, called fingerdiff, that improves upon existing schemes in several important respects. Most notably, fingerdiff dynamically chooses a partitioning strategy for a data object based on its similarities with previously stored objects in order to improve storage and bandwidth utilization. We present a detailed evaluation of fingerdiff, and other existing object partitioning schemes, using a set of real-world workloads. We show that for these workloads, the duplicate elimination strategies employed by fingerdiff improve storage utilization on average by 25%, and bandwidth utilization on average by 40% over comparable techniques.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Berlekamp, E. R. 1968. Algebraic Coding Theory. McGraw-Hill, New York.
|
| |
3
|
Blomer, J., Kalfane, M., Karp, R., Karpinski, M., Luby, M., and Zuckerman, D. 1995. An xor-based erasure-resilient coding scheme. Tech. Rep., International Computer Science Institute, Berkeley, California.
|
| |
4
|
|
| |
5
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
6
|
|
| |
7
|
Cederqvist, P. 1992. Version management with cvs. http://www.cvshome.org/docs/manual/.
|
 |
8
|
|
| |
9
|
Douglis, F. and Iyengar, A. 2003. Application-Specific deltaencoding via resemblance detection. In Usenix Annual Technical Conference. 59--72.
|
| |
10
|
Douglis, P. K. F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Usenix Annual Technical Conference. 59--72.
|
| |
11
|
|
| |
12
|
Hong, B., Plantenberg, D., Long, D. D. E., and Sivan-Zimet, M. 2004. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE/12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST). 301--314.
|
 |
13
|
|
| |
14
|
Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: Tiered approach for eliminating redundancy in replica sychronization. In Proceedings of the 4th Usenix Conference on File and Storage Technologies (FAST).
|
 |
15
|
John Kubiatowicz , David Bindel , Yan Chen , Steven Czerwinski , Patrick Eaton , Dennis Geels , Ramakrishna Gummadi , Sean Rhea , Hakim Weatherspoon , Chris Wells , Ben Zhao, OceanStore: an architecture for global-scale persistent storage, Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, p.190-201, November 2000, Cambridge, Massachusetts, United States
|
 |
16
|
|
 |
17
|
Qin Lv , Pei Cao , Edith Cohen , Kai Li , Scott Shenker, Search and replication in unstructured peer-to-peer networks, Proceedings of the 16th international conference on Supercomputing, June 22-26, 2002, New York, New York, USA
[doi> 10.1145/514191.514206]
|
| |
18
|
Manber, U. 1994. Finding similar files in a large file system. In Usenix Winter Conference. 1--10.
|
 |
19
|
|
| |
20
|
National Institute of Standards and Technology FIPS PUB 180-1. 1995. Secure hash standard.
|
| |
21
|
|
| |
22
|
Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Usenix Annual Technical Conference. 73--86.
|
| |
23
|
|
| |
24
|
Rabin, M. 1981. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard University.
|
| |
25
|
Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. 1, 4, 364--370.
|
| |
26
|
Shivakumar, N. and García-Molina, H. 1995. SCAM: A copy detection mechanism for digital documents. In Proceedings of the 2nd Annual Conference on the Theory and Practice of Digital Libraries.
|
 |
27
|
|
| |
28
|
|
| |
29
|
W. J. Bolosky, S. Corbin, D. G. and Douceur, J. R. Single instance storage in windows 2000. In Usenix Annual Technical Conference.
|
| |
30
|
|
| |
31
|
You, L. L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st IEEE Symposium on Mass Storage Systems and Technologies (MSST).
|
| |
32
|
Ziv, J. and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3, 337--343.
|
CITED BY
|
Lior Aronovich , Ron Asher , Eitan Bachmat , Haim Bitner , Michael Hirsch , Shmuel T. Klein, The design of a similarity based deduplication system, Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, May 04-April 06, 2009, Haifa, Israel
|
|