ACM Home Page
Please provide us with feedback. Feedback
Multi-level comparison of data deduplication in a backup scenario
Full text PdfPdf (387 KB)
Source ACM International Conference Proceeding Series archive
Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference table of contents
Haifa, Israel
SESSION: Deduplication table of contents
Article No. 8  
Year of Publication: 2009
ISBN:978-1-60558-623-6
Authors
Dirk Meister  Paderborn Center for Parallel Computing
André Brinkmann  Paderborn Center for Parallel Computing
Sponsors
: Melanox Technologies
: Hebrew University of Jerusalem
IBM : IBM
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 102,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1534530.1534541
What is a DOI?

ABSTRACT

Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks.

This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-life user data in a weekly full backup scenario, both at a single point in time as well as over several weeks.

In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application-specific chunking can help to further decrease storage capacity demands.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
H. K. F. Bjerke, D. Shiyachki, A. Unterkircher, and I. Habib. Tools and techniques for managing virtual machine images. In Proceedings of the Workshop on Virtualization in High-Performance Cluster and Grid Computing, August 2008.
 
2
 
3
 
4
N. Burch and Y. Kazlov. PIO-HSLF - a guide to the Powerpoint file format. Web, http://poi.apache.org/slideshow/ppt-file-format.html.
5
 
6
F. Douglis and A. Iyengar. Application-specific deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 113--126, 2003.
 
7
M. Dutch. Understanding data deduplication ratios. SNIA White Paper, June 2008.
 
8
T. Gibson, E. Miller, and D. Long. Long-term file activity and inter-reference patterns. In Proceedings of the 24th International Conference on Technology Management and Performance Evaluation of Enterprise-Wide Information Systems, 1998.
 
9
 
10
B. Hong and D. D. E. Long. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), pages 301--314, 2004.
 
11
International Standardization Organisation. Iso 32000-1:2008: Portable document format, July 2008.
 
12
 
13
 
14
 
15
J. P. Macdonald. Abstract file system support for delta compression. Master's thesis, Department of Electrical Engineering and Computer Science, University of California at Berkeley, 2000.
16
 
17
Microsoft Corporation. Knowledge base article 938808. Microsoft Support Knowledge Base, September 2007.
 
18
Microsoft Corporation. Microsoft Office Powerpoint 97--2007 binary file format specification (.ppt). Technical report, Microsoft Corporation, 2007.
19
 
20
Pkware. .zip file format specification. Specification, http://www.pkware.com/documents/casestudies/APPNOTE.TXT, September 2007.
 
21
 
22
 
23
M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, 1981.
24
 
25
L. L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In Proceedings of 21st IEEE/NASA Goddard MSS, 2004.
 
26
 
27

Collaborative Colleagues:
Dirk Meister: colleagues
André Brinkmann: colleagues