|
ABSTRACT
Data deduplication systems detect redundancies between data blocks to either reduce storage needs or to reduce network traffic. A class of deduplication systems splits the data stream into data blocks (chunks) and then finds exact duplicates of these blocks. This paper compares the influence of different chunking approaches on multiple levels. On a macroscopic level, we compare the chunking approaches based on real-life user data in a weekly full backup scenario, both at a single point in time as well as over several weeks. In addition, we analyze how small changes affect the deduplication ratio for different file types on a microscopic level for chunking approaches and delta encoding. An intuitive assumption is that small semantic changes on documents cause only small modifications in the binary representation of files, which would imply a high ratio of deduplication. We will show that this assumption is not valid for many important file types and that application-specific chunking can help to further decrease storage capacity demands.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
H. K. F. Bjerke, D. Shiyachki, A. Unterkircher, and I. Habib. Tools and techniques for managing virtual machine images. In Proceedings of the Workshop on Virtualization in High-Performance Cluster and Grid Computing, August 2008.
|
| |
2
|
|
| |
3
|
William J. Bolosky , Scott Corbin , David Goebel , John R. Douceur, Single instance storage in Windows® 2000, Proceedings of the 4th conference on USENIX Windows Systems Symposium, p.2-2, August 03-04, 2000, Seattle, Washington
|
| |
4
|
N. Burch and Y. Kazlov. PIO-HSLF - a guide to the Powerpoint file format. Web, http://poi.apache.org/slideshow/ppt-file-format.html.
|
 |
5
|
|
| |
6
|
F. Douglis and A. Iyengar. Application-specific deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 113--126, 2003.
|
| |
7
|
M. Dutch. Understanding data deduplication ratios. SNIA White Paper, June 2008.
|
| |
8
|
T. Gibson, E. Miller, and D. Long. Long-term file activity and inter-reference patterns. In Proceedings of the 24th International Conference on Technology Management and Performance Evaluation of Enterprise-Wide Information Systems, 1998.
|
| |
9
|
|
| |
10
|
B. Hong and D. D. E. Long. Duplicate data elimination in a san file system. In Proceedings of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST), pages 301--314, 2004.
|
| |
11
|
International Standardization Organisation. Iso 32000-1:2008: Portable document format, July 2008.
|
| |
12
|
|
| |
13
|
Purushottam Kulkarni , Fred Douglis , Jason LaVoie , John M. Tracey, Redundancy elimination within large collections of files, Proceedings of the annual conference on USENIX Annual Technical Conference, p.5-5, June 27-July 02, 2004, Boston, MA
|
| |
14
|
Chuanyi Liu , Yingping Lu , Chunhui Shi , Guanlin Lu , David H. C. Du , Dong-Sheng Wang, ADMAD: Application-Driven Metadata Aware De-duplication Archival Storage System, Proceedings of the 2008 Fifth IEEE International Workshop on Storage Network Architecture and Parallel I/Os, p.29-35, September 22-22, 2008
[doi> 10.1109/SNAPI.2008.11]
|
| |
15
|
J. P. Macdonald. Abstract file system support for delta compression. Master's thesis, Department of Electrical Engineering and Computer Science, University of California at Berkeley, 2000.
|
 |
16
|
|
| |
17
|
Microsoft Corporation. Knowledge base article 938808. Microsoft Support Knowledge Base, September 2007.
|
| |
18
|
Microsoft Corporation. Microsoft Office Powerpoint 97--2007 binary file format specification (.ppt). Technical report, Microsoft Corporation, 2007.
|
 |
19
|
|
| |
20
|
Pkware. .zip file format specification. Specification, http://www.pkware.com/documents/casestudies/APPNOTE.TXT, September 2007.
|
| |
21
|
|
| |
22
|
|
| |
23
|
M. O. Rabin. Fingerprinting by random polynomials. Technical report, Center for Research in Computing Technology, 1981.
|
 |
24
|
Constantine P. Sapuntzakis , Ramesh Chandra , Ben Pfaff , Jim Chow , Monica S. Lam , Mendel Rosenblum, Optimizing the migration of virtual computers, Proceedings of the 5th symposium on Operating systems design and implementation Due to copyright restrictions we are not able to make the PDFs for this conference available for downloading, December 09-11, 2002, Boston, Massachusetts
[doi> 10.1145/1060289.1060324]
|
| |
25
|
L. L. You and C. Karamanolis. Evaluation of efficient archival storage techniques. In Proceedings of 21st IEEE/NASA Goddard MSS, 2004.
|
| |
26
|
|
| |
27
|
|
|