|
ABSTRACT
Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few. In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BSSN Pugh Benchmark. http://www.cactuscode.org/Benchmarks/bench_bssn_pugh.
|
| |
2
|
NAS PARALLEL BENCHMARKS. http://www.nas.nasa.gov/Resources/Software/npb.html.
|
| |
3
|
Oracle berkeley db. http://www.oracle.com/database/berkeley-db.html.
|
 |
4
|
|
 |
5
|
|
| |
6
|
Belle. http://belle.kek.jp/.
|
| |
7
|
Deepavali Bhagwat , Kristal Pollack , Darrell D. E. Long , Thomas Schwarz , Ethan L. Miller , Jehan-Francois Paris, Providing High Reliability in a Minimum Redundancy Archival Storage System, Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation, p.413-421, September 11-14, 2006
[doi> 10.1109/MASCOTS.2006.42]
|
| |
8
|
|
| |
9
|
William J. Bolosky , Scott Corbin , David Goebel , John R. Douceur, Single instance storage in Windows® 2000, Proceedings of the 4th conference on USENIX Windows Systems Symposium, p.2-2, August 03-04, 2000, Seattle, Washington
|
 |
10
|
|
| |
11
|
A. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.
|
| |
12
|
|
| |
13
|
C. Chan and H. Lu. Fingerprinting using polynomial (rabin's method). Faculty of Science, University of Alberta, CMPUT690 Term Project, December 2001.
|
 |
14
|
|
 |
15
|
Frank Dabek , M. Frans Kaashoek , David Karger , Robert Morris , Ion Stoica, Wide-area cooperative storage with CFS, Proceedings of the eighteenth ACM symposium on Operating systems principles, October 21-24, 2001, Banff, Alberta, Canada
|
| |
16
|
Data Domain. http://www.datadomain.com.
|
| |
17
|
OSDL Database Test 2. http://www.osdl.org/.
|
| |
18
|
|
| |
19
|
EMC Corp. EMC Centera Content Addressed Storage System, 2003. http://www.emc.com/.
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
Purushottam Kulkarni , Fred Douglis , Jason LaVoie , John M. Tracey, Redundancy elimination within large collections of files, Proceedings of the annual conference on USENIX Annual Technical Conference, p.5-5, June 27-July 02, 2004, Boston, MA
|
| |
25
|
Jinyuan Li , Maxwell Krohn , David Mazières , Dennis Shasha, Secure untrusted data repository (SUNDR), Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.9-9, December 06-08, 2004, San Francisco, CA
|
| |
26
|
J. McKnight, T. Asaro, and B. Babineau. Digital archiving: End-user survey and market forecast 2006-2010. The Enterprise Strategy Group, Jan 2006.
|
| |
27
|
Jeffery C. Mogul , Yee Man Chan , Terence Kelly, Design, implementation, and evaluation of duplicate transfer detection in HTTP, Proceedings of the 1st conference on Symposium on Networked Systems Design and Implementation, p.4-4, March 29-31, 2004, San Francisco, California
|
| |
28
|
|
| |
29
|
|
 |
30
|
|
 |
31
|
|
| |
32
|
Partho Nath , Michael A. Kozuch , David R. O'Hallaron , Jan Harkes , M. Satyanarayanan , Niraj Tolia , Matt Toups, Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines, Proceedings of the annual conference on USENIX '06 Annual Technical Conference, p.6-6, May 30-June 03, 2006, Boston, MA
|
| |
33
|
NCBI GenBank. http://www.ncbi.nlm.nih.gov/Genbank/.
|
| |
34
|
K. Olsen, J. B. Minster, Y. Cui, A. Chourasia, R. Moore, Y. Hu, J. Zhu, P. Maechling, and T. Jordan. SCEC TeraShake Simulations: High Resolution Simulations of Large Southern San Andreas Earthquakes Using the TeraGrid. In Proceedings of the TeraGrid 2006 Conference.
|
| |
35
|
TeraByte Scale Enterprise databases. http://members.microsoft.com/customerevidence/Common/FileOpen.aspx?FileName=7405_FirstPremier_TDM_SQL_Server_Case_Study_Final.doc.
|
| |
36
|
Terabyte scale enterprise databases. http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf.
|
| |
37
|
Terabyte scale enterprise databases. http://www.webtechniques.com/archives/1999/02/data/.
|
| |
38
|
|
| |
39
|
S. Quinlan, J. McKie, and R. Cox. Fossil, an archival file-server. http://www.cs.bell-labs.com/sys/doc/fossil.pdf.
|
| |
40
|
M. Rabin. Fingerprinting by Random Polynomials. In Harvard University Center for Research in Computing Technology Technical Report TR-15-81, 1981.
|
 |
41
|
|
 |
42
|
Ion Stoica , Robert Morris , David Karger , M. Frans Kaashoek , Hari Balakrishnan, Chord: A scalable peer-to-peer lookup service for internet applications, Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, p.149-160, August 2001, San Diego, California, United States
|
| |
43
|
|
| |
44
|
|
| |
45
|
Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Bressoud, T., Perrig, A. Opportunistic Use of Content-Addressable Storage for Distributed File Systems. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003.
|
| |
46
|
A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, The Australian National University, 1999.
|
| |
47
|
|
| |
48
|
|
| |
49
|
J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 1978.
|
CITED BY
|
|
Chuanyi Liu , Yu Gu , Linchun Sun , Bin Yan , Dongsheng Wang, R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
|