ACM Home Page
Please provide us with feedback. Feedback
Evaluating the usefulness of content addressable storage for high-performance data intensive applications
Full text PdfPdf (442 KB)
Source
High Performance Distributed Computing archive
Proceedings of the 17th international symposium on High performance distributed computing table of contents
Boston, MA, USA
SESSION: Data intensive computing table of contents
Pages 35-44  
Year of Publication: 2008
ISBN:978-1-59593-997-5
Authors
Partho Nath  Cisco Systems, Inc., San Jose, CA, USA
Bhuvan Urgaonkar  Pennsylvania State University, University Park, PA, USA
Anand Sivasubramaniam  Pennsylvania State University, University Park, PA, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 121,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1383422.1383428
What is a DOI?

ABSTRACT

Content Addressable Storage (CAS) is a data representation technique that operates by partitioning a given data-set into non-intersecting units called chunks and then employing techniques to efficiently recognize chunks occurring multiple times. This allows CAS to eliminate duplicate instances of such chunks, resulting in reduced storage space compared to conventional representations of data. CAS is an attractive technique for reducing the storage and network bandwidth needs of performance-sensitive, data-intensive applications in a variety of domains. These include enterprise applications, Web-based e-commerce or entertainment services and highly parallel scientific/engineering applications and simulations, to name a few.

In this paper, we conduct an empirical evaluation of the benefits offered by CAS to a variety of real-world data-intensive applications. The savings offered by CAS depend crucially on (i) the nature of the data-set itself and (ii) the chunk-size that CAS employs. We investigate the impact of both these factors on disk space savings, savings in network bandwidth, and error resilience of data. We find that a chunk-size of 1 KB can provide up to 84% savings in disk space and even higher savings in network bandwidth whilst trading off error resilience and incurring 14% CAS related overheads. Drawing upon lessons learned from our study, we provide insights on (i) the choice of the chunk-size for effective space savings and (ii) the use of selective data replication to counter the loss of error resilience caused by CAS.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
BSSN Pugh Benchmark. http://www.cactuscode.org/Benchmarks/bench_bssn_pugh.
 
2
NAS PARALLEL BENCHMARKS. http://www.nas.nasa.gov/Resources/Software/npb.html.
 
3
Oracle berkeley db. http://www.oracle.com/database/berkeley-db.html.
4
5
 
6
Belle. http://belle.kek.jp/.
 
7
 
8
 
9
10
 
11
A. Broder. Some applications of rabin's fingerprinting method. In Sequences II: Methods in Communications, Security, and Computer Science, pages 143--152. Springer-Verlag, 1993.
 
12
 
13
C. Chan and H. Lu. Fingerprinting using polynomial (rabin's method). Faculty of Science, University of Alberta, CMPUT690 Term Project, December 2001.
14
15
 
16
Data Domain. http://www.datadomain.com.
 
17
OSDL Database Test 2. http://www.osdl.org/.
 
18
 
19
EMC Corp. EMC Centera Content Addressed Storage System, 2003. http://www.emc.com/.
20
 
21
 
22
 
23
 
24
 
25
 
26
J. McKnight, T. Asaro, and B. Babineau. Digital archiving: End-user survey and market forecast 2006-2010. The Enterprise Strategy Group, Jan 2006.
 
27
 
28
 
29
30
31
 
32
 
33
NCBI GenBank. http://www.ncbi.nlm.nih.gov/Genbank/.
 
34
K. Olsen, J. B. Minster, Y. Cui, A. Chourasia, R. Moore, Y. Hu, J. Zhu, P. Maechling, and T. Jordan. SCEC TeraShake Simulations: High Resolution Simulations of Large Southern San Andreas Earthquakes Using the TeraGrid. In Proceedings of the TeraGrid 2006 Conference.
 
35
TeraByte Scale Enterprise databases. http://members.microsoft.com/customerevidence/Common/FileOpen.aspx?FileName=7405_FirstPremier_TDM_SQL_Server_Case_Study_Final.doc.
 
36
Terabyte scale enterprise databases. http://www.wintercorp.com/VLDB/2005_TopTen_Survey/2005TopTenWinners.pdf.
 
37
Terabyte scale enterprise databases. http://www.webtechniques.com/archives/1999/02/data/.
 
38
 
39
S. Quinlan, J. McKie, and R. Cox. Fossil, an archival file-server. http://www.cs.bell-labs.com/sys/doc/fossil.pdf.
 
40
M. Rabin. Fingerprinting by Random Polynomials. In Harvard University Center for Research in Computing Technology Technical Report TR-15-81, 1981.
41
42
 
43
 
44
 
45
Tolia, N., Kozuch, M., Satyanarayanan, M., Karp, B., Bressoud, T., Perrig, A. Opportunistic Use of Content-Addressable Storage for Distributed File Systems. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003.
 
46
A. Tridgell. Efficient Algorithms for Sorting and Synchronization. PhD thesis, The Australian National University, 1999.
 
47
 
48
 
49
J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5), 1978.


Collaborative Colleagues:
Partho Nath: colleagues
Bhuvan Urgaonkar: colleagues
Anand Sivasubramaniam: colleagues