ACM Home Page
Please provide us with feedback. Feedback
Accelerating large-scale data exploration through data diffusion
Full text PdfPdf (552 KB)
Source
High Performance Distributed Computing archive
Proceedings of the 2008 international workshop on Data-aware distributed computing table of contents
Boston, MA, USA
Pages 9-18  
Year of Publication: 2008
ISBN:978-1-60558-154-5
Authors
Ioan Raicu  University of Chicago, Chicago, IL, USA
Yong Zhao  Microsoft Coorporation, Redmond, WA, USA
Ian T. Foster  University of Chicago, Chicago, IL and Argonne National Laboratory, Argonne IL, USA
Alex Szalay  The Johns Hopkins University, Baltimore, MD, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 8,   Downloads (12 Months): 135,   Citation Count: 2
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1383519.1383521
What is a DOI?

ABSTRACT

Data-intensive applications often require exploratory analysis of large datasets. If analysis is performed on distributed resources, data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. The approach is reminiscent of cooperative caching, web-caching, and peer-to-peer storage systems, but addresses different application demands. Other data-aware scheduling approaches assume dedicated resources, which can be expensive and/or inefficient if load varies significantly. To explore the feasibility of the data diffusion approach, we have extended the Falkon resource provisioning and task scheduling system to support data caching and data-aware scheduling. Performance results from both micro-benchmarks and a large scale astronomy application demonstrate that our approach improves performance relative to alternative approaches, as well as provides improved scalability as aggregated I/O bandwidth scales linearly with the number of data cache nodes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
W. Xiaohui, et al. "Implementing data aware scheduling in Gfarm using LSF scheduler plugin mechanism", 2005 International Conference on Grid Computing and Applications, pp.3--10, 2005
 
2
P. Fuhrmann. "dCache, the commodity cache," IEEE Mass Storage Systems and Technologies 2004
3
4
5
 
6
I. Raicu, I. Foster, A. Szalay, G. Turcu. "AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis", TeraGrid Conference 2006
 
7
A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. "The Importance of Data Locality in Distributed Computing Applications", NSF Workflow Workshop 2006
 
8
 
9
SDSS: Sloan Digital Sky Survey, http://www.sdss.org/, 2007
 
10
K. Ranganathan, I. Foster, "Simulation Studies of Computation and Data Scheduling Algorithms for Data Grids", Journal of Grid Computing, V1(1) 2003
 
11
 
12
I. Raicu, C. Dumitrescu, I. Foster. "Dynamic Resource Provisioning in Grid Environments", TeraGrid Conf. 2007
 
13
 
14
 
15
G.B. Berriman, et al. "Montage: a Grid Enabled Engine for Delivering Custom Science-Grade Image Mosaics on Demand." SPIE Conference on Astronomical Telescopes and Instrumentation, 2004
 
16
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. "Swift: Fast, Reliable, Loosely Coupled Parallel Computation", IEEE Workshop on Scientific Workflows 2007
 
17
18
 
19
 
20
C. Catlett, et al. "TeraGrid: Analysis of Organization, System Architecture, and Middleware Enabling New Types of Applications," HPC 2006
 
21
 
22
I. Raicu, I. Foster. "Characterizing Storage Resources Performance in Accessing the SDSS Dataset," Tech. Report, Univ of Chicago, 2006
 
23
X. Wei, W.W. Li, O. Tatebe, G. Xu, L. Hu, and J. Ju. "Integrating Local Job Scheduler -- LSF with Gfarm", Parallel and Distributed Processing and Applications, Springer Berlin, Vol. 3758/2005, pp 196--204, 2005
 
24
ANL/UC TeraGrid Site Details, http://www.uc.teragrid.org/tg-docs/tg-tech-sum.html, 2007
 
25
CAS SkyServer, http://cas.sdss.org/dr6/en/tools/search/sql.asp, 2007
 
26
I. Raicu, Y. Zhao, I. Foster, A. Szalay. "A Data Diffusion Approach to Large Scale Scientific Exploration," Microsoft eScience Workshop at RENCI 2007
 
27
A. Bialecki, M. Cafarella, D. Cutting, O. O'Malley. "Hadoop: a framework for running applications on large clusters built of commodity hardware", http://lucene.apache.org/hadoop/, 2005
 
28
T. Kosar. "A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers", IEEE CLADE 2006
 
29
 
30
I. Raicu. "Harnessing Grid Resources with Data-Centric Task Farms", Technical Report, University of Chicago, 2007
31
 
32
I. Raicu, I. Foster. "A Comparison of Data Diffusion to the GPFS Shared File System", Technical Report, University of Chicago, 2007
 
33
Y. Zhao, I. Raicu, I. Foster, M. Hategan, V. Nefedova, M. Wilde. "Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments", Grid Computing Research Progress, Nova Pub. 2008
 
34
J. Gray. "Distributed Computing Economics", Technical Report MSR-TR-2003-24, Microsoft Research, Microsoft Corporation, 2003
 
35
 
36


Collaborative Colleagues:
Ioan Raicu: colleagues
Yong Zhao: colleagues
Ian T. Foster: colleagues
Alex Szalay: colleagues