ACM Home Page
Please provide us with feedback. Feedback
The quest for scalable support of data-intensive workloads in distributed systems
Full text PdfPdf (1.28 MB)
Source
High Performance Distributed Computing archive
Proceedings of the 18th ACM international symposium on High performance distributed computing table of contents
Garching, Germany
SESSION: Data nabagenebt table of contents
Pages 207-216  
Year of Publication: 2009
ISBN:978-1-60558-587-1
Authors
Ioan Raicu  University of Chicago, Chicago, IL, USA
Ian T. Foster  University of Chicago & Argonne National Laboratory, Chicago, IL, USA
Yong Zhao  Microsoft Corporation, Redmond, WA, USA
Philip Little  University of Notre Dame, Notre Dame, IN, USA
Christopher M. Moretti  University of Notre Dame, Notre Dame, IN, USA
Amitabh Chaudhary  University of Notre Dame, Notre Dame, IN, USA
Douglas Thain  University of Notre Dame, Notre Dame, IN, USA
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 35,   Downloads (12 Months): 116,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1551609.1551642
What is a DOI?

ABSTRACT

Data-intensive applications involving the analysis of large datasets often require large amounts of compute and storage resources, for which data locality can be crucial to high throughput and performance. We propose a "data diffusion" approach that acquires compute and storage resources dynamically, replicates data in response to demand, and schedules computations close to data. As demand increases, more resources are acquired, thus allowing faster response to subsequent requests that refer to the same data; when demand drops, resources are released. This approach can provide the benefits of dedicated hardware without the associated high costs, depending on workload and resource characteristics. To explore the feasibility of data diffusion, we offer both a theoretical and an empirical analysis. We define an abstract model for data diffusion, introduce new scheduling policies with heuristics to optimize real-world performance, and develop a competitive online cache eviction policy. We also offer many empirical experiments to explore the benefits of dynamically expanding and contracting resources based on load, to improve system responsiveness while keeping wasted resources small. We show performance improvements of one to two orders of magnitude across three diverse workloads when compared to the performance of parallel file systems with throughputs approaching 80 Gb/s on a modest cluster of 200 processors. We also compare data diffusion with a best model for active storage, contrasting the difference between a pull-model found in data diffusion and a push-model found in active storage.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A. Szalay, J. Bunn, J. Gray, I. Foster, I. Raicu. The Importance of Data Locality in Distributed Computing Applications, NSF Workflow Workshop 2006
 
2
J. Gray. Distributed Computing Economics, Technical Report MSR-TR-2003-24, Microsoft Research, 2003
3
 
4
5
6
 
7
 
8
 
9
W. Xiaohui, et al. Implementing Data Aware Scheduling in Gfarm Using LSF Scheduler Plugin Mechanism, GCA05, 2005
 
10
P. Fuhrmann. dCache, the Commodity Cache, MSST 2004
 
11
C. Moretti, et al. All-Pairs: An Abstraction for Data-Intensive Cloud Computing, IPDPS 2008
 
12
D. Thain, et al. Chirp: A Practical Global Filesystem for Cluster and Grid Computing, JGC, Springer, 2008
13
 
14
 
15
 
16
A. Bialecki, et al. Hadoop: A Framework for Running Applications on Large Clusters Built of Commodity Hardware, http://lucene.apache.org/hadoop/, 2005
 
17
M. Feller, et al. GT4 GRAM: A Functionality and Performance Study, TeraGrid Conference 2007
 
18
 
19
 
20
I. Raicu, I. Foster, Y. Zhao, A. Szalay, P. Little, C. Moretti, A. Chaudhary, D. Thain. Towards Data Intensive Many-Task Computing, under review at Data Intensive Distributed Computing: Challenges and Solutions for Large-Scale Information Management, 2009
 
21
I. Raicu, I. Foster, A. Szalay, G. Turcu. AstroPortal: A Science Gateway for Large-scale Astronomy Data Analysis, TeraGrid Conf. 2006
 
22
E. Torng. A Unified Analysis of Paging and Caching, Algorithmica 20, 175--200, 1998
 
23
ANL/UC TeraGrid Site Details, http://www.uc.teragrid.org/tg-docs/tg-tech-sum.html, 2007
 
24
 
25
T. Kosar. A New Paradigm in Data Intensive Computing: Stork and the Data-Aware Schedulers, IEEE CLADE 2006
 
26
X. Wei, et al. Integrating Local Job Scheduler - LSF with Gfarm, ISPA05, vol. 3758/2005, 2005
 
27
 
28
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, I. Raicu, T. Stef-Praun, M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation, IEEE Workshop on Scientific Workflows 2007
 
29
Y. Zhao, I. Raicu, I. Foster, M. Hategan, V. Nefedova, M. Wilde. Realizing Fast, Scalable and Reliable Scientific Computations in Grid Environments, Grid Computing Research Progress, Nova Pub. 2008
30
 
31
 
32
 
33
S. Irani. Randomized Weighted Caching with Two Page Weights, Algorithmica, 32:4, 624--640, 2002
 
34
X. Zhang, A. Espinosa, K. Iskra, I. Raicu, I. Foster, M. Wilde. Design and Evaluation of a Collective I/O Model for Loosely-coupled PetascaleProgramming, IEEE MTAGS 2008

Collaborative Colleagues:
Ioan Raicu: colleagues
Ian T. Foster: colleagues
Yong Zhao: colleagues
Philip Little: colleagues
Christopher M. Moretti: colleagues
Amitabh Chaudhary: colleagues
Douglas Thain: colleagues