ACM Home Page
Please provide us with feedback. Feedback
Scalable algorithms for global snapshots in distributed systems
Full text PdfPdf (354 KB)
Source International Conference on Supercomputing archive
Proceedings of the 20th annual international conference on Supercomputing table of contents
Cairns, Queensland, Australia
SESSION: Memory table of contents
Pages: 269 - 277  
Year of Publication: 2006
ISBN:1-59593-282-8
Authors
Rahul Garg  IBM India Research Lab, IIT Delhi, Hauz Khas, New Delhi, India
Vijay K. Garg  The University of Texas at Austin, Austin, TX
Yogish Sabharwal  IBM India Research Lab, IIT Delhi, Hauz Khas, New Delhi, India
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 13,   Downloads (12 Months): 80,   Citation Count: 1
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1183401.1183439
What is a DOI?

ABSTRACT

Existing algorithms for global snapshots in distributed systems are not scalable when the underlying topology is complete. In a network with N processors, these algorithms require O(N) space and O(N) messages per processor. As a result, these algorithms are not efficient in large systems when the logical topology of the communication layer such as MPI is complete. In this paper, we propose three algorithms for global snapshot: a grid-based, a tree-based and a centralized algorithm. The grid-based algorithm uses O(N) space but only O(√N) messages per processor. The tree-based algorithm requires only O(1) space and O(logNlog w) messages per processor where w is the average number of messages in transit per processor. The centralized algorithm requires only O(1) space and O(log w) messages per processor. We also have a matching lower bound for this problem. Our algorithms have applications in checkpointing, detecting stable predicates and implementing synchronizers. We have implemented our algorithms on top of the MPI library on the Blue Gene/L supercomputer. Our experiments confirm that the proposed algorithms significantly reduce the message and space complexity of a global snapshot.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
4
 
5
E. W. Dijkstra and C. S. Scholten. Termination detection for diffusing computations. Information Processing Letters, 11(4):1--4, Aug. 1980.
 
6
 
7
A. D. Kshemkalyani, M. Raynal, and M. Singhal. An introduction to snapshot algorithms in distributed computing. Distributed Systems Engineering, 2(4):224--233, Dec. 1995.
8
 
9
 
10
 
11
M. Spezialetti and P. Kearns. Efficient distributed snapshots. In Proc. of the 6th International Conference on Distributed Computing Systems, pages 382--388, 1986.


Collaborative Colleagues:
Rahul Garg: colleagues
Vijay K. Garg: colleagues
Yogish Sabharwal: colleagues