| Scalable algorithms for global snapshots in distributed systems |
| Full text |
Pdf
(354 KB)
|
| Source
|
International Conference on Supercomputing
archive
Proceedings of the 20th annual international conference on Supercomputing
table of contents
Cairns, Queensland, Australia
Pages: 269 - 277
Year of Publication: 2006
ISBN:1-59593-282-8
|
|
Authors
|
|
Rahul Garg
|
IBM India Research Lab, IIT Delhi, Hauz Khas, New Delhi, India
|
|
Vijay K. Garg
|
The University of Texas at Austin, Austin, TX
|
|
Yogish Sabharwal
|
IBM India Research Lab, IIT Delhi, Hauz Khas, New Delhi, India
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 9, Downloads (12 Months): 80, Citation Count: 1
|
|
|
ABSTRACT
Existing algorithms for global snapshots in distributed systems are not scalable when the underlying topology is complete. In a network with N processors, these algorithms require O(N) space and O(N) messages per processor. As a result, these algorithms are not efficient in large systems when the logical topology of the communication layer such as MPI is complete. In this paper, we propose three algorithms for global snapshot: a grid-based, a tree-based and a centralized algorithm. The grid-based algorithm uses O(N) space but only O(√N) messages per processor. The tree-based algorithm requires only O(1) space and O(logNlog w) messages per processor where w is the average number of messages in transit per processor. The centralized algorithm requires only O(1) space and O(log w) messages per processor. We also have a matching lower bound for this problem. Our algorithms have applications in checkpointing, detecting stable predicates and implementing synchronizers. We have implemented our algorithms on top of the MPI library on the Blue Gene/L supercomputer. Our experiments confirm that the proposed algorithms significantly reduce the message and space complexity of a global snapshot.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
|
 |
3
|
|
 |
4
|
|
| |
5
|
E. W. Dijkstra and C. S. Scholten. Termination detection for diffusing computations. Information Processing Letters, 11(4):1--4, Aug. 1980.
|
| |
6
|
|
| |
7
|
A. D. Kshemkalyani, M. Raynal, and M. Singhal. An introduction to snapshot algorithms in distributed computing. Distributed Systems Engineering, 2(4):224--233, Dec. 1995.
|
 |
8
|
|
| |
9
|
|
| |
10
|
Martin Schulz , Greg Bronevetsky , Rohit Fernandes , Daniel Marques , Keshav Pingali , Paul Stodghill, Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs, Proceedings of the 2004 ACM/IEEE conference on Supercomputing, p.38, November 06-12, 2004
[doi> 10.1109/SC.2004.29]
|
| |
11
|
M. Spezialetti and P. Kearns. Efficient distributed snapshots. In Proc. of the 6th International Conference on Distributed Computing Systems, pages 382--388, 1986.
|
|