ACM Home Page
Please provide us with feedback. Feedback
Algorithms for scalable synchronization on shared-memory multiprocessors
Full text PdfPdf (3.07 MB)
Source ACM Transactions on Computer Systems (TOCS) archive
Volume 9 ,  Issue 1  (February 1991) table of contents
Pages: 21 - 65  
Year of Publication: 1991
ISSN:0734-2071
Authors
John M. Mellor-Crummey  Rice Univ., Houston, TX
Michael L. Scott  Univ. of Rochester, Rochester, NY
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 109,   Downloads (12 Months): 535,   Citation Count: 170
Additional Information:

abstract   references   cited by   index terms   review   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/103727.103729
What is a DOI?

ABSTRACT

Busy-wait techniques are heavily used for mutual exclusion and barrier synchronization in shared-memory parallel programs. Unfortunately, typical implementations of busy-waiting tend to produce large amounts of memory and interconnect contention, introducing performance bottlenecks that become markedly more pronounced as applications scale. We argue that this problem is not fundamental, and that one can in fact construct busy-wait synchronization algorithms that induce no memory or interconnect contention. The key to these algorithms is for every processor to spin on separate locally-accessible flag variables, and for some other processor to terminate the spin with a single remote write operation at an appropriate time. Flag variables may be locally-accessible as a result of coherent caching, or by virtue of allocation in the local portion of physically distributed shared memory. We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock acquisition, independent of the number of processors attempting to acquire the lock. Our algorithm provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction. We also present a new scalable barrier algorithm that generates 0(1) remote references per processor reaching the barrier, and observe that two previously-known barriers can likewise be cast in a form that spins only on locally-accessible flag variables. None of these barrier algorithms requires hardware support beyond the usual atomicity of memory reads and writes. We compare the performance of our scalable algorithms with other software approaches to busy-wait synchronization on both a Sequent Symmetry and a BBN Butterfly. Our principal conclusion is that contention due to synchronization need not be a problem in large-scale shared-memory multiprocessors. The existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides a case against so-called “dance hall” architectures, in which shared memory locations are equally far from all processors. From the Authors' Abstract


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
2
3
 
4
 
5
 
6
7
 
8
BBN Laboratories. Butterfly parallel processor overview. Tech. Rep. 6148, version 1, BBN Laboratories, Cambridge, Mass, Mar. 1986.
 
9
 
10
BZOOKS, E. D., III. The shared memory hypercube. Parallel Comput. 6 (1988), 235-245.
11
12
13
14
15
 
16
GOTTLIEB, A. Scalability, combining and the NYU ultracomputer. Ohio State University Parallel Computing Workshop, Mar. 1990. Invited Lecture.
 
17
GOTTLIEB, A., GRISHMAN, R., KRUSKAL, C. P., McAuLIFFE, K. P., RUDOLPH, L., AND SNIR, M. The NYU Ultracomputer--Designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (Feb. 1983), 175-189.
 
18
 
19
20
 
21
JAYASIMHA, D. N. Distributed synchronizers. In Proceedings of the 1988 Internatwnal Conference on Parallel Processing (Aug. 1988), 23-27.
22
 
23
KUMAR, M., AND PFISTER, G.F. The onset of hot spot contention In Proceedings of the 1986 International Conference on Parallel Processing (1986), 28-34.
24
25
26
 
27
LEE, C.A. Barrier synchronization over multistage interconnection networks. In Proceedrags of the Second IEEE Symposium on Parallel and Distributed Processing (Dallas, Tex, Dec 1990), 130-133.
28
 
29
LUBACHEVSK~, B. An approach to automating the verification of compact parallel coordination programs. I Acta Inf. 21 (1984), 125-169.
 
30
LUBACnEVSKY, B. Synchronization barrier and related tools for shared memory parallel programming. In Proceedings of the 1989 International Converence on Parallel Processing (Aug 1989), II-175-II-179.
31
 
32
MELLOR-CRUMMEY, J.M. Concurrent queues: Practical fetch-and-~ algorithms. Tech. Rep. 229, Computer Science Dept., Univ. of Rochester, Nov. 1987.
 
33
34
 
35
P1596 Working Group of the IEEE Computer Society Microprocessor Standards Committee. SCI (scalable coherent interface): An overview of extended cache-coherence protocols, Feb 5, 1990. Draft 0.59 P1596/Part III-D.
36
 
37
PFISTER, G., BRANTLEY, W. C., GEORGE, D. A., HARVEY, S. L., KLEINFELDER, W. J., McAvLIFFE, K. P., MELTON, T. A., NORTON, V A., AND WEISS, J. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing (Aug. 1985), 764-771.
 
38
PFISTER, G., AND NORTON, V. A. "Hot spot" contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 10 (Oct. 1985), 943 948.
39
 
40
41
 
42
43
44
45
46
47
 
48
TANG, P., AND YEW, P.-C. Processor self-scheduling for multiple-nested parallel loops. In Proceedtngs of the 1986 International Conference on Parallel Processing (Aug. 1986), 528-535.
 
49
TANG, P., AND Y~W, P.-C. Algorithms for distributing hot-spot addressing CSRD Rep. 617, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana- Champaign, Jan. 1987.
 
50
YEW, P.-C. Architecture of the Cedar parallel supercomputer. CSRD Rep 609, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, Aug. 1986.
 
51
 
52
ZAHORIAN, J., LAZOWSKA, E. D., AND EAGER, D. L. The effect of scheduling discipline on spin overhead in shared memory parallel processors. Tech. Rep. TR-89-07-03, Computer Science Dept., Univ. of Washington, July 1989.

CITED BY  170


REVIEW

"Roy Levin : Reviewer"

The authors have compiled a competent survey of spin-lock and barrier synchronization algorithms, including some new ones of their own. Two sections of the paper present these algorithms with clear pseudocode. Performance comparisons on a BBN   more...

Collaborative Colleagues:
John M. Mellor-Crummey: colleagues
Michael L. Scott: colleagues