|
ABSTRACT
Busy-wait techniques are heavily used for mutual exclusion and
barrier synchronization in shared-memory parallel programs.
Unfortunately, typical implementations of busy-waiting tend to produce
large amounts of memory and interconnect contention, introducing
performance bottlenecks that become markedly more pronounced as
applications scale. We argue that this problem is not fundamental, and
that one can in fact construct busy-wait synchronization algorithms that
induce no memory or interconnect contention. The key to these algorithms
is for every processor to spin on separate
locally-accessible flag variables,
and for some other processor to terminate the spin with a single remote
write operation at an appropriate time. Flag variables may be
locally-accessible as a result of coherent caching, or by virtue of
allocation in the local portion of physically distributed shared
memory.
We present a new scalable algorithm for spin locks that generates 0(1) remote references per lock
acquisition, independent of the number of processors attempting to
acquire the lock. Our algorithm provides reasonable latency in the
absence of contention, requires only a constant amount of space per
lock, and requires no hardware support other than a swap-with-memory
instruction. We also present a new scalable barrier algorithm that
generates 0(1) remote references per
processor reaching the barrier, and observe that two previously-known
barriers can likewise be cast in a form that spins only on
locally-accessible flag variables. None of these barrier algorithms
requires hardware support beyond the usual atomicity of memory reads and
writes.
We compare the performance of our scalable algorithms with other
software approaches to busy-wait synchronization on both a Sequent
Symmetry and a BBN Butterfly. Our principal conclusion is that
contention due to synchronization need not be a problem
in large-scale shared-memory multiprocessors. The
existence of scalable algorithms greatly weakens the case for costly
special-purpose hardware support for synchronization, and provides a
case against so-called “dance hall” architectures, in which
shared memory locations are equally far from all processors.
—From the Authors' Abstract
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
|
 |
3
|
A. Agarwal , R. Simoni , J. Hennessy , M. Horowitz, An evaluation of directory schemes for cache coherence, Proceedings of the 15th Annual International Symposium on Computer architecture, p.280-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
|
| |
8
|
BBN Laboratories. Butterfly parallel processor overview. Tech. Rep. 6148, version 1, BBN Laboratories, Cambridge, Mass, Mar. 1986.
|
| |
9
|
|
| |
10
|
BZOOKS, E. D., III. The shared memory hypercube. Parallel Comput. 6 (1988), 235-245.
|
 |
11
|
Helen Davis , John Hennessy, Characterizing the synchronization behavior of parallel programs, Proceedings of the ACM/SIGPLAN conference on Parallel programming: experience with applications, languages and systems, p.198-211, July 19-21, 1988, New Haven, Connecticut, United States
|
 |
12
|
|
 |
13
|
Kourosh Gharachorloo , Daniel Lenoski , James Laudon , Phillip Gibbons , Anoop Gupta , John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th annual international symposium on Computer Architecture, p.15-26, May 28-31, 1990, Seattle, Washington, United States
|
 |
14
|
James R. Goodman , Mary K. Vernon , Philip J. Woest, Efficient synchronization primitives for large-scale cache-coherent multiprocessors, Proceedings of the third international conference on Architectural support for programming languages and operating systems, p.64-75, April 03-06, 1989, Boston, Massachusetts, United States
|
 |
15
|
|
| |
16
|
GOTTLIEB, A. Scalability, combining and the NYU ultracomputer. Ohio State University Parallel Computing Workshop, Mar. 1990. Invited Lecture.
|
| |
17
|
GOTTLIEB, A., GRISHMAN, R., KRUSKAL, C. P., McAuLIFFE, K. P., RUDOLPH, L., AND SNIR, M. The NYU Ultracomputer--Designing an MIMD shared memory parallel computer. IEEE Trans. Comput. C-32, 2 (Feb. 1983), 175-189.
|
| |
18
|
|
| |
19
|
|
 |
20
|
|
| |
21
|
JAYASIMHA, D. N. Distributed synchronizers. In Proceedings of the 1988 Internatwnal Conference on Parallel Processing (Aug. 1988), 23-27.
|
 |
22
|
Clyde P Kruskal , Larry Rudolph , Marc Snir, Efficient synchronization of multiprocessors with shared memory, Proceedings of the fifth annual ACM symposium on Principles of distributed computing, p.218-228, August 11-13, 1986, Calgary, Alberta, Canada
[doi> 10.1145/10590.10609]
|
| |
23
|
KUMAR, M., AND PFISTER, G.F. The onset of hot spot contention In Proceedings of the 1986 International Conference on Parallel Processing (1986), 28-34.
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
LEE, C.A. Barrier synchronization over multistage interconnection networks. In Proceedrags of the Second IEEE Symposium on Parallel and Distributed Processing (Dallas, Tex, Dec 1990), 130-133.
|
 |
28
|
|
| |
29
|
LUBACHEVSK~, B. An approach to automating the verification of compact parallel coordination programs. I Acta Inf. 21 (1984), 125-169.
|
| |
30
|
LUBACnEVSKY, B. Synchronization barrier and related tools for shared memory parallel programming. In Proceedings of the 1989 International Converence on Parallel Processing (Aug 1989), II-175-II-179.
|
 |
31
|
|
| |
32
|
MELLOR-CRUMMEY, J.M. Concurrent queues: Practical fetch-and-~ algorithms. Tech. Rep. 229, Computer Science Dept., Univ. of Rochester, Nov. 1987.
|
| |
33
|
|
 |
34
|
|
| |
35
|
P1596 Working Group of the IEEE Computer Society Microprocessor Standards Committee. SCI (scalable coherent interface): An overview of extended cache-coherence protocols, Feb 5, 1990. Draft 0.59 P1596/Part III-D.
|
 |
36
|
|
| |
37
|
PFISTER, G., BRANTLEY, W. C., GEORGE, D. A., HARVEY, S. L., KLEINFELDER, W. J., McAvLIFFE, K. P., MELTON, T. A., NORTON, V A., AND WEISS, J. The IBM research parallel processor prototype (RP3): Introduction and architecture. In Proceedings of the 1985 International Conference on Parallel Processing (Aug. 1985), 764-771.
|
| |
38
|
PFISTER, G., AND NORTON, V. A. "Hot spot" contention and combining in multistage interconnection networks. IEEE Trans. Comput. C-34, 10 (Oct. 1985), 943 948.
|
 |
39
|
|
| |
40
|
|
 |
41
|
|
| |
42
|
|
 |
43
|
|
 |
44
|
|
 |
45
|
|
 |
46
|
|
 |
47
|
M. L. Scott , T. J. LeBlanc , B. D. Marsh, Multi-model parallel programming in psyche, Proceedings of the second ACM SIGPLAN symposium on Principles & practice of parallel programming, p.70-78, March 14-16, 1990, Seattle, Washington, United States
|
| |
48
|
TANG, P., AND YEW, P.-C. Processor self-scheduling for multiple-nested parallel loops. In Proceedtngs of the 1986 International Conference on Parallel Processing (Aug. 1986), 528-535.
|
| |
49
|
TANG, P., AND Y~W, P.-C. Algorithms for distributing hot-spot addressing CSRD Rep. 617, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana- Champaign, Jan. 1987.
|
| |
50
|
YEW, P.-C. Architecture of the Cedar parallel supercomputer. CSRD Rep 609, Center for Supercomputing Research and Development, Univ. of Illinois at Urbana-Champaign, Aug. 1986.
|
| |
51
|
|
| |
52
|
ZAHORIAN, J., LAZOWSKA, E. D., AND EAGER, D. L. The effect of scheduling discipline on spin overhead in shared memory parallel processors. Tech. Rep. TR-89-07-03, Computer Science Dept., Univ. of Washington, July 1989.
|
CITED BY 170
|
|
|
|
|
Alain Kägi , Nagi Aboulenein , Douglas C. Burger , James R. Goodman, Techniques for reducing overheads of shared-memory multiprocessing, Proceedings of the 9th international conference on Supercomputing, p.11-20, July 03-07, 1995, Barcelona, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Maged M. Michael , Michael L. Scott, Simple, fast, and practical non-blocking and blocking concurrent queue algorithms, Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing, p.267-275, May 23-26, 1996, Philadelphia, Pennsylvania, United States
|
|
|
|
|
|
Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , Ioannis Schoinas , Mark D. Hill , James R. Larus , Anne Rogers , David A. Wood, Application-specific protocols for user-level shared memory, Proceedings of the 1994 conference on Supercomputing, p.380-389, December 1994, Washington, D.C., United States
|
|
|
|
|
|
|
|
|
|
|
|
Jae-Heon Yang , James H. Anderson, Fast, scalable synchronization with minimal hardware support, Proceedings of the twelfth annual ACM symposium on Principles of distributed computing, p.171-182, August 15-18, 1993, Ithaca, New York, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jason Liu , David M. Nicol , King Tan, Lock-free scheduling of logical processes in parallel simulation, Proceedings of the fifteenth workshop on Parallel and distributed simulation, p.22-31, May 15-18, 2001, Lake Arrowhead, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ole Agesen , David Detlefs , Alex Garthwaite , Ross Knippel , Y. S. Ramakrishna , Derek White, An efficient meta-lock for implementing ubiquitous synchronization, ACM SIGPLAN Notices, v.34 n.10, p.207-222, Oct. 1999
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
James H. Anderson , Mark Moir, Using k-exclusion to implement resilient, scalable shared objects (extended abstract), Proceedings of the thirteenth annual ACM symposium on Principles of distributed computing, p.141-150, August 14-17, 1994, Los Angeles, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
J. K. Bennett , S. Dwarkadas , J. Greenwood , E. Speight, Willow: a scalable shared memory multiprocessor, Proceedings of the 1992 ACM/IEEE conference on Supercomputing, p.336-345, November 16-20, 1992, Minneapolis, Minnesota, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bilge Saglam Akgul , Jaehwan Lee , Vincent John Mooney, A system-on-a-chip lock cache with task preemption support, Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, November 16-17, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
Vara Ramakrishnan , Isaac D. Scherson , Raghu Subramanian, Efficient techniques for fast nested barrier synchronization, Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, p.157-164, June 24-26, 1995, Santa Barbara, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Sanjeev Kumar , Michael Chu , Christopher J. Hughes , Partha Kundu , Anthony Nguyen, Hybrid transactional memory, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Anja Feldmann , Thomas Gross , David O'Hallaron , Thomas M. Stricker, Subset barrier synchronization on a private-memory parallel system, Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, p.209-218, June 29-July 01, 1992, San Diego, California, United States
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , Ioannis Schoinas , Mark D. Hill , James R. Larus , Anne Rogers , David A. Wood, Application-specific protocols for user-level shared memory, Proceedings of the 1994 ACM/IEEE conference on Supercomputing, November 14-18, 1994, Washington, D.C.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
William N. Scherer, III , Doug Lea , Michael L. Scott, Scalable synchronous queues, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Matteo Monchiero , Gianluca Palermo , Cristina Silvano , Oreste Villa, Power/performance hardware optimization for synchronization intensive applications in MPSoCs, Proceedings of the conference on Design, automation and test in Europe: Proceedings, March 06-10, 2006, Munich, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Bratin Saha , Ali-Reza Adl-Tabatabai , Anwar Ghuloum , Mohan Rajagopalan , Richard L. Hudson , Leaf Petersen , Vijay Menon , Brian Murphy , Tatiana Shpeisman , Eric Sprangle , Anwar Rohillah , Doug Carmean , Jesse Fang, Enabling scalability and performance in a large scale CMP environment, ACM SIGOPS Operating Systems Review, v.41 n.3, June 2007
|
|
|
|
|
|
|
|
|
Erich M. Nahum , David J. Yates , James F. Kurose , Don Towsley, Performance issues in parallelized network protocols, Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation, p.10-es, November 14-17, 1994, Monterey, California
|
|
|
Ronald C. Unrau , Orran Krieger , Benjamin Gamsa , Michael Stumm, Experiences with locking in a NUMA multiprocessor operating system kernel, Proceedings of the 1st USENIX conference on Operating Systems Design and Implementation, p.11-es, November 14-17, 1994, Monterey, California
|
|
|
|
|
|
|
|
|
Zhen Fang , Lixin Zhang , John B. Carter , Ali Ibrahim , Michael A. Parker, Active memory operations, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
|
|
|
|
|
|
|
|
|
|
|
|
Jayaram Bobba , Kevin E. Moore , Haris Volos , Luke Yen , Mark D. Hill , Michael M. Swift , David A. Wood, Performance pathologies in hardware transactional memory, ACM SIGARCH Computer Architecture News, v.35 n.2, May 2007
|
|
|
|
|
|
Jack Sampson , Ruben Gonzalez , Jean-Francois Collard , Norman P. Jouppi , Mike Schlansker , Brad Calder, Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers, Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, p.235-246, December 09-13, 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chen Tian , Vijay Nagarajan , Rajiv Gupta , Sriraman Tallam, Dynamic recognition of synchronization operations for improved data race detection, Proceedings of the 2008 international symposium on Software testing and analysis, July 20-24, 2008, Seattle, WA, USA
|
|
|
|
|
|
Andrea Marongiu , Luca Benini , Mahmut Kandemir, Lightweight barrier-based parallelization support for non-cache-coherent MPSoC platforms, Proceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems, September 30-October 03, 2007, Salzburg, Austria
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ryan Johnson , Ippokratis Pandis , Nikos Hardavellas , Anastasia Ailamaki , Babak Falsafi, Shore-MT: a scalable storage manager for the multicore era, Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, March 24-26, 2009, Saint Petersburg, Russia
|
|
|
Antonino Tumeo , Christian Pilato , Gianluca Palermo , Fabrizio Ferrandi , Donatella Sciuto, HW/SW methodologies for synchronization in FPGA multiprocessors, Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays, February 22-24, 2009, Monterey, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
John H. Kelm , Daniel R. Johnson , Matthew R. Johnson , Neal C. Crago , William Tuohy , Aqeel Mahesri , Steven S. Lumetta , Matthew I. Frank , Sanjay J. Patel, Rigel: an architecture and scalable programming interface for a 1000-core accelerator, ACM SIGARCH Computer Architecture News, v.37 n.3, June 2009
|
REVIEW
"Roy Levin : Reviewer"
The authors have compiled a competent survey of spin-lock and
barrier synchronization algorithms, including some new ones of their
own. Two sections of the paper present these algorithms with clear
pseudocode. Performance comparisons on a BBN
more...
|