|
ABSTRACT
Efficient synchronization primitives are essential for achieving high performance in fine-grain, shared-memory parallel programs. One function of synchronization primitives is to enable exclusive access to shared data and critical sections of code. This paper makes three contributions. (1) We enumerate the five sources of overhead that locking synchronization primitives can incur. (2) We describe four mechanisms (local spinning, queue-based locking, collocation, and synchronized prefetch) that reduce these synchronization overheads. (3) With detailed simulations, we show the extent to which these four mechanisms can improve the performance of shared-memory programs. We evaluate the space of these mechanisms using seventeen synchronization constructs, which are formed from six base typed of locks (TEST&SET, TEST&TEST&SET, MCS, LH, M, and QOLB). We show that large performance gains (speedups of more than 1.5 for three of five benchmarks) can be achieved if at least three optimizing mechanisms are used simultaneously. We find that QOLB, which incorporates all four mechanisms, outperforms all other primitives (including reactive synchronization) in all cases. Finally, we demonstrate the superior performance of a low-cost implementation of QOLB, which runs on an unmodified cluster of commodity workstations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
T. E. Anderson , D. D. Lazowska , H. M. Levy, The performance implications of thread management alternatives for shared-memory multiprocessors, Proceedings of the 1989 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, p.49-60, May 23-26, 1989, Oakland, California, United States
|
| |
3
|
|
| |
4
|
Nanette J. Boden , Danny Cohen , Robert E. Felderman , Alan E. Kulawik , Charles L. Seitz , Jakov N. Seizovic , Wen-King Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, v.15 n.1, p.29-36, February 1995
[doi> 10.1109/40.342015]
|
| |
5
|
Douglas C. Burger and James R. Goodman. Simulation of the SCI Transport Layer on the Wisconsin Wind Tunnel. In Proceedings of the Second International Workshop on SCI-Based High-Performance Low-Cost Computing, March 1995.
|
| |
6
|
|
| |
7
|
Convex Computer Corporation, Richardson, TX. SPPIO00 Systems Overview, 1994.
|
| |
8
|
Travis S. Craig. Building FIFO and Priority-Queueing Spin Locks from Atomic Swap. Technical Report 93-02-02, Department of Computer Science and Engineering, University of Washington, Seattle, WA, February 1993.
|
| |
9
|
Cypress Semiconductor, San Jose, CA. CY7C601 SPARC RISC User's Guide, second edition, 1990.
|
| |
10
|
Joseph A. Fisher. Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Transactions on Computers, (2-30(7):478- 490, July 1981.
|
| |
11
|
Kourosh Gharaehodoo, Sarita V. Adve, Anoop Gupta, John L. Hennessy, and Mark D. Hill. Programming for Different Memory Consisteney Models. Journal of Parallel and Distributed Computing, 15(4):399--407, 1992.
|
 |
12
|
Kourosh Gharachorloo , Daniel Lenoski , James Laudon , Phillip Gibbons , Anoop Gupta , John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, Proceedings of the 17th annual international symposium on Computer Architecture, p.15-26, May 28-31, 1990, Seattle, Washington, United States
|
 |
13
|
James R. Goodman , Mary K. Vernon , Philip J. Woest, Efficient synchronization primitives for large-scale cache-coherent multiprocessors, Proceedings of the third international conference on Architectural support for programming languages and operating systems, p.64-75, April 03-06, 1989, Boston, Massachusetts, United States
|
 |
14
|
|
| |
15
|
Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. MeAuliffe, Larry Rudolph, and Mare Snir. The NYU Ultraeomputer- Designing an MMD Shared Memory Parallel Computer. IEEE Transactions on Computers, (2-32(2):175-189, February 1983.
|
| |
16
|
|
| |
17
|
International Business Machines, Inc., Poughkeepsie, NY. IBM Systert/360 Principles of Operation, ninth edition, May 1970.
|
 |
18
|
Alain Kägi , Nagi Aboulenein , Douglas C. Burger , James R. Goodman, Techniques for reducing overheads of shared-memory multiprocessing, Proceedings of the 9th international conference on Supercomputing, p.11-20, July 03-07, 1995, Barcelona, Spain
[doi> 10.1145/224538.224540]
|
| |
19
|
Alain Ktigi and James R. Goodman. SofiQOLB: An Ultra-Efficient Synchronization Primitive for Clusters of Commodity Workstations. Technical Report 1327, Computer Sciences Department, University of Wisconsin, Madison, WI, November 1996.
|
| |
20
|
R.E. Kessler and J. L. Sehwartzmeier. CRAY T3D: A New Dimension for Cray Research. In Proceedings of the 38thlEEE Computer Society International Conference (COMPCON), pages 176--182, February I993.
|
 |
21
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
22
|
|
 |
23
|
Charles E. Leiserson , Zahi S. Abuhamdeh , David C. Douglas , Carl R. Feynman , Mahesh N. Ganmukhi , Jeffrey V. Hill , Daniel Hillis , Bradley C. Kuszmaul , Margaret A. St. Pierre , David S. Wells , Monica C. Wong , Shaw-Wen Yang , Robert Zak, The network architecture of the Connection Machine CM-5 (extended abstract), Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, p.272-285, June 29-July 01, 1992, San Diego, California, United States
[doi> 10.1145/140901.141883]
|
 |
24
|
|
| |
25
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510]
|
 |
26
|
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
 |
30
|
|
 |
31
|
Shubhendu S. Mukherjee , Babak Falsafi , Mark D. Hill , David A. Wood, Coherent network interfaces for fine-grain communication, Proceedings of the 23rd annual international symposium on Computer architecture, p.247-258, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
32
|
G. 17. Pfister, W.C. Brantley, D.A. George, S. L, Harvey, W, J, Kleinfelder, K.P. MeAuliffe, E.A. Melton, V. A, Norton, and J. Weiss. The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture. In Proceedings of the 1985 bztenta, tional Conference on Parallel Processing, pages 764-771, August 1985.
|
 |
33
|
Steven K. Reinhardt , Mark D. Hill , James R. Larus , Alvin R. Lebeck , James C. Lewis , David A. Wood, The Wisconsin Wind Tunnel: virtual prototyping of parallel computers, Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.48-60, May 10-14, 1993, Santa Clara, California, United States
|
 |
34
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
 |
35
|
Steven K. Reinhardt , Robert W. Pfile , David A. Wood, Decoupled hardware support for distributed shared memory, Proceedings of the 23rd annual international symposium on Computer architecture, p.34-43, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
36
|
ROSS Technology, Inc., Austin, TX. SPARC RISC User's Guide: hyperSPARC Edition, third edition, September 1993.
|
 |
37
|
|
 |
38
|
Daniel J. Scales , Kourosh Gharachorloo , Chandramohan A. Thekkath, Shasta: a low overhead, software-only approach for supporting fine-grain shared memory, Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, p.174-185, October 01-04, 1996, Cambridge, Massachusetts, United States
|
| |
39
|
Ioannis Sehoinas, Babak Falsafi, Mark D, Hill, James R, 1,arus, Christopher E. Lukas, Shubhendu S. Mukherjee, Steven K. Relnhardt, Erie Sehnarr, and David A. Wood. Implementlng Fine-Graln Distributed Shared Memory on Commodity SMP Workstations, Technical Report 1307, UWCS, March 1996.
|
 |
40
|
Ioannis Schoinas , Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , James R. Larus , David A. Wood, Fine-grain access control for distributed shared memory, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.297-306, October 05-07, 1994, San Jose, California, United States
|
 |
41
|
|
 |
42
|
|
| |
43
|
IEEE Computer Society. Sealable Coherent interface (SCI), ANSI/ IE Std 1596-1992, August 1993.
|
| |
44
|
|
| |
45
|
Webster. Webster's Seventh Dictionary. 1965,
|
 |
46
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
CITED BY 28
|
|
|
|
|
|
|
|
|
|
|
Bilge Saglam Akgul , Jaehwan Lee , Vincent John Mooney, A system-on-a-chip lock cache with task preemption support, Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, November 16-17, 2001, Atlanta, Georgia, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Matteo Monchiero , Gianluca Palermo , Cristina Silvano , Oreste Villa, Power/performance hardware optimization for synchronization intensive applications in MPSoCs, Proceedings of the conference on Design, automation and test in Europe: Proceedings, March 06-10, 2006, Munich, Germany
|
|
|
|
|
|
Yao Guo , Vladimir Vlassov , Raksit Ashok , Richard Weiss , Csaba Andras Moritz, Synchronization coherence: A transparent hardware mechanism for cache coherence and fine-grained synchronization, Journal of Parallel and Distributed Computing, v.68 n.2, p.165-181, February, 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|