| Mechanisms for store-wait-free multiprocessors |
| Full text |
Pdf
(535 KB)
|
Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 34th annual international symposium on Computer architecture
table of contents
San Diego, California, USA
SESSION: Memory consistency
table of contents
Pages: 266 - 277
Year of Publication: 2007
ISBN:978-1-59593-706-3
Also published in ...
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 48, Downloads (12 Months): 183, Citation Count: 10
|
|
|
ABSTRACT
Store misses cause significant delays in shared-memory multiprocessors because of limited store buffering and ordering constraints required for proper synchronization. Today, programmers must choose from a spectrum of memory consistency models that reduce store stalls at the cost of increased programming complexity. Prior research suggests that the performance gap among consistency models can be closed through speculation--enforcing order only when dynamically necessary. Unfortunately, past designs either provide insufficient buffering, replace all stores with read-modify-write operations, and/or recover from ordering violations via impractical fine-grained rollback mechanisms. We propose two mechanisms that, together, enable store-wait-free implementations of any memory consistency model. To eliminate buffer-capacity-related stalls, we propose the scalable store buffer, which places private/speculative values directly into the L1 cache, thereby eliminating the non-scalable associative search of conventional store buffers. To eliminate ordering-related stalls, we propose atomic sequence ordering, which enforces ordering constraints over coarse-grain access sequences while relaxing order among individual accesses. Using cycle-accurate full-system simulation of scientific and commercial applications, we demonstrate that these mechanisms allow the simplified programming of strict ordering while outperforming conventional implementations on average by 32% (sequential consistency), 22% (SPARC total store order) and 9% (SPARC relaxed memory order).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
Luis Ceze , James Tuck , Pablo Montesinos , Josep Torrellas, BulkSC: bulk enforcement of sequential consistency, Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
 |
6
|
|
| |
7
|
|
| |
8
|
|
| |
9
|
A. Gandhi, H. Akkary, R. Rajwar, S. T. Srinivasan, and K. Lai. Scalable load and store processing in latency tolerant processors. Proc. of the 38th Int'l Symposium on Microarchitecture, Dec. 2005.
|
| |
10
|
K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance the performance of memory consistency models. Proc. of the Int'l Conference on Parallel Processing, Aug. 1991.
|
 |
11
|
Kourosh Gharachorloo , Anoop Gupta , John Hennessy, Performance evaluation of memory consistency models for shared-memory multiprocessors, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.245-257, April 08-11, 1991, Santa Clara, California, United States
|
| |
12
|
|
 |
13
|
Chris Gniady , Babak Falsafi , T. N. Vijaykumar, Is SC + ILP = RC?, Proceedings of the 26th annual international symposium on Computer architecture, p.162-171, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
14
|
Lance Hammond , Mark Willey , Kunle Olukotun, Data speculation support for a chip multiprocessor, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.58-69, October 02-07, 1998, San Jose, California, United States
|
| |
15
|
M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Equipment Corporation, Cambridge Research Laboratory, Dec. 1992.
|
| |
16
|
|
| |
17
|
|
| |
18
|
L. Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Transactions on Computers, C-28(9):690--691, Sep. 1979.
|
| |
19
|
J. Larus and R. Rajwar. Transactional Memory. Morgan Claypool Publishers, 2006.
|
| |
20
|
José F. Martínez , Jose Renau , Michael C. Huang , Milos Prvulovic , Josep Torrellas, Cherry: checkpointed early resource recycling in out-of-order microprocessors, Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, November 18-22, 2002, Istanbul, Turkey
|
 |
21
|
|
| |
22
|
|
| |
23
|
|
 |
24
|
|
 |
25
|
Parthasarathy Ranganathan , Kourosh Gharachorloo , Sarita V. Adve , Luiz André Barroso, Performance of database workloads on shared-memory systems with out-of-order processors, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.307-318, October 02-07, 1998, San Jose, California, United States
|
 |
26
|
Parthasarathy Ranganathan , Vijay S. Pai , Hazim Abdel-Shafi , Sarita V. Adve, The interaction of software prefetching with ILP processors in shared-memory systems, Proceedings of the 24th annual international symposium on Computer architecture, p.144-156, June 01-04, 1997, Denver, Colorado, United States
|
 |
27
|
Parthasarathy Ranganathan , Vijay S. Pai , Sarita V. Adve, Using speculative retirement and larger instruction windows to narrow the performance gap between memory consistency models, Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.199-210, June 23-25, 1997, Newport, Rhode Island, United States
[doi> 10.1145/258492.258512]
|
 |
28
|
|
| |
29
|
|
| |
30
|
Minglong Shao , Anastassia Ailamaki , Babak Falsafi, DBmbench: fast and accurate database workload representation on modern microarchitecture, Proceedings of the 2005 conference of the Centre for Advanced Studies on Collaborative research, p.254-267, October 17-20, 2005, Toranto, Ontario, Canada
|
 |
31
|
|
 |
32
|
J. Greggory Steffan , Christopher B. Colohan , Antonia Zhai , Todd C. Mowry, A scalable approach to thread-level speculation, Proceedings of the 27th annual international symposium on Computer architecture, p.1-12, June 2000, Vancouver, British Columbia, Canada
|
 |
33
|
Per Stenström , Mats Brorsson , Lars Sandberg, An adaptive cache coherence protocol optimized for migratory sharing, Proceedings of the 20th annual international symposium on Computer architecture, p.109-118, May 16-19, 1993, San Diego, California, United States
|
| |
34
|
|
 |
35
|
|
 |
36
|
|
| |
37
|
Thomas F. Wenisch , Roland E. Wunderlich , Michael Ferdman , Anastassia Ailamaki , Babak Falsafi , James C. Hoe, SimFlex: Statistical Sampling of Computer System Simulation, IEEE Micro, v.26 n.4, p.18-31, July 2006
[doi> 10.1109/MM.2006.79]
|
 |
38
|
|
|