|
ABSTRACT
Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the data is found in another processor's cache rather than in memory. SMPs are optimized for this case by using snooping protocols that broadcast address transactions to all processors. Conversely, directory-based shared-memory systems must indirectly locate the owner and sharers through a directory, resulting in larger average miss latencies.This paper proposes timestamp snooping, a technique that allows SMPs to i) utilize high-speed switched interconnection networks and ii) exploit physical locality by delivering address transactions to processors and memories without regard to order. Traditional snooping requires physical ordering of transactions. Timestamp snooping works by processing address transactions in a logical order. Logical time is maintained by adding a few bits per address transaction and having network switches perform a handshake to ensure on-time delivery. Processors and memories then reorder transactions based on their timestamps to establish a total order.We evaluate timestamp snooping with commercial workloads on a 16-processor SPARC system using the Simics full-system simulator. We simulate both an indirect (butterfly) and a direct (torus) network design. For OLTP, DSS, web serving, web searching, and one scientific application, timestamp snooping with the butterfly network runs 6-28% faster than directories, at a cost of 13-43% more link traffic. Similarly, with the torus network, timestamp snooping runs 6-29% faster for 17-37% more link traffic. Thus, timestamp snooping is worth considering when buying more interconnect bandwidth is easier than reducing interconnect latency.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
 |
2
|
A. Agarwal , R. Simoni , J. Hennessy , M. Horowitz, An evaluation of directory schemes for cache coherence, Proceedings of the 15th Annual International Symposium on Computer architecture, p.280-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
3
|
Altavista Business Solutions. http://doc.altavista.com/ business_sohitions/bus_solutions.html.
|
| |
4
|
Apache HTTP Server Project. http://www.apache.org/ httpd.html.
|
| |
5
|
E. Artiaga, N. Navarro, X. Martorell, and Y. Becerra. Implementing PARMACS Macros for Shared Memory Multiprocessor Environments. Technical report, Polytechnic University of Catalunya, Department of Computer Architecture Technical Report UPC-DAC-1997-07, Jan. 1997.
|
 |
6
|
|
 |
7
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, ACM SIGARCH Computer Architecture News, v.28 n.2, p.282-293, May 2000
|
 |
8
|
|
 |
9
|
E. Ender Bilir , Ross M. Dickson , Ying Hu , Manoj Plakal , Daniel J. Sorin , Mark D. Hill , David A. Wood, Multicast snooping: a new coherence method using a multicast address network, Proceedings of the 26th annual international symposium on Computer architecture, p.294-304, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
10
|
R. Bisiani, A. Nowatzyk, and M. Ravishankar. Coherent Shared Memory on a Message Passing Machine. In Proceedings of the 1989 International Conference on Parallel Processing, pages 1-133-141. ICPP, August 1989.
|
| |
11
|
J. Borkenhagen and S. Storino. 4th Generation 64-bit PowerPC-Compatible Commercial Processor Design. IBM Whitepaper, January 13, 1999, http://www.rs6000.ibm.com/ resource/technology/nstar.pdf.
|
| |
12
|
|
| |
13
|
K. Diefendorff. Power4 Focuses on Memory Bandwidth. Microprocessor Report, 13(13), Oct. 1999.
|
| |
14
|
|
| |
15
|
S.J. Frank. Tightly Coupled Multiprocessor System Speeds Memory-access Times. Electronics, 57(1):164-169, Jan. 1984.
|
 |
16
|
|
 |
17
|
|
 |
18
|
Kourosh Gharachorloo , Madhu Sharma , Simon Steely , Stephen Van Doren, Architecture and design of AlphaServer GS320, Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, p.13-24, November 2000, Cambridge, Massachusetts, United States
|
| |
19
|
L. Gwennap. Alpha 21364 to Ease Memory Bottleneck. Microprocessor Report, Oct. 1998.
|
| |
20
|
|
 |
21
|
|
 |
22
|
|
| |
23
|
S. Kunkel. Personal Communication, Apr. 2000.
|
| |
24
|
|
 |
25
|
Anders Landin , Erik Hagersten , Seif Haridi, Race-free interconnection networks and multiprocessor consistency, Proceedings of the 18th annual international symposium on Computer architecture, p.106-115, May 27-30, 1991, Toronto, Ontario, Canada
|
 |
26
|
|
| |
27
|
|
| |
28
|
C.E. Leiserson. Systolic Priority Queues. In Caltech Conference on VLSI, pages 199-214, Jan. 1979.
|
 |
29
|
|
| |
30
|
P.S. Magnusson etal. SimlCS/sun4m: A Virtual Workstation. In Proceedings of Usenix Annual Technical Conference, June 1998.
|
| |
31
|
|
| |
32
|
A. Nowatzyk. Performance Analysis of Hypercube Based Ensemble Machine Architectures. Phd thesis, Carnegie- Mellon, 1989.
|
| |
33
|
A. Nowatzyk, M. Monger, M. Parkin, E. Kelly, M. Borwne, G. Aybay, and D. Lee. S3.mp: A Multiprocessor in a Matchbox. In Proc. PASA, 1993.
|
| |
34
|
G.M. Papadopoulos. SC99 State-of-the-Field Address, 1999.
|
| |
35
|
F. Pong, M. Dubois, and K. Lee. Design and Performance of SMPs with Asynchronous Caches. Technical Report HPL- 1999-149, HP Labs, Nov. 1999.
|
| |
36
|
|
| |
37
|
|
| |
38
|
A. Singhal, D. Broniarczyk, F. Cerauskis, J. Price, L. Yaun, C. Cheng, D. Doblar, S. Fosth, N. Agarwal, K. Harvery, E. Hagersten, and B. Liencres. Gigaplane: A High Performance Bus of Large SMPs. In IEEE Hot Interconnects, pages 41-52, Aug. 1996.
|
| |
39
|
D.J. Sorin, M. Plakal, M.D. Hill, A.E. Condon, M.M. Martin, and D.A. Wood. Specifying and Verifying a Broadcast and a Multicast Snooping Cache Coherence Protocol. Technical Report 1412, Computer Sciences Department, University of Wisconsin-Madison, Mar. 2000.
|
 |
40
|
|
| |
41
|
Transaction Processing Performance Council. TPC Benchmark C, Draft Specification, Revision 4.0.q, Aug. 1999.
|
| |
42
|
Transaction Processing Performance Council. TPC Benchmark H (Decision Support), Standard Specification, Revision 1.1.0, June 1999.
|
| |
43
|
G. White and P. Vogt. Profusion (tin): A Buffered, Cache Coherent Crossbar Switch. In IEEE Hot Interconnects, pages 87-96, Aug. 1997.
|
| |
44
|
|
 |
45
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
|