|
ABSTRACT
Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with cachable, coherent memory operations. Using coherence can improve performance by facilitating burst transfers of whole cache blocks and reducing control overheads (e.g., for polling).This paper begins an exploration of network interfaces (NIs) that use coherence---coherent network interfaces (CNIs)---to improve communication performance. We restrict this study to NI/CNIs that reside on coherent memory or I/O buses, to NI/CNIs that are much simpler than processors, and to the performance of fine-grain messaging from user process to user process.Our first contribution is to develop and optimize two mechanisms that CNIs use to communicate with processors. A cachable device register---derived from cachable control registers [39,40]---is a coherent, cachable block of memory used to transfer status, control, or data between a device and a processor. Cachable queues generalize cachable device registers from one cachable, coherent memory block to a contiguous region of cachable, coherent blocks managed as a circular queue.Our second contribution is a taxonomy and comparison of four CNIs with a more conventional NI. Microbenchmark results show that CNIs can improve the round-trip latency and achievable bandwidth of a small 64-byte message by 37% and 125% respectively on the memory bus and 74% and 123% respectively on a coherent I/O bus. Experiments with five macrobenchmarks show that CNIs can improve the performance by 17-53% on the memory bus and 30-88% on the I/O bus.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Anant Agarwal , Ricardo Bianchini , David Chaiken , Kirk L. Johnson , David Kranz , John Kubiatowicz , Beng-Hong Lim , Kenneth Mackenzie , Donald Yeung, The MIT Alewife machine: architecture and performance, Proceedings of the 22nd annual international symposium on Computer architecture, p.2-13, June 22-24, 1995, S. Margherita Ligure, Italy
|
 |
2
|
A. Agarwal , R. Simoni , J. Hennessy , M. Horowitz, An evaluation of directory schemes for cache coherence, Proceedings of the 15th Annual International Symposium on Computer architecture, p.280-298, May 30-June 02, 1988, Honolulu, Hawaii, United States
|
| |
3
|
|
 |
4
|
M. A. Blumrich , K. Li , R. Alpert , C. Dubnicki , E. W. Felten , J. Sandberg, Virtual memory mapped network interface for the SHRIMP multicomputer, Proceedings of the 21ST annual international symposium on Computer architecture, p.142-153, April 18-21, 1994, Chicago, Illinois, United States
|
 |
5
|
Shekhar Borkar , Robert Cohn , George Cox , Thomas Gross , H. T. Kung , Monica Lam , Margie Levine , Brian Moore , Wire Moore , Craig Peterson , Jim Susman , Jim Sutton , John Urbanski , Jon Webb, Supporting systolic and memory communication in iWarp, Proceedings of the 17th annual international symposium on Computer Architecture, p.70-81, May 28-31, 1990, Seattle, Washington, United States
|
 |
6
|
Eric A. Brewer , Frederic T. Chong , Lok T. Liu , Shamik D. Sharma , John D. Kubiatowicz, Remote queues: exposing message queues for optimization and atomicity, Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, p.42-53, June 24-26, 1995, Santa Barbara, California, United States
[doi> 10.1145/215399.215416]
|
| |
7
|
B.R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. J. States, S. Swamintathan, and M. Karplus. Charmm: A program for macromolecular energy, rmmrmzation, and dynamics calculatlon. Journal of Computational Chemtstry, 4(187), 1983.
|
| |
8
|
Doug Burger and San.lay Mehta. Parallelizing Appbt for a Shared-Memory Multiprocessor. Technical Report 1286, Computer Sciences Department, University of Wisconsin-Madison, September 1995.
|
 |
9
|
Satish Chandra , James R. Larus , Anne Rogers, Where is time spent in message-passing and shared-memory programs?, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.61-73, October 05-07, 1994, San Jose, California, United States
|
| |
10
|
Derek Chiou , Boon Seong Ang , Robert Greiner , Arvind , James C. Hoe , Michael J. Beckerle , James E. Hicks , G. Andrew Boughton, START-NG: Delivering Seamless Parallel Computing, Proceedings of the First International Euro-Par Conference on Parallel Processing, p.101-116, August 29-31, 1995
|
| |
11
|
|
| |
12
|
Fred Chong, Shamik Sharma, Eric Brewer, and Joel Saltz. Multiprocessor Runtame Support for Irregular DAGs. In R. Kalia and P. Vashishta, editors, Toward TerafIop Computing and New Grand Challenge Apphcations. Nova Science Pulishers, Inc., 1995.
|
 |
13
|
A. Krishnamurthy , D. E. Culler , A. Dusseau , S. C. Goldstein , S. Lumetta , T. von Eicken , K. Yelick, Parallel programming in Split-C, Proceedings of the 1993 ACM/IEEE conference on Supercomputing, p.262-273, December 1993, Portland, Oregon, United States
[doi> 10.1145/169627.169724]
|
 |
14
|
|
| |
15
|
William J. Dally, Andrew Chien, Stuart Fiske, Waldemar Horwat, John Keen, Michael Lanvee, Rich Nuth, Scott Wills, Paul Carrick, and Greg Flyer. The J- Machine. A Fine-Grain Concurrent Computer. In G. X. Ritter, editor, Proc. Information Processing 89. Elsevier North-Holland, Inc., 1989.
|
| |
16
|
Babak Falsafi , Alvin R. Lebeck , Steven K. Reinhardt , Ioannis Schoinas , Mark D. Hill , James R. Larus , Anne Rogers , David A. Wood, Application-specific protocols for user-level shared memory, Proceedings of the 1994 conference on Supercomputing, p.380-389, December 1994, Washington, D.C., United States
|
 |
17
|
|
| |
18
|
PCI Special Interest Group PCI Local Bus Specificatton. Revzszon 2 1, 1995.
|
 |
19
|
John Heinlein , Kourosh Gharachorloo , Scott Dresser , Anoop Gupta, Integration of message passing and shared memory in the Stanford FLASH multiprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.38-50, October 05-07, 1994, San Jose, California, United States
|
 |
20
|
Mark Heinrich , Jeffrey Kuskin , David Ofelt , John Heinlein , Joel Baxter , Jaswinder Pal Singh , Richard Simoni , Kourosh Gharachorloo , David Nakahira , Mark Horowitz , Anoop Gupta , Mendel Rosenblum , John Hennessy, The performance impact of flexibility in the Stanford FLASH multiprocessor, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.274-285, October 05-07, 1994, San Jose, California, United States
|
 |
21
|
|
| |
22
|
|
| |
23
|
MIPS Technologies Inc. MIPS RIO000 Microprocessor User's Manual, 1995.
|
| |
24
|
Sun Mlcrosystems Inc. SPARC MBus Interface Specification, April 1991.
|
 |
25
|
|
 |
26
|
|
 |
27
|
John Kubiatowicz , David Chaiken , Anant Agarwal, Closing the window of vulnerability in multiphase memory transactions, Proceedings of the fifth international conference on Architectural support for programming languages and operating systems, p.274-284, October 12-15, 1992, Boston, Massachusetts, United States
|
 |
28
|
J. Kuskin , D. Ofelt , M. Heinrich , J. Heinlein , R. Simoni , K. Gharachorloo , J. Chapin , D. Nakahira , J. Baxter , M. Horowitz , A. Gupta , M. Rosenblum , J. Hennessy, The Stanford FLASH multiprocessor, Proceedings of the 21ST annual international symposium on Computer architecture, p.302-313, April 18-21, 1994, Chicago, Illinois, United States
|
 |
29
|
Charles E. Leiserson , Zahi S. Abuhamdeh , David C. Douglas , Carl R. Feynman , Mahesh N. Ganmukhi , Jeffrey V. Hill , Daniel Hillis , Bradley C. Kuszmaul , Margaret A. St. Pierre , David S. Wells , Monica C. Wong , Shaw-Wen Yang , Robert Zak, The network architecture of the Connection Machine CM-5 (extended abstract), Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, p.272-285, June 29-July 01, 1992, San Diego, California, United States
[doi> 10.1145/140901.141883]
|
| |
30
|
Daniel Lenoski , James Laudon , Kourosh Gharachorloo , Wolf-Dietrich Weber , Anoop Gupta , John Hennessy , Mark Horowitz , Monica S. Lam, The Stanford Dash Multiprocessor, Computer, v.25 n.3, p.63-79, March 1992
[doi> 10.1109/2.121510]
|
| |
31
|
Lok Tin Liu and David E Culler. Evaluatmn of the intel Paragon on Actave Message Communication In Proceedings of InteI Supercomputer Users Group Conference. June 1995.
|
| |
32
|
|
| |
33
|
Meiko World Inc. Computing Surface 2: Overview Documentation Set, 1993.
|
 |
34
|
|
 |
35
|
Shubhendu S. Mukherjee , Shamik D. Sharma , Mark D. Hill , James R. Larus , Anne Rogers , Joel Saltz, Efficient support for irregular applications on distributed-memory machines, Proceedings of the fifth ACM SIGPLAN symposium on Principles and practice of parallel programming, p.68-79, July 19-21, 1995, Santa Barbara, California, United States
|
| |
36
|
Robert W. Pfile Typhoon-Zero Implementation: The Vortex Module Technical report, Computer Sciences Department, University of Wisconsin-Madison, 1995.
|
| |
37
|
Steven K. Remhardt. Tempest Interface Specification (Revismn 1.2.1) Technical Report 1267, Computer Sciences Department, University of Wisconsin-Madison, February 1995.
|
 |
38
|
S. K. Reinhardt , J. R. Larus , D. A. Wood, Tempest and typhoon: user-level shared memory, Proceedings of the 21ST annual international symposium on Computer architecture, p.325-336, April 18-21, 1994, Chicago, Illinois, United States
|
| |
39
|
Steven K. Reinhardt, Robert W. Pfile, and David Wood. Typhoon-0: Hardware Support for Distributed Shared on a Network of Workstations Memory. In Workshop on Scalable Shared-Memory Muttiprocessors, 1995
|
 |
40
|
Steven K. Reinhardt , Robert W. Pfile , David A. Wood, Decoupled hardware support for distributed shared memory, Proceedings of the 23rd annual international symposium on Computer architecture, p.34-43, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
41
|
|
| |
42
|
SPARC Technology Business. UItraSPARC-I User's Manual, Revision 1.0, September 1995.
|
 |
43
|
|
| |
44
|
Thinking Machines Corporation. The Connection Machine CM-5 Technical Summary, 199 I.
|
 |
45
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
| |
46
|
|
| |
47
|
|
| |
48
|
|
CITED BY 10
|
|
|
|
|
|
|
Michael Schlansker , Nagabhushan Chitlur , Erwin Oertli , Paul M. Stillwell, Jr , Linda Rankin , Dennis Bradford , Richard J. Carter , Jayaram Mudigonda , Nathan Binkert , Norman P. Jouppi, High-performance ethernet-based communications for future multi-core processors, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
|
|
|
|
|
|
|
|
|
|
|
Henry Wong , Anne Bracy , Ethan Schuchman , Tor M. Aamodt , Jamison D. Collins , Perry H. Wang , Gautham Chinya , Ankur Khandelwal Groen , Hong Jiang , Hong Wang, Pangaea: a tightly-coupled IA32 heterogeneous chip multiprocessor, Proceedings of the 17th international conference on Parallel architectures and compilation techniques, October 25-29, 2008, Toronto, Ontario, Canada
|
|
|
|
|
|