|
ABSTRACT
Data access costs contribute significantly to the execution time of applications with complex data structures. As the latency of memory accesses becomes high relative to processor cycle times, application performance is increasingly limited by memory performance. In some situations it may be reasonable to trade increased computation costs for reduced memory costs. The contributions of this paper are three-fold: we provide a detailed analysis of the memory performance of a set of seven, memory-intensive benchmarks; we describe Computation Regrouping, a general, source-level approach to improving the overall performance of these applications by improving temporal locality to reduce cache and TLB miss ratios (and thus memory stall times); and we demonstrate significant performance improvements from applying Computation Regrouping to our suite of seven benchmarks. With Computation Regrouping, we observe an average speedup of 1.97, with individual speedups ranging from 1.26 to 3.03. Most of this improvement comes from eliminating memory stall time.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
A. W. Appel , J. R. Ellis , K. Li, Real-time concurrent collection on stock multiprocessors, Proceedings of the ACM SIGPLAN 1988 conference on Programming Language design and Implementation, p.11-20, June 20-24, 1988, Atlanta, Georgia, United States
|
| |
2
|
|
| |
3
|
|
| |
4
|
|
 |
5
|
David Callahan , Ken Kennedy , Allan Porterfield, Software prefetching, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.40-52, April 08-11, 1991, Santa Clara, California, United States
|
| |
6
|
|
 |
7
|
Steve Carr , Kathryn S. McKinley , Chau-Wen Tseng, Compiler optimizations for improving data locality, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.252-262, October 05-07, 1994, San Jose, California, United States
|
 |
8
|
Siddhartha Chatterjee , Alvin R. Lebeck , Praveen K. Patnala , Mithuna Thottethodi, Recursive array layouts and fast parallel matrix multiplication, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, p.222-231, June 27-30, 1999, Saint Malo, France
[doi> 10.1145/305619.305645]
|
 |
9
|
Trishul M. Chilimbi , Mark D. Hill , James R. Larus, Cache-conscious structure layout, Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, p.1-12, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
10
|
A. Krishnamurthy , D. E. Culler , A. Dusseau , S. C. Goldstein , S. Lumetta , T. von Eicken , K. Yelick, Parallel programming in Split-C, Proceedings of the 1993 ACM/IEEE conference on Supercomputing, p.262-273, December 1993, Portland, Oregon, United States
[doi> 10.1145/169627.169724]
|
| |
11
|
|
| |
12
|
M. Frigo and S. Johnson. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 1381--1384, May 1998
|
| |
13
|
|
 |
14
|
Somnath Ghosh , Margaret Martonosi , Sharad Malik, Precise miss analysis for program transformations with caches of arbitrary associativity, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.228-239, October 02-07, 1998, San Jose, California, United States
|
 |
15
|
|
| |
16
|
|
| |
17
|
H. Han and C.-W. Tseng. Improving locality for adaptive irregular scientific codes. Technical Report CS-TR-4039, University of Maryland, College Park, September 1999
|
| |
18
|
M. Kandemir , A. Choudhary , J. Ramanujam , P. Banerjee, Improving locality using loop and data transformations in an integrated framework, Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, p.285-297, November 1998, Dallas, Texas, United States
|
| |
19
|
M. Karlsson, F. Dahlgren, and P. Stenstrom. A Prefetching Technique for Irregular Accesses to Linked Data Structures. In Proceedings of the Sixth Annual Symposium on High Performance Computer Architecture, pages 206--217, January 2000
|
 |
20
|
Induprakas Kodukula , Nawaaz Ahmed , Keshav Pingali, Data-centric multi-level blocking, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.346-357, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
21
|
Induprakas Kodukula , Nawaaz Ahmed , Keshav Pingali, Data-centric multi-level blocking, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.346-357, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
22
|
Monica D. Lam , Edward E. Rothberg , Michael E. Wolf, The cache performance and optimizations of blocked algorithms, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.63-74, April 08-11, 1991, Santa Clara, California, United States
|
| |
23
|
|
| |
24
|
S. Leung and J. Zahorjan. Optimizing Data Locality by Array Restructuring. Technical Report UW-CSE-95-09-01, University of Washington Dept. of Computer Science and Engineering, September 1995
|
 |
25
|
|
| |
26
|
J. W. Manke and J. Wu. Data-Intensive System Benchmark Suite Analysis and Specification. Atlantic Aerospace Electronics Corp., June 1999
|
| |
27
|
V. Pingali. Memory performance of complex data structures: Characterization and optimization. Master's thesis, University of Utah, August 2001
|
| |
28
|
|
 |
29
|
|
| |
30
|
E. S. Roberts and M. T. Vandevoorde. WorkCrews: An Abstraction for Controlling Parallelism. Technical Report SRC-042, Digital Systems Research Center, April 1989
|
 |
31
|
|
| |
32
|
|
| |
33
|
Silicon Graphics Inc. SpeedShop User's Guide. 1996
|
| |
34
|
F. Somenzi. CUDD: CU Decision Diagram Package Release 2.3.1, 2001
|
 |
35
|
|
| |
36
|
|
| |
37
|
Marco Zagha , Brond Larson , Steve Turner , Marty Itzkowitz, Performance analysis using the MIPS R10000 performance counters, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.16-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/369028.369059]
|
|