ACM Home Page
Please provide us with feedback. Feedback
An experimental comparison of cache-oblivious and cache-conscious programs
Full text PdfPdf (464 KB)
Source ACM Symposium on Parallel Algorithms and Architectures archive
Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures table of contents
San Diego, California, USA
SESSION: Cache-oblivious/cache-aware algorithms table of contents
Pages: 93 - 104  
Year of Publication: 2007
ISBN:978-1-59593-667-7
Authors
Kamen Yotov  Cornell University
Tom Roeder  Cornell University
Keshav Pingali  Cornell University
John Gunnels  IBM T. J. Watson Research Center
Fred Gustavson  IBM T. J. Watson Research Center
Sponsors
ACM: Association for Computing Machinery
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 16,   Downloads (12 Months): 144,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1248377.1248394
What is a DOI?

ABSTRACT

Cache-oblivious algorithms have been advanced as a way of circumventing some of the difficulties of optimizing applications to take advantage of the memory hierarchy of modern microprocessors. These algorithms are based on the divide-and-conquer paradigm -- each division step creates sub-problems of smaller size, and when the working set of a sub-problem fits in some level of the memory hierarchy, the computations in that sub-problem can be executed without suffering capacity misses at that level. In this way, divide-and-conquer algorithms adapt automatically to all levels of the memory hierarchy; in fact, for problems like matrix multiplication, matrix transpose, and FFT, these recursive algorithms are optimal to within constant factors for some theoretical models of the memory hierarchy.

An important question is the following: how well do carefully tuned cache-oblivious programs perform compared to carefully tuned cache-conscious programs for the same problem? Is there a price for obliviousness, and if so, how much performance do we lose? Somewhat surprisingly, there are few studies in the literature that have addressed this question.

This paper reports the results of such a study in the domain of dense linear algebra. Our main finding is that in this domain, even highly optimized cache-oblivious programs perform significantly worse than corresponding cacheconscious programs. We provide insights into why this is so, and suggest research directions for making cache-oblivious algorithms more competitive.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Basic Linear Algebra Routines (BLAS). http://www.netlib.org/blas.
 
2
 
3
 
4
L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966.
 
5
 
6
Gianfranco Bilardi, 2005. Personal communication.
 
7
8
9
10
11
12
 
13
 
14
J. J. Dongarra, F. Gustavson, and A. Karp. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review, 26(1):91--112, 1984.
 
15
Matteo Frigo, 2005. Personal communication.
 
16
17
 
18
Jia Guo, María Jesús Garzarán, and David Padua. The power of Belady's algorithm in register allocation for long basic blocks. In Proc. 16th International Workshop in Languages and Parallel Computing, pages 374--390, 2003.
 
19
20
 
21
Piyush Kumar. Cache-oblivious algorithms. In Lecture Notes in Computer Science 2625. Springer-Verlag, 1998.
22
 
23
 
24
 
25
 
26
Clint Whaley. personal communication, 2005.
 
27
R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.
 
28
 
29
Kamen Yotov, Xiaoming Li, Gang Ren, Maria Garzaran, David Padua, Keshav Pingali, and Paul Stodghill. Is search really necessary to generate high-performance BLAS? Proceedings of the IEEE, 93(2), 2005.


Collaborative Colleagues:
Kamen Yotov: colleagues
Tom Roeder: colleagues
Keshav Pingali: colleagues
John Gunnels: colleagues
Fred Gustavson: colleagues