|
ABSTRACT
This paper presents a multicore-cache model that reflects the reality that multicore processors have both per-processor private (L1) caches and a large shared (L2) cache on chip. We consider a broad class of parallel divide-and-conquer algorithms and present a new on-line scheduler, CONTROLLED-PDF, that is competitive with the standard sequential scheduler in the following sense. Given any dynamically unfolding computation DAG from this class of algorithms, the cache complexity on the multicore-cache model under our new scheduler is within a constant factor of the sequential cache complexity for both L1 and L2, while the time complexity is within a constant factor of the sequential time complexity divided by the number of processors p. These are the first such asymptotically-optimal results for any multicore model. Finally, we show that a separator-based algorithm for sparse-matrix-dense-vector-multiply achieves provably good cache performance in the multicore-cache model, as well as in the well-studied sequential cache-oblivious model.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
www.sun.com/processors/UltraSPARC-T1/, 2007.
|
| |
2
|
www.tilera.com, 2007.
|
| |
3
|
Intel shows off 80-core processor. www.news.com/2100-1006_3-6158181.html, 2007.
|
| |
4
|
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3), 2002. Springer.
|
 |
5
|
A. Aggarwal , B. Alpern , A. Chandra , M. Snir, A model for hierarchical memory, Proceedings of the nineteenth annual ACM symposium on Theory of computing, p.305-314, January 1987, New York, New York, United States
[doi> 10.1145/28395.28428]
|
 |
6
|
|
| |
7
|
|
| |
8
|
B. Alpern, L. Carter, E. Feig, and T. Selker. The uniform memory hierachy model of computation. Algorth-mica, 12(2/3), 1994. Springer.
|
 |
9
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
 |
10
|
Michael A. Bender , Gerth Stølting Brodal , Rolf Fagerberg , Riko Jacob , Elias Vicari, Optimal sparse matrix dense vector multiplication in the I/O-model, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
[doi> 10.1145/1248377.1248391]
|
 |
11
|
Michael A. Bender , Jeremy T. Fineman , Seth Gilbert , Bradley C. Kuszmaul, Concurrent cache-oblivious b-trees, Proceedings of the seventeenth annual ACM symposium on Parallelism in algorithms and architectures, July 18-20, 2005, Las Vegas, Nevada, USA
[doi> 10.1145/1073970.1074009]
|
 |
12
|
|
 |
13
|
|
 |
14
|
Guy E. Blelloch , Phillip B. Gibbons , Girija J. Narlikar , Yossi Matias, Space-efficient scheduling of parallelism with synchronization variables, Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.12-23, June 23-25, 1997, Newport, Rhode Island, United States
[doi> 10.1145/258492.258494]
|
 |
15
|
Robert D. Blumofe , Matteo Frigo , Christopher F. Joerg , Charles E. Leiserson , Keith H. Randall, An analysis of dag-consistent distributed shared-memory algorithms, Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, p.297-308, June 24-26, 1996, Padua, Italy
[doi> 10.1145/237502.237574]
|
 |
16
|
|
 |
17
|
Ernie Chan , Enrique S. Quintana-Orti , Gregorio Quintana-Orti , Robert van de Geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
[doi> 10.1145/1248377.1248397]
|
 |
18
|
Shimin Chen , Phillip B. Gibbons , Michael Kozuch , Vasileios Liaskovitis , Anastassia Ailamaki , Guy E. Blelloch , Babak Falsafi , Limor Fix , Nikos Hardavellas , Todd C. Mowry , Chris Wilkerson, Scheduling threads for constructive cache sharing on CMPs, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
[doi> 10.1145/1248377.1248396]
|
 |
19
|
|
| |
20
|
Alexandra Fedorova , Margo Seltzer , Christoper Small , Daniel Nussbaum, Performance of multithreaded chip multiprocessors and implications for operating system design, Proceedings of the annual conference on USENIX Annual Technical Conference, p.26-26, April 10-15, 2005, Anaheim, CA
|
| |
21
|
|
 |
22
|
|
| |
23
|
M. T. Goodrich, M. Nelson, and N. Sitchinava. Sorting in parallel external-memory multicores. Technical report, U.C. Irvine, 2007.
|
| |
24
|
Lance Hammond , Benedict A. Hubbert , Michael Siu , Manohar K. Prabhu , Michael Chen , Kunle Olukotun, The Stanford Hydra CMP, IEEE Micro, v.20 n.2, p.71-84, March 2000
[doi> 10.1109/40.848474]
|
| |
25
|
|
| |
26
|
R. J. Lipton and R. E. Tarjan. A separator theorem for planar graphs. SIAM Journal on Applied Mathematics, 36(2), 1979.
|
| |
27
|
G. J. Narlikar. Scheduling threads for low space requirement and good locality. Theory of Computing Systems, 35(2), 2002. Springer.
|
 |
28
|
Basem A. Nayfeh , Lance Hammond , Kunle Olukotun, Evaluation of design alternatives for a multiprocessor microprocessor, Proceedings of the 23rd annual international symposium on Computer architecture, p.67-77, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
29
|
V. Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4), 1969. Springer.
|
 |
30
|
|
|