|
ABSTRACT
We compare the number of cache misses M1 for running a computation on a single processor with cache size C1 to the total number of misses Mp for the same computation when using p processors or threads and a shared cache of size Cp. We show that for any computation, and with an appropriate (greedy) parallel schedule, if Cp ≥ C1 + pd then Mp ≤ M1. The depth d of the computation is the length of the critical path of dependences. This gives the perhaps surprising result that for sufficiently parallel computations the shared cache need only be an additive size larger than the single-processor cache, and gives some theoretical justification for designing machines with shared caches.We model a computation as a DAG and the sequential execution as a depth first schedule of the DAG. The parallel schedule we study is a parallel depth-first schedule (PDF schedule) based on the sequential one. The schedule is greedy and therefore work-efficient. Our main results assume the Ideal Cache model, but we also present results for other more realistic cache models.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
U. A. Acar, G. E. Blelloch, and R. D. Blumofe. The data locality of work stealing. Theory of Computing Systems, 35(3):321--347, 2002.
|
 |
2
|
|
 |
3
|
Lars Arge , Michael A. Bender , Erik D. Demaine , Bryan Holland-Minkley , J. Ian Munro, Cache-oblivious priority queue and graph algorithm applications, Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, May 19-21, 2002, Montreal, Quebec, Canada
[doi> 10.1145/509907.509950]
|
| |
4
|
L. A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese.
|
 |
5
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
6
|
|
| |
7
|
L. A. Belady. A study of replacment algorithms for virtual storage computers. IBM Systems Journal, 5(2):78--101, 1966.
|
 |
8
|
|
 |
9
|
Guy E. Blelloch , Phillip B. Gibbons , Girija J. Narlikar , Yossi Matias, Space-efficient scheduling of parallelism with synchronization variables, Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.12-23, June 23-25, 1997, Newport, Rhode Island, United States
[doi> 10.1145/258492.258494]
|
 |
10
|
Robert D. Blumofe , Matteo Frigo , Christopher F. Joerg , Charles E. Leiserson , Keith H. Randall, An analysis of dag-consistent distributed shared-memory algorithms, Proceedings of the eighth annual ACM symposium on Parallel algorithms and architectures, p.297-308, June 24-26, 1996, Padua, Italy
[doi> 10.1145/237502.237574]
|
 |
11
|
|
| |
12
|
Y.-Y. Chen, J.-K. Peir, and C.-T. King. Performance of shared caches on multithreaded architectures. Journal of Information Science and Engineering, 14(2):499--514, 1998.
|
 |
13
|
|
| |
14
|
|
| |
15
|
Lance Hammond , Benedict A. Hubbert , Michael Siu , Manohar K. Prabhu , Michael Chen , Kunle Olukotun, The Stanford Hydra CMP, IEEE Micro, v.20 n.2, p.71-84, March 2000
[doi> 10.1109/40.848474]
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
R. Kalla, B. Sinharoy, and J. Tendler. Simultaneous multi-threading implementation in POWER5. In 15th IEEE Hot Chips, Aug. 2003.
|
| |
20
|
D. T. Marr, F. Binns, D. L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton. Hyper-threading technology architecture and microarchitecture, white paper. Intel Technical Journal, 6(1), Feb. 2002.
|
 |
21
|
|
| |
22
|
G. J. Narlikar. Scheduling threads for low space requirement and good locality. Theory of Computing Systems, 35(2):151--187, 2002.
|
 |
23
|
|
 |
24
|
|
 |
25
|
|
 |
26
|
|
| |
27
|
|
| |
28
|
J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. Power4 system microarchitecture, technical white paper. Technical Report 20, IBM Server Group, Oct. 2001.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
CITED BY 9
|
|
|
|
|
Vasileios Liaskovitis , Shimin Chen , Phillip B. Gibbons , Anastassia Ailamaki , Guy E. Blelloch , Babak Falsafi , Limor Fix , Nikos Hardavellas , Michael Kozuch , Todd C. Mowry , Chris Wilkerson, Parallel depth first vs. work stealing schedulers on CMP architectures, Proceedings of the eighteenth annual ACM symposium on Parallelism in algorithms and architectures, July 30-August 02, 2006, Cambridge, Massachusetts, USA
|
|
|
Guy E. Blelloch , Rezaul A. Chowdhury , Phillip B. Gibbons , Vijaya Ramachandran , Shimin Chen , Michael Kozuch, Provably good multicore cache performance for divide-and-conquer algorithms, Proceedings of the nineteenth annual ACM-SIAM symposium on Discrete algorithms, p.501-510, January 20-22, 2008, San Francisco, California
|
|
|
Shimin Chen , Phillip B. Gibbons , Michael Kozuch , Vasileios Liaskovitis , Anastassia Ailamaki , Guy E. Blelloch , Babak Falsafi , Limor Fix , Nikos Hardavellas , Todd C. Mowry , Chris Wilkerson, Scheduling threads for constructive cache sharing on CMPs, Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, June 09-11, 2007, San Diego, California, USA
|
|
|
|
|
|
|
|
|
Layali Rashid , Wessam M. Hassanein , Moustafa A. Hammad, Exploiting multithreaded architectures to improve the hash join operation, Proceedings of the 9th workshop on MEmory performance: DEaling with Applications, systems and architecture, p.46-53, October 26-26, 2008, Toronto, Canada
|
|
|
|
|
|
|
|