|
ABSTRACT
The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Andrew W. Appel and Zhong Shao. Empirical and analytic study of stack versus heap cost for languages with closures. Journal of Functional Programming, 6(1):47-74, 1996.
|
 |
2
|
|
| |
3
|
|
| |
4
|
Robert D. Blumofe , Christopher F. Joerg , Bradley C. Kuszmaul , Charles E. Leiserson , Keith H. Randall , Yuli Zhou, Cilk: an efficient multithreaded runtime system, Journal of Parallel and Distributed Computing, v.37 n.1, p.55-69, Aug. 25, 1996
[doi> 10.1006/jpdc.1996.0107]
|
| |
5
|
Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work stealing. In Proceed* ings of the 35th Annual Symposium on Foundations of Corn* puter Science (FOCS), pages 356-368, Santa Fe, New Mexico, November 1994.
|
 |
6
|
|
 |
7
|
Guang-Ien Cheng , Mingdong Feng , Charles E. Leiserson , Keith H. Randall , Andrew F. Stark, Detecting data races in Cilk programs that use locks, Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, p.298-309, June 28-July 02, 1998, Puerto Vallarta, Mexico
[doi> 10.1145/277651.277696]
|
| |
8
|
Cilk-5.1 (Beta 1) Reference Manual. Available on the Internet from http://theory.lcs .mit .edu/'cilk.
|
| |
9
|
|
 |
10
|
David E. Culler , Anurag Sah , Klaus E. Schauser , Thorsten von Eicken , John Wawrzynek, Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine, Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, p.164-175, April 08-11, 1991, Santa Clara, California, United States
|
 |
11
|
|
 |
12
|
|
 |
13
|
|
| |
14
|
|
| |
15
|
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416-429, March 1969.
|
| |
16
|
Dirk Grunwald. Heaps o' stacks: Time and space efficient threads without operating system support. Technical Report CU-CS-750-94, University of Colorado, November 1994.
|
 |
17
|
|
| |
18
|
|
 |
19
|
|
| |
20
|
Leslie Lamport. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE 7~nsactions on Computers, C-28(9):690-691, September 1979.
|
| |
21
|
Phillip Lisiecki and Alberto Medina. Personal communication.
|
| |
22
|
|
| |
23
|
Robert C. Miller. A type-checking preprocessor for Cilk 2, a multithreaded C language. Master's thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, May 1995.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
 |
28
|
|
CITED BY 71
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Guang-Ien Cheng , Mingdong Feng , Charles E. Leiserson , Keith H. Randall , Andrew F. Stark, Detecting data races in Cilk programs that use locks, Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, p.298-309, June 28-July 02, 1998, Puerto Vallarta, Mexico
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kunal Agrawal , Yuxiong He , Wen Jing Hsu , Charles E. Leiserson, Adaptive scheduling with parallelism feedback, Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, March 29-31, 2006, New York, New York, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christopher J. Hughes , Radek Grzeszczuk , Eftychios Sifakis , Daehyun Kim , Sanjeev Kumar , Andrew P. Selle , Jatin Chhugani , Matthew Holliman , Yen-Kuang Chen, Physical simulation for animation and visual effects: parallelization and characterization for chip multiprocessors, ACM SIGARCH Computer Architecture News, v.35 n.2, May 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Vincent Danjean , Roland Gillard , Serge Guelton , Jean-Louis Roch , Thomas Roche, Adaptive loops with kaapi on multicore and grid: applications in symmetric cryptography, Proceedings of the 2007 international workshop on Parallel symbolic computation, July 27-28, 2007, London, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
Pieter Bellens , Josep M. Perez , Felipe Cabarcas , Alex Ramirez , Rosa M. Badia , Jesus Labarta, CellSs: Scheduling techniques to better exploit memory hierarchy, Scientific Programming, v.17 n.1-2, p.77-95, January 2009
|
|
|
Milind Kulkarni , Martin Burtscher , Rajeshkar Inkulu , Keshav Pingali , Calin Casçaval, How much parallelism is there in irregular applications?, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
Tasuku Hiraishi , Masahiro Yasugi , Seiji Umatani , Taiichi Yuasa, Backtracking-based load balancing, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
Abdallah Deeb I. Al Zain , Kevin Hammond , Jost Berthold , Phil Trinder , Greg Michaelson , Mustafa Aswad, Low-pain, high-gain multicore programming in Haskell: coordinating irregular symbolic computations on multicore architectures, Proceedings of the 4th workshop on Declarative aspects of multicore programming, January 20-20, 2009, Savannah, GA, USA
|
|
|
|
|
|
|
|
|
Aydin Buluç , Jeremy T. Fineman , Matteo Frigo , John R. Gilbert , Charles E. Leiserson, Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, August 11-13, 2009, Calgary, AB, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Daniel Spoonhower , Guy E. Blelloch , Phillip B. Gibbons , Robert Harper, Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, August 11-13, 2009, Calgary, AB, Canada
|
|
|
Jason Ansel , Cy Chan , Yee Lok Wong , Marek Olszewski , Qin Zhao , Alan Edelman , Saman Amarasinghe, PetaBricks: a language and compiler for algorithmic choice, ACM SIGPLAN Notices, v.44 n.6, June 2009
|
|
|
Matteo Frigo , Pablo Halpern , Charles E. Leiserson , Stephen Lewin-Berlin, Reducers and other Cilk++ hyperobjects, Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures, August 11-13, 2009, Calgary, AB, Canada
|
|