ACM Home Page
Please provide us with feedback. Feedback
Scheduling dynamic parallelism on accelerators
Full text PdfPdf (858 KB)
Source
Conference On Computing Frontiers archive
Proceedings of the 6th ACM conference on Computing frontiers table of contents
Ischia, Italy
SESSION: Advanved architectures 2 table of contents
Pages 161-170  
Year of Publication: 2009
ISBN:978-1-60558-413-3
Authors
Filip Blagojevic  Lawrence Berkeley National Lab, Berkeley, USA
Costin Iancu  Lawrence Berkeley National Lab, Berkeley, USA
Katherine Yelick  Lawrence Berkeley National Lab, Berkeley, USA
Matthew Curtis-Maury  NetApp, Raleigh, USA
Dimitrios S. Nikolopoulos  Virginia Tech, Blacksburg, USA
Benjamin Rose  Virginia Tech, Blacksburg, USA
Sponsors
ACM: Association for Computing Machinery
SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 32,   Downloads (12 Months): 114,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1531743.1531769
What is a DOI?

ABSTRACT

Resource management on accelerator based systems is complicated by the disjoint nature of the main CPU and accelerator, which involves separate memory hierarhcies, different degrees of parallelism, and relatively high cost of communicating between them. For applications with irregul parallelism, where work is dynamically created based on other computations, the accelerators may both consume and produce work. To maintain load balance, the accelerators hand work back to the CPU to be scheduled. In this paper we consider multiple approaches for such scheduling problems and use the Cell BE system to demonstrate the different schedulers and the trade-offs between them. Our evaluation is done with both microbenchmarks and two bioinformatics applications (PBPI and RAxML). Our baseline approach uses a standard Linux scheduler on the CPU, possibly with more than one process per CPU. We then consider the addition of cooperative scheduling to the Linux kernel and a user-level work-stealing approach. The two cooperative approaches are able to decrease SPE idle time, by 30% and 70%, respectively, relative to the baseline scheduler. In both cases we believe the changes required to application level codes, e.g., a program written with MPI processes that use accelerator based compute nodes, is reasonable, although the kernel level approach provides more generality and ease of implementation, but often less performance than work stealing approach.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Charm++ on the Cell Procesor. Available at http://charm.cs.uiuc.edu/research/cell/.
 
2
IBM Accelerated Library Framework for Cell Programmer's Guide and API Reference.
 
3
D.A. Bader, V. Agarwal, and K. Madduri. On the design and analysis of irregular algorithms on the cell processor: A case study of list ranking. In Proc. of the 21st International Parallel and Distributed Processing Symposium, pages 1--10, 2007.
4
 
5
 
6
F. Blagojevic, X. Feng, K. Cameron, and D.S. Nikolopoulos. Modeling Multi-grain Parallelism on Heterogeneous Multi-core Processors: A Case Study with the Cell BE. In Proc. of the 2008 HiPEAC Conference, Jan. 2008.
7
 
8
F. Blagojevic, A. Stamatakis, C. Antonopoulos, and D. Nikolopoulos. Raxml-cell: Parallel phylogenetic tree inference on the cell broadband engine. In Proc. of the 21st International Parallel and Distributed Processing Symposium, Long Beach, CA, Mar. 2007.
9
 
10
M. Charalambous, P. Trancoso, and A. Stamatakis. Initial Experiences Porting a Bioinformatics Application to a Graphics Processor. In Panhellenic Conference on Informatics, pages 415--425, 2005.
11
 
12
M. de Krujif and K. Sankaralingam. MapReduce for the Cell B.E. Architecture.
 
13
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. pages 137--150.
 
14
15
16
 
17
18
19
 
20
M. Monteyne. RapidMind Multi-core Development Platform. Available from http://www.rapidmind.net/case-rapidmind.php.
 
21
C. Mueller, B. Martin, and A. Lumsdaine. CorePy: High-Productivity Cell/B.E. Programming. In Proc. of the First STI/Georgia Tech Workshop on Software and Applications for the Cell/B.E. Processor, June 2007.
 
22
F. Petrini, G. Fossum, M. Kistler, and M. Perrone. Multicore Suprises: Lesson Learned from Optimizing Sweep3D on the Cell Broadbend Engine.
 
23
F. Petrini, D. Scarpazza, O. Villa, and J. Fernandez. Challenges in Mapping Graph Exploration Algorithms on Advanced Multi-core Processors. In Proc. of the 21st International Parallel and Distributed Processing Symposium, Long Beach, CA, Mar. 2007.
 
24
D.P. Scarpazza, O. Villa, and F. Petrini. Peak-Performance DFA-based String Matching on the Cell Processor. In IPDPS, pages 1--8. IEEE, 2007.
25
 
26
Y. Zhao and K. Kennedy. Dependence-based Code Generation for a Cell Processor. In Proc. of the 19th International Workshop on Languages and Compilers for Parallel Computing, New Orleans, LA, Nov. 2006.

Collaborative Colleagues:
Filip Blagojevic: colleagues
Costin Iancu: colleagues
Katherine Yelick: colleagues
Matthew Curtis-Maury: colleagues
Dimitrios S. Nikolopoulos: colleagues
Benjamin Rose: colleagues