|
ABSTRACT
Recently, multi-core architectures with alternative memory subsystem designs have emerged. Instead of using hardware-managed cache hierarchies, they employ software-managed embedded memory. An open question is what programming and compiling methods are effective to exploit the performance potential of this new class of architectures. Using the LU decomposition as a case study, we propose three techniques that combined achieve a 27 times speedup on the IBM Cyclops-64 many-core architecture, compared to the parallel LU implementation from the SPLASH-2 benchmarks suite. Our first method allows adaptive load distribution, which maximizes load-balance among cores - this is important to leverage the potential of the next two methods. Secondly, we developed a method for register tiling that determines the optimal data tile parameters and maximizes data reuse according to register file size constraints. We demonstrate that our method is inherently general and that it should have a much broader applicability beyond Cyclops-64. Thirdly, we present a register allocation method for register tiled loop bodies. We evaluate its effect through hand-tuned Cyclops-64 assembly code and observe a 6-fold reduction in load/store operations. We achieve a performance of 19.17 and 27.50 GFlops with double-precision floating point numbers, for a 700x700 and a 1000x1000 matrix respectively.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
E. Anderson , Z. Bai , C. Bischof , L. S. Blackford , J. Demmel , Jack J. Dongarra , J. Du Croz , S. Hammarling , A. Greenbaum , A. McKenney , D. Sorensen, LAPACK Users' guide (third ed.), Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999
|
| |
2
|
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Dec. 2006.
|
| |
3
|
A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures. LAPACK Working Note 194, Nov. 2007.
|
 |
4
|
|
| |
5
|
T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broadband Engine Architecture and its First Implementation: A Performance View. http://www-128.ibm.com/developerworks/power/library/pa-cellperf.
|
| |
6
|
Clearspeed White Paper: CSX Processor Architecture. http://www.clearspeed.com/newsevents/presskit.
|
| |
7
|
J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. FAST: A Functionally Accurate Simulation Toolset for the Cyclops-64 Cellular Architecture. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation (MoBS 2005), Madison, Wisconsin, June 2005.
|
| |
8
|
J. del Cuvillo, W. Zhu, Z. Hu, and G. R. Gao. TiNy Threads: a Thread Virtual Machine for the Cyclops64 Cellular Architecture. In Proceedings of the 5th Workshop on Massively Parallel Processing, Denver, Apr. 2005.
|
 |
9
|
|
| |
10
|
|
| |
11
|
W. Eatherton. The Push of Network Processing to the Top of the Pyramid. Keynote at the Symposium on Architectures for Networking and Communication Systems, Princeton, NJ.
|
 |
12
|
|
| |
13
|
|
| |
14
|
HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. http://www.netlib.org/benchmark/hpl, 2004.
|
| |
15
|
Z. Hu, J. del Cuvillo, W. Zhu, and G. R. Gao. Optimization of Dense Matrix Multiplication on IBM Cyclops-64: Challenges and Experiences. In 12th International European Conference on Parallel Processing (Euro-Par 2006), pages 134--144, Dresden, Germany, Aug. 2006.
|
| |
16
|
H. Jin, M. Frumkin, and J. Yan. The OpenMP Implementation of NAS Parallel Benchmarks and its Performance. Technical report nas-99-011, NASA Ames Research Center, 1999.
|
 |
17
|
|
| |
18
|
Message Passing Interface Forum. MPI-2:Extensions to the Message-Passing Interface, 2003.
|
| |
19
|
G. Quintana-Orti, E. S. Quintana-Orti, E. Chan, R. A. van de Geijn, and F. G. V. Zee. Design and Scheduling of an Algorithm-by-Blocks for the LU Factorization on Multithreaded Architectures. FLAME Working Note 26, Sept. 2007.
|
| |
20
|
The Top500 List. http://www.top500.org.
|
| |
21
|
S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS. In Proceedings of the 2007 IEEE International Solid-State Circuits Conference, pages 5--7, San Francisco Marriott, CA, USA, Feb. 2007.
|
| |
22
|
I. E. Venetis and G. R. Gao. Optimizing the LU Benchmark for the Cyclops-64 Architecture. Technical Memo 75, Computer Architecture and Parallel Systems Laboratory, University of Delaware, Feb. 2007. http://www.capsl.udel.edu/publications.shtml.
|
 |
23
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
 |
24
|
Weirong Zhu , Vugranam C Sreedhar , Ziang Hu , Guang R. Gao, Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures, Proceedings of the 34th annual international symposium on Computer architecture, June 09-13, 2007, San Diego, California, USA
|
|