|
ABSTRACT
The uniform representation of 2-dimensional arrays serially in Morton order (or {\eee} order) supports both their iterative scan with cartesian indices and their divide-and-conquer manipulation as quaternary trees. This data structure is important because it relaxes serious problems of locality and latency, and the tree helps to schedule multi-processing. Results here show how it facilitates algorithms that avoid cache misses and page faults at all levels in hierarchical memory, independently of a specific runtime environment.We have built a rudimentary C-to-C translator that implements matrices in Morton-order from source that presumes a row-major implementation. Early performance from LAPACK's reference implementation of \texttt{dgesv} (linear solver), and all its supporting routines (including \texttt{dgemm} matrix-multiplication) form a successful research demonstration. Its performance predicts improvements from new algebra in back-end optimizers.We also present results from a more stylish \texttt{dgemm} algorithm that takes better advantage of this representation. With only routine back-end optimizations inserted by hand (unfolding the base case and passing arguments in registers), we achieve machine performance exceeding that of the manufacturer-crafted {\tt dgemm} running at 67% of peak flops. And the same code performs similarly on several machines.Together, these results show how existing codes and future block-recursive algorithms can work well together on this matrix representation. Locality is key to future performance, and the new representation has a remarkable impact.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
F. E. Allen, J. Cocke, and K. Kennedy. Reduction of operator strength. In S. W. Muchnick and N. D. Jones, editors, Program Flow Analysis: Theory and Applications, chapter 3.2, pages 79-101. Prentice-Hall, Englewood Cliffs, NJ, 1981.
|
 |
3
|
|
 |
4
|
Jeff Bilmes , Krste Asanovic , Chee-Whye Chin , Jim Demmel, Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology, Proceedings of the 11th international conference on Supercomputing, p.340-347, July 07-11, 1997, Vienna, Austria
[doi> 10.1145/263580.263662]
|
 |
5
|
|
| |
6
|
F. W. Burton, V. J. Kollias, and J. G. Kollias. Real-time raster-to-quadtree and quadtree-to-raster conversion algorithms with modest storage requirements. Angew. Informatik, 4:170-174, 1986.
|
 |
7
|
Siddhartha Chatterjee , Vibhor V. Jain , Alvin R. Lebeck , Shyam Mundhra , Mithuna Thottethodi, Nonlinear array layouts for hierarchical memory systems, Proceedings of the 13th international conference on Supercomputing, p.444-453, June 20-25, 1999, Rhodes, Greece
[doi> 10.1145/305138.305231]
|
 |
8
|
Siddhartha Chatterjee , Alvin R. Lebeck , Praveen K. Patnala , Mithuna Thottethodi, Recursive array layouts and fast parallel matrix multiplication, Proceedings of the eleventh annual ACM symposium on Parallel algorithms and architectures, p.222-231, June 27-30, 1999, Saint Malo, France
[doi> 10.1145/305619.305645]
|
| |
9
|
H. G. Cragon. A historical note on binary tree. SIGARCH Comput. Archit. News, 18(4):3, Dec. 1990.
|
 |
10
|
David E. Culler , Richard M. Karp , David Patterson , Abhijit Sahay , Eunice E. Santos , Klaus Erik Schauser , Ramesh Subramonian , Thorsten von Eicken, LogP: a practical model of parallel computation, Communications of the ACM, v.39 n.11, p.78-85, Nov. 1996
[doi> 10.1145/240455.240477]
|
| |
11
|
|
 |
12
|
|
| |
13
|
E. Elmroth and F. Gustavson. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Develop., 44(4):605-624, July 2000. http://www.research.ibm.com/journal/rd/444/elmroth.html
|
 |
14
|
|
| |
15
|
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
| |
21
|
E.-J. Im and K. Yelick. Optimizing sparse matrix vector multimplication on SMPs. In 9th SIAM Conf. on Parallel Processing for Scientific Computing, volume 98 of Proc. in Applied Mathematics, Mar. 1999. http://www.siam.org/catalog/mcc07/pr98.htm
|
| |
22
|
|
 |
23
|
Induprakas Kodukula , Nawaaz Ahmed , Keshav Pingali, Data-centric multi-level blocking, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.346-357, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
24
|
Alvin R. Lebeck , Xiaobo Fan , Heng Zeng , Carla Ellis, Power aware page allocation, Proceedings of the ninth international conference on Architectural support for programming languages and operating systems, p.105-116, November 2000, Cambridge, Massachusetts, United States
|
 |
25
|
|
| |
26
|
G. M. Morton. A computer oriented geodetic data base and a new technique in file sequencing. Technical report, IBM Ltd., Ottawa, Ontario, Mar. 1, 1966.
|
 |
27
|
|
 |
28
|
|
| |
29
|
G. Peano. Sur une courbe, qui remplit toute une aire plaine. Math. Ann., 36:157-160, 1890.
|
| |
30
|
|
| |
31
|
|
| |
32
|
K. D. Tocher. The application of automatic computers to sampling experiments. J. Roy. Statist. Soc. Ser. B, 16(1):39-61, 1954. See pp. 53-55.
|
 |
33
|
|
| |
34
|
|
| |
35
|
|
| |
36
|
D. S. Wise and J. D. Frens. Morton-order matrices deserve compilers' support. Technical Report 533, Computer Science Dept., Indiana University, Nov. 1999. http://www.cs.indiana.edu/ftp/techreports/TR533.html
|
 |
37
|
Qing Yi , Vikram Adve , Ken Kennedy, Transforming loops to recursion for multi-level memory hierarchies, Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, p.169-181, June 18-21, 2000, Vancouver, British Columbia, Canada
|
CITED BY 10
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gregorio Quintana-Ortí , Francisco D. Igual , Enrique S. Quintana-Ortí , Robert A. van de Geijn, Solving dense linear systems on platforms with multiple hardware accelerators, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|