|
ABSTRACT
Automatic vectorization of programs for partitioned-ALU SIMD (Single Instruction Multiple Data) processors has been difficult because of not only data dependency issues but also non-aligned and irregular data access problems. A non-aligned or irregular data access operation incurs many overhead cycles for data alignment. Moreover, this causes difficulty in efficient code generation and hinders automatic vectorization. In this paper, we employ special memory access hardware for improving the performance of SIMD processors; one is the split line buffer and the other is the packing buffer. The former solves the non-aligned memory access problem, while the latter simplifies irregular and stride data access. The addition of these hardware units not only requires very small changes to the instruction set architecture but also contributes to the significant performance improvement by vectorizing more loops and reducing the overhead cycles. We have also developed an auto-vectorization compiler which utilizes these special hardware units. Experiments have been conducted to compare the proposed method with the conventional one, which show 50% increase in the number of vectorized loops and 77% increase in the total performance of an MPEG2 encoder program.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Intel Integrated Performance Primitives for Intel Pentium Processors and Intel Itanium Architectures. Intel Corporation.
|
| |
2
|
TMS320C64x Technical Overview. Texas Instruments, 2000.
|
| |
3
|
Cortex-A8 Technical Reference Manual. ARM, 2007.
|
| |
4
|
Realview Compilation Tools: NEON Vectorizing Compiler Guide. ARM, 2007.
|
| |
5
|
M. Alvarez, E. Salami, A. Ramirez, and M. Valero. Performance Impact of Unaligned Memory Operations in SIMD Extensions for Video Codec Applications. In Proceedings of IEEE International Symposium on Performance Analysis of Systems & Software, pages 62--71, 2007.
|
| |
6
|
|
| |
7
|
|
| |
8
|
H. Chang, J. Cho, and W. Sung. Performance Evaluation of an SIMD Architecture with a Multi-Bank Vector Memory Unit. In Proceedings of IEEE Workshop on Signal Processing Systems Design and Implementation, 2006.
|
| |
9
|
J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of Computation, 19(90):297--301, 1965.
|
 |
10
|
|
| |
11
|
E. J. Fluhr and S. B. Levenstein. Method and Apparatus for Efficiently Accessing Both Aligned and Unaligned Data from a Memory. US Patent 7302525, 2007.
|
| |
12
|
G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, 5(1):1--13, 2001.
|
| |
13
|
M. Kandemir, J. Ramanujam, M. J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. A Compiler-Based Approach for Dynamically Managing Scratch-Pad Memories in Embedded Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23(2):243--260, 2004.
|
 |
14
|
Alexei Kudriavtsev , Peter Kogge, Generation of permutations for SIMD processors, Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, June 15-17, 2005, Chicago, Illinois, USA
|
| |
15
|
|
| |
16
|
Chunho Lee , Miodrag Potkonjak , William H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.330-335, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
17
|
J. Lorenz, S. Kral, F. Franchetti, and C. W. Ueberhuber. Vectorization Techniques for the Blue Gene/L Double FPU. IBM Journal of Research and Development, 49(2/3):437--446, 2005.
|
 |
18
|
Dorit Naishlos , Marina Biberstein , Shay Ben-David , Ayal Zaks, Vectorizing for a SIMdD DSP architecture, Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, October 30-November 01, 2003, San Jose, California, USA
[doi> 10.1145/951710.951714]
|
| |
19
|
D. Nuzman and A. Zaks. Autovectorization in GCC - Two Years Later. In Proceedings of the 2006 GCC Developers Summit, pages 145--58, 2006.
|
| |
20
|
N. C. Paver, B. C. Aldrich, and M. H. Khan. Intel Wireless MMX Technology: A 64-Bit SIMD Architecture for Mobile Multimedia. In Proceedings of International Conference on Acoustics, Speech, and Signal Processing, 2003.
|
 |
21
|
|
| |
22
|
|
 |
23
|
|
| |
24
|
Z. Wang. Fast Algorithms for the Discrete W Transform and for the Discrete Fourier Transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-32(4):803--816, 1984.
|
| |
25
|
|
| |
26
|
K. X. Zhang. Buffer for a Split Cache Line Access. US Patent 6862225, 2005.
|
|