ACM Home Page
Please provide us with feedback. Feedback
Outer-loop vectorization: revisited for short SIMD architectures
Full text PdfPdf (315 KB)
Source
PACT archive
Proceedings of the 17th international conference on Parallel architectures and compilation techniques table of contents
Toronto, Ontario, Canada
SESSION: Compilation table of contents
Pages 2-11  
Year of Publication: 2008
ISBN:978-1-60558-282-5
Authors
Dorit Nuzman  IBM Haifa Research Lab, Haifa, Israel
Ayal Zaks  IBM Haifa Research Lab, Haifa, Israel
Sponsors
ACM: Association for Computing Machinery
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 255,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1454115.1454119
What is a DOI?

ABSTRACT

Vectorization has been an important method of using data-level parallelism to accelerate scientific workloads on vector machines such as Cray for the past three decades. In the last decade it has also proven useful for accelerating multi-media and embedded applications on short SIMD architectures such as MMX, SSE and AltiVec. Most of the focus has been directed at innermost loops, effectively executing their iterations concurrently as much as possible. Outer loop vectorization refers to vectorizing a level of a loop nest other than the innermost, which can be beneficial if the outer loop exhibits greater data-level parallelism and locality than the innermost loop. Outer loop vectorization has traditionally been performed by interchanging an outer-loop with the innermost loop, followed by vectorizing it at the innermost position. A more direct unroll-and-jam approach can be used to vectorize an outer-loop without involving loop interchange, which can be especially suitable for short SIMD architectures.

In this paper we revisit the method of outer loop vectorization, paying special attention to properties of modern short SIMD architectures. We show that even though current optimizing compilers for such targets do not apply outer-loop vectorization in general, it can provide significant performance improvements over innermost loop vectorization. Our implementation of direct outer-loop vectorization, available in GCC 4.3, achieves speedup factors of 3.13 and 2.77 on average across a set of benchmarks, compared to 1.53 and 1.39 achieved by innermost loop vectorization, when running on a Cell BE SPU and PowerPC970 processors respectively. Moreover, outer-loop vectorization provides new reuse opportunities that can be vital for such short SIMD architectures, including efficient handling of alignment. We present an optimization tapping such opportunities, capable of further boosting the performance obtained by outer-loop vectorization to achieve average speedup factors of 5.26 and 3.64.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
R. Allen and K. Kennedy. Pfc: A program to convert fortran to parallel form. Dept. of Math. Sciences, Rice University, 1982.
2
 
3
 
4
 
5
A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. Intel Technology J., February 2001.
 
6
 
7
8
 
9
Free Software Foundation. GCC, http://gcc.gnu.org.
 
10
Free Software Foundation. gcc.gnu.org/projects/tree-ssa/vectorization.html.
11
 
12
 
13
C. Kozyrakis, D. Judd, J. Gebis, S. Williams, D. Patterson, and K. Yelick. Hardware/compiler co-development for an embedded media processor. IEEE, 89(11):694--709, November 2001.
 
14
15
 
16
 
17
C. G. Lee. Utdsp benchmarks. http://www.eecg.toronto.edu/ corinna/DSP/infrastructure/UTDSP.html, 1998.
18
 
19
 
20
21
22
 
23
Gang Ren, Peng Wu, and David Padua. A preliminary study on the vectorization of multimedia applications for multimedia extensions. In 16th International Workshop of Languages and Compilers for Parallel Computing, October 2003.
 
24
25
 
26
A. Shahbahrami, B.H.H. Juurlink, and S. Vassiliadis. Efficient vectorization of the fir filter. In ProRisc 2005, pages 432--437, November 2005.
 
27
 
28
 
29
K. B. Smith, A. J.C. Bik, and X. Tian. Support for the intel pentium 4 processor with hyper-threading technology in intel 8.0 compilers. Intel Tech. J., 8(1):19--31, February 2004.
30
 
31
C. Tenllado, L. Pinuel, M. Prieto, and F. Catthoor. Pack transposition: Enhancing superword level parallelism exploitation. In ParCo, 2005.
 
32
 
33
 
34
35