|
ABSTRACT
Most implementations of the Single Instruction Multiple Data (SIMD) model available today require that data elements be packed in vector registers. Operations on disjoint vector elements are not supported directly and require explicit data reorganization manipulations. Computations on non-contiguous and especially interleaved data appear in important applications, which can greatly benefit from SIMD instructions once the data is reorganized properly. Vectorizing such computations efficiently is therefore an ambitious challenge for both programmers and vectorizing compilers. We demonstrate an automatic compilation scheme that supports effective vectorization in the presence of interleaved data with constant strides that are powers of 2, facilitating data reorganization. We demonstrate how our vectorization scheme applies to dominant SIMD architectures, and present experimental results on a wide range of key kernels, showing speedups in execution time up to 3.7 for interleaving levels (stride) as high as 8.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
Leonardo Bachega , Siddhartha Chatterjee , Kenneth A. Dockser , John A. Gunnels , Manish Gupta , Fred G. Gustavson , Christopher A. Lapkowski , Gary K. Liu , Mark P. Mendell , Charles D. Wait , T. J. Chris Ward, A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design, Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, p.85-96, September 29-October 03, 2004
[doi> 10.1109/PACT.2004.2]
|
| |
4
|
A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian. Efficient exploitation of parallelism on Pentium III and Pentium 4 processor-based systems. Intel Technology J., February 2001.
|
| |
5
|
|
| |
6
|
|
| |
7
|
P. D'Arcy and S. Beach. StarCore SC140: A New DSP Architecture for Portable Devices. In Wireless Symposium. Motorola, September 1999.
|
| |
8
|
|
 |
9
|
|
 |
10
|
Roger Espasa , Federico Ardanaz , Joel Emer , Stephen Felix , Julio Gago , Roger Gramunt , Isaac Hernandez , Toni Juan , Geoff Lowney , Matthew Mattina , André Seznec, Tarantula: a vector extension to the alpha architecture, Proceedings of the 29th annual international symposium on Computer architecture, p.281, May 25-29, 2002, Anchorage, Alaska
|
| |
11
|
Free Software Foundation. Auto-Vectorization in GCC, http://gcc.gnu.org/projects/tree-ssa/vectorization.html.
|
| |
12
|
Free Software Foundation. GCC, http://gcc.gnu.org.
|
 |
13
|
Gina Goff , Ken Kennedy , Chau-Wen Tseng, Practical dependence testing, Proceedings of the ACM SIGPLAN 1991 conference on Programming language design and implementation, p.15-29, June 24-28, 1991, Toronto, Ontario, Canada
|
| |
14
|
Texas Instruments. www.ti.com/sc/c6x, 2000.
|
| |
15
|
J. A. Kahle , M. N. Day , H. P. Hofstee , C. R. Johns , T. R. Maeurer , D. Shippy, Introduction to the cell multiprocessor, IBM Journal of Research and Development, v.49 n.4/5, p.589-604, July 2005
|
 |
16
|
Alexei Kudriavtsev , Peter Kogge, Generation of permutations for SIMD processors, Proceedings of the 2005 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems, June 15-17, 2005, Chicago, Illinois, USA
|
 |
17
|
|
| |
18
|
J. Lorenz, S. Kral, F. Franchetti, and C. W. Ueberhuber. Vectorization Techniques for the BlueGene/L Double FPU. IBM Journal of Research and Development, 49(2-3), pages 437--446, March/May 2005.
|
| |
19
|
J. Merrill. Generic and Gimple: A New Tree Representation for Entire Functions. In the GCC Developer's summit, pages 171--180, June 2003.
|
| |
20
|
J. H. Moreno , V. Zyuban , U. Shvadron , F. D. Neeser , J. H. Derby , M. S. Ware , K. Kailas , A. Zaks , A. Geva , S. Ben-David , S. W. Asaad , T. W. Fox , D. Littrell , M. Biberstein , D. Naishlos , H. Hunter, An innovative low-power high-performance programmable signal processor for digital communications, IBM Journal of Research and Development, v.47 n.2-3, p.299-326, March 2003
|
 |
21
|
Dorit Naishlos , Marina Biberstein , Shay Ben-David , Ayal Zaks, Vectorizing for a SIMdD DSP architecture, Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, October 30-November 01, 2003, San Jose, California, USA
[doi> 10.1145/951710.951714]
|
| |
22
|
|
 |
23
|
|
| |
24
|
D. Novillo. Tree SSA - a New Optimization Infrastructure for GCC. In Proc. of the GCC Developers Summit, pages 181--194, June 2003.
|
| |
25
|
|
| |
26
|
|
| |
27
|
S. Pop, G. Silber, A. Cohen, P. Clauss, and V. Loechner. Fast Recognition of Scalar Evolutions on Three-address SSA Code. Research Report A/354/CRI, CRI/ENSMP, April 2004.
|
| |
28
|
S. Pop, A. Cohen, and G. Silber. Induction Variable Analysis with Delayed Abstractions. In Proc. of the First International Conference of High Performance Embedded Architectures and Compilers (HiPEAC), pages 218--232, November 2005.
|
| |
29
|
I. Pryanishnikov, A. Krall, and N. Horspool. Pointer Alignment Analysis for Processors with SIMD Instructions. In Proc. of the 5th Workshop on Media and Streaming Processors at Micro '03, pages 50--57, December 2003.
|
| |
30
|
G. Ren, P. Wu, and D. Padua. A Preliminary Study on the Vectorization of Multimedia Applications for Multimedia Extensions. In 16th International Workshop of Languages and Compilers for Parallel Computing (LCPC), pages 420 -- 435, October 2003.
|
 |
31
|
|
| |
32
|
|
| |
33
|
|
| |
34
|
K. B. Smith, A. J. Bik, and X. Tian. Support for the Intel Pentium 4 Processor with Hyper-threading Technology in Intel 8.0 Compilers. Intel Technology Journal, 8(1), pages 19--31, February 2004.
|
| |
35
|
|
| |
36
|
Crecent Bay Software. VAST-F/ALtivec: Automatic Fortran Vectorizer for PowerPC Vector Unit, http://www.crescentbaysoftware.com/docs/vastfav.pdf.
|
| |
37
|
Crecent Bay Software. Vast/altivec faq: Vectorization for Altivec, http://www.crescentbaysoftware.com/altivec_FAQ.html.
|
| |
38
|
|
| |
39
|
|
CITED BY 8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Arun Kejariwal , Alexander V. Veidenbaum , Alexandru Nicolau , Milind Girkar , Xinmin Tian , Hideki Saito, On the exploitation of loop-level parallelism in embedded applications, ACM Transactions on Embedded Computing Systems (TECS), v.8 n.2, p.1-34, January 2009
|
|
|
|
|
|
Hiroaki Tanaka , Yoshinori Takeuchi , Keishi Sakanushi , Masaharu Imai , Hiroki Tagawa , Yutaka Ota , Nobu Matsumoto, Generation of Pack Instruction Sequence for Media Processors Using Multi-Valued Decision Diagram, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, v.E90-A n.12, p.2800-2809, December 2007
|
|
|
|
|