|
ABSTRACT
Extended subwords and the matrix register file (MRF) are two micro architectural techniques that address some of the limitations of existing SIMD architectures. Extended subwords are wider than the data stored in memory. Specifically, for every byte of data stored in memory, there are four extra bits in the media register file. This avoids the need for data-type conversion instructions. The MRF is a register file organization that provides both conventional row-wise, as well as column-wise, access to the register file. In other words, it allows to view the register file as a matrix in which corresponding subwords in different registers corresponds to a column of the matrix. It was introduced to accelerate matrix transposition which is a very common operation in multimedia applications. In this paper, we show that the MRF is very versatile, since it can also be used for other permutations than matrix transposition. Specifically, it is shown how it can be used to provide efficient access to strided data, as is needed in, e.g., color space conversion. Furthermore, it is shown that special-purpose instructions (SPIs), such as the sum-of-absolute differences (SAD) instruction, have limited usefulness when extended subwords and a few general SIMD instructions that we propose are supported, for the following reasons. First, when extended subwords are supported, the SAD instruction provides only a relatively small performance improvement. Second, the SAD instruction processes 8-bit subwords only, which is not sufficient for quarter-pixel resolution nor for cost functions used in image and video retrieval. Results obtained by extending the SimpleScalar toolset show that the proposed techniques provide a speedup of up to 3.00 over the MMX architecture. The results also show that using, at most, 13 extra media registers yields an additional performance improvement ranging from 1.38 to 1.57.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Baron, M. 2005. Cortex-A8: High speed, low power. Microprocessor Rep. 11, 14, 1--6.
|
| |
3
|
Bartkowiak, M. 2001. Optimizations of color transformation for real time video decoding. In Proceedings of the EURASIP Conference on Digital Signal Processing for Multimedia Communications and Services.
|
| |
4
|
Bensaali, F. and Amira, A. 2005. Accelerating colour space conversion on reconfigurable hardware. Image Vision Comput. 23, 935--942.
|
| |
5
|
|
| |
6
|
|
| |
7
|
|
| |
8
|
Flachs, B., Asano, S., Dhong, S. H., Hofstee, H. P., Gervais, G., Kim, R., Le, T., Liu, P., Leenstra, J., Michael, J. L. B., Oh, H. J., Mueller, S. M., Takahashi, O., Hatakeyama, A., Watanabe, Y., Yano, N., Brokenshire, D. A., Peyravian, M., Vandung, T., and Iwata, E. 2006. The microarchitecture of the synergistic processor for a cell processor. IEEE J. Solid-State Circuits 41, 63--70.
|
| |
9
|
|
| |
10
|
Michael Gschwind , H. Peter Hofstee , Brian Flachs , Martin Hopkins , Yukio Watanabe , Takeshi Yamazaki, Synergistic Processing in Cell's Multicore Architecture, IEEE Micro, v.26 n.2, p.10-24, March 2006
[doi> 10.1109/MM.2006.41]
|
| |
11
|
Gwennap, L. 1996. Digital, MIPS add multimedia extensions. Microprocessor Rep. 10, 15, 24--28.
|
| |
12
|
|
| |
13
|
IBM 2007. Synergistic Processor Unit Instruction Set Architecture. IBM. Version 1.2.
|
| |
14
|
|
| |
15
|
Juurlink, B., Borodin, D., Meeuws, R. J., Aalbers, G. T., and Leisink, H. 2007. The SimpleScalar Instruction Tool (SSIT) and the SimpleScalar Architecture Tool (SSAT). Available via http://ce.et.tudelft.nl/~shahbahrami
|
| |
16
|
Kozyrakis, C., Gebis, J., Martin, D., Williams, S., Mavroidis, I., Pope, S., Jones, D., Patterson, D., and Yelick, K. 2000. Vector IRAM: A media-oriented vector processor with embedded DRAM. In Proceedings of the 12th International Conference on Hot Chips.
|
| |
17
|
|
 |
18
|
|
| |
19
|
Lee, A. J. T., Hong, R. W., and Chang, M. F. 2004. An approach to content-based video retrieval. In Proceedings of the IEEE International Conference on Multimedia and Expo. Vol. 1. 273--276.
|
| |
20
|
|
| |
21
|
|
| |
22
|
J. H. Moreno , V. Zyuban , U. Shvadron , F. D. Neeser , J. H. Derby , M. S. Ware , K. Kailas , A. Zaks , A. Geva , S. Ben-David , S. W. Asaad , T. W. Fox , D. Littrell , M. Biberstein , D. Naishlos , H. Hunter, An innovative low-power high-performance programmable signal processor for digital communications, IBM Journal of Research and Development, v.47 n.2-3, p.299-326, March 2003
|
| |
23
|
Motorola Inc. 1998. AltiVec Technology Programming Environments Manual. Motorola Inc. Rev.0.1.
|
 |
24
|
Dorit Naishlos , Marina Biberstein , Shay Ben-David , Ayal Zaks, Vectorizing for a SIMdD DSP architecture, Proceedings of the 2003 international conference on Compilers, architecture and synthesis for embedded systems, October 30-November 01, 2003, San Jose, California, USA
[doi> 10.1145/951710.951714]
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
Seshan, N. 1998. High VelociTI Processing. IEEE Signal Processing Mag. 15, 2, 86--101.
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
| |
33
|
Shanableh, T. and Ghanbari, M. 2000. Heterogeneous video transcoding to lower spatio-temporal resolutions and different encoding formats. IEEE Trans. Multimedia 2, 2, 101--110.
|
| |
34
|
|
| |
35
|
Tamhankar, A. and Rao, K. R. 2003. An overview of H.264/MPEG-4 Part 10. In Proceedings of the 4th International Conference on Video and Image Processing and Multimedia Communications. 1--51.
|
| |
36
|
Texas Instruments 2007. TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide. Texas Instruments. Literature Number: SPRU732D.
|
| |
37
|
|
| |
38
|
|
| |
39
|
Zhang, D. and Lu, G. 2003. Evaluation of similarity measurement for image rretrieval. In Proceedings of the IEEE International Conference on Neural Networks and Signal Processing. Vol. 2. 928--931.
|
|