|
ABSTRACT
The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
AMD Opteron Processor Family. http://www.amd.com/.
|
| |
2
|
CRAY-2 Engineering Maintenance Manual. Cray Research Inc., Publication No. HM-2032, 1985.
|
| |
3
|
|
| |
4
|
Intel Pentium/Core/Core 2 Processors. http://www.intel.com/.
|
| |
5
|
NVIDIA CUDA (Compute Unified Device Architecture). http://www.nvidia.com/, 2007.
|
| |
6
|
PowerPC User Instruction Set Architecture (Book I). 2003.
|
 |
7
|
Dennis Abts , Abdulla Bataineh , Steve Scott , Greg Faanes , Jim Schwarzmeier , Eric Lundberg , Tim Johnson , Mike Bye , Gerald Schwoerer, The Cray BlackWidow: a highly scalable vector multiprocessor, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, November 10-16, 2007, Reno, Nevada
[doi> 10.1145/1362622.1362646]
|
| |
8
|
|
| |
9
|
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. Princeton University Technical Report TR-811-08, 2008.
|
| |
10
|
|
| |
11
|
P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.
|
| |
12
|
|
 |
13
|
Zhen Fang , Lixin Zhang , John B. Carter , Ali Ibrahim , Michael A. Parker, Active memory operations, Proceedings of the 21st annual international conference on Supercomputing, June 17-21, 2007, Seattle, Washington
[doi> 10.1145/1274971.1275004]
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
J. R. Larus and R. Rajwar. Transactional Memory. Morgan and Claypool, 2006.
|
 |
21
|
|
 |
22
|
|
| |
23
|
J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips Keynote, 2006.
|
| |
24
|
O. Schenk. Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors. PhD thesis, ETH Zurich, Zurich, Switzerland, 2005.
|
 |
25
|
|
| |
26
|
R. Smith. Open dynamics engine v0.5 user guide. http://www.ode.org/ode-latest-userguide.html, 2006.
|
| |
27
|
|
|