ACM Home Page
Please provide us with feedback. Feedback
Atomic Vector Operations on Chip Multiprocessors
Full text PdfPdf (400 KB)
Source
ACM SIGARCH Computer Architecture News archive
Volume 36 ,  Issue 3  (June 2008) table of contents
Pages 441-452  
Year of Publication: 2008
ISSN:0163-5964
Also published in ...
Authors
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 36,   Downloads (12 Months): 284,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1394608.1382154
What is a DOI?

ABSTRACT

The current trend is for processors to deliver dramatic improvements in parallel performance while only modestly improving serial performance. Parallel performance is harvested through vector/SIMD instructions as well as multithreading (through both multithreaded cores and chip multiprocessors). Vector parallelism can be more efficiently supported than multithreading, but is often harder for software to exploit. In particular, code with sparse data access patterns cannot easily utilize the vector/SIMD instructions of mainstream processors. Hardware to scatter and gather sparse data has previously been proposed to enable vector execution for these codes. However, on multithreaded architectures, a number of applications spend significant time on atomic operations (e.g., parallel reductions), which cannot be vectorized using previously proposed schemes. This paper proposes architectural support for atomic vector operations (referred to as GLSC) that addresses this limitation. GLSC extends scatter-gather hardware to support atomic memory operations. Our experiments show that the GLSC provides an average performance improvement on a set of important RMS kernels of 54% for 4-wide SIMD.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
AMD Opteron Processor Family. http://www.amd.com/.
 
2
CRAY-2 Engineering Maintenance Manual. Cray Research Inc., Publication No. HM-2032, 1985.
 
3
 
4
Intel Pentium/Core/Core 2 Processors. http://www.intel.com/.
 
5
NVIDIA CUDA (Compute Unified Device Architecture). http://www.nvidia.com/, 2007.
 
6
PowerPC User Instruction Set Architecture (Book I). 2003.
7
 
8
 
9
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. Princeton University Technical Report TR-811-08, 2008.
 
10
 
11
P. Dubey. Recognition, Mining and Synthesis Moves Computers to the Era of Tera. Technology@Intel Magazine, February 2005.
 
12
13
14
15
16
 
17
18
 
19
 
20
J. R. Larus and R. Rajwar. Transactional Memory. Morgan and Claypool, 2006.
21
22
 
23
J. Rattner. Cool Codes for Hot Chips: A Quantitative Basis for Multi-Core Design. HotChips Keynote, 2006.
 
24
O. Schenk. Scalable Parallel Sparse LU Factorization Methods on Shared Memory Multiprocessors. PhD thesis, ETH Zurich, Zurich, Switzerland, 2005.
25
 
26
R. Smith. Open dynamics engine v0.5 user guide. http://www.ode.org/ode-latest-userguide.html, 2006.
 
27

Collaborative Colleagues:
Sanjeev Kumar: colleagues
Daehyun Kim: colleagues
Mikhail Smelyanskiy: colleagues
Yen-Kuang Chen: colleagues
Jatin Chhugani: colleagues
Christopher J. Hughes: colleagues
Changkyu Kim: colleagues
Victor W. Lee: colleagues
Anthony D. Nguyen: colleagues