|
ABSTRACT
Memory access latency and memory-related operations are often the performance bottleneck in parallel applications. In this paper, we present a concept of active memory operations which is an on-chip network transaction that operates based on the microcode provided by the software designer. Utilizing the active memory operation, we can replace multiple transactions of memory accesses over the on-chip network and related local processing element computation with a smaller number of high-level transactions and near-memory computation. We implemented a processor called active memory processor which is located near the memory and executes the active memory operations. In our case studies, we applied the concept to three real-world applications (parallelized JPEG, FFT, and text indexing for data mining) running on a 36-tile architecture with 32 cores and 4 memories and found that the programmable transaction approach can improve performance by 34.3% to 618% at the cost of additional design effort.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
B. K. Mathew, S. A. McKee, J. B. Carter, A. Davis, "Design of a parallel vector access unit for SDRAM memory systems," in Proc. 6th International Symposium on High-Performance Computer Architecture, pp. 39--48, Jan 2000
|
| |
2
|
Wei-fen Lin, Steven K. Reinhart, D. Burger. "Reducing DRAM latencies with an integrated memory hierarchy design," in Proc. 7th International Symposium on High-Performance Computer Architecture, pp. 301, Jan 2001
|
| |
3
|
A. Roth and G. S. Sohi. "Effective jump-pointer prefetching for linked data structures," in Proc. 26th International Symposium on Computer Architecture, May 1999.
|
| |
4
|
M. Karlsson, F. Dahlgren, P, Stenstrom. "A prefetching technique for irregular accesses to linked data structures," in Proc. 6th International Symposium on High-Performance Computer Architecture, 2000.
|
| |
5
|
S. P. Vanderwiel, D. J. Lilja, "Data prefetch mechanisms," ACM Computing Surveys, v. 32 n. 2, p. 174--199, June 2000.
|
| |
6
|
M. Frigo, C. E. Leiserson, H. Prokop, S. Ramachandran, "Cache-oblivious algorithms," in Proc. 40th Annual Symposium on Foundations of Computer Science, 1999.
|
| |
7
|
M. Bender, E. Demaine, M. Farach-Coltom. "Cache-oblivious B-trees," in Proc. 41st Annual Symposium of Foundations of Computer Science, 2000.
|
| |
8
|
L. Arge, M. Bender, E. Demaine, B. Holland-Minkley, J. Ian Munro, "Cache-oblivious priority queue and graph algorithm applications," in Proc. 34th annual ACM Symposium on Theory of Computing, May 2002.
|
| |
9
|
T. v. Eicken, D. E. Culler, S. C. Goldstein, K. E. Schauser, "Active messages: a mechanism for integrated communication and computation," in Proc. 19th Annual Internation Symposium on Computer Architecture, 1992.
|
| |
10
|
Stratix III FPGA Device Family Overview, http://www.altera.com/products/devices/stratix-fpgas/stratix-iii/overview/st3-overview.html
|
| |
11
|
L. Li, L. Gao, J. Xue, "Memory coloring: a compiler approach for scratchpad memory management," in Proc. of 14th International Conference on Parallel Architectures and Compilation Techniques, pp. 329--338, 2005.
|
| |
12
|
I. Issenin, E. Brockmeyer, B. Durinck, N. Dutt, "Multiprocessor system-on-chip data reuse analysis for exploring customized memory hierarchies," in Proc. of 43rd Design Automation Conference, pp. 49--52, July 2006.
|
| |
13
|
L. Rudolph, P. Jain, S. Devadas, D. Chiou, "Application-specific memory management for embedded systems using software-controlled caches," in Proc. of 37th Design Automation Conference, pp. 416--419, June 2000.
|
| |
14
|
Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, M. A. Parker, "Active Memory Operations", in Proc. of 21st Annual International Conference on Supercomputing, pp. 232--241, July 2007.
|
|