|
ABSTRACT
Hardly predictable data addresses in many irregular applications have rendered prefetching ineffective. In many cases, the only accurate way to predict these addresses is to directly execute the code that generates them. As multithreaded architectures become increasingly popular, one attractive approach is to use idle threads on these machines to perform pre-execution—essentially a combined act of speculative address generation and prefetching—to accelerate the main thread. In this paper, we propose such a pre-execution technique for simultaneous multithreading (SMT) processors. By using software to control pre-execution, we are able to handle some of the most important access patterns that are typically difficult to prefetch. Compared with existing work on pre-execution, our technique is significantly simpler to implement (e.g., no integration of pre-execution results, no need of shortening programs for pre-execution, and no need of special hardware to copy register values upon thread spawns). Consequently, only minimal extensions to SMT machines are required to support our technique. Despite its simplicity, our technique offers an average speedup of 24% in a set of irregular applications, which is a 19% speedup over state-of-the-art software-controlled prefetching.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Anant Agarwal , Beng-Hong Lim , David Kranz , John Kubiatowicz, APRIL: a processor architecture for multiprocessing, Proceedings of the 17th annual international symposium on Computer Architecture, p.104-114, May 28-31, 1990, Seattle, Washington, United States
|
| |
2
|
|
| |
3
|
Alpha Development Group, Compaq Computer Corp. The Asim Manual, 2000.
|
 |
4
|
|
 |
5
|
|
| |
6
|
|
 |
7
|
Robert S. Chappell , Jared Stark , Sangwook P. Kim , Steven K. Reinhardt , Yale N. Patt, Simultaneous subordinate microthreading (SSMT), Proceedings of the 26th annual international symposium on Computer architecture, p.186-195, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
8
|
|
 |
9
|
Jamison D. Collins , Hong Wang , Dean M. Tullsen , Christopher Hughes , Yong-Fong Lee , Dan Lavery , John P. Shen, Speculative precomputation: long-range prefetching of delinquent loads, Proceedings of the 28th annual international symposium on Computer architecture, p.14-25, June 30-July 04, 2001, Göteborg, Sweden
|
| |
10
|
Standard Performance Evaluation Corporation. The SPEC95 benchmark suite. hup://www.specbench org.
|
| |
11
|
M. Dubois and Y. H Song. Assisted execution. Technical Report CENG Technical Report 98-25, University of Southern California, October 1998.
|
 |
12
|
|
| |
13
|
J. S. Emer. Simultaneous Multithreading: Multiplying Alpha Performance. Micoprocessor Forum, October 1999.
|
| |
14
|
J. S. Emer. Relaxing Constraints: Thoughts on the Evolution of Computer Architecture. Keynote Speech for the 7th HPCA. January 2000.
|
| |
15
|
Alexandre Farcy , Olivier Temam , Roger Espasa , Toni Juan, Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes, Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture, p.59-68, November 1998, Dallas, Texas, United States
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
| |
19
|
N. Kohout S. Cboi. and D. Yeung. Mulfi-chain pret;etching: Exploiting memory parallelism in pointer-chasing codes. In ISCA Workshop on Solving the Memory Wall Problem. 2000.
|
 |
20
|
|
| |
21
|
|
| |
22
|
|
| |
23
|
|
| |
24
|
|
 |
25
|
|
 |
26
|
Amir Roth , Andreas Moshovos , Gurindar S. Sohi, Dependence based prefetching for linked data structures, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.115-126, October 02-07, 1998, San Jose, California, United States
|
 |
27
|
|
| |
28
|
|
 |
29
|
|
| |
30
|
|
 |
31
|
|
 |
32
|
Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd annual international symposium on Computer architecture, p.191-202, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
33
|
|
 |
34
|
Steven Cameron Woo , Moriyoshi Ohara , Evan Torrie , Jaswinder Pal Singh , Anoop Gupta, The SPLASH-2 programs: characterization and methodological considerations, Proceedings of the 22nd annual international symposium on Computer architecture, p.24-36, June 22-24, 1995, S. Margherita Ligure, Italy
|
 |
35
|
|
 |
36
|
|
CITED BY 54
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tor M. Aamodt , Pedro Marcuello , Paul Chow , Antonio González , Per Hammarlund , Hong Wang , John P. Shen, A framework for modeling and optimization of prescient instruction prefetch, ACM SIGMETRICS Performance Evaluation Review, v.31 n.1, June 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform, ACM SIGPLAN Notices, v.39 n.11, November 2004
|
|
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper Threads via Virtual Multithreading, IEEE Micro, v.24 n.6, p.74-82, November 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tanping Wang , Filip Blagojevic , Dimitrios S. Nikolopoulos, Runtime support for integrating precomputation and thread-level parallelism on simultaneous multithreaded processors, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-12, October 22-23, 2004, Houston, Texas
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
R. Shetty , M. Kharbutli , Y. Solihin , M. Prvulovic, HeapMon: a helper-thread approach to programmable, automatic, and low-overhead memory bug detection, IBM Journal of Research and Development, v.50 n.2/3, p.261-275, March 2006
|
|
|
|
|
|
Minas Dasygenis , Erik Brockmeyer , Bart Durinck , Francky Catthoor , Dimitrios Soudris , Antonios Thanailakis, A combined DMA and application-specific prefetching approach for tackling the memory latency bottleneck, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v.14 n.3, p.279-291, March 2006
|
|
|
|
|
|
|
|
|
|
|
|
Dongkeun Kim , Steve Shih-wei Liao , Perry H. Wang , Juan del Cuvillo , Xinmin Tian , Xiang Zou , Hong Wang , Donald Yeung , Milind Girkar , John P. Shen, Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, p.27, March 20-24, 2004, Palo Alto, California
|
|
|
|
|
|
Tanausú Ramírez , Alex Pajuelo , Oliverio J. Santana , Mateo Valero, Kilo-instruction processors, runahead and prefetching, Proceedings of the 3rd conference on Computing frontiers, May 03-05, 2006, Ischia, Italy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jiwei Lu , Abhinav Das , Wei-Chung Hsu , Khoa Nguyen , Santosh G. Abraham, Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.93-104, November 12-16, 2005, Barcelona, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Yong Chen , Surendra Byna , Xian-He Sun , Rajeev Thakur , William Gropp, Hiding I/O latency with pre-execution prefetching for parallel applications, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
|
|
Seung Woo Son , Mahmut Kandemir , Mustafa Karakoy , Dhruva Chakrabarti, A compiler-directed data prefetching scheme for chip multiprocessors, Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, February 14-18, 2009, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|