| Speculative precomputation: long-range prefetching of delinquent loads |
| Full text |
Pdf
(996 KB)
|
| Source
|
International Symposium on Computer Architecture
archive
Proceedings of the 28th annual international symposium on Computer architecture
table of contents
Göteborg, Sweden
Pages: 14 - 25
Year of Publication: 2001
ISBN:0-7695-1162-7
Also published in ...
|
|
Authors
|
|
Jamison D. Collins
|
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA
|
|
Hong Wang
|
Microprocessor Research Lab, Intel Corporation, Santa Clara, CA
|
|
Dean M. Tullsen
|
Department of Computer Science and Engineering, University of California, San Diego, La Jolla, CA
|
|
Christopher Hughes
|
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL
|
|
Yong-Fong Lee
|
Microcomputer Software Lab, Intel Corporation, Santa Clara, CA
|
|
Dan Lavery
|
Microcomputer Software Lab, Intel Corporation, Santa Clara, CA
|
|
John P. Shen
|
Microprocessor Research Lab, Intel Corporation, Santa Clara, CA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 79, Citation Count: 65
|
|
|
ABSTRACT
This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future memory accesses in available thread contexts, and prefetching these data. This technique is evaluated by simulating the performance of a research processor based on the Itanium™ ISA supporting Simultaneous Multithreading. Two primary forms of Speculative Precomputation are evaluated. If only the non-speculative thread spawns speculative threads, performance gains of up to 30% are achieved when assuming ideal hardware. However, this speedup drops considerably with more realistic hardware assumptions. Permitting speculative threads to directly spawn additional speculative threads reduces the overhead associated with spawning threads and enables significantly more aggressive speculation, overcoming this limitation. Even with realistic costs for spawning threads, speedups as high as 169% are achieved, with an average speedup of 76%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
S.G. Abraham and B. R. Rau. Predicting load latencies using cache profiling. In Hewlett Packard Lab, Technical Report HPL-94-110, Dec. 1994.
|
| |
2
|
Jay Bharadwaj , William Y. Chen , Weihaw Chuang , Gerolf Hoflehner , Kishore Menezes , Kalyan Muthukumar , Jim Pierce, The Intel IA-64 Compiler Code Generator, IEEE Micro, v.20 n.5, p.44-53, September 2000
[doi> 10.1109/40.877949]
|
| |
3
|
|
 |
4
|
Robert S. Chappell , Jared Stark , Sangwook P. Kim , Steven K. Reinhardt , Yale N. Patt, Simultaneous subordinate microthreading (SSMT), Proceedings of the 26th annual international symposium on Computer architecture, p.186-195, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
5
|
J. Emer. Simultaneous multithreading: Multiplying Alpha's performance. In Microprocessor Forum, Oct. 1999.
|
| |
6
|
|
| |
7
|
Jerry Huck , Dale Morris , Jonathan Ross , Allan Knies , Hans Mulder , Rumi Zahir, Introducing the IA-64 Architecture, IEEE Micro, v.20 n.5, p.12-23, September 2000
[doi> 10.1109/40.877947]
|
| |
8
|
Intel Corporation. Intel IA-64 architecture software developer's manual.
|
 |
9
|
|
 |
10
|
Yul H. Kim , Mark D. Hill , David A. Wood, Implementing stack simulation for highly-associative memories, Proceedings of the 1991 ACM SIGMETRICS conference on Measurement and modeling of computer systems, p.212-213, May 21-24, 1991, San Diego, California, United States
|
| |
11
|
Rakesh Krishnaiyer , Dattatraya Kulkarni , Daniel Lavery , Wei Li , Chu-cheow Lim , John Ng , David Sehr, An Advanced Optimizer for the IA-64 Architecture, IEEE Micro, v.20 n.6, p.60-68, November 2000
[doi> 10.1109/40.888704]
|
 |
12
|
Amir Roth , Andreas Moshovos , Gurindar S. Sohi, Dependence based prefetching for linked data structures, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.115-126, October 02-07, 1998, San Jose, California, United States
|
| |
13
|
|
| |
14
|
|
| |
15
|
Y. Song and M. Dubois. Assisted execution. In Tcchnicai Report CENG 98-25, Department of EE-Systems, UniversiO' of Southern Californm, Oct. 1998.
|
| |
16
|
SPEC. SPEC cpu2000 documentation. In http://www.spec.org/osg/cpu2OOO/docs/.
|
 |
17
|
|
| |
18
|
D. Tullsen. Simulation and modeling of a simultaneous multitbreaded processor. In 22nd Annual Computer Measurement Group Conference, Dec. 1996.
|
 |
19
|
Dean M. Tullsen , Susan J. Eggers , Joel S. Emer , Henry M. Levy , Jack L. Lo , Rebecca L. Stamm, Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor, Proceedings of the 23rd annual international symposium on Computer architecture, p.191-202, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
 |
20
|
|
| |
21
|
R. Uhlig, R. Fishtein, O. Gershon, 1. Hirsh, and H. Wang. SoftSDV: A presilicon software development environment for the IA-64 architecture. In lntel Technology Journal, 4th Quarter 1999.
|
 |
22
|
|
| |
23
|
H. Wang et al. A conjugate flow processor. In Docket No. 884.225US1. Patent Pending, May 2000.
|
 |
24
|
|
 |
25
|
|
CITED BY 65
|
|
|
|
|
|
|
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper threads via virtual multithreading on an experimental itanium® 2 processor-based platform, ACM SIGPLAN Notices, v.39 n.11, November 2004
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Perry H. Wang , Jamison D. Collins , Hong Wang , Dongkeun Kim , Bill Greene , Kai-Ming Chan , Aamir B. Yunus , Terry Sych , Stephen F. Moore , John P. Shen, Helper Threads via Virtual Multithreading, IEEE Micro, v.24 n.6, p.74-82, November 2004
|
|
|
Tor M. Aamodt , Pedro Marcuello , Paul Chow , Antonio González , Per Hammarlund , Hong Wang , John P. Shen, A framework for modeling and optimization of prescient instruction prefetch, ACM SIGMETRICS Performance Evaluation Review, v.31 n.1, June 2003
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christos D. Antonopoulos , Xiaoning Ding , Andrey Chernikov , Filip Blagojevic , Dimitrios S. Nikolopoulos , Nikos Chrisochoides, Multigrain parallel Delaunay Mesh generation: challenges and opportunities for multithreaded architectures, Proceedings of the 19th annual international conference on Supercomputing, June 20-22, 2005, Cambridge, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
Dongkeun Kim , Steve Shih-wei Liao , Perry H. Wang , Juan del Cuvillo , Xinmin Tian , Xiang Zou , Hong Wang , Donald Yeung , Milind Girkar , John P. Shen, Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, p.27, March 20-24, 2004, Palo Alto, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gregorio Bernabé , Ricardo Fernández , Jose M. García , Manuel E. Acacio , José González, An efficient implementation of a 3D wavelet transform based encoder on hyper-threading technology, Parallel Computing, v.33 n.1, p.54-72, February, 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Jiwei Lu , Abhinav Das , Wei-Chung Hsu , Khoa Nguyen , Santosh G. Abraham, Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.93-104, November 12-16, 2005, Barcelona, Spain
|
|
|
|
|
|
|
|
|
|
|
|
Tong Chen , Tao Zhang , Zehra Sura , Mar Gonzales Tallada, Prefetching irregular references for software cache on cell, Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, April 05-09, 2008, Boston, MA, USA
|
|
|
|
|
|
Jiwei Lu , Howard Chen , Rao Fu , Wei-Chung Hsu , Bobbie Othmer , Pen-Chung Yew , Dong-Yuan Chen, The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.180, December 03-05, 2003
|
|
|
Ronald D. Barnes , Erik M. Nystrom , John W. Sias , Sanjay J. Patel , Nacho Navarro , Wen-mei W. Hwu, Beating in-order stalls with "flea-flicker" two-pass pipelining, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.387, December 03-05, 2003
|
|
|
Akihiro Yamamoto , Yusuke Tanaka , Hideki Ando , Toshio Shimada, Data prefetching and address pre-calculation through instruction pre-execution with two-step physical register deallocation, Proceedings of the 2007 workshop on MEmory performance: DEaling with Applications, systems and architecture, p.33-40, September 16-16, 2007, Brasov, Romania
|
|
|
|
|
|
Ronald D. Barnes , John W. Sias , Erik M. Nystrom , Sanjay J. Patel , Jose (Nacho) Navarro , Wen-mei W. Hwu, Beating In-Order Stalls with "Flea-Flicker" Two-Pass Pipelining, IEEE Transactions on Computers, v.55 n.1, p.18-33, January 2006
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Christos D. Antonopoulos , Filip Blagojevic , Andrey N. Chernikov , Nikos P. Chrisochoides , Dimitrios S. Nikolopoulos, Algorithm, software, and hardware optimizations for Delaunay mesh generation on simultaneous multithreaded architectures, Journal of Parallel and Distributed Computing, v.69 n.7, p.601-612, July, 2009
|
|
|
|
|
|
|
|
|
|
|
|
Carlos Madriles , Pedro López , Josep M. Codina , Enric Gibert , Fernando Latorre , Alejandro Martinez , Raúl Martinez , Antonio Gonzalez, Boosting single-thread performance in multi-core systems through fine-grain multi-threading, ACM SIGARCH Computer Architecture News, v.37 n.3, June 2009
|
|