|
ABSTRACT
Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Vasanth Bala , Evelyn Duesterwald , Sanjeev Banerjia, Dynamo: a transparent dynamic optimization system, Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, p.1-12, June 18-21, 2000, Vancouver, British Columbia, Canada
|
 |
3
|
Rajeev Balasubramonian , David Albonesi , Alper Buyuktosunoglu , Sandhya Dwarkadas, Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures, Proceedings of the 33rd annual ACM/IEEE international symposium on Microarchitecture, p.245-257, December 2000, Monterey, California, United States
[doi> 10.1145/360128.360153]
|
| |
4
|
|
 |
5
|
|
 |
6
|
|
 |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
Xiaoning Ding , Song Jiang , Feng Chen , Kei Davis , Xiaodong Zhang, DiskSeen: exploiting disk layout and access history to enhance I/O prefetch, 2007 USENIX Annual Technical Conference on Proceedings of the USENIX Annual Technical Conference, p.1-14, June 17-22, 2007, Santa Clara, CA
|
| |
12
|
|
 |
13
|
R. H. Patterson , G. A. Gibson , E. Ginting , D. Stodolsky , J. Zelenka, Informed prefetching and caching, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.79-95, December 03-06, 1995, Copper Mountain, Colorado, United States
|
 |
14
|
|
| |
15
|
|
| |
16
|
|
| |
17
|
|
 |
18
|
|
 |
19
|
|
| |
20
|
Intel. Intel Core Duo Processor and Intel Core Solo Processor on 65 nm Process, January 2007. Datasheet.
|
| |
21
|
Intel Corporation. Intel Develops Tera-Scale Research Chips, 2006. http://www.intel.com/pressroom/archive/releases/20060926corp_b.htm.
|
| |
22
|
Jung et al. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In IPDPS, 2006.
|
| |
23
|
|
 |
24
|
|
| |
25
|
D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In ASPLOS, pages 159--170, 2002.
|
| |
26
|
|
| |
27
|
|
 |
28
|
Steve S.W. Liao , Perry H. Wang , Hong Wang , Gerolf Hoflehner , Daniel Lavery , John P. Shen, Post-pass binary adaptation for software-based speculative precomputation, Proceedings of the ACM SIGPLAN 2002 Conference on Programming language design and implementation, June 17-19, 2002, Berlin, Germany
|
| |
29
|
Jiwei Lu , Howard Chen , Rao Fu , Wei-Chung Hsu , Bobbie Othmer , Pen-Chung Yew , Dong-Yuan Chen, The Performance of Runtime Data Cache Prefetching in a Dynamic Optimization System, Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture, p.180, December 03-05, 2003
|
| |
30
|
Jiwei Lu , Abhinav Das , Wei-Chung Hsu , Khoa Nguyen , Santosh G. Abraham, Dynamic Helper Threaded Prefetching on the Sun UltraSPARC CMP Processor, Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, p.93-104, November 12-16, 2005, Barcelona, Spain
[doi> 10.1109/MICRO.2005.18]
|
 |
31
|
|
 |
32
|
|
 |
33
|
Chi-Keung Luk , Robert Muth , Harish Patil , Richard Weiss , P. Geoffrey Lowney , Robert Cohn, Profile-guided post-link stride prefetching, Proceedings of the 16th international conference on Supercomputing, June 22-26, 2002, New York, New York, USA
[doi> 10.1145/514191.514217]
|
| |
34
|
Peter S. Magnusson , Magnus Christensson , Jesper Eskilson , Daniel Forsgren , Gustav Hållberg , Johan Högberg , Fredrik Larsson , Andreas Moestedt , Bengt Werner, Simics: A Full System Simulation Platform, Computer, v.35 n.2, p.50-58, February 2002
[doi> 10.1109/2.982916]
|
| |
35
|
C. McNairy and R. Bhatia. Montecito -- The next product in the Itanium(R) Processor Family, 2004. In Hot Chips 16, http://www.hotchips.org/archives/.
|
| |
36
|
Microsoft. Phoenix as a Tool in Research and Instruction. http://research.microsoft.com/phoenix/.
|
| |
37
|
Mowry et al. Design and Evaluation of a Compiler Algorithm for Prefetching. In OSDI, pages 62--73, 1992.
|
 |
38
|
Todd C. Mowry , Angela K. Demke , Orran Krieger, Automatic compiler-inserted I/O prefetching for out-of-core applications, Proceedings of the second USENIX symposium on Operating systems design and implementation, p.3-17, October 29-November 01, 1996, Seattle, Washington, United States
|
| |
39
|
|
 |
40
|
Rodric M. Rabbah , Hariharan Sandanagobalane , Mongkol Ekpanyapong , Weng-Fai Wong, Compiler orchestrated prefetching via speculation and predication, Proceedings of the 11th international conference on Architectural support for programming languages and operating systems, October 07-13, 2004, Boston, MA, USA
|
 |
41
|
Amir Roth , Andreas Moshovos , Gurindar S. Sohi, Dependence based prefetching for linked data structures, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.115-126, October 02-07, 1998, San Jose, California, United States
|
 |
42
|
|
| |
43
|
Shi et al. Coterminous locality and coterminous group data prefetching on chip multiprocessors. In IPDPS, 2006.
|
| |
44
|
|
| |
45
|
SPEC. SPEC OMP Version 3.0 Documentation (OpenMP Benchmark Suite). http://www.spec.org/omp/.
|
| |
46
|
|
 |
47
|
|
| |
48
|
Sun Microsystems. UltraSPARC--II Enhancements: Support for Software Controlled Prefetch, 1997. White Paper WPR-0002.
|
| |
49
|
|
 |
50
|
|
 |
51
|
|
 |
52
|
Zhenlin Wang , Doug Burger , Kathryn S. McKinley , Steven K. Reinhardt , Charles C. Weems, Guided region prefetching: a cooperative hardware/software approach, Proceedings of the 30th annual international symposium on Computer architecture, June 09-11, 2003, San Diego, California
|
| |
53
|
|
 |
54
|
|
| |
55
|
|
| |
56
|
|
|