ACM Home Page
Please provide us with feedback. Feedback
A compiler-directed data prefetching scheme for chip multiprocessors
Full text PdfPdf (2.04 MB)
Source
Principles and Practice of Parallel Programming archive
Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming table of contents
Raleigh, NC, USA
SESSION: Parallel compilers and tools table of contents
Pages 209-218  
Year of Publication: 2009
ISBN:978-1-60558-397-6
Also published in ...
Authors
Seung Woo Son  Pennsylvania State University, University Park, PA, USA
Mahmut Kandemir  Pennsylvania State University, University Park, PA, Macao
Mustafa Karakoy  Imperial College, London, United Kingdom
Dhruva Chakrabarti  HP Labs, Cupertino, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGPLAN: ACM Special Interest Group on Programming Languages
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 33,   Downloads (12 Months): 208,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1504176.1504208
What is a DOI?

ABSTRACT

Data prefetching has been widely used in the past as a technique for hiding memory access latencies. However, data prefetching in multi-threaded applications running on chip multiprocessors (CMPs) can be problematic when multiple cores compete for a shared on-chip cache (L2 or L3). In this paper, we (i) quantify the impact of conventional data prefetching on shared caches in CMPs. The experimental data collected using multi-threaded applications indicates that, while data prefetching improves performance in small number of cores, its benefits reduce significantly as the number of cores is increased, that is, it is not scalable; (ii) identify harmful prefetches as one of the main contributors for degraded performance with a large number of cores; and (iii) propose and evaluate a compiler-directed data prefetching scheme for shared on-chip cache based CMPs. The proposed scheme first identifies program phases using static compiler analysis, and then divides the threads into groups within each phase and assigns a customized prefetcher thread (helper thread) to each group of threads. This helps to reduce the total number of prefetches issued, prefetch overheads, and negative interactions on the shared cache space due to data prefetches, and more importantly, makes compiler-directed prefetching a scalable optimization for CMPs. Our experiments with the applications from the SPEC OMP benchmark suite indicate that the proposed scheme improves overall parallel execution latency by 18.3% over the no-prefetch case and 6.4% over the conventional data prefetching scheme (where each core prefetches its data independently), on average, when 12 cores are used. The corresponding average performance improvements with 24 cores are 16.4% (over the no-prefetch case) and 11.7% (over the conventional prefetching case). We also demonstrate that the proposed scheme is robust under a wide range of values of our major simulation parameters, and the improvements it achieves come very close to those that can be achieved using an optimal scheme.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
5
6
7
 
8
9
 
10
 
11
 
12
13
14
 
15
 
16
 
17
18
19
 
20
Intel. Intel Core Duo Processor and Intel Core Solo Processor on 65 nm Process, January 2007. Datasheet.
 
21
Intel Corporation. Intel Develops Tera-Scale Research Chips, 2006. http://www.intel.com/pressroom/archive/releases/20060926corp_b.htm.
 
22
Jung et al. Helper Thread Prefetching for Loosely-Coupled Multiprocessor Systems. In IPDPS, 2006.
 
23
24
 
25
D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In ASPLOS, pages 159--170, 2002.
 
26
 
27
28
 
29
 
30
31
32
33
 
34
 
35
C. McNairy and R. Bhatia. Montecito -- The next product in the Itanium(R) Processor Family, 2004. In Hot Chips 16, http://www.hotchips.org/archives/.
 
36
Microsoft. Phoenix as a Tool in Research and Instruction. http://research.microsoft.com/phoenix/.
 
37
Mowry et al. Design and Evaluation of a Compiler Algorithm for Prefetching. In OSDI, pages 62--73, 1992.
38
 
39
40
41
42
 
43
Shi et al. Coterminous locality and coterminous group data prefetching on chip multiprocessors. In IPDPS, 2006.
 
44
 
45
SPEC. SPEC OMP Version 3.0 Documentation (OpenMP Benchmark Suite). http://www.spec.org/omp/.
 
46
47
 
48
Sun Microsystems. UltraSPARC--II Enhancements: Support for Software Controlled Prefetch, 1997. White Paper WPR-0002.
 
49
50
51
52
 
53
54
 
55
 
56

Collaborative Colleagues:
Seung Woo Son: colleagues
Mahmut Kandemir: colleagues
Mustafa Karakoy: colleagues
Dhruva Chakrabarti: colleagues