|
ABSTRACT
With the diverging improvements in CPU speeds and memory access latencies, detecting and removing memory access bottlenecks becomes increasingly important. In this work we present METRIC, a software framework for isolating and understanding such bottlenecks using partial access traces. METRIC extracts access traces from executing programs without special compiler or linker support. We make four primary contributions. First, we present a framework for extracting partial access traces based on dynamic binary rewriting of the executing application. Second, we introduce a novel algorithm for compressing these traces. The algorithm generates constant space representations for regular accesses occurring in nested loop structures. Third, we use these traces for offline incremental memory hierarchy simulation. We extract symbolic information from the application executable and use this to generate detailed source-code correlated statistics including per-reference metrics, cache evictor information, and stream metrics. Finally, we demonstrate how this information can be used to isolate and understand memory access inefficiencies. This illustrates a potential advantage of METRIC over compile-time analysis for sample codes, particularly when interprocedural analysis is required.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Vasanth Bala , Evelyn Duesterwald , Sanjeev Banerjia, Dynamo: a transparent dynamic optimization system, Proceedings of the ACM SIGPLAN 2000 conference on Programming language design and implementation, p.1-12, June 18-21, 2000, Vancouver, British Columbia, Canada
|
| |
2
|
|
| |
3
|
|
| |
4
|
Burrows, M. and Wheeler, D. J. 1994. A block-sorting lossless data compression algorithm. Tech. Rep. 124.
|
 |
5
|
|
| |
6
|
Burtscher, M. 2004b. Vpc3 source code. http://www.csl.cornell.edu/burtscher/research/tracecom pression/.
|
 |
7
|
Siddhartha Chatterjee , Erin Parker , Philip J. Hanlon , Alvin R. Lebeck, Exact analysis of the cache behavior of nested loops, Proceedings of the ACM SIGPLAN 2001 conference on Programming language design and implementation, p.286-297, June 2001, Snowbird, Utah, United States
|
 |
8
|
|
 |
9
|
Trishul M. Chilimbi , Bob Davidson , James R. Larus, Cache-conscious structure definition, Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, p.13-24, May 01-04, 1999, Atlanta, Georgia, United States
|
 |
10
|
Trishul M. Chilimbi , Mark D. Hill , James R. Larus, Cache-conscious structure layout, Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, p.1-12, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
11
|
|
| |
12
|
Luiz DeRose , K. Ekanadham , Jeffrey K. Hollingsworth , Simone Sbaraglia, SIGMA: a simulator infrastructure to guide memory analysis, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-13, November 16, 2002, Baltimore, Maryland
|
 |
13
|
|
 |
14
|
|
 |
15
|
Brian Grant , Matthai Philipose , Markus Mock , Craig Chambers , Susan J. Eggers, An evaluation of staged run-time optimizations in DyC, Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation, p.293-304, May 01-04, 1999, Atlanta, Georgia, United States
|
| |
16
|
|
 |
17
|
Mark Horowitz , Margaret Martonosi , Todd C. Mowry , Michael D. Smith, Informing memory operations: providing memory performance feedback in modern processors, Proceedings of the 23rd annual international symposium on Computer architecture, p.260-270, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
18
|
Intel. 2004. Intel Itanium2 Processor Reference Manual for Software Development and Optimization Vol.1, Intel, Santa Clara, CA.
|
| |
19
|
|
 |
20
|
|
| |
21
|
|
 |
22
|
|
| |
23
|
Manning, N. 2005. Sequitur source code. http://sequence.rutgers.edu/sequitur/sequitur.cc.
|
| |
24
|
Marathe, J. and Mueller, F. 2002. Detecting memory performance bottlenecks via binary rewriting. In Proceedings of the Workshop on Binary Translation.
|
 |
25
|
|
| |
26
|
Jaydeep Marathe , Frank Mueller , Tushar Mohan , Bronis R. de Supinski , Sally A. McKee , Andy Yoo, METRIC: tracking down inefficiencies in the memory hierarchy via binary rewriting, Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, March 23-26, 2003, San Francisco, California
|
 |
27
|
|
 |
28
|
|
| |
29
|
Tushar Mohan , Bronis R. de Supinski , Sally A. McKee , Frank Mueller , Andy Yoo , Martin Schulz, Identifying and Exploiting Spatial Regularity in Data Memory References, Proceedings of the 2003 ACM/IEEE conference on Supercomputing, p.49, November 15-21, 2003
|
| |
30
|
|
| |
31
|
Mueller, F., Mohan, T., de Supinski, B. R., McKee, S. A., and Yoo, A. 2001. Partial data traces: Efficient generation and representation. In Workshop on Binary Translation. IEEE Technical Committee on Computer Architecture Newsletter.
|
| |
32
|
Nevill-Manning, C. G. and Witten, I. H. 1997a. Compression and explanation using hierarchical grammars. Comput. J. 40, 2--3.
|
| |
33
|
|
| |
34
|
Seward, J. 2005. Libbzip2 source code. http://www.bzip.org/index.html.
|
 |
35
|
|
 |
36
|
|
| |
37
|
Tendler, J. M., Dodson, J. S., Fields, Jr., J. S., Le, H., and Sinharoy, B. 2002. POWER4 system microarchitecture. IBM J. Res. Develop. 46, 1 (Jan.), 5--25.
|
| |
38
|
Ung, D. and Cifuentes, C. 2000. Optimising hot paths in a dynamic binary translator. In Proceedings of the Workshop on Binary Translation.
|
| |
39
|
|
| |
40
|
Weikle, D., McKee, S. A., Skadron, K., and Wulf, W. 2000. Caches as filters: A framework for the analysis of caching systems. In Proceedings of the Grace Murray Hopper Conference.
|
 |
41
|
|
 |
42
|
|
CITED BY
|
|
Surendra Byna , Yong Chen , Xian-He Sun , Rajeev Thakur , William Gropp, Parallel I/O prefetching using MPI file caching and I/O signatures, Proceedings of the 2008 ACM/IEEE conference on Supercomputing, November 15-21, 2008, Austin, Texas
|
REVIEW
"Olivier Louis Marie Lecarme : Reviewer"
A long time ago, computer hardware was designed in order to efficiently execute the code generated by compilers for higher-level programming languages. Now, computer hardware is designed in order to claim extraordinary performances, but the burden
more...
|