|
ABSTRACT
Commercial applications such as databases and Web servers constitute the most important market segment for high-performance servers. Among these applications, on-line transaction processing (OLTP) workloads provide a challenging set of requirements for system designs since they often exhibit inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates. A number of recent studies have characterized the behavior of commercial workloads and proposed architectural features to improve their performance. However, there has been little research on the impact of software and compiler-level optimizations for improving the behavior of such workloads.
This paper provides a detailed study of profile-driven compiler optimizations to improve the code layout in commercial workloads with large instruction footprints. Our compiler algorithms are implemented in the context of Spike, an executable optimizer for the Alpha architecture. Our experiments use the Oracle commercial database engine running an OLTP workload, with results generated using both full system simulations and actual runs on Alpha multiprocessors. Our results show that code layout optimizations can provide a major improvement in the instruction cache behavior, providing a 55% to 65% reduction in the application misses for 64-128K caches. Our analysis shows that this improvement primarily arises from longer sequences of consecutively executed instructions and more reuse of cache lines before they are replaced. We also show that the majority of application instruction misses are caused by self-interference. However, code layout optimizations significantly reduce the amount of self-interference, thus elevating the relative importance of interference with operating system code. Finally, we show that better code layout can also provide substantial improvements in the behavior of other memory system components such as the instruction TLB and the unified second-level cache. The overall performance impact of our code layout optimizations is an improvement of 1.33 times in the execution time of our workload.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Jennifer M. Anderson , Lance M. Berc , Jeffrey Dean , Sanjay Ghemawat , Monika R. Henzinger , Shun-Tak A. Leung , Richard L. Sites , Mark T. Vandevoorde , Carl A. Waldspurger , William E. Weihl, Continuous profiling: where have all the cycles gone?, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.1-14, October 05-08, 1997, Saint Malo, France
|
 |
2
|
|
 |
3
|
Luiz André Barroso , Kourosh Gharachorloo , Robert McNamara , Andreas Nowatzyk , Shaz Qadeer , Barton Sano , Scott Smith , Robert Stets , Ben Verghese, Piranha: a scalable architecture based on single-chip multiprocessing, Proceedings of the 27th annual international symposium on Computer architecture, p.282-293, June 2000, Vancouver, British Columbia, Canada
|
| |
4
|
L. A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. Impact of Chip-Level Integration on Performance of OLTP Workloads. In Proceedings of the 6th International Symposium on High Performance Comp,tter Architecture, January 2000.
|
| |
5
|
|
 |
6
|
|
| |
7
|
Z. Cvetanovic and D. D. Donaldson. AlphaServer 4100 performance characterization. Digital Technical Journal, 8(4):3-20, 1996.
|
 |
8
|
|
| |
9
|
Nikolas Gloy , Trevor Blackwell , Michael D. Smith , Brad Calder, Procedure placement using temporal ordering information, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.303-313, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
10
|
D. J. Hartfield and J. Gerald. Program restructuring for virtual memory. IBM Systems Journal, 2:169-192, 1971.
|
| |
11
|
|
 |
12
|
Amir H. Hashemi , David R. Kaeli , Brad Calder, Efficient procedure mapping using cache line coloring, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.171-182, June 16-18, 1997, Las Vegas, Nevada, United States
|
 |
13
|
|
| |
14
|
|
 |
15
|
Kimberly Keeton , David A. Patterson , Yong Qiang He , Roger C. Raphael , Walter E. Baker, Performance characterization of a Quad Pentium Pro SMP using OLTP workloads, Proceedings of the 25th annual international symposium on Computer architecture, p.15-26, June 27-July 02, 1998, Barcelona, Spain
|
 |
16
|
Jack L. Lo , Luiz André Barroso , Susan J. Eggers , Kourosh Gharachorloo , Henry M. Levy , Sujay S. Parekh, An analysis of database workload performance on simultaneous multithreaded processors, Proceedings of the 25th annual international symposium on Computer architecture, p.39-50, June 27-July 02, 1998, Barcelona, Spain
|
 |
17
|
Ann Marie Grizzaffi Maynard , Colette M. Donnelly , Bret R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.145-156, October 05-07, 1994, San Jose, California, United States
|
 |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
Alex Ramirez , Josep Ll. Larriba-Pey , Carlos Navarro , Xavi Serrano , Mateo Valero , Josep Torrellas, Optimization of Instruction Fetch for Decision Support Workloads, Proceedings of the 1999 International Conference on Parallel Processing, p.238, September 21-24, 1999
|
 |
22
|
Alex Ramírez , Josep-L. Larriba-Pey , Carlos Navarro , Josep Torrellas , Mateo Valero, Software trace cache, Proceedings of the 13th international conference on Supercomputing, p.119-126, June 20-25, 1999, Rhodes, Greece
[doi> 10.1145/305138.305178]
|
 |
23
|
Parthasarathy Ranganathan , Kourosh Gharachorloo , Sarita V. Adve , Luiz André Barroso, Performance of database workloads on shared-memory systems with out-of-order processors, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.307-318, October 02-07, 1998, San Jose, California, United States
|
 |
24
|
|
 |
25
|
M. Rosenblum , E. Bugnion , S. A. Herrod , E. Witchel , A. Gupta, The impact of architectural trends on operating system performance, Proceedings of the fifteenth ACM symposium on Operating systems principles, p.285-298, December 03-06, 1995, Copper Mountain, Colorado, United States
|
| |
26
|
W. J. Schmidt , R. R. Roediger , C. S. Mestad , B. Mendelson , I. Shavit-Lottem , V. Bortnikov-Sitnitsky, Profile-directed restructuring of operating system code, IBM Systems Journal, v.37 n.2, p.270-297, April 1998
|
| |
27
|
A. Srivastava and D. W. Wall. A practical system for intermodule code optimization at link-time. Journal of Programming Languages, 1(1):1-18, Dec. 1992.
|
| |
28
|
|
| |
29
|
|
| |
30
|
Transaction Processing Performance Council. TPC Benchmark B (Online Transaction Processing) Standard Specification, 1990.
|
CITED BY 14
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gerolf Hoflehner , Knud Kirkegaard , Rod Skinner , Daniel Lavery , Yong-fong Lee , Wei Li, Compiler Optimizations for Transaction Processing Workloads on Itanium® Linux Systems, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.294-303, December 04-08, 2004, Portland, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
Abhinav Das , Jiwei Lu , Howard Chen , Jinpyo Kim , Pen-Chung Yew , Wei-Chung Hsu , Dong-Yuan Chen, Performance of Runtime Optimization on BLAST, Proceedings of the international symposium on Code generation and optimization, p.86-96, March 20-23, 2005
|
|
|
|
|