|
ABSTRACT
With the continuing technological trend of ever cheaper and larger memory, most data sets in database servers will soon be able to reside in main memory. In this configuration, the performance bottleneck is likely to be the gap between the processing speed of the CPU and the memory access latency. Previous work has shown that database applications have large instruction and data footprints and hence do not use processor caches effectively. In this paper, we propose Call Graph Prefetching (CGP), an N instruction prefetching technique that analyzes the call graph of a database system and prefetches instructions from the function that is deemed likely to be called next. CGP capitalizes on the highly predictable function call sequences that are typical of database systems. CGP can be implemented either in software or in hardware. The software-based CGP (CGP_S) uses profile information to build a call graph, and uses the predictable call sequences in the call graph to determine which function to prefetch next. The hardware-based CGP(CGP_H) uses a hardware table, called the Call Graph History Cache (CGHC), to dynamically store sequences of functions invoked during program execution, and uses that stored history when choosing which functions to prefetch.We evaluate the performance of CGP on sets of Wisconsin and TPC-H queries, as well as on CPU-2000 benchmarks. For most CPU-2000 applications the number of instruction cache (I-cache) misses were very few even without any prefetching, obviating the need for CGP. On the other hand, the database workloads do suffer a significant number of I-cache misses; CGP_S improves their performance by 23% and CGP_H by 26% over a baseline system that has already been highly tuned for efficient I-cache usage by using the OM tool. CGP, with or without OM, reduces the I-cache miss stall time by about 50% relative to O5+OM, taking us about half way from an already highly tuned baseline system toward perfect I-cache performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
|
| |
3
|
|
 |
4
|
|
 |
5
|
Phil Bernstein , Michael Brodie , Stefano Ceri , David DeWitt , Mike Franklin , Hector Garcia-Molina , Jim Gray , Jerry Held , Joe Hellerstein , H. V. Jagadish , Michael Lesk , Dave Maier , Jeff Naughton , Hamid Pirahesh , Mike Stonebraker , Jeff Ullman, The Asilomar report on database research, ACM SIGMOD Record, v.27 n.4, p.74-80, Dec. 1998
[doi> 10.1145/306101.306137]
|
| |
6
|
|
| |
7
|
|
| |
8
|
Burger, D. and Austin, T. 1997. The SimpleScalar Tool Set. Tech. Rep. 1342, University of Wisconsin-Madison, Computer ScienceDepartment. June.
|
 |
9
|
Michael J. Carey , David J. DeWitt , Michael J. Franklin , Nancy E. Hall , Mark L. McAuliffe , Jeffrey F. Naughton , Daniel T. Schuh , Marvin H. Solomon , C. K. Tan , Odysseas G. Tsatalos , Seth J. White , Michael J. Zwilling, Shoring up persistent applications, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.383-394, May 24-27, 1994, Minneapolis, Minnesota, United States
|
| |
10
|
|
 |
11
|
|
 |
12
|
Richard J. Eickemeyer , Ross E. Johnson , Steven R. Kunkel , Mark S. Squillante , Shiafun Liu, Evaluation of multithreaded uniprocessors for commercial application environments, Proceedings of the 23rd annual international symposium on Computer architecture, p.203-212, May 22-24, 1996, Philadelphia, Pennsylvania, United States
|
| |
13
|
|
| |
14
|
Nikolas Gloy , Trevor Blackwell , Michael D. Smith , Brad Calder, Procedure placement using temporal ordering information, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.303-313, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
 |
15
|
Amir H. Hashemi , David R. Kaeli , Brad Calder, Efficient procedure mapping using cache line coloring, Proceedings of the ACM SIGPLAN 1997 conference on Programming language design and implementation, p.171-182, June 16-18, 1997, Las Vegas, Nevada, United States
|
| |
16
|
|
| |
17
|
Intel Web Site: http://www.developer.intel.com/drg/mmx/appnotes/perfmon.htm Survey of Pentium Processor Performance Monitoring Capabilities & Tools.
|
 |
18
|
|
 |
19
|
|
| |
20
|
|
 |
21
|
Jack L. Lo , Luiz André Barroso , Susan J. Eggers , Kourosh Gharachorloo , Henry M. Levy , Sujay S. Parekh, An analysis of database workload performance on simultaneous multithreaded processors, Proceedings of the 25th annual international symposium on Computer architecture, p.39-50, June 27-July 02, 1998, Barcelona, Spain
|
 |
22
|
|
 |
23
|
Ann Marie Grizzaffi Maynard , Colette M. Donnelly , Bret R. Olszewski, Contrasting characteristics and cache performance of technical and multi-user commercial workloads, Proceedings of the sixth international conference on Architectural support for programming languages and operating systems, p.145-156, October 05-07, 1994, San Jose, California, United States
|
 |
24
|
Chris Nyberg , Tom Barclay , Zarka Cvetanovic , Jim Gray , Dave Lomet, AlphaSort: a RISC machine sort, Proceedings of the 1994 ACM SIGMOD international conference on Management of data, p.233-242, May 24-27, 1994, Minneapolis, Minnesota, United States
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
Romer, T., Voelker, G., Lee, D., Wolman, A., Wong, W., Levy, H., Bershad, B., and Chen, B. 1997. Instrumentation and Optimization of Win32/Intel Executables Using Etch. In USENIX Windows NT Workshop. 1--7.
|
| |
29
|
Rupley, J., Annavaram, M., Devale, J., Diep, T., and Black, B. 2002. Comparing and Contrasting a Commercial OLTP Workload with CPU2000 on IPF. In the 5th Workshop on Workload Characterization.
|
| |
30
|
|
| |
31
|
Smith, A. 1978. Sequential Program Prefetching in Memory Hierarchies. IEEE Comput. 11, 2 (December), 7--21.
|
| |
32
|
|
| |
33
|
|
| |
34
|
Srivastava, A. and Eustace, A. 1994. ATOM: A System for Building Customized Program Analysis Tools. Tech. Rep. 94/2, Digital Western Research Laboratory. March.
|
| |
35
|
Srivastava, A. and Wall, D. 1992. A Practical System for Intermodule Code Optimization at Link-Time. Tech. Rep. 92/6, Digital Western Research Laboratory. June.
|
| |
36
|
|
| |
37
|
TPC. 1999. TPC Benchmark H Standard Specification (Decision Support). In Revision 1.1.0.
|
|