|
ABSTRACT
Application-specific hardware accelerators can significantly improve a system's performance. In a Java-based system, we then have to consider a hybrid architecture that consists of a Java Virtual Machine running on a general-purpose processor connected to the hardware accelerator. In such a hybrid architecture, data communication between the accelerator and the general-purpose processor can incur a significant cost, which may even annihilate the original performance improvement of adding the accelerator. A careful layout of the data in the memory structure is therefore of major importance to maintain the acceleration performance benefits. This article addresses the reduction of the communication cost in a distributed shared memory consisting of the main memory of the processor and the accelerator's local memory, which are unified in the Java heap. Since memory access times are highly nonuniform, a suitable allocation of objects in either main memory or the accelerator's local memory can significantly reduce the communication cost. We propose several techniques for finding the optimal location for each Java object's data, either statically through profiling or dynamically at runtime. We show how we can reduce communication cost by up to 86% for the SPECjvm and DaCapo benchmarks. We also show that the best strategy is application dependent and also depends on the relative cost of remote versus local accesses. For a relative cost higher than 10, a self-learning dynamic approach often results in the best performance.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Beck, A. C. S. and Carro, L. 2005. Dynamic reconfiguration with binary translation: Breaking the ILP barrier with software compatibility. In Proceedings of the 42nd Annual Design Automation Conference (DAC). ACM, New York, 732--737.
|
| |
2
|
Bertels, P., Heirman, W., and Stroobandt, D. 2008. Efficient measurement of data flow enabling communication-aware parallelisation. In Proceedings of the International Forum on Next-Generation Multicore/Manycore Technologies (IFMT). ACM, New York, 1--7.
|
| |
3
|
Blackburn, S. M., Garner, R., Hoffman, C., Khan, A. M., McKinley, K. S., Bentzur, R., Diwan, A., Feinberg, D., Frampton, D., Guyer, S. Z., Hirzel, M., Hosking, A., Jump, M., Lee, H., Moss, J. E. B., Phansalkar, A., Stefanović, D., VanDrunen, T., von Dincklage, D., and Wiedermann, B. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-Oriented Programing, Systems, Languages, and Applications (OOPSLA'06). ACM Press, New York, 169--190.
|
| |
4
|
Borg, A., Gao, R., and Audsley, N. 2006. A codesign strategy for embedded Java applications based on a hardware interface with invocation semantics. In Proceedings of the 4th International Workshop on Java Technologies for Real-Time and Embedded Systems (JTRES). ACM, New York, 58--67.
|
| |
5
|
Eeckhaut, H., Devos, H., Lambert, P., De Schrijver, D., Van Lancker, W., Nollet, V., Avasare, P., Clerckx, T., Verdicchio, F., Christiaens, M., Schelkens, P., Van de Walle, R., and Stroobandt, D. 2007. Scalable, wavelet-based video: From server to hardware-accelerated client. IEEE Trans. Multimedia 9, 7, 1508--1519.
|
| |
6
|
Ernst, R., Henkel, J., and Benner, T. 1993. Hardware-software cosynthesis for micro-controllers. IEEE Des. Test Comput. 10, 4, 64--75.
|
| |
7
|
Faes, P., Christiaens, M., Buytaert, D., and Stroobandt, D. 2005. FPGA-aware garbage collection in Java. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL). IEEE, 675--680.
|
| |
8
|
Faes, P., Christiaens, M., and Stroobandt, D. 2004. Transparent communication between Java and reconfigurable hardware. In Proceedings of the 16th IASTED International Conference Parallel and Distributed Computing and Systems, T. Gonzalez, Ed. ACTA Press, Cambridge, MA, 380--385.
|
| |
9
|
Faes, P., Christiaens, M., and Stroobandt, D. 2007. Mobility of data in distributed hybrid computing systems. In Proceedings of the 21st International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society, 386.
|
| |
10
|
Faes, P., Minnaert, B., Christiaens, M., Bonnet, E., Saeys, Y., Stroobandt, D., and Van de Peer, Y. 2006. Scalable hardware accelerator for comparing DNA and protein sequences. In Proceedings of the 1st International Conference on Scalable Information Systems (InfoScale'06). ACM, 33.
|
| |
11
|
Gupta, R. K. and De Micheli, G. 1993. Hardware-software cosynthesis for digital systems. IEEE Des. Test Comput. 10, 3, 29--41.
|
| |
12
|
Hakkennes, E. A. and Vassiliadis, S. 2001. Multimedia execution hardware accelerator. J. VLSI Signal Process. Syst. Signal Image Video Technol. 28, 3, 221--234.
|
| |
13
|
Helaihel, R. and Olukotun, K. 1997. Java as a specification language for hardware/software systems. In Proceedings of the International Conference on Computer-Aided Design (ICCAD). IEEE Computer Society, 690--697.
|
| |
14
|
Lattanzi, E., Gayasen, A., Kandemir, M., Vijaykrishnan, N., Benini, L., and Bogliolo, A. 2005. Improving Java performance using dynamic method migration on FPGAs. Int. J. Embed. Syst. 1, 3, 228--236.
|
| |
15
|
Lysecky, R., Stitt, G., and Vahid, F. 2006. WARP processors. Trans. Des. Autom. Electron. Syst. 11, 3, 659--681.
|
| |
16
|
Maddimsetty, R. P., Buhler, J., Chamberlain, R. D., Franklin, M. A., and Harris, B. 2006. Accelerator design for protein sequence HMM search. In Proceedings of the 20th Annual International Conference on Super-Computing (ICS '06). ACM, New York, 288--296.
|
| |
17
|
Panainte, E. M., Bertels, K., and Vassiliadis, S. 2007. The MOLEN compiler for reconfigurable processors. Trans. Embed. Comput. Syst. 6, 1, 6.
|
| |
18
|
Standard Performance Evaluation Corporation. 1998. Java Virtual Machine benchmarks (SPECjvm1998).
|
| |
19
|
Standard Performance Evaluation Corporation. 2008. Java Virtual Machine benchmarks (SPECjvm2008).
|
| |
20
|
Vassiliadis, S., Wong, S., Gaydadjiev, G., Bertels, K., Kuzmanov, G., and Panainte, E. M. 2004. The MOLEN polymorphic processor. IEEE Trans. Comput. 53, 11, 1363--1375.
|
|