|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ABSTRACT
Many modern embedded processors such as DSPs support partitioned memory banks (also called X--Y memory or dual-bank memory) along with parallel load/store instructions to achieve higher code density and performance. In order to effectively utilize the parallel load/store instructions, the compiler must partition the memory-resident values and assign them to X or Y bank. This paper gives a postregister allocation solution to merge the generated load/store instructions into their parallel counterparts. Simultaneously, our framework performs allocation of values to X or Y memory banks. We first remove as many load/stores and register--register moves as possible through an excellent iterated coalescing based register allocator by Appel and George [1996]. We then attempt to parallelize the generated load/stores using a multipass approach. The basic phase of our approach attempts the merger of load/stores without duplication and web splitting. We model this problem as a graph-coloring problem in which each value is colored as either X or Y. We then construct a motion scheduling graph (MSG), based on the range of motion for each load/store instruction. MSG reflects potential instructions that could be merged. We propose a notion of pseudofixed boundaries so that the load/store movement is less affected by register dependencies. We prove that the coloring problem for MSG is NP-complete and solve it with two different heuristic algorithms with different complexity. We then propose a two-level iterative process to attempt instruction duplication, variable duplication, web splitting, and local conflict elimination to effectively merge the remaining load/stores. Finally, we clean up some multiple-aliased load/stores. To improve the performance, we combine profiling information with each stage coupled with some modifications to the algorithm. We show that our framework results in parallelization of a large number of load/stores without much growth in data and code segments. The average speedup for our optimization pass reaches roughly 13% if no profile information is available and 17% with profile information. The average code and data segment growth is controlled within 13%. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
INDEX TERMS
Primary Classification:
Additional Classification:
General Terms:
Keywords:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||