ACM Home Page
Please provide us with feedback. Feedback
Decoupled store completion/silent deterministic replay: enabling scalable data memory for CPR/CFP processors
Full text PdfPdf (442 KB)
Source
International Symposium on Computer Architecture archive
Proceedings of the 36th annual international symposium on Computer architecture table of contents
Austin, TX, USA
SESSION: Load and stores table of contents
Pages 245-254  
Year of Publication: 2009
ISBN:978-1-60558-526-0
Also published in ...
Authors
Andrew Hilton  University of Pennsylvania, Philadelphia, PA, USA
Amir Roth  University of Pennsylvania, Philadelphia, PA, USA
Sponsors
SIGARCH: ACM Special Interest Group on Computer Architecture
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 24,   Downloads (12 Months): 69,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1555754.1555786
What is a DOI?

ABSTRACT

CPR/CFP (Checkpoint Processing and Recovery/Continual Flow Pipeline) support an adaptive instruction window that scales to tolerate last-level cache misses. CPR/CFP scale the register file by aggressively reclaiming the destination registers of many in-flight instructions. However, an analogous mechanism does not exist for stores and loads. As the window expands, CPR/CFP processors must track all in-flight stores and loads to support forwarding and detect memory ordering violations.

The previously-described SVW (Store Vulnerability Window) and SQIP (Store Queue Index Prediction) schemes provide scalable, non-associative load and store queues, respectively. However, they don't work smoothly in a CPR/CFP context. SVW/SQIP rely on the ability to dynamically stall some loads until a specific older store writes to the cache. Enforcing this serialization in CPR/CFP is expensive if the load and store are in the same checkpoint.

We introduce two complementary procedures that implement this serialization efficiently. Decoupled Store Completion (DSC) allows stores to write to the cache before the enclosing checkpoint completes execution. Silent Deterministic Replay (SDR) supports mis-speculation recovery in the presence of DSC by replaying loads older than completed stores using values from the load queue. The combination of DSC and SDR enables an SVW/SQIP based CPR/CFP memory system that outperforms previous designs while occupying less area.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
5
6
7
 
8
A. Hilton, S. Nagarakatte, and A. Roth. iCFP: Tolerating All-Level Cache Misses in In-Order Pipelines. In Proc. 15th Intl. Symp. on High Performance Computer Architecture, pages 431--442, Feb. 2009.
 
9
 
10
P. Michaud. A PPM-like, Tag-Based Branch Predictor. Journal of Instruction Level Parallelism, 7(1):1--10, Apr. 2005.
 
11
12
 
13
 
14
M. Pericas, R. Gonzalez, D. Jimenez, and M. Valero. A Decoupled KILO-Instruction Processor. In Proc. 12th Intl. Symp. on High Performance Computer Architecture, pages 53--64, Feb. 2006.
15
16
 
17
A. Roth. Store Vulnerability Window (SVW): A Filter and Potential Replacement for Load Re-Execution. Journal of Instruction Level Parallelism, 8, 2006. (http://www.jilp.org/vol8/).
 
18
19
20
 
21
 
22
23
 
24
 
25
D. Tarjan, S. Thoziyoor, and N. Jouppi. CACTI 4.0. Technical Report HPL-2006-86, Hewlett-Packard Labs Technical Report, Jun. 2006.
26
27
28

Collaborative Colleagues:
Andrew Hilton: colleagues
Amir Roth: colleagues