|
ABSTRACT
Debuggers have been proven indispensable in improving software reliability. Unfortunately, on most real-life software, debuggers fail to deliver their most essential feature --- a faithful replay of the execution. The reason is non-determinism caused by multithreading and non-repeatable inputs. A common solution to faithful replay has been to record the non-deterministic execution. Existing recorders, however, either work only for datarace-free programs or have prohibitive overhead.As a step towards powerful debugging, we develop a practical low-overhead hardware recorder for cachecoherent multiprocessors, called Flight Data Recorder (FDR). Like an aircraft flight data recorder, FDR continuously records the execution, even on deployed systems, logging the execution for post-mortem analysis.FDR is practical because it piggybacks on the cache coherence hardware and logs nearly the minimal threadordering information necessary to faithfully replay the multiprocessor execution. Our studies, based on simulating a four-processor server with commercial workloads, show that when allocated less than 7% of system's physical memory, our FDR design can capture the last one second of the execution at modest (less than 2%) slowdown.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alaa R. Alameldeen , Milo M. K. Martin , Carl J. Mauer , Kevin E. Moore , Min Xu , Mark D. Hill , David A. Wood , Daniel J. Sorin, Simulating a $2M Commercial Server on a $2K PC, Computer, v.36 n.2, p.50-57, February 2003
[doi> 10.1109/MC.2003.1178046]
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
 |
5
|
|
| |
6
|
Geodesic Systems. Geodesic TraceBack - Application Fault Management Monitor. Geodesic Systems, Inc., 2003.
|
| |
7
|
D. Hunt and P. Marinos. A General Purpose Cache-Aided Rollback Error Recovery (CARER) Technique. In Proceedings of the 17th International Symposium on Fault-Tolerant Computing Systems, pages 170--175, 1987.
|
| |
8
|
|
| |
9
|
L. Lamport. How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs. IEEE Transactions on Computers, C-28(9):690--691, Sept. 1979.
|
| |
10
|
|
| |
11
|
Peter S. Magnusson , Magnus Christensson , Jesper Eskilson , Daniel Forsgren , Gustav Hållberg , Johan Högberg , Fredrik Larsson , Andreas Moestedt , Bengt Werner, Simics: A Full System Simulation Platform, Computer, v.35 n.2, p.50-58, February 2002
[doi> 10.1109/2.982916]
|
| |
12
|
|
| |
13
|
José F. Martínez , Jose Renau , Michael C. Huang , Milos Prvulovic , Josep Torrellas, Cherry: checkpointed early resource recycling in out-of-order microprocessors, Proceedings of the 35th annual ACM/IEEE international symposium on Microarchitecture, November 18-22, 2002, Istanbul, Turkey
|
 |
14
|
|
 |
15
|
|
 |
16
|
|
| |
17
|
David Patterson , Aaron Brown , Pete Broadwell , George Candea , Mike Chen , James Cutler , Patricia Enriquez , Armando Fox , Emre Kiciman , Matthew Merzbacher , David Oppenheimer , Naveen Sastry , William Tetzlaff , Jonathan Traupman , Noah Treuhaft, Recovery Oriented Computing (ROC): Motivation, Definition, Techniques,, University of California at Berkeley, Berkeley, CA, 2002
|
| |
18
|
|
 |
19
|
|
 |
20
|
|
| |
21
|
M. Ronsse and K. D. Bosschere. Non-intrusive On-the-fly Data Race Detection using Execution Replay. In Automated and Algorithmic Debugging, Nov. 2000.
|
 |
22
|
|
| |
23
|
C. E. Scheurich. Access Ordering and Coherence in Shared Memory Multiprocessors. Technical report, University of Southern California, Computer Engineering Division Technical Report No. CENG 89-19, May 1989.
|
 |
24
|
|
| |
25
|
D. J. Sorin, M. M. K. Martin, M. D. Hill, and D. A. Wood. Fast Checkpoint/Recovery to Support Kilo-Instruction Speculation and Hardware Fault Tolerance. Technical Report 1420, Computer Sciences Department, University of Wisconsin--Madison, Oct. 2000.
|
 |
26
|
Daniel J. Sorin , Milo M. K. Martin , Mark D. Hill , David A. Wood, SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery, Proceedings of the 29th annual international symposium on Computer architecture, p.123, May 25-29, 2002, Anchorage, Alaska
|
| |
27
|
R. Tremaine, P. Franaszek, J. Robinson, C. Schulz, T. Smith, M. Wazlowski, and P. Bland. IBM Memory Expansion Technology (MXT). IBM Journal of Research and Development, 45(2):271--285, Mar. 2001.
|
| |
28
|
|
| |
29
|
J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23(3):337--343, May 1977.
|
CITED BY 55
|
|
Pin Zhou , Feng Qin , Wei Liu , Yuanyuan Zhou , Josep Torrellas, iWatcher: Simple, General Architectural Support for Software Debugging, IEEE Micro, v.24 n.6, p.50-56, November 2004
|
|
|
|
|
|
Martin Schulz , Brian S. White , Sally A. McKee , Hsien-Hsin S. Lee , Jürgen Jeitner, Owl: next generation system monitoring, Proceedings of the 2nd conference on Computing frontiers, May 04-06, 2005, Ischia, Italy
|
|
|
Naveen Kumar , Bruce R. Childers , Mary Lou Soffa, Tdb: a source-level debugger for dynamically translated programs, Proceedings of the sixth international symposium on Automated analysis-driven debugging, p.123-132, September 19-21, 2005, Monterey, California, USA
|
|
|
|
|
|
|
|
|
|
|
|
Pin Zhou , Wei Liu , Long Fei , Shan Lu , Feng Qin , Yuanyuan Zhou , Samuel Midkiff , Josep Torrellas, AccMon: Automatically Detecting Memory-Related Bugs via Program Counter-Based Invariants, Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p.269-280, December 04-08, 2004, Portland, Oregon
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Paul Sack , Brian E. Bliss , Zhiqiang Ma , Paul Petersen , Josep Torrellas, Accurate and efficient filtering for the Intel thread checker race detector, Proceedings of the 1st workshop on Architectural and system support for improving software dependability, p.34-41, October 21-21, 2006, San Jose, California
|
|
|
|
|
|
Shimin Chen , Babak Falsafi , Phillip B. Gibbons , Michael Kozuch , Todd C. Mowry , Radu Teodorescu , Anastassia Ailamaki , Limor Fix , Gregory R. Ganger , Bin Lin , Steven W. Schlosser, Log-based architectures for general-purpose monitoring of deployed code, Proceedings of the 1st workshop on Architectural and system support for improving software dependability, p.63-65, October 21-21, 2006, San Jose, California
|
|
|
Daniela A. S. de Oliveira , Jedidiah R. Crandall , Gary Wassermann , S. Felix Wu , Zhendong Su , Frederic T. Chong, ExecRecorder: VM-based full-system replay for attack analysis and system recovery, Proceedings of the 1st workshop on Architectural and system support for improving software dependability, p.66-71, October 21-21, 2006, San Jose, California
|
|
|
Sanjay Bhansali , Wen-Ke Chen , Stuart de Jong , Andrew Edwards , Ron Murray , Milenko Drinić , Darek Mihočka , Joe Chau, Framework for instruction-level tracing and analysis of program executions, Proceedings of the second international conference on Virtual execution environments, June 14-16, 2006, Ottawa, Ontario, Canada
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dennis Geels , Gautam Altekar , Scott Shenker , Ion Stoica, Replay debugging for distributed applications, Proceedings of the Annual Technical Conference on USENIX'06 Annual Technical Conference, p.27-27, May 30-June 03, 2006, Boston, MA
|
|
|
|
|
|
|
|
|
|
|
|
Vijay Nagarajan , Rajiv Gupta, Support for symmetric shadow memory in multiprocessors, Proceedings of the 6th workshop on Parallel and distributed systems: testing, analysis, and debugging, p.1-9, July 20-21, 2008, Seattle, Washington
|
|
|
Brendan Cully , Geoffrey Lefebvre , Dutch Meyer , Mike Feeley , Norm Hutchinson , Andrew Warfield, Remus: high availability via asynchronous virtual machine replication, Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, p.161-174, April 16-18, 2008, San Francisco, California
|
|
|
Olatunji Ruwase , Phillip B. Gibbons , Todd C. Mowry , Vijaya Ramachandran , Shimin Chen , Michael Kozuch , Michael Ryan, Parallelizing dynamic information flow tracking, Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, June 14-16, 2008, Munich, Germany
|
|
|
Chen Tian , Vijay Nagarajan , Rajiv Gupta , Sriraman Tallam, Dynamic recognition of synchronization operations for improved data race detection, Proceedings of the 2008 international symposium on Software testing and analysis, July 20-24, 2008, Seattle, WA, USA
|
|
|
George W. Dunlap , Dominic G. Lucchetti , Michael A. Fetterman , Peter M. Chen, Execution replay of multiprocessor virtual machines, Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, March 05-07, 2008, Seattle, WA, USA
|
|
|
|
|
|
|
|
|
|
|
|
Shimin Chen , Michael Kozuch , Theodoros Strigkos , Babak Falsafi , Phillip B. Gibbons , Todd C. Mowry , Vijaya Ramachandran , Olatunji Ruwase , Michael Ryan , Evangelos Vlachos, Flexible Hardware Acceleration for Instruction-Grain Program Monitoring, ACM SIGARCH Computer Architecture News, v.36 n.3, p.377-388, June 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Geoffrey Lefebvre , Brendan Cully , Michael J. Feeley , Norman C. Hutchinson , Andrew Warfield, Tralfamadore: unifying source code and execution experience, Proceedings of the fourth ACM european conference on Computer systems, April 01-03, 2009, Nuremberg, Germany
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|