ACM Home Page
Please provide us with feedback. Feedback
PRES: probabilistic replay with execution sketching on multiprocessors
Full text PdfPdf (693 KB)
Source
ACM Symposium on Operating Systems Principles archive
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles table of contents
Big Sky, Montana, USA
SESSION: Parallel debugging table of contents
Pages 177-192  
Year of Publication: 2009
ISBN:978-1-60558-752-3
Authors
Soyeon Park  University of California, San Diego, La Jolla, USA
Yuanyuan Zhou  University of California, San Diego, La Jolla, USA
Weiwei Xiong  University of Illinois at Urbana Champaign, Urbana, USA
Zuoning Yin  University of Illinois at Urbana Champaign, Urbana, USA
Rini Kaushik  University of Illinois at Urbana Champaign, Urbana, USA
Kyu H. Lee  Purdue University, West Lafayette, USA
Shan Lu  University of Wisconsin - Madison , Madison, USA
Sponsors
ACM: Association for Computing Machinery
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 34,   Downloads (12 Months): 34,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629575.1629593
What is a DOI?

ABSTRACT

Bug reproduction is critically important for diagnosing a production-run failure. Unfortunately, reproducing a concurrency bug on multi-processors (e.g., multi-core) is challenging. Previous techniques either incur large overhead or require new non-trivial hardware extensions.

This paper proposes a novel technique called PRES (probabilistic replay via execution sketching) to help reproduce concurrency bugs on multi-processors. It relaxes the past (perhaps idealistic) objective of "reproducing the bug on the first replay attempt" to significantly lower production-run recording overhead. This is achieved by (1) recording only partial execution information (referred to as "sketches") during the production run, and (2) relying on an intelligent replayer during diagnosis time (when performance is less critical) to systematically explore the unrecorded non-deterministic space and reproduce the bug. With only partial information, our replayer may require more than one coordinated replay run to reproduce a bug. However, after a bug is reproduced once, PRES can reproduce it every time.

We implemented PRES along with five different execution sketching mechanisms. We evaluated them with 11 representative applications, including 4 servers, 3 desktop/client applications, and 4 scientific/graphics applications, with 13 real-world concurrency bugs of different types, including atomicity violations, order violations and deadlocks. PRES (with synchronization or system call sketching) significantly lowered the production-run recording overhead of previous approaches (by up to 4416 times), while still reproducing most tested bugs in fewer than 10 replay attempts. Moreover, PRES scaled well with the number of processors; PRES's feedback generation from unsuccessful replays is critical in bug reproduction.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Direct communication with the authors of SMP-Revirt, 2009.
 
2
T.C. Bressoud and F.B. Schneider. Hypervisor-based fault tolerance. In SOSP, 1995.
 
3
M. Burrows and K.R.M. Leino. Finding stale-value errors in concurrent programs. Concurrency and Computation: Practice and Experience, 16(12):1161--1172, 2004.
 
4
M. Castro, M. Costa, and J.-P. Martin. Better bug reporting with better privacy. In ASPLOS, pages 319--328. ACM, 2008.
 
5
J.-D. Choi and H. Srinivasan. Deterministic replay of java multithreaded applications. In SPDT, 1998.
 
6
J. Devietti, B. Lucia, M. Oskin, and L. Ceze. Dmp: Deterministic shared-memory multiprocessing. In ASPLOS, 2009.
 
7
A. Dinning and E. Schonberg. An empirical comparison of monitoring algorithms for access anomaly detection. In PPoPP, 1990.
 
8
G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. Execution replay of multiprocessor virtual machines. In VEE, 2008.
 
9
G.W. Dunlap. Execution replay for intrusion analysis (ph.d. thesis). http://www.eecs.umich.edu/pmchen/papers/dunlap06.pdf.
 
10
Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, F. Kaashoek, and Z. Zhang. R2: An application-level kernel for record and replay. In OSDI, 2008.
 
11
D.R. Hower and M.D. Hill. Rerun: Exploiting episodes for lightweight memory race recording. In ISCA, 2008.
 
12
S.T. King, G.W. Dunlap, and P.M. Chen. Debugging operating systems with time-traveling virtual machines. In Usenix, 2005.
 
13
O. Laadan, R.A. Baratto, D. Phung, S. Potter, and J. Nieh. Dejaview: A personal virtual computer recorder. In SOSP, 2007.
 
14
T.J. LeBlanc and J.M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Trans. Comput., 36(4), 1987.
 
15
K. Li and P. Hudak. Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst., 7(4):321--359, 1989.
 
16
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study of real world concurrency bug characteristics. In ASPLOS, March 2008.
 
17
C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V.J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. In PLDI, 2005.
 
18
P. Montesinos, L. Ceze, and J. Torrellas. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. In ISCA, 2008.
 
19
P. Montesinos, M. Hicks, S.T. King, and J. Torrellas. Capo: Abstractions and software-hardware interface for hardware-assisted deterministic multiprocessor replay. In ASPLOS, 2009.
 
20
M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P.A. Nainar, and I. Neamtiu. Finding and reproducing heisenbugs in concurrent programs. In OSDI, 2008.
 
21
S. Narayanasamy, C. Pereira, and B. Calder. Recording shared memory dependencies using strata. In ASPLOS, 2006.
 
22
S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic logging of operating system effects to guide application-level architecture simulation. In SIGMETRICS, 2006.
 
23
S. Narayanasamy, G. Pokam, and B. Calder. Bugnet: Continuously recording program execution for deterministic replay debugging. In ISCA, 2005.
 
24
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, and B. Calder. Automatically classifying benign and harmful data racesallusing replay analysis. In PLDI, 2007.
 
25
M. Olszewski, J. Ansel, and S. Amarasinghe. Kendo: Efficient determistic multithreading in software. In ASPLOS, 2009.
 
26
S. Park, S. Lu, and Y. Zhou. Ctrigger: Exposing atomicity violation bugs from their hiding places. In ASPLOS, 2009.
 
27
D. Perkovic and P.J. Keleher. Online data-race detection via coherency guarantees. In OSDI, 1996.
 
28
M. Ronsse and K.D. Bosschere. Non-intrusive on-the-fly data race detection using execution replay. In Automated and Algorithmic Debugging, Nov 2000.
 
29
S. Sarangi, B. Greskamp, and J. Torrellas. Cadre: Cycle-accurate deterministic replay for hardware debugging. In DSN, 2006.
 
30
S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM TOCS, 1997.
 
31
SecurityFocus. Software bug contributed to blackout. http://www.securityfocus.com/news/8016.
 
32
S.M. Srinivasan, S. Kandula, C.R. Andrews, and Y. Zhou. Flashback: a lightweight extension for rollback and deterministic replay for software debugging. In USENIX, 2004.
 
33
J.M. Stone. Debugging concurrent processes: a case study. In SIGPLAN, pages 145--153. ACM, 1988.
 
34
J. Tucek, S. Lu, C. Huang, S. Xanthos, and Y. Zhou. Triage: Diagnosing production run failures at the user's site. In SOSP, 2007.
 
35
VMware. (appendix c) using the integrated virtual debugger for visual studio. http://www.vmware.com/pdf/ws65_manual.pdf.
 
36
VMware. Using the snapshot (vmware workstation 4). http://www.vmware.com/support/ws4/doc/preserve_snapshot_ws.html.
 
37
A. Whitaker, R.S. Cox, and S.D. Gribble. Configuration debugging as search: finding the needle in the haystack. In OSDI, 2004.
 
38
S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In ISCA, 1995.
 
39
M. Xu, R. Bodik, and M. Hill. A "flight data recorder" for enabling full-system multiprocessor deterministic replay. In ISCA/03.
 
40
M. Xu, R. Bodík, and M.D. Hill. A serializability violation detector for shared-memory server programs. In PLDI, 2005.
 
41
M. Xu, M.D. Hill, and R. Bodík. A regulated transitive reduction (rtr) for longer memory race recording. In ASPLOS, 2006.