ACM Home Page
Please provide us with feedback. Feedback
Triage: diagnosing production run failures at the user's site
Full text FlvFlv (33:15),  Mp3Mp3 (13.97 MB),  PdfPdf (292 KB)
Source
ACM Symposium on Operating Systems Principles archive
Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles table of contents
Stevenson, Washington, USA
SESSION: Software robustness table of contents
Pages: 131 - 144  
Year of Publication: 2007
ISBN:978-1-59593-591-5
Also published in ...
Authors
Joseph Tucek  University of Illinois at Urbana Champaign, Urbana, IL
Shan Lu  University of Illinois at Urbana Champaign, Urbana, IL
Chengdu Huang  University of Illinois at Urbana Champaign, Urbana, IL
Spiros Xanthos  University of Illinois at Urbana Champaign, Urbana, IL
Yuanyuan Zhou  University of Illinois at Urbana Champaign, Urbana, IL
Sponsors
ACM: Association for Computing Machinery
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 28,   Downloads (12 Months): 149,   Citation Count: 9
Additional Information:

appendices and supplements   abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1294261.1294275
What is a DOI?

APPENDICES and SUPPLEMENTS
Zipp131-slides.zip (24.73 MB),
Supplemental material for Triage: diagnosing production run failures at the user's site


ABSTRACT

Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers.

To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables.

We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
 
5
 
6
 
7
G. Clarke. How to diagnose and solve software errors. PC World, 1999.
8
 
9
 
10
GNU. Gdb: The gnu project debugger.
 
11
R. Hastings and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the 1992 USENIX Winter Technical Conference, 1992.
12
 
13
 
14
15
16
17
18
19
 
20
Microsoft Corporation. Dr. Watson overview.
21
 
22
mozilla.org. Quality feedback agent.
 
23
E. W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251--266, 1986.
24
25
 
26
N. Nethercote and J. Seward. Valgrind: A program supervision framework. Electronic Notes in Theoretical Computer Science, 2003.
27
 
28
J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the 12th Annual Network and Distributed System Security Symposium, 2005.
29
 
30
 
31
32
 
33
B. Randell. Facing up to faults. The Computer Journal, 2000.
 
34
 
35
A. C. Rosander. Elementary Principles of Statistics. D. Van Nostrand Company, 1951.
 
36
A. Sabelfeld and A. Myers. Language-based information-flow security. In IEEE Journal on Selected Areas in Communications, 2003.
37
 
38
 
39
40
 
41
42
 
43
44
45
46
47
 
48

CITED BY  9

Collaborative Colleagues:
Joseph Tucek: colleagues
Shan Lu: colleagues
Chengdu Huang: colleagues
Spiros Xanthos: colleagues
Yuanyuan Zhou: colleagues