ABSTRACT
Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers. To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables. We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
Marcos K. Aguilera , Jeffrey C. Mogul , Janet L. Wiener , Patrick Reynolds , Athicha Muthitacharoen, Performance debugging for distributed systems of black boxes, Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
|
 |
3
|
|
| |
4
|
|
| |
5
|
|
| |
6
|
Mike Y. Chen , Emre Kiciman , Eugene Fratkin , Armando Fox , Eric Brewer, Pinpoint: Problem Determination in Large, Dynamic Internet Services, Proceedings of the 2002 International Conference on Dependable Systems and Networks, p.595-604, June 23-26, 2002
|
| |
7
|
G. Clarke. How to diagnose and solve software errors. PC World, 1999.
|
 |
8
|
Manuel Costa , Jon Crowcroft , Miguel Castro , Antony Rowstron , Lidong Zhou , Lintao Zhang , Paul Barham, Vigilante: end-to-end containment of internet worms, Proceedings of the twentieth ACM symposium on Operating systems principles, October 23-26, 2005, Brighton, United Kingdom
|
| |
9
|
Dennis Geels , Gautam Altekar , Scott Shenker , Ion Stoica, Replay debugging for distributed applications, Proceedings of the annual conference on USENIX '06 Annual Technical Conference, p.27-27, May 30-June 03, 2006, Boston, MA
|
| |
10
|
GNU. Gdb: The gnu project debugger.
|
| |
11
|
R. Hastings and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the 1992 USENIX Winter Technical Conference, 1992.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
 |
15
|
|
 |
16
|
Ben Liblit , Alex Aiken , Alice X. Zheng , Michael I. Jordan, Bug isolation via remote program sampling, Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation, June 09-11, 2003, San Diego, California, USA
|
 |
17
|
|
 |
18
|
Chi-Keung Luk , Robert Cohn , Robert Muth , Harish Patil , Artur Klauser , Geoff Lowney , Steven Wallace , Vijay Janapa Reddi , Kim Hazelwood, Pin: building customized program analysis tools with dynamic instrumentation, Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, June 12-15, 2005, Chicago, IL, USA
|
 |
19
|
|
| |
20
|
Microsoft Corporation. Dr. Watson overview.
|
 |
21
|
|
| |
22
|
mozilla.org. Quality feedback agent.
|
| |
23
|
E. W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251--266, 1986.
|
 |
24
|
|
 |
25
|
|
| |
26
|
N. Nethercote and J. Seward. Valgrind: A program supervision framework. Electronic Notes in Theoretical Computer Science, 2003.
|
 |
27
|
|
| |
28
|
J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the 12th Annual Network and Distributed System Security Symposium, 2005.
|
 |
29
|
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
| |
33
|
B. Randell. Facing up to faults. The Computer Journal, 2000.
|
| |
34
|
Martin Rinard , Cristian Cadar , Daniel Dumitran , Daniel M. Roy , Tudor Leu , William S. Beebee, Jr., Enhancing server availability and security through failure-oblivious computing, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.21-21, December 06-08, 2004, San Francisco, CA
|
| |
35
|
A. C. Rosander. Elementary Principles of Statistics. D. Van Nostrand Company, 1951.
|
| |
36
|
A. Sabelfeld and A. Myers. Language-based information-flow security. In IEEE Journal on Selected Areas in Communications, 2003.
|
 |
37
|
|
| |
38
|
Stelios Sidiroglou , Michael E. Locasto , Stephen W. Boyd , Angelos D. Keromytis, Building a reactive immune system for software services, Proceedings of the annual conference on USENIX Annual Technical Conference, p.11-11, April 10-15, 2005, Anaheim, CA
|
| |
39
|
Sumeet Singh , Cristian Estan , George Varghese , Stefan Savage, Automated worm fingerprinting, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.4-4, December 06-08, 2004, San Francisco, CA
|
 |
40
|
|
| |
41
|
Sudarshan M. Srinivasan , Srikanth Kandula , Christopher R. Andrews , Yuanyuan Zhou, Flashback: a lightweight extension for rollback and deterministic replay for software debugging, Proceedings of the annual conference on USENIX Annual Technical Conference, p.3-3, June 27-July 02, 2004, Boston, MA
|
 |
42
|
Joseph Tucek , James Newsome , Shan Lu , Chengdu Huang , Spiros Xanthos , David Brumley , Yuanyuan Zhou , Dawn Song, Sweeper: a lightweight end-to-end system for defending against fast worms, Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, March 21-23, 2007, Lisbon, Portugal
|
| |
43
|
Chad Verbowski , Emre Kiciman , Arunvijay Kumar , Brad Daniels , Shan Lu , Juhan Lee , Yi-Min Wang , Roussi Roussev, Flight data recorder: monitoring persistent-state interactions to improve systems management, Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p.9-9, November 06-08, 2006, Seattle, WA
|
 |
44
|
Helen J. Wang , John Platt , Yu Chen , Ruyun Zhang , Yi-Min Wang, PeerPressure for automatic troubleshooting, Proceedings of the joint international conference on Measurement and modeling of computer systems, June 10-14, 2004, New York, NY, USA
|
 |
45
|
|
 |
46
|
|
 |
47
|
|
| |
48
|
|
CITED BY 9
|
|
|
|
|
|
|
|
Shimin Chen , Michael Kozuch , Theodoros Strigkos , Babak Falsafi , Phillip B. Gibbons , Todd C. Mowry , Vijaya Ramachandran , Olatunji Ruwase , Michael Ryan , Evangelos Vlachos, Flexible Hardware Acceleration for Instruction-Grain Program Monitoring, ACM SIGARCH Computer Architecture News, v.36 n.3, p.377-388, June 2008
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kai Shen , Christopher Stewart , Chuanpeng Li , Xin Li, Reference-driven performance anomaly identification, Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, June 15-19, 2009, Seattle, WA, USA
|
|
|
|
|