|
ABSTRACT
Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alignment Software, Inc. Appassure. http://www.alignmentsoftware.com, 2003.
|
| |
2
|
S. Bagchi, G. Kar, and J. L. Hellerstein. Dependency analysis in distributed systems using fault injection: Application to problem determination in an e-commerce environment. In Proc. 12th Intl. Workshop on Distributed Systems: Operations & Management, Nancy, France, Oct. 2001.
|
| |
3
|
A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In Proc. 7th IFIP/IEEE Intl. Symp. on Integrated Network Management, Seattle, WA, May 2001.
|
| |
4
|
|
| |
5
|
M. Chen, E. Kiciman, A. Accardi, A. Fox, and E. Brewer. Using runtime paths for macro analysis. In Proc. HotOS-IX, Kauai, HI, May 2003.
|
| |
6
|
Mike Y. Chen , Emre Kiciman , Eugene Fratkin , Armando Fox , Eric Brewer, Pinpoint: Problem Determination in Large, Dynamic Internet Services, Proceedings of the 2002 International Conference on Dependable Systems and Networks, p.595-604, June 23-26, 2002
|
| |
7
|
|
| |
8
|
|
 |
9
|
Susan L. Graham , Peter B. Kessler , Marshall K. Mckusick, Gprof: A call graph execution profiler, Proceedings of the 1982 SIGPLAN symposium on Compiler construction, p.120-126, June 23-25, 1982, Boston, Massachusetts, United States
|
| |
10
|
|
| |
11
|
|
 |
12
|
Polly Huang , Anja Feldmann , Walter Willinger, A non-instrusive, wavelet-based approach to detecting network performance problems, Proceedings of the 1st ACM SIGCOMM Workshop on Internet Measurement, November 01-02, 2001, San Francisco, California, USA
[doi> 10.1145/505202.505229]
|
| |
13
|
R. Isaacs and P. Barham. Performance analysis in loosely-coupled distributed systems. In 7th CaberNet Radicals Workshop, Bertinoro, Italy, Oct. 2002.
|
| |
14
|
V. Jacobson, C. Leres, and S. McCanne. tcpdump. www.tcpdump.org, 1989.
|
| |
15
|
JBoss Group. http://www.jboss.org/.
|
| |
16
|
E. Kiciman. JBoss request-tracing in Pinpoint, 2003.
|
| |
17
|
J. B. Micheel. Personal communication, 2003.
|
| |
18
|
|
| |
19
|
B. P. Miller and C.-Q. Yang. Critical path analysis for the execution of parallel and distributed programs. In Proc. 8th Intl. Conf. on Distributed Computing Systems, pages 366--373, San Jose, CA, June 1988.
|
| |
20
|
D. L. Mills. The network computer as precision timekeeper. In Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting, pages 96--108, Reston, VA, Dec. 1996.
|
 |
21
|
Vern Paxson, Automated packet trace analysis of TCP implementations, Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication, p.167-179, September 14-18, 1997, Cannes, France
|
| |
22
|
Performant, Inc. Optibench. http://www.performant.com/.
|
| |
23
|
Quest Software Inc. Performasure. http://java.quest.com/performasure, 2003.
|
| |
24
|
Sun Microsystems, Inc. Java Pet Store Demo. http://developer.java.sun.com/developer/releases/petstore/.
|
| |
25
|
Sun Microsystems, Inc. J2EE platform specification. http://java.sun.com/j2ee/, 2003.
|
| |
26
|
Brian Tierney , William Johnston , Brian Crowley , Gary Hoo , Chris Brooks , Dan Gunter, The NetLogger Methodology for High Performance Distributed Systems Performance Analysis, Proceedings of the The Seventh IEEE International Symposium on High Performance Distributed Computing, p.260, July 28-31, 1998
|
| |
27
|
Y. Zhang and V. Paxson. Detecting stepping stones. In Proc. 9th USENIX Security Symp., Denver, CO, Aug. 2000.
|
CITED BY 79
|
|
Ratul Mahajan , Neil Spring , David Wetherall , Thomas Anderson, User-level internet path diagnosis, Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
|
|
|
Patrick G. Bridges , Arthur B. MacCabe, IMPuLSE: integrated monitoring and profiling for large-scale environments, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-5, October 22-23, 2004, Houston, Texas
|
|
|
|
|
|
Haifeng Chen , Guofei Jiang , Cristian Ungureanu , Kenji Yoshihira, Failure detection and localization in component based systems by online tracking, Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, August 21-24, 2005, Chicago, Illinois, USA
|
|
|
Ira Cohen , Steve Zhang , Moises Goldszmidt , Julie Symons , Terence Kelly , Armando Fox, Capturing, indexing, clustering, and retrieving system history, ACM SIGOPS Operating Systems Review, v.39 n.5, December 2005
|
|
|
|
|
|
W. De Pauw , M. Lei , E. Pring , L. Villard , M. Arnold , J. F. Morar, Web services navigator: visualizing the execution of web services, IBM Systems Journal, v.44 n.4, p.821-845, 2005
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Prasenjit Sarkar , Ramani Routray , Eric Butler , Chung-hao Tan , Kaladhar Voruganti , Kiyoung Yang, SPIKE: best practice generation for storage area networks, Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques, p.1-6, April 10, 2007, Cambridge, MA
|
|
|
|
|
|
|
|
|
Eno Thereska , Brandon Salmon , John Strunk , Matthew Wachs , Michael Abd-El-Malek , Julio Lopez , Gregory R. Ganger, Stardust: tracking activity in a distributed storage system, ACM SIGMETRICS Performance Evaluation Review, v.34 n.1, June 2006
|
|
|
Patrick Reynolds , Janet L. Wiener , Jeffrey C. Mogul , Marcos K. Aguilera , Amin Vahdat, WAP5: black-box performance debugging for wide-area systems, Proceedings of the 15th international conference on World Wide Web, May 23-26, 2006, Edinburgh, Scotland
|
|
|
Rebecca Isaacs , Paul Barham , James Bulpin , Richard Mortier , Dushyanth Narayanan, Request extraction in Magpie: events, schemas and temporal joins, Proceedings of the 11th workshop on ACM SIGOPS European workshop: beyond the PC, September 19-22, 2004, Leuven, Belgium
|
|
|
|
|
|
|
|
|
Chun Yuan , Ni Lao , Ji-Rong Wen , Jiwei Li , Zheng Zhang , Yi-Min Wang , Wei-Ying Ma, Automated known problem diagnosis with event traces, ACM SIGOPS Operating Systems Review, v.40 n.4, October 2006
|
|
|
|
|
|
Moises Goldszmidt , Ira Cohen , Armando Fox , Steve Zhang, Three research challenges at the intersection of machine learning, statistical induction, and systems, Proceedings of the 10th conference on Hot Topics in Operating Systems, p.10-10, June 12-15, 2005, Santa Fe, NM
|
|
|
|
|
|
|
|
|
Christopher Stewart , Ming Zhong , Kai Shen , Thomas O'Neill, Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems, Proceedings of the 2nd conference on Hot Topics in System Dependability, p.1-1, November 08, 2006, Seattle, WA
|
|
|
Anupam Chanda , Khaled Elmeleegy , Alan L. Cox , Willy Zwaenepoel, Causeway: operating system support for controlling and analyzing the execution of distributed programs, Proceedings of the 10th conference on Hot Topics in Operating Systems, p.18-18, June 12-15, 2005, Santa Fe, NM
|
|
|
|
|
|
Ira Cohen , Moises Goldszmidt , Terence Kelly , Julie Symons , Jeffrey S. Chase, Correlating instrumentation data to system states: a building block for automated diagnosis and control, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.16-16, December 06-08, 2004, San Francisco, CA
|
|
|
Helen J. Wang , John C. Platt , Yu Chen , Ruyun Zhang , Yi-Min Wang, Automatic misconfiguration troubleshooting with peerpressure, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.17-17, December 06-08, 2004, San Francisco, CA
|
|
|
Paul Barham , Austin Donnelly , Rebecca Isaacs , Richard Mortier, Using magpie for request extraction and workload modelling, Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, p.18-18, December 06-08, 2004, San Francisco, CA
|
|
|
Christopher Stewart , Terence Kelly , Alex Zhang , Kai Shen, A dollar from 15 cents: cross-platform management for internet services, USENIX 2008 Annual Technical Conference on Annual Technical Conference, p.199-212, June 22-27, 2008, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Rui Zhang , Steve Moyle , Steve McKeever , Alan Bivens, Performance problem localization in self-healing, service-oriented systems using Bayesian networks, Proceedings of the 2007 ACM symposium on Applied computing, March 11-15, 2007, Seoul, Korea
|
|
|
Jungwoo Ha , Christopher J. Rossbach , Jason V. Davis , Indrajit Roy , Hany E. Ramadan , Donald E. Porter , David L. Chen , Emmett Witchel, Improved error reporting for software that uses black-box components, ACM SIGPLAN Notices, v.42 n.6, June 2007
|
|
|
|
|
|
|
|
|
|
|
|
Kai Engels , Ralf Heidger , Reinhold Kroeger , Morris Milekovic , Jan Schaefer , Markus Schmid , Marcus Thoss, eMIVA: tool support for the instrumentation of critical distributed applications, ACM SIGMETRICS Performance Evaluation Review, v.35 n.3, December 2007
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Xuezheng Liu , Zhenyu Guo , Xi Wang , Feibo Chen , Xiaochen Lian , Jian Tang , Ming Wu , M. Frans Kaashoek , Zheng Zhang, D3S: debugging deployed distributed systems, Proceedings of the 5th USENIX Symposium on Networked Systems Design and Implementation, p.423-437, April 16-18, 2008, San Francisco, California
|
|
|
|
|
|
|
|
|
Xi Wang , Zhenyu Guo , Xuezheng Liu , Zhilei Xu , Haoxiang Lin , Xiaoge Wang , Zheng Zhang, Hang analysis: fighting responsiveness bugs, ACM SIGOPS Operating Systems Review, v.42 n.4, May 2008
|
|
|
Nikolai Joukov , Avishay Traeger , Rakesh Iyer , Charles P. Wright , Erez Zadok, Operating system profiling via latency analysis, Proceedings of the 7th symposium on Operating systems design and implementation, November 06-08, 2006, Seattle, Washington
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Gunjan Khanna , Mike Yu Cheng , Padma Varadharajan , Saurabh Bagchi , Miguel P. Correia , Paulo J. Veríssimo, Automated Rule-Based Diagnosis through a Distributed Monitor System, IEEE Transactions on Dependable and Secure Computing, v.4 n.4, p.266-279, October 2007
|
|
|
Mohammad Maifi Hasan Khan , Hieu Khac Le , Hossein Ahmadi , Tarek F. Abdelzaher , Jiawei Han, Dustminer: troubleshooting interactive complexity bugs in sensor networks, Proceedings of the 6th ACM conference on Embedded network sensor systems, November 05-07, 2008, Raleigh, NC, USA
|
|
|
|
|
|
|
|
|
Shivnath Babu , Nedyalko Borisov , Sandeep Uttamchandani , Ramani Routray , Aameek Singh, DIADS: addressing the "my-problem-or-yours" syndrome with integrated SAN and database diagnosis, Proccedings of the 7th conference on File and stroage technologies, p.57-70, February 24-27, 2009, San Francisco, California
|
|
|
Weihang Jiang , Chongfeng Hu , Shankar Pasupathy , Arkady Kanevsky , Zhenmin Li , Yuanyuan Zhou, Understanding customer problem troubleshooting from storage system logs, Proccedings of the 7th conference on File and stroage technologies, p.43-56, February 24-27, 2009, San Francisco, California
|
|
|
|
|
|
Anupam Chanda , Khaled Elmeleegy , Alan L. Cox , Willy Zwaenepoel, Causeway: support for controlling and analyzing the execution of multi-tier applications, Proceedings of the ACM/IFIP/USENIX 2005 International Conference on Middleware, p.42-59, November 01-01, 2005, Grenoble, France
|
|
|
|
|
|
|
|
|
|
|
|
Yee Jiun Song , Marcos K. Aguilera , Ramakrishna Kotla , Dahlia Malkhi, RPC chains: efficient client-server communication in geodistributed systems, Proceedings of the 6th USENIX symposium on Networked systems design and implementation, p.277-290, April 22-24, 2009, Boston, Massachusetts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|