ACM Home Page
Please provide us with feedback. Feedback
Performance debugging for distributed systems of black boxes
Full text PdfPdf (409 KB)
Source ACM Symposium on Operating Systems Principles archive
Proceedings of the nineteenth ACM symposium on Operating systems principles table of contents
Bolton Landing, NY, USA
SESSION: Probing the black box table of contents
Pages: 74 - 89  
Year of Publication: 2003
ISBN:1-58113-757-5
Also published in ...
Authors
Marcos K. Aguilera  HP Labs, Palo Alto, CA
Jeffrey C. Mogul  HP Labs, Palo Alto, CA
Janet L. Wiener  HP Labs, Palo Alto, CA
Patrick Reynolds  Duke University
Athicha Muthitacharoen  MIT Lab for Computer Science, MA
Sponsors
SIGOPS: ACM Special Interest Group on Operating Systems
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 30,   Downloads (12 Months): 197,   Citation Count: 76
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/945445.945454
What is a DOI?

ABSTRACT

Many interesting large-scale systems are distributed systems of multiple communicating components. Such systems can be very hard to debug, especially when they exhibit poor performance. The problem becomes much harder when systems are composed of "black-box" components: software from many different (perhaps competing) vendors, usually without source code available. Typical solutions-provider employees are not always skilled or experienced enough to debug these systems efficiently. Our goal is to design tools that enable modestly-skilled programmers (and experts, too) to isolate performance bottlenecks in distributed systems composed of black-box nodes.We approach this problem by obtaining message-level traces of system activity, as passively as possible and without any knowledge of node internals or message semantics. We have developed two very different algorithms for inferring the dominant causal paths through a distributed system from these traces. One uses timing information from RPC messages to infer inter-call causality; the other uses signal-processing techniques. Our algorithms can ascribe delay to specific nodes on specific causal paths. Unlike previous approaches to similar problems, our approach requires no modifications to applications, middleware, or messages.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Alignment Software, Inc. Appassure. http://www.alignmentsoftware.com, 2003.
 
2
S. Bagchi, G. Kar, and J. L. Hellerstein. Dependency analysis in distributed systems using fault injection: Application to problem determination in an e-commerce environment. In Proc. 12th Intl. Workshop on Distributed Systems: Operations & Management, Nancy, France, Oct. 2001.
 
3
A. Brown, G. Kar, and A. Keller. An active approach to characterizing dynamic dependencies for problem determination in a distributed environment. In Proc. 7th IFIP/IEEE Intl. Symp. on Integrated Network Management, Seattle, WA, May 2001.
 
4
 
5
M. Chen, E. Kiciman, A. Accardi, A. Fox, and E. Brewer. Using runtime paths for macro analysis. In Proc. HotOS-IX, Kauai, HI, May 2003.
 
6
 
7
 
8
9
 
10
 
11
12
 
13
R. Isaacs and P. Barham. Performance analysis in loosely-coupled distributed systems. In 7th CaberNet Radicals Workshop, Bertinoro, Italy, Oct. 2002.
 
14
V. Jacobson, C. Leres, and S. McCanne. tcpdump. www.tcpdump.org, 1989.
 
15
JBoss Group. http://www.jboss.org/.
 
16
E. Kiciman. JBoss request-tracing in Pinpoint, 2003.
 
17
J. B. Micheel. Personal communication, 2003.
 
18
 
19
B. P. Miller and C.-Q. Yang. Critical path analysis for the execution of parallel and distributed programs. In Proc. 8th Intl. Conf. on Distributed Computing Systems, pages 366--373, San Jose, CA, June 1988.
 
20
D. L. Mills. The network computer as precision timekeeper. In Proc. Precision Time and Time Interval (PTTI) Applications and Planning Meeting, pages 96--108, Reston, VA, Dec. 1996.
21
 
22
Performant, Inc. Optibench. http://www.performant.com/.
 
23
Quest Software Inc. Performasure. http://java.quest.com/performasure, 2003.
 
24
Sun Microsystems, Inc. Java Pet Store Demo. http://developer.java.sun.com/developer/releases/petstore/.
 
25
Sun Microsystems, Inc. J2EE platform specification. http://java.sun.com/j2ee/, 2003.
 
26
 
27
Y. Zhang and V. Paxson. Detecting stepping stones. In Proc. 9th USENIX Security Symp., Denver, CO, Aug. 2000.

CITED BY  79

Collaborative Colleagues:
Marcos K. Aguilera: colleagues
Jeffrey C. Mogul: colleagues
Janet L. Wiener: colleagues
Patrick Reynolds: colleagues
Athicha Muthitacharoen: colleagues