| Detecting large-scale system problems by mining console logs |
| Full text |
Pdf
(1.09 MB)
|
Source
|
ACM Symposium on Operating Systems Principles
archive
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
table of contents
Big Sky, Montana, USA
SESSION: Debugging
table of contents
Pages 117-132
Year of Publication: 2009
ISBN:978-1-60558-752-3
|
|
Authors
|
|
Wei Xu
|
University of California at Berkeley, Berkeley, CA, USA
|
|
Ling Huang
|
Intel Labs Berkeley, Berkeley, CA, USA
|
|
Armando Fox
|
University of California at Berkeley, Berkeley, CA, USA
|
|
David Patterson
|
University of California at Berkeley, Berkeley, CA, USA
|
|
Michael I. Jordan
|
University of California at Berkeley, Berkeley, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 58, Downloads (12 Months): 58, Citation Count: 0
|
|
|
ABSTRACT
Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
A.W. Appel. Modern Compiler Implementation in Java. Cambridge University Press, second edition, 2002.
|
| |
2
|
D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.
|
| |
3
|
M.Y. Chen and et al. Path-based failure and evolution management. In Proc. NSDI'04, pages 23--23, San Francisco, California, 2004. USENIX.
|
| |
4
|
M.H. DeGroot and M.J. Schervish. Probability and Statistics. Addison-Wesley, 3rd edition, 2002.
|
| |
5
|
R. Dunia and S.J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In Proc. ACC, 1997.
|
| |
6
|
R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 12 2006.
|
| |
7
|
K. Fisher, D. Walker, K.Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In Proceedings of ACM POPL'08, pages 421--434, 2008.
|
| |
8
|
R. Fonseca and et al. Xtrace: A pervasive network tracing framework. In In Proc. NSDI, 2007.
|
| |
9
|
C. Gulcu. Short introduction to log4j, March 2002. http://logging.apache.org/log4j.
|
| |
10
|
S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with Swatch. In Proc. USENIX LISA '93, pages 145--152, 1993.
|
| |
11
|
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, 2004.
|
| |
12
|
J. Hellerstein, S. Ma, and C. Perng. Discovering actionable patterns in event data. IBM Sys. Jour, 41(3), 2002.
|
| |
13
|
J.E. Jackson and G.S. Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341--349, 1979.
|
| |
14
|
W. Jiang and et al. Understanding customer problem troubleshooting from storage system logs. In Proceedings of USENIX FAST'09, 2009.
|
| |
15
|
I. Jolliffe. Principal Component Analysis. Springer, 2002.
|
| |
16
|
A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In Proc. ACM SIGCOMM, 2004.
|
| |
17
|
C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proc. DSN, June 2008.
|
| |
18
|
S. Ma and J.L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proc. IEEE ICDE, Washington, DC, 2001.
|
| |
19
|
A.A. Makanju, A.N. Zincir-Heywood, and E.E. Milios. Clustering event logs using iterative partitioning. In Proceedings of KDD'09, 2009.
|
| |
20
|
C. Manning, P. Ragahavan, and et al. Introduction to Information Retrieval. Cambridge University Press, 2008.
|
| |
21
|
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In Proc. ACM KDD, New York, NY, 2006.
|
| |
22
|
A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proc. IEEE DSN, Washington, DC, 2007.
|
| |
23
|
K. Papineni. Why inverse document frequency? In Proc. NAACL '01:, pages 1--8, Morristown, NJ, 2001. Asso. for Comp. Linguistics.
|
| |
24
|
J.E. Prewett. Analyzing cluster log files using logsurfer. In Proc. Annual Conf. on Linux Clusters, 2003.
|
| |
25
|
T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. In Proc. ACM MSR '06, pages 65--71, 2006.
|
| |
26
|
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell, Ithaca, NY, USA, 1987.
|
| |
27
|
J. Stearley. Towards informatic analysis of syslogs. In Proc. IEEE CLUSTER, Washington, DC, 2004.
|
| |
28
|
Sun. Project darkstar. www.projectdarkstar.com, 2008.
|
| |
29
|
Sun. Solaris Dynamic Tracing Guide, 2008.
|
| |
30
|
J. Tan and et al. SALSA: Analyzing logs as StAte machines. In Proc. of WASL '08, 2008.
|
| |
31
|
L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: bugs or bad comments?*/. In Proc. ACM SOSP '07, New York, NY, 2007. ACM.
|
| |
32
|
R. Vaarandi. A data clustering algorithm for mining patterns from event logs. Proc. IPOM, 2003.
|
| |
33
|
R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In INTELLCOMM, volume 3283, pages 293--308. Springer, 2004.
|
| |
34
|
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.
|
| |
35
|
K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proc. ACM
|
|