ACM Home Page
Please provide us with feedback. Feedback
Detecting large-scale system problems by mining console logs
Full text PdfPdf (1.09 MB)
Source
ACM Symposium on Operating Systems Principles archive
Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles table of contents
Big Sky, Montana, USA
SESSION: Debugging table of contents
Pages 117-132  
Year of Publication: 2009
ISBN:978-1-60558-752-3
Authors
Wei Xu  University of California at Berkeley, Berkeley, CA, USA
Ling Huang  Intel Labs Berkeley, Berkeley, CA, USA
Armando Fox  University of California at Berkeley, Berkeley, CA, USA
David Patterson  University of California at Berkeley, Berkeley, CA, USA
Michael I. Jordan  University of California at Berkeley, Berkeley, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 58,   Downloads (12 Months): 58,   Citation Count: 0
Additional Information:

abstract   references   index terms  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1629575.1629587
What is a DOI?

ABSTRACT

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
A.W. Appel. Modern Compiler Implementation in Java. Cambridge University Press, second edition, 2002.
 
2
D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.
 
3
M.Y. Chen and et al. Path-based failure and evolution management. In Proc. NSDI'04, pages 23--23, San Francisco, California, 2004. USENIX.
 
4
M.H. DeGroot and M.J. Schervish. Probability and Statistics. Addison-Wesley, 3rd edition, 2002.
 
5
R. Dunia and S.J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In Proc. ACC, 1997.
 
6
R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 12 2006.
 
7
K. Fisher, D. Walker, K.Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In Proceedings of ACM POPL'08, pages 421--434, 2008.
 
8
R. Fonseca and et al. Xtrace: A pervasive network tracing framework. In In Proc. NSDI, 2007.
 
9
C. Gulcu. Short introduction to log4j, March 2002. http://logging.apache.org/log4j.
 
10
S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with Swatch. In Proc. USENIX LISA '93, pages 145--152, 1993.
 
11
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, 2004.
 
12
J. Hellerstein, S. Ma, and C. Perng. Discovering actionable patterns in event data. IBM Sys. Jour, 41(3), 2002.
 
13
J.E. Jackson and G.S. Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341--349, 1979.
 
14
W. Jiang and et al. Understanding customer problem troubleshooting from storage system logs. In Proceedings of USENIX FAST'09, 2009.
 
15
I. Jolliffe. Principal Component Analysis. Springer, 2002.
 
16
A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In Proc. ACM SIGCOMM, 2004.
 
17
C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proc. DSN, June 2008.
 
18
S. Ma and J.L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proc. IEEE ICDE, Washington, DC, 2001.
 
19
A.A. Makanju, A.N. Zincir-Heywood, and E.E. Milios. Clustering event logs using iterative partitioning. In Proceedings of KDD'09, 2009.
 
20
C. Manning, P. Ragahavan, and et al. Introduction to Information Retrieval. Cambridge University Press, 2008.
 
21
I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In Proc. ACM KDD, New York, NY, 2006.
 
22
A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proc. IEEE DSN, Washington, DC, 2007.
 
23
K. Papineni. Why inverse document frequency? In Proc. NAACL '01:, pages 1--8, Morristown, NJ, 2001. Asso. for Comp. Linguistics.
 
24
J.E. Prewett. Analyzing cluster log files using logsurfer. In Proc. Annual Conf. on Linux Clusters, 2003.
 
25
T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. In Proc. ACM MSR '06, pages 65--71, 2006.
 
26
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell, Ithaca, NY, USA, 1987.
 
27
J. Stearley. Towards informatic analysis of syslogs. In Proc. IEEE CLUSTER, Washington, DC, 2004.
 
28
Sun. Project darkstar. www.projectdarkstar.com, 2008.
 
29
Sun. Solaris Dynamic Tracing Guide, 2008.
 
30
J. Tan and et al. SALSA: Analyzing logs as StAte machines. In Proc. of WASL '08, 2008.
 
31
L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: bugs or bad comments?*/. In Proc. ACM SOSP '07, New York, NY, 2007. ACM.
 
32
R. Vaarandi. A data clustering algorithm for mining patterns from event logs. Proc. IPOM, 2003.
 
33
R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In INTELLCOMM, volume 3283, pages 293--308. Springer, 2004.
 
34
I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000.
 
35
K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proc. ACM