conclusion
The applications and examples in
this article demonstrate the degree
to which system management has become log-centric. Whether used for
debugging problems or provisioning
resources, logs contain a wealth of information that can pinpoint, or at least
implicate, solutions.
Although log-analysis techniques
have made much progress recently,
several challenges remain. First, as systems become increasingly composed of
many, often distributed, components,
using a single log file to monitor events
from different parts of the system is difficult. In some scenarios logs from entirely different systems must be cross-correlated for analysis. For example,
a support organization may correlate
phone-call logs with Web-access logs to
track how well the online documentation for a product addresses frequently
asked questions and how many customers concurrently search the online documentation during a support call. Interleaving heterogeneous logs is seldom
straightforward, especially when timestamps are not synchronized or present
across all logs and when semantics are
inconsistent across components.
Second, the logging process itself
requires additional management.
Controlling the verbosity of logging is
important, especially in the event of
spikes or potential adversarial behavior, to manage overhead and facilitate analysis. The logging mechanism
should also not be a channel to propagate malicious activity. It remains a
challenge to minimize instrumentation overhead while maximizing information content.
A third challenge is that although
various analytical and statistical modeling techniques can mine large quantities of log data, they do not always
provide actionable insights. For example, statistical techniques could reveal
an anomaly in the workload or that the
system’s CPU utilization is high but
not explain what to do about it. The interpretation of the information is subjective, and whether the information
is actionable or not depends on many
factors. It is important to investigate
techniques that trade off efficiency, accuracy, and actionability.
There are several promising research
directions. Since humans will likely re-
main a part of the process of interpret-
ing and acting on logs for the foresee-
able future, advances in visualization
techniques should prove worthwhile.
Related articles
on queue.acm.org
Modern Performance Monitoring
Mark Purdy
http://queue.acm.org/detail.cfm?id=1117404
network Forensics
Ben Laurie
http://queue.acm.org/detail.cfm?id=1016982
The Pathologies of Big Data
Adam Jacobs
http://queue.acm.org/detail.cfm?id=1563874
References
1. bluegene/l team. an overview of the bluegene/l
supercomputer. IEEE Supercomputing and IBM
Research Report (nov. 2002).
2. cantrill, b.m., shapiro, m.w. and leventhal, a.h.
dynamic instrumentation of production systems.
usenix 2004 annual technical conference (boston,
ma, June 2004); http://www.usenix.org/event/
usenix04/tech/general/full_papers/cantrill/cantrill.pdf.
3. erlingsson, Ú., Peinado, m., Peter, s., budiu and m.
fay: extensible distributed tracing from kernels to
clusters. in Proceedings of the 23rd ACM Symposium
on Operating Systems Principles, cascais, Portugal
(oct. 2011); http://research.google.com/pubs/
archive/37199.pdf.
4. fonseca, r., Porter, g., katz r., shenker, s. and stoica,
i. X-trace: a pervasive network-tracing framework.
Usenix Symposium on Networked Systems Design and
Implementation (cambridge, ma , apr. 2007).
5. ganapathi, a., chen, y., fox, a., katz, r. h. and
Patterson, d. a. statistics-driven workload modeling
for the cloud. workshop on self-managing database
systems at icde (2010), 87− 92.
6. ganapathi, a., kuno, h. a., dayal, u., wiener, J. l.,
fox, a., Jordan, m. i. and Patterson, d. a. Predicting
multiple metrics for queries: better decisions enabled
by machine learning. International Conference on
Data Engineering (2009) 592−603.
7. gautam, a. and stoica, i. odr: output-deterministic
replay for multicore debugging. ACM Symposium on
Operating System Principles (2009), 193−206.
8. nguyen, X., huang, l. and Joseph, a. support vector
machines, data reduction, and approximate kernel
matrices. in Proceedings of the European Conference
on Machine Learning and Knowledge Discovery in
Databases (2008), 137−153.
9. oliner, a.J. and aiken, a. online detection of multi-component interactions in production systems. in
Proceedings of the International Conference on
Dependable Systems and Networks (hong kong, 2011);
http://adam.oliner.net/files/oliner_dsn_2011.pdf.
10. oliner, a.J., kulkarni, a.V. and aiken, a. using
correlated surprise to infer shared influence. in
Proceedings of the International Conference on
Dependable Systems and Networks (chicago, il,
2010), 191−200; http://adam.oliner.net/files/oliner_
dsn_2010.pdf.
11. rabkin, a. and randy, k. chukwa: a system for reliable
large-scale log collection. USENIX Conference on
Large Installation System Administration (2010), 1− 15.
12. sigelman, b., barroso, l., burrows, m., stephenson,
P., Plakal, m., beaver, d., Jaspan, s. and shanbhag,
c. dapper, a large-scale distributed systems tracing
infrastructure. google technical report; http://research.
google.com/archive/papers/dapper-2010-1.pdf.
13. thrun, s. et al. stanley: the robot that won the darPa
grand challenge. Journal of Field Robotics 23, 9
(2006), 661−692.
14. Xu, m. et al. a “flight data recorder” for enabling
full-system multiprocessor deterministic replay.
in Proceedings of the 30th annual International
Symposium on Computer Architecture (san diego, ca,
June 2003).
15. Xu, w., huang, l., fox, a., Patterson, d. and Jordan,
m. detecting large-scale system problems by
mining console logs. in Proceeding of the 22nd ACM
Symposium on Operating Systems Principles (big sky,
mt, oct. 2009).
16. yuan, d., zheng, J., Park, s., zhou, y. and savage,
s. improving software diagnosability via log
enhancement. in Proceedings of Architectural Support
for Programming Languages and Operating Systems
(newport beach, ca, mar. 2011); http://opera.ucsd.
edu/paper/asplos11-logenhancer.pdf.
Adam Oliner is a postdoctoral scholar in electrical
engineering and computer sciences at uc berkeley,
working with ion stoica and the amP (algorithms,
machine and People) lab.
Archana Ganapathi is a research engineer at splunk,
where she focuses on large-scale data analytics. she has
spent much of her research career analyzing production
datasets to model system behavior.
Wei Xu is a software engineer at google, where he works
on google’s debug logging and monitoring infrastructure.
his research interest is in cluster management and
debugging.
© 2012 acm 0001-0782/12/02 $10.00