practice
DOI: 10.1145/2890780
Article development led by
queue.acm.org
Applying statistical techniques
to operations data.
BY HEINRICH HARTMANN
MODERN IT SYSTEMS collect an increasing wealth
of data from network gear, operating systems,
applications, and other components. This data needs
to be analyzed to derive vital information about the
user experience and business performance. For
instance, faults need to be detected, service quality
needs to be measured and resource usage of the next
days and month needs to be forecasted.
Rule #1: Spend more time working on code that
analyzes the meaning of metrics, than code that collects,
moves, stores and displays metrics. —Adrian Cockcroft1
Statistics is the art of extracting information
from data, and hence becomes an essential tool
for operating modern IT systems. Despite a rising
awareness of this fact within the community (see
the quote above), resources for learning the relevant
statistical methods for this domain are hard to find.
The statistics courses offered in universities usually
depend on their students having prior knowledge of
probability, measure, and set theory, which is a high
barrier of entry. Even worse, these cours-
es often focus on parametric methods,
such as t-tests, that are inadequate for
this kind of analysis since they rely on
strong assumptions on the distribution
of data (for example, normality) that are
not met by operations data.
This lack of relevance of classical,
parametric statistics can be explained
by history. The origins of statistics
reach back to the 17th century, when
computation was expensive and data
was a sparse resource, leading mathematicians to spend a lot of effort to
avoid calculations.
Today the stage has changed radically and allows different approaches
to statistical problems. Consider this
example from a textbook2 used in a
university statistics class:
A fruit merchant gets a delivery of
10,000 oranges. He wants to know how
many of those are rotten. To find out he
takes a sample of 50 oranges and counts
the number of rotten ones. Which deductions can he make about the total number
of rotten oranges?
The chapter goes on to explain various inference methods. The example
translated to the IT domain could go as:
The abundance of computing resources has completely eliminated the
need for elaborate estimations.
Therefore, this article takes a different approach to statistics. Instead of
presenting textbook material on inference statistics, we will walk through
four sections with descriptive statistical
methods that are accessible and relevant to the case in point. I will discuss
several visualization methods, gain a
precise understanding of how to summarize data with histograms, visit classical summary statistics, and see how
to replace them with robust, quantile-based alternatives. I have tried to keep
prerequisite mathematical knowledge
to a minimum (for example, by provid-
Statistics
for Engineers