DO YOU EVER feel overwhelmed by an unending stream
of information? It can seem like a barrage of new
email and text messages demands constant attention,
and there are also phone calls to pick up, articles to
read, and knocks on the door to answer. Putting these
pieces together to keep track of what is important can
be a real challenge.
The same information overload is a concern in
many computational settings. Telecommunications
companies, for example, want to keep track of the
activity on their networks, to identify overall network
health and spot anomalies or changes in behavior. Yet,
the scale of events occurring is huge: many millions of
network events per hour, per network element. While
new technologies allow the scale and granularity
of events being monitored to increase by orders of
magnitude, the capacity of computing elements
(processors, memory, and disks) to make sense of
these is barely increasing. Even on a small
scale, the amount of information may
be too large to store in an impoverished
setting (say, an embedded device) or to
keep conveniently in fast storage.
In response to this challenge, the
model of streaming data processing
has grown in popularity. The aim is no
longer to capture, store, and index ev-
ery minute event, but rather to process
each observation quickly in order to
create a summary of the current state.
Following its processing, an event is
dropped and is no longer accessible.
The summary that is retained is often
referred to as a sketch of the data.
Coping with the vast scale of information means making compromises:
The description of the world is approximate rather than exact; the nature of
queries to be answered must be decided in advance rather than after the fact;
and some questions are now insoluble.
The ability to process vast quantities of
data at blinding speeds with modest resources, however, can more than make
up for these limitations.
As a consequence, streaming methods have been adopted in a number
of domains, starting with telecommunications but spreading to search
engines, social networks, finance, and
time-series analysis. These ideas are
also finding application in areas using
traditional approaches, but where the
rough-and-ready sketching approach
is more cost effective. Successful applications of sketching involve a mixture
of algorithmic tricks, systems know-how, and mathematical insight, and
have led to new research contributions
in each of these areas.
This article introduces the ideas behind sketching, with a focus on algorithmic innovations. It describes some
algorithmic developments in the abstract, followed by the steps needed to
put them into practice, with examples.
The article also looks at four novel algorithmic ideas and discusses some
When faced with a large amount of
information to process, there may be
Article development led by
The approximate approach is
often faster and more efficient.
BY GRAHAM CORMODE