technical Perspective
sketches Get sketchier
By Peter J. Haas
are Data s YnOPses—such as the hash-based sketches discussed by Li and
König in the following paper—still
needed for querying massive datasets?
After all, disk and memory are cheap,
and both modern multicore processors and data-processing platforms
such as Hadoop are rendering Web
scale datasets ever more amenable to
parallel and distributed processing
techniques. The answer to this question is a firm “yes!” Many important data-processing tasks, such as detecting
duplicate Web pages, are not embarrassingly parallel and involve a huge
number of operations. Even massively
parallelizable tasks face formidable
performance challenges: it has long
been observed that data volumes are
growing at faster rates than computing power, and this trend continues.
Moreover, methods that rely solely on
parallelism can be expensive. Indeed,
under evolving “platform as a service”
models for cloud computing, users
will pay costs that directly reflect the
computing resources they use. In this
setting, use of synopsis techniques can
lead to significant cost savings, as well
as to energy savings and greener computing. The need to process streaming data exacerbates the pressures on
data management systems and makes
synopsis techniques even more appealing. So synopses such as samples, histograms, wavelets, and sketches will
continue to play a key role in information management, data mining, and
knowledge discovery. Reducing synopses’ time and space requirements thus
remains an important challenge.
The paper highlights the fascinat-
ing interplay between two synopses
techniques: hashing and sampling.
Random samples have long served as
useful synopses because of their flex-
ibility: a given sample can be used
to estimate many different charac-
teristics of the original dataset. This
flexibility often comes at the price of
limited accuracy, however. In particu-
lar, sampling does poorly at finding
“needles in haystacks” and so fails at,
for example, estimating the number
of distinct values in a “skewed” data-
set containing a few values that each
occur millions of times along with a
few hundred values that each occur
once. Hash-based sketches, on the
other hand, are typically applicable to
only one particular task, but perform
that task extremely well; a famous ex-
ample is the Flajolet-Martin sketch for
the number of distinct values, based
on a single pass through the dataset.
In general, hash-based sketches have
proven to be extremely well suited for
streaming applications, and for this
reason have received much attention
in recent years.
Peter J. haas ( peterh@almaden.ibm.com) is a research
staff member at Ibm almaden research center, san
Jose, ca.