which rapidly becomes unsustainable
for large matrices. The competing approach of random projections argues
that rather than finding “the best” directions, it suffices to use (a slightly
larger number of) random vectors.
Picking a moderate number of random directions captures a comparable
amount of variation, while requiring
much less computation.
The random projection of each row
of the data matrix can be seen as an example of a sketch of the data. More directly, close connections exist between
random projections and the sketches
described earlier. The Count-Min sketch
can be viewed as a random projection of
sorts; moreover, the best constructions
of random projections for dimensionality reduction look a lot like Count-Min sketches with some twists (such as
randomly multiplying each column of
the matrix by either - 1 or 1). This is the
basis of methods for speeding up high-dimensional machine learning, such as
the Hash Kernels approach. 14
Randomized numerical linear algebra. A grand objective for sketching
is to allow arbitrary complex mathematical operations over large volumes
of data to be answered approximately
and quickly via sketches. While this
objective appears quite a long way off,
and perhaps infeasible because of some
impossibility results, a number of core
mathematical operations can be solved
using sketching ideas, which leads
to the notion of randomized numerical linear algebra. A simple example is
matrix multiplication: given two large
matrices A and B, you want to find their
product AB. An approach using sketching is to build a dimensionality-reduc-ing sketch of each row of A and each column of B. Combining each pair of these
provides an estimate for each entry of
the product. Similar to other examples,
small answers are not well preserved,
but large entries are accurately found.
Other problems that have been tackled in this space include regression.
Here the input is a high-dimensional
dataset modeled as matrix A and col-
umn vector b: each row of A is a data
point, with the corresponding entry of
b the value associated with the row. The
goal is to find regression coefficients x
that minimize ||Ax-b|| 2. An exact so-
lution to this problem is possible but
costly in terms of time as a function of
the number of rows. Instead, applying
sketching to matrix A solves the prob-
lem in the lower-dimensional sketch
space. 5 David Woodruff provides a
comprehensive mathematical survey
of the state of the art in this area. 16
Rich data: Graphs and geometry. The
applications of sketching so far can be
seen as summarizing data that might
be thought of as a high-dimensional
vector, or matrix. These mathematical
abstractions capture a large number of
situations, but, increasingly, a richer
model of data is desired—say, to model
links in a social network (best thought of
as a graph) or to measure movement pat-
terns of mobile users (best thought of as
points in the plane or in 3D). Sketching
ideas have been applied here also.
For graphs, there are techniques
to summarize the adjacency informa-
tion of each node, so that connectivity
and spanning tree information can be
extracted. 1 These methods provide a
surprising mathematical insight that
much edge data can be compressed
while preserving fundamental informa-
tion about the graph structure. These
techniques have not found significant
use in practice yet, perhaps because of
high overheads in the encoding size.
For geometric data, there has been
much interest in solving problems such
as clustering. 9 The key idea here is that
clustering part of the input can capture
a lot of the overall structural information, and by merging clusters together
(clustering clusters) you can retain a
good picture of the overall point density
Why Should You Care?
The aim of this article has been to
introduce a selection of recent tech-
niques that provide approximate an-
swers to some general questions that
often occur in data analysis and manip-
ulation. In all cases, simple alternative
approaches can provide exact answers,
at the expense of keeping complete
information. The examples shown
here have illustrated, however, that in
many cases the approximate approach
can be faster and more space efficient.
The use of these methods is growing.
Bloom filters are sometimes said to
be one of the core technologies that
“big data experts” must know. At the
very least, it is important to be aware
of sketching techniques to test claims
that solving a problem a certain way
is the only option. Often, fast approxi-
mate sketch-based techniques can pro-
vide a different trade-off.
It Probably Works
Statistics for Engineers
1. Ahn, K.J., Guha, S., McGregor, A. Analyzing graph
structure via linear measurements. In Proceedings of the
ACM-SIAM Symposium on Discrete Algorithms, (2012).
2. Bhagat, S., Burke, M., Diuk, C., Filiz, I.O., Edunov, S.
Three-and-a-half degrees of separation. Facebook
Research, 2016; https://research.fb.com/three-and-a-half-degrees-of-separation/.
3. Bloom, B. Space/time trade-offs in hash coding with
allowable errors. Commun. ACM 13, 7 (July 1970),
4. Broder, M., Mitzenmacher, A. Network applications
of Bloom filters: a survey. Internet Mathematics 1, 4
5. Clarkson, K.L., Woodruff, D.P. Low rank approximation
and regression in input sparsity time. In Proceedings
of the ACM Symposium on Theory of Computing,
6. Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T.,
Spatscheck, O., Srivastava, D. 2004. Holistic UDAFs
at streaming speeds. In Proceedings of the ACM
SIGMOD International Conference on Management of
Data, (2004), 35–46.
7. Cormode, G., Muthukrishnan, S. An improved data
stream summary: the Count-Min sketch and its
applications. J. Algorithms 55, 1 (2005), 58–75.
8. Flajolet, P., Martin, G.N. 1985. Probabilistic counting.
In Proceedings of the IEEE Conference on
Foundations of Computer Science, 1985, 76–82. Also
in J. Computer and System Sciences 31, 182–209.
9. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.
Clustering data streams. In Proceedings of the IEEE
Conference on Foundations of Computer Science, 2000.
10. Heule, S., Nunkesser, M., Hall, A. HyperLogLog in
practice: Algorithmic engineering of a state of the art
cardinality estimation algorithm. In Proceedings of
the International Conference on Extending Database
11. Jermaine, C. Sampling techniques for massive data.
Synopses for massive data: samples, histograms,
wavelets and sketches. Foundations and Trends in
Databases 4, 1–3 (2012). G. Cormode, M. Garofalakis,
P. Haas, and C. Jermaine, Eds. NO W Publishers.
12. Morris, R. Counting large numbers of events in small
registers. Commun. ACM 21, 10 (Oct. 1977), 840–842.
13. Pike, R., Dorward, S., Griesemer, R., Quinlan, S.
Interpreting the data: Parallel analysis with Sawzall.
Dynamic Grids and Worldwide Computing 13, 4 (2005),
14. Weinberger, K. Q., Dasgupta, A., Langford, J., Smola,
A.J., Attenberg, J. Feature hashing for large-scale
multitask learning. In Proceedings of the International
Conference on Machine Learning, 2009.
15. Whang, K. Y., Vander-Zanden, B. T., Taylor, H.M. A linear-time probabilistic counting algorithm for database
applications. ACM Trans. Database Systems 15, 2
16. Woodruff, D. Sketching as a tool for numerical linear
algebra. Foundations and Trends in Theoretical
Computer Science 10, 1–2 (2014), 1–157.
Graham Cormode is a professor of computer science
at the University of War wick, U. K. Previously, he was a
researcher at Bell Labs and AT&T on algorithms for data
management. He received the 2017 Adams Prize for his
work on data analysis.
Copyright held by owner/author.
Publication rights licensed to ACM. $15.00.