Difficulty No. 4 is addressed by noting that some of the big datasets really matter. Measured in dollars, good
solutions to the “ad display problem”
(and high frequency trading in general) are easily worth many billions.
Measured in time, a good spam filter
extremely large datasets. However,
there has been a failure to scale up
processor speeds. The response on the
computer architecture side has been
to provide more, rather than faster,
CPUs, slowly addressing difficulty No.
3. Parallelism is now routine, and software techniques for using it are improving in viability.
or optimized search engine can save
millennia of time per day.
Difficulty No. 5 is addressed by noting that many of these problems are
inherently complex. What is the inherent complexity of the function that
always returns the best answer given
any question by anyone anywhere? If
a significant portion of that function
is parameterized and the parameters
are learned, we want significant quantities of data directly informing those
This leaves difficulties No. 1 and No.
2, both of which can be dealt with by
To WArd more
eFFicien T Algori Thms
Recently, my colleagues and I edited a
book surveying the state of the art in
parallel machine learning [ 1]. Based
on this, we created a survey tutorial,
which provides a high-level view of the
state of public research alongside a
summary of the book’s contents [ 2]. Of
particular interest to me are the quan-tifications of gross computational performance in Part 3, where we quantified the computational performance
of each algorithm while neglecting the
predictive performance. The core unit
of interest was a feature (i.e., a nonzero