us from adding other calculations?
Should it also calculate the min, count,
total, and average? Obviously not. If
you have the count and total, then you
can calculate the average yourself.
5. Maintain the Max Along the Way
What if the max value cannot be
tracked as part of loading the dataset? Perhaps you don’t control the
method that loads the data. If you
are using an off-the-shelf JSON (
would be very difficult. Perhaps the
data is modified after being loaded,
or it is generated in place.
In such situations I would ask
why the data structure holding the
data isn’t doing the tracking itself. If
data is only added, never removed or
changed, the data structure can easily track the largest value seen so far.
The need for a linear search has been
If items are being removed and
changed, more sophisticated data
structures are required. A heap makes
the highest value accessible in O( 1)
time. The data can be kept in the original order but in a heap or other index
on the side. You will then always have
fast access to the highest value, though
you will suffer from additional overhead maintaining the indexes.
6. Hide Long Calculations
Maybe the process can’t be made any
faster, but the delay can be hidden
from the user.
One good place to hide the calculation is when waiting for user input. You
don’t need the entire processing power
of the computer to ask “Are you sure?”
and then wait for a response. Instead,
you can use that time to perform calculations, and no one will be the wiser.
One video-game console manufacturer requires games to have some kind
of user interaction within a few seconds
of starting. Sadly, most games need
more time than that to load and initialize. To meet the vendor’s requirement,
most games first load and display a title
screen, then ask users to click a button
to start the game. What users don’t realize is that while they are sitting in awe
of the amazing title screen, the game is
finishing its preparations.
brag, but I can pack a lot of bugs into
five lines. What if I were to leverage code
that has been heavily tested instead?
Most languages have a built-in sort
function that has been tested far more
than any code I’ve ever written. I could
find the max by sorting the list and picking the last element. That would be lazy
and execute more slowly than a linear
search, but it would be very reliable.
A few simple benchmarks found that
on the same old laptop, this “lazy algorithm” could sort 700,000 elements and
still be under the 200ms mark.
What about smaller values of N?
If N = 16,000, then the entire dataset
fits in the L1 cache of the CPU, assuming the CPU was made in this decade.
This means the CPU can scan the data
so fast it will make your hair flip. If N =
64,000, then the data will fit in a modern L2 cache, and your hair may still
do interesting things. If the computer
wasn’t made in this decade, I would
recommend that my friend reconsider
working for this company.
If N is less than 100, then the lazy
algorithm runs imperceptibly fast. In
fact, you could repeat the search on
demand rather than storing the value,
and unless you were running the algorithm thousands of times, the perceived time would be negligible.
The algorithms mentioned so far
are satisfactory until N = 700,000 if we
are lazy and N = 13,000,000 if we aren’t;
13 million 32-bit integers (about 52MB)
is hardly small by some standards. Yet,
in terms of human perception, it can
be searched instantly.
If my friend had known these benchmark numbers, he could have had
some fun during the interview, asking
the interviewer to suggest a large value
of N, and replying, “What? I don’t get
out of bed for less than 13 million integers!” (Of course, this would probably
have cost him the job.)
2. Use SIMD Instructions
Most modern CPUs have SIMD (
single instruction, multiple data) instructions that let you repeat the
same operation over a large swath of
memory. They are able to do this very
quickly because they benefit from
more efficient memory access and
According to one simple benchmark ( http://stackoverflow.com/a/
2743040/71978), a 2.67GHz Core i7 saw
a 7–8x improvement by using SIMD
instructions where N = 100,000. If the
amount of data exceeded the CPU’s
cache size, the benefit dropped to 3.5x.
With SIMD, small becomes about 45
million elements, or about 180MB.
3. Work in Parallel
Even if N is larger than the small quantity, you can keep within your 200ms
time budget by using multiple CPUs.
Each CPU core can search a shard of
the data. With four CPU cores, small
becomes 4N, or nearly 200 million
When I was in college, the study of
parallel programming was hypothetical because we didn’t have access to
computers with more than one CPU. In
fact, I didn’t think I would ever be lucky
enough to access a machine with such
a fancy architecture. Boy, was I wrong!
Now I have a phone with eight CPU
cores, one of which, I believe, is dedicated exclusively to crushing candy.
Parallel processing is now the norm,
not the exception. Code should be written to take advantage of this.
4. Hide Calculation
in Another Function
The search for the max value can be hidden in other work. For example, earlier
in the process the data is loaded into
memory. Why not have that code also
track the max value as it iterates through
the data? If the data is being loaded
from disk, the time spent waiting for I/O
will dominate, and the additional comparison will be, essentially, free.
If the data is being read from a text
file, the work to convert ASCII digits to
32-bit integers is considerably more
than tracking the largest value seen so
far. Adding max-value tracking would
be “error in the noise” of any benchmarks. Therefore, it is essentially free.
You might point out that this violates the SoC (separation of concerns)
principle. The method that loads data
from the file should just load data from
a file. Nothing else. Having it also track
the maximum value along the way adds
complexity. True, but we’ve already decided the added complexity is worth
Where will this end? If the Load-DataFromFile() method also calculates the max value, what’s to stop