microarchitecture, although GCC still
emits XOR when targeting it.
Maybe you have never needed to count
the number of set bits in an integer,
but you have probably written code
like the example in (t). Instinctively,
I thought the code generation would
be full of compares and branches, but
both Clang and GCC use a clever trick
to make this code pretty efficient. GCC
9. 1’s output is shown in (u; https://god-
The compilers turn this sequence
of comparisons into a lookup table.
The magic value loaded into rax is a 33-
bit lookup table, with a one-bit in the
locations where you would return true
(indices 32, 13, 10, and 9 for ' ', \r, \n,
and \t, respectively). The shift and &
then pick out the cth bit and return it.
Clang generates slightly different but
broadly equivalent code. This is another example of strength reduction.
I was pleasantly surprised to see
this kind of optimization. This is definitely the kind of thing that—prior to
investigating in Compiler Explorer—I
would have written manually assuming I knew better than the compiler.
One unfortunate thing I did notice
while experimenting is—for GCC, at
least—the order of the comparisons
can affect the compiler’s ability to
make this optimization. If you switch
the order of the comparison of the \r
and \n, GCC generates the code in (v).
There’s a pretty neat trick with the
and to combine the comparison of \r
and \t, but this seems worse than the
code generated before. That said, a simplistic benchmark on Quick Bench suggests the compare-based version might
be a tiny bit faster in a predictable tight
loop.a Who ever said this was simple, eh?
Sometimes you need to add a bunch
of things up. Compilers are extremely
good at taking advantage of the vec-torized instructions available in most
CPUs these days, so even a pretty
straightforward piece of code such as
(w) gets turned into code whose core
loop looks like (x) (https://godbolt.
The compiler has been able to pro-
cess eight values per instruction, by sep-
arating the total into eight separate sub-
totals for each one. At the end it sums
across those subtotals to make the final
total. It’s as if the code was rewritten for
you to look more like the (y) example.
Simply place the compiler’s optimization level at a high enough setting
and pick an appropriate CPU architecture to target, and vectorization kicks
This does rely on the fact that sepa-
rating the totals into individual sub-
totals and then summing at the end is
equivalent to adding them in the order
the program specified. For integers, this
is trivially true; but for floating-point
data types this is not the case. Floating
point operations are not associative:
(a+b)+c is not the same as a+(b+c), as—
among other things—the precision of
the result of an addition depends on the
relative magnitude of the two inputs.
This means, unfortunately, that
changing the vector<int> to be a
vector<float> does not result in the
code you would ideally want. The compiler could use some vector operations
(it can square eight values at once), but
it is forced to sum across those values
serially as shown in (z; https://godbolt.
This is unfortunate, and there is not
an easy way around it. If you are absolutely sure the order of addition is not
important in your case, you can give GCC
the dangerous (but amusingly named)
This lets it generate this beautiful inner
loop illustrated in (a'; https://godbolt.
Amazing stuff: processing eight floats
at a time, using a single instruction to ac-