early 2000s on the original Xbox), I
have been surprised how few projects
use LTO. In part this may be because
programmers unintentionally rely on
undefined behavior that becomes apparent only when the compiler gets
more visibility: I know I have been
guilty of this.
Favorite Optimization Examples
Over the years, I have collected a number of interesting real-world optimizations, both from first-hand experience
optimizing my own code and from
helping others understand their code
on Compiler Explorer. Here are some
of my favorite examples of how clever
the compiler can be.
Integer division by a constant. It
may be surprising to learn that—until
very recently—about the most expensive thing you could do on a modern
CPU is an integer divide. Division is
more than 50 times more expensive
than addition and more than 10 times
more expensive than multiplication.
(This was true until Intel’s release of
the Cannon Lake microarchitecture,
where the maximum latency of a 64-bit
divide was reduced from 96 cycles to
18. 6 This is only around 20 times slower than an addition, and 5 times more
expensive than multiplication.)
Thankfully, compiler authors have
some strength reduction tricks up
their sleeves when it comes to division
by a constant. I am sure we have all realized that division by a power of two
can often be replaced by a logical shift
right—rest assured the compiler will
do this for you. I would advise not writing a >> in your code to do division; let
the compiler work it out for you. It’s
clearer, and the compiler also knows
how to account properly for signed
values: integer division truncates toward zero, and shifting down by itself
truncates toward negative infinity.
However, what if you are dividing by
a non-power-of-two value, as illustrated
in (j) Are you out of luck? Luckily,
the compiler has your back again. The
code compiles to (k) (https://godbolt.
Not a divide instruction in sight.
Just a shift, and a multiply by a strange
large constant: the 32-bit unsigned input value is multiplied by 0xaaaaaaab,
and the resulting 64-bit value is shifted
down by 33 bits. The compiler has re-
In this case the compiler has realized that the vector’s begin() and
end() are constant during the operation of the loop. As such it has been
able to realize that the call to size()
is also a constant. Armed with this
knowledge, it hoists these constants
out of the loop, and then rewrites
the index operation (vec[i]) to be
a pointer walk, starting at begin()
and walking up one int at a time to
end(). This vastly simplifies the generated assembly.
In this example, I gave the compiler
a body to testFunc() but marked it
as non-inlineable (a GNU extension) to
demonstrate this optimization in isolation. In a more realistic codebase, the
compiler could inline testFunc() if it
believed it beneficial.
Another way to enable this optimization without exposing the body of
the function to the compiler is to mark
it as [[gnu::pure]] (another language extension). This promises the
compiler that the function is pure—
entirely a function of its arguments
with no side effects.
Interestingly, using range-for in the
initial example yields optimal assembly,
even without knowing that testFunc()
does not modify vec (https://godbolt.
org/z/acm19_count3). This is because
range-for is defined as a source code
transformation that puts begin() and
end() into local variables as shown in
(h) and is is interpreted as (i).
All things considered, if you need to
use a “raw” loop, the modern range-
for style is preferred: it’s optimal even
if the compiler cannot see the body of
called functions, and it is clearer to the
reader. Arguably better still is to use the
STL’s count _ if function to do all
the work for you: the compiler still gen-
erates optimal code (https://godbolt.
In the traditional single-transla-
tion-unit-at-a-time compilation mod-
el, function bodies are often hidden
from call sites, as the compiler has
seen only their declaration. Link time
optimization (LTO, also known as
LTCG for link time code generation)
can be used to allow the compiler to
see across translation unit boundar-
ies. In LTO, individual translation
units are compiled to an intermediate
form instead of machine code. During
the link process—when the entire pro-
gram (or dynamic linked library) is vis-
ible—machine code is generated. The
compiler can take advantage of this to
inline across translation units, or at
least use information about the side
effects of called functions to optimize.
Enabling LTO for optimized builds
can be a good win in general, as the
compiler can see your whole program.
I now rely on LTO to let me move more
function bodies out of headers to re-
duce coupling, compile time, and
build dependencies for debug builds
and tests, while still giving me the per-
formance I need in final builds.
Despite being a relatively estab-
lished technology (I used LTCG in the