This is presumably a result of the
general algorithm Clang uses.
Further experimentation shows
that Clang is clever enough to optimize many of these types of loops.
Both Clang and GCC track loop variables in a way that allows this kind of
optimization, but only Clang chooses
to generate the closed-form version.
It’s not always less work: for small
values of x the overhead of the closed-form solution might be more than just
looping. Krister Walfridsson goes into
great detail about how this is achieved
in a blog post. 7
It is also worth noting that in order to
do this optimization, the compiler may
rely on signed integer overflow being
undefined behavior. As such, it can assume your code cannot pass a value of
x that would overflow the result (65536,
in this case). If Clang cannot make that
assumption, it is sometimes unable
to find a closed-form solution (https://
godbolt.org/z/acm19_sum_fail).
Devirtualization
Although it seems to have fallen out
of favor a little, traditional virtual-function-based polymorphism has its
place. Whether it’s to allow for genuine
polymorphic behavior, or add a “seam”
for testability, or allow for future extensibility, polymorphism through virtual
functions can be a convenient choice.
As we know, though, virtual functions are slow. Or are they? Let’s see how
they affect the sum-of-squares example
from earlier—something like (d').
Of course, this is not polymorphic
yet. A quick run through the compiler shows the same highly vec
tor-ized assembly ( https://godbolt.org/z/
acm19_poly1).
Now adding the virtual keyword in
front of the int operator() should result in a much slower implementation,
filled with indirect calls, right? Well,
sort of ( https://godbolt.org/z/acm19_
poly2). There is a lot more going on
than before, but at the core of the loop
is something perhaps surprising (e').
What is happened here is GCC has
made a bet. Given that it has seen only
one implementation of the Trans-
form class, it is likely going to be that
one implementation that is used. In-
stead of blindly indirecting through the
virtual function pointer, it has taken
the slight hit of comparing the pointer
cumulate and square. The drawback is
potentially unbounded precision loss.
Additionally, GCC does not allow you to
turn this feature on for just the functions
you need it for—it’s a per-compilation
unit flag. Clang at least lets you control it
in the source code with #pragma Clang
fp contract.
While playing around with these
kinds of optimizations, I discovered
that compilers have even more tricks
up their sleeves, check out (b'). GCC
generates fairly straightforward code
for this, and with appropriate compil-
er settings will use vector operations
as noted earlier. Clang, however, gen-
erates the code in (c'); (https://god-
bolt.org/z/acm19_sum_up).
First, note there is no loop at all.
Working through the generated code,
you see that Clang returns:
(x – 1)(x – 2)
2 +(x+ 1)
It has replaced the iteration of a
loop with a closed-form general solution of the sum. The solution differs
from what I would naively write myself:
x (x – 1)
2