against the only known implementation.
If it matches, then the compiler knows
what to do: it inlines the body of the
Transform::operator() and squares
it in place.
That’s right: the compiler has inlined a virtual call. This is amazing,
and was a huge surprise when I first
discovered this. This optimization is
called speculative devirtualization and
is the source of continued research
and improvement by compiler writers.
Compilers are capable of devirtualiz-ing at LTO time too, allowing for whole-program determination of possible
function implementations.
The compiler has missed a trick,
however. Note that at the top of the
loop it reloads the virtual function
pointer from the vtable every time.
If the compiler were able to notice
that this value remains constant if
the called function does not change
the dynamic type of Transform, this
check could be hoisted out of the loop,
and then there would be no dynamic
checks in the loop at all. The compiler
could use loop-invariant code motion
to hoist the vtable check out of the
loop. At this point the other optimizations could kick in, and the whole
code could be replaced with the vec-torized loop from earlier in the case of
the vtable check passing.
You would be forgiven for think-
ing that the dynamic type of the object
could not possibly change, but it’s actu-
ally allowed by the standard: an object
can placement new over itself so long as
it returns to its original type by the time
it’s destructed. I recommend that you
never do this, though. Clang has an op-
tion to promise you never do such hor-
rible things in your code: -fstrict-
vtable-pointers.
Of the compilers I use, GCC is the
only one that does this as a matter of
course, but Clang is overhauling its
type system to leverage this kind of optimization more. 4
C++ 11 added the final specifier to
allow classes and virtual methods to be
marked as not being further overridden. This gives the compiler more information about which methods may
profit from such optimizations, and in
some cases may even allow the compiler to avoid a virtual call completely
( https://godbolt.org/z/acm19_poly3).
Even without the final keyword, sometimes the analysis phase is able to
prove that a particular concrete class
is being used ( https://godbolt.org/z/
acm19_poly4). Such static devirtualization can yield significant performance
improvements.
Conclusion
Hopefully, after reading this article,
you will appreciate the lengths to which
the compiler goes to ensure efficient
code generation. I hope that some of
these optimizations are a pleasant sur-
prise and will factor in your decisions
to write clear, intention-revealing code
and leave it to the compiler to do the
right thing. I have reinforced the idea
that the more information the com-
piler has, the better job it can do. This
includes allowing the compiler to see
more of your code at once, as well as
giving the compiler the right informa-
tion about the CPU architecture you
are targeting. There is a trade-off to
be made in giving the compiler more
information: it can make compilation
slower. Technologies such as link time
optimization can give you the best of
both worlds.
Optimizations in compilers continue
to improve, and upcoming improve-
ments in indirect calls and virtual func-
tion dispatch might soon lead to even
faster polymorphism. I am excited about
the future of compiler optimizations. Go
take a look at your compiler’s output.
Acknowledgments. The author
would like to extend his thanks to Matt
Hellige, Robert Douglas, and Samy Al
Bahra, who gave feedback on drafts of
this article.
Related articles
on queue.acm.org
C Is Not a Low-level Language
David Chisnall
https://queue.acm.org/detail.cfm?id=3212479
Uninitialized Reads
Robert C. Seacord
https://queue.acm.org/detail.cfm?id=3041020
You Don’t Know Jack about Shared
Variables or Memory Models
Hans-J. Boehm, Sarita V. Adve
https://queue.acm.org/detail.cfm?id=2088916
References
1. Godbolt, M. Compiler explorer, 2012; https://godbolt.org/.
2. Lemire, D. Faster remainders when the divisor is a
constant: beating compilers and libdivide, 2019; http://
bit.ly/33nzsy4/.
3. LLVM. The LLVM compiler infrastructure, 2003;
https://llvm.org.
4. Padlewski, P. RFC: Devirtualization v2. LLVM; http://
lists.llvm.org/pipermail/llvm-dev/2018-March/
121931.html.
5. ridiculous_fish. Libdivide, 2010; https://libdivide.com/.
6. Uops. Uops.info Instruction Latency Tables; https://
uops.info/table.html.
7. Walfridsson, K. How LLVM optimizes power sums,
2019; https://krister w.blogspot.com/2019/04/how-
llvm-optimizes-geometric-sums.html.
8. Warren, H.S. Hacker’s Delight. 2nd edition. Addison-Wesley Professional, 2012.
Matt Godbolt is the creator of the Compiler Explorer
website. He is passionate about writing efficient code.
He currently works at Aquatic Capital, and has worked on
low-latency trading systems, worked on mobile apps at
Google, run his own C++ tools company, and spent more
than a decade making console games.
Copyright held by author/owner.
Publication rights licensed to ACM.