divisibleBy3(unsigned ):
How often have you wondered, How
many set bits are in this integer? Prob-
ably not all that often. But it turns
imul edi, edi, -1431655765 ; edi = edi 0xaaaaaaab
cmp edi, 1431655765 ; compare with 0x55555555
setbe al ; return 1 if edi <= 0x55555555
ret
int
out this simple operation is surprisingly useful in a number of cases. For
example, calculating the Hamming
distance between two bitsets, dealing
with packed representations of sparse
matrices, or handling the results of
vector operations.
You might write a function to count
the bits as shown in (p). Of note is the
bit manipulation “trick” a &= (a – 1);,
which clears the bottom-most set bit.
It’s a fun one to prove to yourself how it
works on paper. Give it a go.
When targeting the Haswell microarchitecture, GCC 8. 2 compiles this code
to the assembly shown in (q) (https://
godbolt.org/z/acm19_bits). Note how
GCC has cleverly found the BLSR
bit-manipulation instruction to pick off the
bottom set bit. Neat, right? But not as
clever as Clang 7.0 illustrated in (r).
This operation is common enough
that there is an instruction on most
CPU architectures to do it in one go:
POPCNT (population count). Clang is
clever enough to take a whole loop in
C++ and reduce it to a single instruction. This is a great example of good instruction selection: Clang’s code generator recognizes this pattern and is
able to choose the perfect instruction.
I was actually being a little unfair
here: GCC 9 also implements this (s),
and in fact shows a slight difference. At
first glance this appears to be suboptimal: Why on earth would you write a
zero value, only to overwrite it immediately with the result of the “population
count” instruction popcnt?
A little research brings up Intel CPU
erratum SKL029: “POPCNT Instruction
May Take Longer to Execute Than Expected”—there is a CPU bug! Although
the popcnt instruction completely
overwrites the output register eax, it is
incorrectly tagged as depending on the
prior value of eax. This limits the CPU’s
ability to schedule the instruction until any prior instructions writing to eax
have completed—even though they
have no impact.
This apparent witchcraft is ex-
plained very well by Daniel Lemire in
his blog. 2 As an aside, it’s possible to do
these kinds of integer division tricks at
runtime too. If you need to divide many
numbers by the same value, you can
use a library such as libdivide. 5
cost is to dispatch to the correct piece
of code in the switch statement.
GCC 9 has a neat trick (n) for checking for divisibility by a non-power-of-two ( https://godbolt.org/z/acm19_mul-
tof3) and compiles to (o).
Counting Set Bits
GCC’s approach here is to break
the dependency on eax: the CPU recognizes xor eax, eax as a dependen-cy-breaking idiom. No prior instruction can influence eax after xor eax,
eax, and the popcnt can run as soon
as its input operand edi is available.
This affects only Intel CPUs and
seems to be fixed in the Cannon Lake