Summary: This is a two part series. The first part describes how the expression "(v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f" yields the bit count (number of set bits) of 'v' where 'v' is a 12 bit value. Part 1 also describes a general proof on which this code snippet is based. Part 2 uses the understanding in the first article to generate a code snippet that can bit count a 16 bit value using the a 64 bit register and discusses tradeoffs.
If I had a blog, then I guess this would have been an entry in it. Since I don't, here it stands as a lone write-up. Last night we had been out to watch Cold Mountain, nice movie. By the time the move got over at 12:30am there were things that were bothering me. One of them was about how the protagonists lover dies at the end of the movie and I kept thinking about alternate endings that were probably nice too. What bothered me more about the scene in the movie was how there were these black crows that fly around when he is dying. It kept bringing back these images of Vincent Van Gogh's Wheat Fields with Crows, a picture that has bothered me a lot.

Another thing that was bothering me was this code snippet I saw recently that claimed to be able to count the number of set bits in a number. Generally when I don't understand code, it bothers me a lot. The code snippet is given below and is courtesy of Sean Eron Anderson of Stanford University. (http://graphics.stanford.edu/~seander/bithacks.html)
Counting bits set in 12, 24, or 32-bit words using 64-bit instructions
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
// option 1, for at most 12-bit values in v:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
// option 2, for at most 24-bit values in v:
c = (v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL;
c += ((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL;
c %= 0x1f;
// option 3, for at most 32-bit values in v:
c = (((v & 0xfff) *
0x1001001001001ULL & 0x84210842108421ULL) +
((v & 0xfff000) >> 12) *
0x1001001001001ULL & 0x84210842108421ULL)) % 0x1f;
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
|
This snippet was originally posted at one of our college email boards by one of my seniors who works on the HP Unix kernel, writing machine dependant code for porting the OS to different hardware architectures. He was taking HP UX to the IA64 and I guess doing bit counting like shown above was relevant to him.
The code snippet by Sean Anderson was based on work at MIT by
Schroeppel, R (http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167)
where he does:
;if A is a 9 bit quantity, B gets number of 1's (Schroeppel)
IMUL A,[1001001001] ;4 copies
AND A,[42104210421] ;every 4th bit
IDIVI A,17 ;casting out 15.'s in hexadecimal
|
I wish I had know this fact that night, might have saved us a lot of time.
Half past midnight we figured that the most useful thing we could do is to have coffee. During coffee the I got to terms with myself and decide that its time to figure out this code snippet. If I didn't it would keep bothering me for much longer than I'd like it to.
The hard part in the code above is how
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) %
0x1f;
actually works. What are all those magic numbers and how does that give a bit
count?
More traditional approaches of bit counting, or at least ones that I would have thought of by myself without much provocation would be like this:
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; v >>= 1)
{
c += v & 1;
}
|
or like this:
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
v &= v - 1; // clear the least significant bit set
}
|
These seem obvious enough, but how does " c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;" work when v is a 12 bit value?
The approach is to split up the problems and examine each of the actions.
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
// option 1, for at most 12-bit values in v:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
|
The first operation is the multiplication:
v * 0x1001001001001ULL
Now 'v' is a 12 bit value, which means that it would in effect take 3 hex digits to be written. Therefore is v = 0xXYZ, where X, Y and Z are hex digits then:
= v * 0x001001001001001
= 0xXYZ * 0x001001001001001
0xXYZXYZXYZXYZXYZ
|
The multiplication has simply got the 12 bit value to repeat itself out
What does the the next step do? The next step does a binary AND of this result with "0x84210842108421ULL". What's special about this number? If you give it some thought, it figures that when you write the same number in binary, this is its basic pattern.
0x08421 = 00001000010000100001
This number simply is a 1 bit separated by four 0 bits. ie to say, every fifth bit is a 1. Hmm... so what happens to the product when we AND it with this value ? What happens is that Every fifth bit in the product is selected. Ok genius, so what does that mean?
What that means is that if v = 0xXYZ = abcd efgh ijkl (in
bits),
Then our product is 0xXYZXYZXYZXYZXYZ = abcd efgh ijkl, abcd efgh ijkl,
abcd efgh ijkl, abcd efgh ijkl, abcd efgh ijkl
If we AND this with our magic bit pattern we get
0xXYZXYZXYZXYZXYZ AND 0x084210842108421 is the same as
abcd efgh ijkl abcd efgh ijkl abcd efgh ijkl abcd efgh ijkl abcd efgh ijkl
0000 1000 0100 0010 0001 0000 1000 0100 0010 0001 0000 1000 0100 0010 0001
--------------------------------------------------------------------------
0000 e000 0j00 00c0 000h 0000 a000 0f00 00k0 000d 0000 i000 0b00 00g0 000l
|
Wow! on closer look at the number we realized that what happened to our original value 'v' is that each of the bits that formed 'v' have been juggled out of order and have been padded with four 0s each. If you don't get that description compare that with the result above.
Having evaluated what "(v * 0x1001001001001ULL & 0x84210842108421ULL)" does, we only needed to understand what the "%0x1f" did. Now 0x1f = 31, which is 11111 in binary.
This was the hard part. Try as I might, I couldn't figure out what the result of taking a modulus of 31 of a number meant in terms of bit level operations. That is to say, I can easily imagine what an AND or an OR operation does to the bits. But I could not imagine what a modulus 31 does to the bits.
By this time our coffee's were getting over and this division problem was not getting me anywhere. That's when it struck me that each of the bits of the original 'v' occurring at 5 bit separations probably has something to do with the fact that 31 is actually a five bit value (all the bits being 1).
rearranging or result above we have 0000 e000 0j00 00c0 000h 0000 a000 0f00 00k0 000d 0000 i000 0b00 00g0 000l = 0000e 0000j 0000c 0000h 0000a 0000f 0000k 0000d 0000i 0000b 0000g 0000l |
Now the result of this value modulus 31 should be the number of bits that are 1 in 'v' right? So in some sense I expect that the %31 operation somehow manages to count the number of bits.
This entire thinking got me to work out some simple examples:
000111
% 111111
-----------
= 000111
As a matter of fact a % b = a, for any a < b. Now
000111000000
% 111111
-------------------
=
000111
Surprise! (at least it was for me). If I bit-shift 'a' to exactly the bit width of 'b', the remainder is still the same. What if I bit-shift 'a' by a number that is a multiple of the bit width of 'b' ?
000111000000000000
%
111111
---------------------------
=
000111
Here I bit-shifted the value 12 places where the bit width of 111111 is 6. The remainder still stayed the same.
So if the remainder stays the same not matter how many places I shift it (as long as I shift it by a multiple of the bit width). I had something interesting here which might be a basis of how the original code was working.
Now if
000111 % 111111 = 000111 and
000111000000 % 111111 = 000111
then
000111000111 % 111111 = 000111 + 000111
Thus if I add some of the bit shifted quotients, the remainder merely adds up (unless the sum of the individual remainders is not greater than the number itself).
So if 'a' and 'b' are binary numbers (a < b) and 'n' is the width of 'b' Then a % b = a as well as a*(2^kn) % b = a, where k is any constant a*(2^kn) is the equivalent of shifting the value 'a' k * n number of places. (Similar to the examples above). |
If the above statement is true, then what the %31 actually does is
0000e 0000j 0000c 0000h 0000a 0000f 0000k 0000d 0000i 0000b 0000g 0000l % 11111 is the same as (0000l % 11111) + (0000g 00000 % 11111) + (0000b 00000 00000 % 11111) + .. etc which amounts to l + g + b ... etc Which is the sum of the set bits. |
Thrilled with this discovery, I figured that it not valid unless I can prove that "a*(2^kn) % b = a" is true.
Lets generalize the proof,
Assume a number system of base 'b'
(where for binary b = 2, decimal b is 10 etc,
for this case b = 32).
So our divisor is b-1 (in this case 32 - 1 = 31)
Let a < b-1 a % (b-1) = a We need to prove a*b^n % (b-1) = a a*b^n % (b-1) = a (a*b^n - a) % (b-1) = 0 Therefore we need to prove that (a*b^n - a) is a multiple of (b-1). (a*b^n - a) = a*(b^n - 1) Now a is definitely not a multiple of b-1 (a < b-1), so (b^n -1) should be proven to be a multiple of b-1. b^n-1 b^n-b +b-1 b*(b^(n-1) -1) + (b-1) Recursively expanding this would amount to b^(n-1)*(b-1) + b^(n-2)*(b-1)... b^0*(b-1) Since each of these factor is a multiple if b-1, we know that b^n-1 is a multiple of (b-1). Thus (a*b^n - a) % (b-1) = 0. That is irrespective of the value of 'n', the remainder remains the same. |
Sidharth actually helped out with the expansion of terms in above derivation
(thanks buddy).
If the whole thing looks a little contrived to you, consider substituting
values. If b = 10 and lets say a = 2.
a % (b-1) = 2 % 9 = 2.
(a * b^n) % (b-1) = (lets take n as 3) (2 * 10^3) % 9 = 2000 % 9 = 2
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
// option 1, for at most 12-bit values in v:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
Thus v * 0x1001001001001 will repeat the value 'v' five times in the 64 bit register that holds this value. & 0x84210842108421 will select every 5th bit. Since 12(the width of 'v') and 5 are mutually prime, this select all the bits of 'v' and pad them with filler 0s. % 0x1f will add up every group of 5 bits, effectively adding all the bits of 'v'. |
With that settled it felt like we understood the approach. And as a matter of
fact we ended up teaching ourselves something new about modulus. Which got me
wondering if I can "optimize" on this approach.
More on that in Part 2.
Roshan James
Monday,
March 08 2004