A Modulo Based Bit Counting Approach For 64 Bit Systems

Part 1 : Understanding The Approach

Summary: This is a two part series. The first part describes how the expression "(v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f" yields the bit count (number of set bits) of 'v' where 'v' is a 12 bit value. Part 1 also describes a general proof on which this code snippet is based. Part 2 uses the understanding in the first article to generate a code snippet that can bit count a 16 bit value using the a 64 bit register and discusses tradeoffs. 

If I had a blog, then I guess this would have been an entry in it. Since I don't, here it stands as a lone write-up. Last night we had been out to watch Cold Mountain, nice movie. By the time the move got over at 12:30am there were things that were bothering me. One of them was about how the protagonists lover dies at the end of the movie and I kept thinking about alternate endings that were probably nice too. What bothered me more about the scene in the movie was how there were these black crows that fly around when he is dying. It kept bringing back these images of Vincent Van Gogh's Wheat Fields with Crows, a picture that has bothered me a lot.

Another thing that was bothering me was this code snippet I saw recently that claimed to be able to count the number of set bits in a number. Generally when I don't understand code, it bothers me a lot. The code snippet is given below and is courtesy of Sean Eron Anderson of Stanford University. (http://graphics.stanford.edu/~seander/bithacks.html)

Counting bits set in 12, 24, or 32-bit words using 64-bit instructions 

       	
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 12-bit values in v:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 

// option 2, for at most 24-bit values in v:
c =  (v & 0xfff) * 0x1001001001001ULL & 0x84210842108421ULL;
c += ((v & 0xfff000) >> 12) * 0x1001001001001ULL & 0x84210842108421ULL;
c %= 0x1f;

// option 3, for at most 32-bit values in v:
c = (((v & 0xfff) * 
      0x1001001001001ULL & 0x84210842108421ULL) +
      ((v & 0xfff000) >> 12) * 
      0x1001001001001ULL & 0x84210842108421ULL)) % 0x1f; 
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 

			

This method requires a 64-bit CPU with fast modulus division to be efficient. The first option takes only 6 operations; the second option takes 13; and the third option takes 17.

This snippet was originally posted at one of our college email boards by one of my seniors who works on the HP Unix kernel, writing machine dependant code for porting the OS to different hardware architectures. He was taking HP UX to the IA64 and I guess doing bit counting like shown above was relevant to him. 

The code snippet by Sean Anderson was based on work at MIT by Schroeppel, R (http://www.inwap.com/pdp10/hbaker/hakmem/hacks.html#item167) where he does:

      	
;if A is a 9 bit quantity, B gets number of 1's (Schroeppel)
        IMUL A,[1001001001]     ;4 copies
        AND A,[42104210421]     ;every 4th bit
        IDIVI A,17              ;casting out 15.'s in hexadecimal
			

I wish I had know this fact that night, might have saved us a lot of time. 

Half past midnight we figured that the most useful thing we could do is to have coffee. During coffee the I got to terms with myself and decide that its time to figure out this code snippet. If I didn't it would keep bothering me for much longer than I'd like it to. 

The hard part in the code above is how 
    c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;
actually works. What are all those magic numbers and how does that give a bit count?

More traditional approaches of bit counting, or at least ones that I would have thought of by myself without much provocation would be like this:

       	
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

for (c = 0; v; v >>= 1)
{
  c += v & 1;
}

			

or like this:

       	
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v
for (c = 0; v; c++)
{
  v &= v - 1; // clear the least significant bit set
}

			

These seem obvious enough, but how does " c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;" work when v is a 12 bit value?

An Explanation and a Generalization

The approach is to split up the problems and examine each of the actions.   

       	
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 12-bit values in v:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 

			

The first operation is the multiplication:
    v * 0x1001001001001ULL

Now 'v' is a 12 bit value, which means that it would in effect take 3 hex digits to be written. Therefore is v = 0xXYZ, where X, Y and Z are hex digits then:

       	
 = v     * 0x001001001001001
 = 0xXYZ * 0x001001001001001
           0xXYZXYZXYZXYZXYZ

			

The multiplication has simply got the 12 bit value to repeat itself out

What does the the next step do? The next step does a binary AND of this result with "0x84210842108421ULL". What's special about this number? If you give it some thought, it figures that when you write the same number in binary, this is its basic pattern. 

    0x08421 = 00001000010000100001

This number simply is a 1 bit separated by four 0 bits. ie to say, every fifth bit is a 1. Hmm... so what happens to the product when we AND it with this value ? What happens is that Every fifth bit in the product is selected. Ok genius, so what does that mean? 

What that means is that if v = 0xXYZ = abcd efgh ijkl  (in bits), 
Then our product is 0xXYZXYZXYZXYZXYZ =  abcd efgh ijkl, abcd efgh ijkl, abcd efgh ijkl, abcd efgh ijkl, abcd efgh ijkl

If we AND this with our magic bit pattern we get 

	0xXYZXYZXYZXYZXYZ
AND	0x084210842108421
is the same as
      	
abcd efgh ijkl abcd efgh ijkl abcd efgh ijkl abcd efgh ijkl abcd efgh ijkl
0000 1000 0100 0010 0001 0000 1000 0100 0010 0001 0000 1000 0100 0010 0001
--------------------------------------------------------------------------
0000 e000 0j00 00c0 000h 0000 a000 0f00 00k0 000d 0000 i000 0b00 00g0 000l 

			

Wow! on closer look at the number we realized that what happened to our original value 'v' is that each of the bits that formed 'v' have been juggled out of order and have been padded with four 0s each. If you don't get that description compare that with the result above. 

Having evaluated what "(v * 0x1001001001001ULL & 0x84210842108421ULL)" does, we only needed to understand what the "%0x1f" did. Now 0x1f = 31, which is 11111 in binary. 

This was the hard part. Try as I might, I couldn't figure out what the result of taking a modulus of 31 of a number meant in terms of bit level operations. That is to say, I can easily imagine what an AND or an OR operation does to the bits. But I could not imagine what a modulus 31 does to the bits.

By this time our coffee's were getting over and this division problem was not getting me anywhere. That's when it struck me that each of the bits of the original 'v' occurring at 5 bit separations probably has something to do with the fact that 31 is actually a five bit value (all the bits being 1).

rearranging or result above we have
0000 e000 0j00 00c0 000h 0000 a000 0f00 00k0 000d 0000 i000 0b00 00g0 000l = 
   0000e 0000j 0000c 0000h 0000a 0000f 0000k 0000d 0000i 0000b 0000g 0000l
			

Now the result of this value modulus 31 should be the number of bits that are 1 in 'v' right? So in some sense I expect that the %31 operation somehow manages to count the number of bits. 

This entire thinking got me to work out some simple examples:

    000111
% 111111
-----------
=  000111

As a matter of fact a % b = a, for any a < b. Now 

    000111000000
%             111111
-------------------
=              000111

Surprise! (at least it was for me). If I bit-shift 'a' to exactly the bit width of 'b', the remainder is still the same. What if I bit-shift 'a' by a number that is a multiple of the bit width of 'b' ?

    000111000000000000
%                         111111
---------------------------
=                          000111

Here I bit-shifted the value 12 places where the bit width of 111111 is 6. The remainder still stayed the same. 

So if the remainder stays the same not matter how many places I shift it (as long as I shift it by a multiple of the bit width). I had something interesting here which might be a basis of how the original code was working. 

Now if 

    000111 % 111111 = 000111 and
    000111000000 % 111111 = 000111

then 

000111000111 % 111111 = 000111 + 000111

Thus if I add some of the bit shifted quotients, the remainder merely adds up (unless the sum of the individual remainders is not greater than the number itself).

So if 'a' and 'b' are binary numbers (a < b) and 'n' is the width of 'b'
Then 
	a % b = a
as well as 
	a*(2^kn) % b = a, where k is any constant
a*(2^kn) is the equivalent of shifting the value 'a' k * n number of 
places. (Similar to the examples above).
			

If the above statement is true, then what the %31 actually does is


0000e 0000j 0000c 0000h 0000a 0000f 0000k 0000d 0000i 0000b 0000g 0000l
%                                                                 11111

is the same as
	(0000l % 11111) + 
	(0000g 00000 % 11111) + 
	(0000b 00000 00000 % 11111) + .. etc
which amounts to 
	l + g + b ... etc
Which is the sum of the set bits. 
			

Thrilled with this discovery, I figured that it not valid unless I can prove that "a*(2^kn) % b = a" is true.


Lets generalize the proof, 
	Assume a number system of base 'b' 
	(where for binary b = 2, decimal b is 10 etc, 
	for this case b = 32).
	So our divisor is b-1 (in this case 32 - 1  = 31)
    
	Let a < b-1
		a % (b-1) = a 
We need to prove
	a*b^n % (b-1) = a

	a*b^n % (b-1) = a
	(a*b^n - a) % (b-1) = 0
Therefore we need to prove that (a*b^n - a) is a multiple of (b-1). 
	(a*b^n - a) = 
	a*(b^n - 1)

Now a is definitely not a multiple of b-1 (a < b-1), so (b^n -1) should 
be proven to be a multiple of b-1.
	b^n-1
	b^n-b +b-1
	b*(b^(n-1) -1) + (b-1)

Recursively expanding this would amount to 
	b^(n-1)*(b-1) + b^(n-2)*(b-1)... b^0*(b-1)

Since each of these factor is a multiple if b-1, we know that b^n-1 is 
a multiple of (b-1). Thus (a*b^n - a) % (b-1) = 0. That is irrespective 
of the value of 'n', the remainder remains the same. 
			

Sidharth actually helped out with the expansion of terms in above derivation (thanks buddy).
If the whole thing looks a little contrived to you, consider substituting values. If b = 10 and lets say a = 2.
a % (b-1) = 2 % 9 = 2.
(a * b^n) % (b-1) = (lets take n as 3) (2 * 10^3) % 9 = 2000 % 9 = 2

       	
unsigned int v; // count the number of bits set in v
unsigned int c; // c accumulates the total bits set in v

// option 1, for at most 12-bit values in v:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 
Thus 
	v * 0x1001001001001
		will repeat the value 'v' five times in the 
		64 bit register that holds this value.
	& 0x84210842108421
		will select every 5th bit. Since 12(the width 
		of 'v') and 5 are mutually prime, this select 
		all the bits of 'v' and pad them with filler 0s.
	% 0x1f
		will add up every group of 5 bits, effectively 
		adding all the bits of 'v'.
			

 

With that settled it felt like we understood the approach. And as a matter of fact we ended up teaching ourselves something new about modulus. Which got me wondering if I can "optimize" on this approach.

More on that in Part 2.

Roshan James
Monday,
March 08 2004