A Modulo Based Bit Counting Approach For 64 Bit Systems

Part 2 : Extending To 16 Bits

Summary: This is a two part series. The first part describes how the expression "(v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f" yields the bit count (number of set bits) of 'v' where 'v' is a 12 bit value. Part 1 also describes a general proof on which this code snippet is based. Part 2 uses the understanding in the first article to generate a code snippet that can bit count a 16 bit value using the a 64 bit register and discusses tradeoffs. 

In the previous section we had an examination of what happens when a number is modulo-ed with a value (b^n-1) where b is the base of the number system and n is any positive integer. In terms of binary systems, we say that is is easy to visually predict the result of modulo operation where the divisor is of the form 2^n-1.

If 'x' in a 'm' bit number, then x % (2^n-1) is equal to splitting x into sets of 'n' bits values and summing them. As an example, if x = 1001001 and n is 2, 2^2-1 = 3 = 11b. 
x % 3 = 1001001b % 11b = (splitting x in group of 2 bits each) 1+00+10+01 = 1 + 2 + 1 => modulus is 4%3 = 1.

If that is simple enough, let us progress. The original statement that calculated the number of bits in a 12bit number is 
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 

The divisor is 0x1f, thus n is 5. With a little bit of thought we can see that the maximum value of the count of bits in a 12 bit number is 12. So it figures that 4 bits and probably adequate. When n =4, 2^n-1 is 15 or 0xf.

Can we change the above formula so that we can do a modulo by 0xf and still yield the count of bits of a 12 bit number? Let us see.

It is obvious that bits of the original 12bit number have to be spaced at every fourth bit, instead of being placed at every fifth bit like it was before. Every fourth bit would imply that our AND value change from  0x84210842108421 to a simpler 0x11111111111.

However this is not as easy as it looks. When taking n=5 and the a 12bit number, we are considering 12 and 5 which are mutually prime. if we substitute out changes into the previous expression:
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f;

and make it 
c = (v * 0x1001001001001ULL & 0x11111111111111ULL) % 0xf;

it would not work. The mutual primality of 5 and 12 ensure that every bit that is selected is a different one of the twelve bits. This is however not true of the 4 and 12. Let me try and illustrate this

 

 

The solution I could think of in this situation is to make sure that the value in the subsequent 0,4 and 8th bit positions are not the same bit. By skewing the 12 bit series a little we can make sure that our AND operation will select different bit values each time. 

Thus the expression becomes: 
c = (v * 0x8004002001ULL & 0x11111111111111ULL) % 0xf;

The multiplication of the v has now changed we each subsequent repetition of the 12 bit series has shifted by one place. Let me try and illustrate the differences between v * 0x1001001001001 and v * 0x8004002001

Notice that the number becomes a 60bit (12*5) number Where the 12 bit pattern repeats 5 times end to end.

Notice that each line is shifted by 1 bit as a consequence of being multiplied by 1,2,4 and 8.

Now that the bits are skewed a little I can select every 4th bit and end up selecting each bit of the 12 bit number exactly once. Instead of AND-ing with 0x11111111111111 it is more appropriate to AND with 0x888888888888 (12 8s). the following diagram shows the results:

Notice that every bit that constitutes the 12 bit number is selected once.

The result of the AND would give us every bit padded with 3 zeroes. But there is an issue that they are the higher order bits of each nibble that they are part of. For  modulus to be calculated correctly the bits must be the LSB bit of their n-bit bit section not the MSB bit. This is easily fixed by a bit shift of 3 places.

       	
So in conclusion the expression for count of bits
of a 12 bit value of a machine that allows up to 
64bit registers looks like this:
c = ((v * 0x8004002001ULL & 0x888888888888ULL)>>3) % 0xf;
The inner part "(v * 0x8004002001ULL & 0x888888888888ULL)"
does bit selections as shown:
which are bit shifted to generate the value for the modulus
"( <value> >>3)"
 
This value when modulo-ed with 0xf we get the expected count
of bits 
Thus the sum of bits of 12 numbers can be calculated by
c = ((v * 0x8004002001ULL & 0x888888888888ULL)>>3) % 0xf;
			

The expression above looks a lot like the original expression 
c = (v * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 

But it seems to have the overhead of doing an additional bit shift. But if you look closely you can see that this expression uses at most 48bits of the 64bit register at any time. This is because the 12 bit value needs to be repeated only 4 times now, instead of 5 times as in the previous case.

Can we do anything with the bits we have saved in the 64 bit register?

Extending to counting 16bit values in a 64 bit register

A simple bit of calculation would give us 16*4 = 64. Currently we are counting for a 12 bit value and are using only 48 bits of the register. Can we count the bits in a 16 bit value and use the whole register?

To expand a 16 bit value out and skew it 4 times we do
v * 0x8000400020001

To select every 4th bit we use 
(v * 0x8000400020001) & 0x8888 8888 8888 8888

Finally we normalize it by bit shifting it 3 places right. 
((v * 0x8000400020001) & 0x8888 8888 8888 8888) >> 3

And finally we count the number of bits by taking our modulus
c = (((v * 0x8000400020001) & 0x8888 8888 8888 8888) >> 3) %0xf

Hey... wait a minute. If its a 16 bit number that we are using, then the number of set bits can be any value between 0 and 16. But doing a modulo 15 will return only value in the range 0 to 14 ! Therefore  if 15 bits are set the value of c will be 0 and if 16 bits are the value of c will be 1.

This is wrong. The first impulse to the solution is to put some check around it and to make sure that the value 'v' passed is not 0 and not 0xffff. Thus is the calculation returns c = 1, then it can be because only one bit is set and if c=0 then it can be only because 15 bits are set.

Counting bits in a 16 bit value:
c = (((v * 0x8000400020001) & 0x8888888888888888) >>3)% 0xf
The equation can be used in its current form if we can make 
some assumption about the value of 'v' that is passed or if
we put checks around the expression:

if (v && ~v)
{
 c = (((v * 0x8000400020001) & 0x8888888888888888) >>3)% 0xf
 c = c?c:15;
}
		

However the fact that 3 condition checks are needed in the general case didn't look too appealing. What else can we do.

The problem here is that the sum of bits can be greater than 14. If somehow we can ensure that the sum of bits is lesser that 14, we can get past this issue. But how do we do that, there are 16 bits, each of which can be 1.

Lets go back to our approach with modulus. Remember that if 'x' in a 'm' bit number, then x % (2^n-1) is equal to splitting x into sets of 'n' bits values and summing them. 
The example I gave earlier was, x = 1001001 and n is 2, 2^2-1 = 3 = 11b. 
x % 3 = 1001001b % 11b = (splitting x in group of 2 bits each) 1+00+10+01 = 1 + 2 + 1 => modulus is 4%3 = 1.

What if I changed the width of the number doing the modulus? That is to say, instead of doing:
c = (((v * 0x8000400020001) & 0x8888888888888888) >>3)% 0xf

I do
c = (((v * 0x8000400020001) & 0x8888888888888888) >>3)% 0xff

This would mean that the value would be split into groups of 8 bit values each which would get added up.

This is the layout immediately prior to the modulo

Doing % 15 would cause each of the 4 bit sections to get added up. However doing a %0xff will cause an addition like this to occur:

In doing this notice that each half of the 8bit sum is actually the sum of 8 bits. So neither of these can overflow into the sum of the other part. This makes it very convenient for us to get the value of the original count back using some simple bit shifts.

The general solution to counting the bits in a 16 bit number using 
a 64 bit register is shown below. The initial value of 'c' consists
of two partial sums. 
These two partial sums of 4 bits each are added together in the 
second statement, thus yielding the total number of set bits.
Counting bits in a 16 bit value:
c = (((v * 0x8000400020001) & 0x8888888888888888) >>3)% 0xff
c = c>>4 + c&0xf
		

This is considerably less expensive that the previous case where we had to have two sets of expensive operations (one for finding a 12 bit sum and the other for finding the sum of the remaining 4 bits.

This solution maybe extended to v of 32 bits as shown: 

c1 = ((((v&0xffff) * 0x8000400020001) & 0x8888888888888888) >>3)% 0xff
c2 = ((((v>>16) * 0x8000400020001) & 0x8888888888888888) >>3)% 0xff
c = c1>>4 + (c1 & 0xf) + c2>>4 + (c2 & 0xf)
		

An alternate approach for 32 bits is shown:

This solution maybe extended to v of 32 bits as show. This 
uses a mod 0xfff and a two level shifting strategy. This 
eliminates the use of the intermediate variable above.

c = (((((v&0xffff) * 0x8000400020001) & 0x8888888888888888) >>3) +
    ((((v>>16) * 0x8000400020001) & 0x8888888888888888) >>3))% 0xfff
c = c>>8 + ((c>>4) & 0xf) + (c & 0xf)
This can be compared to the original:
c = (((v & 0xfff) * 
      0x1001001001001ULL & 0x84210842108421ULL) +
      ((v & 0xfff000) >> 12) * 
      0x1001001001001ULL & 0x84210842108421ULL)) % 0x1f; 
c += ((v >> 24) * 0x1001001001001ULL & 0x84210842108421ULL) % 0x1f; 
Total number of operations in original snippet:
 * = 3  %  = 2
 & = 5  >> = 2 + = 2
Total operation in the new snippet: 
 * = 2  %  = 1
 & = 5  >> = 4 + = 3

Arguably the code I have derived here is not much faster that 
the original code. I have reduced on two 'expensive' operations
'*' and '%' at the cost introducing two extra '>>' and one '+'. 
There is probably room for optimization here, but I will leave 
that to you or for another day. 
	

By the time we had figured out this much it was 3:30 am. Since the coffee shop had closed long back, we decided to go crash. 

Roshan James
Monday,
March 08 2004