Count bits in a large array, 100x faster

Computer programmer's question

What is the most efficient way to determine the sum of the set bits in a large array of 16-bit values?

This is a standard question to the computer scientist's knowledge, which even Google has already asked its job applicants.^[1][2]

Answer

I thought I would have some fun and investigating it too, based on my knowledge as a senior computer science engineer. The result is a Visual Studio solution (source code) with many comments, in which the optimization steps are shown and the speeds are measured. If you are interested, you can download the complete solution or just the program executable on any Windows machine and run it on your own machine. In the following I show which optimization steps that can be taken.

The starting point is an array named 'array' filled with a large amount of 16-bit integer values. In my solution it's a million, but you can change it in the program.

A first, naive and straightforward programmed solution:

int count = 0;
for (int i = 0; i < array.Length; i++)
{
  int value = array[i];
  if (value < 0) value += 65536;
  int bitCount = 0;
  for (int bit = 0; bit < 16; bit++)
  {
    if (value % 2 == 1) bitCount++;
    value /= 2;
  }
  count += bitCount;
}

Let's start optimizing that. The inner loop can be written with Shift instead of divisions:

for (int bit = 0; bit < 16; bit++)
{
if (value % 2 == 1) bitCount++;
value >>= 1;
}

To avoid the Modulo operator '%' (combination of division and subtraction), you can work with AND ('&'):

int count = 0;
for (int i = 0; i < array.Length; i++)
{
  int value = array[i];
  int bitCount = 0;
  int probeBit = 1;
  for (int bit = 0; bit < 16; bit++)
  {
    if ((value & probeBit) != 0) bitCount++;
    probeBit <<= 1;
  }
  count += bitCount;
}

The comparison "unequal 0" is also a bit faster than a comparison with 'probeBit', because 0 is a constant and the processor already has a hardware instruction for it. Anyone who has been in the IT business for some time knows this for example under the name 'BEQ' (Branch if equal). But the question is whether the IL implements that at all.

Now you can see that 'bitCount' and 'probeBit' are executed once for each array element. These are COM operations. Of course they will notice the 2nd time that the variable already exists, but they will still have to look it up each time (you can check this out by declaring a new variable during a break in the debugger). In general, you can always move iteration variables up and outside of a loop. But you also lose some checks which the compiler does for you, to assure that the variables are not accessed outside its planned scope. But in our case this is completely irrelevant. The loop is overseeable and what we want is speed.

int count = 0;
int elementBits, probeBit;

for (int i = 0; i < array.Length; i++)
{
  int element = array[i];
  elementBits = 0;
  probeBit = 1;
  for (int bit = 0; bit < 16; bit++)
  {
    if ((element & probeBit) != 0) elementBits++;
    probeBit <<= 1;
  }
  count += elementBits;
}

We can now also do without 'elementBits' and directly increase 'count':

int count = 0;
int probeBit;

for (int i = 0; i < array.Length; i++)
{
  int element = array[i];
  probeBit = 1;
  for (int bit = 0; bit < 16; bit++)
  {
    if ((element & probeBit) != 0) count++;
    probeBit <<= 1;
  }
}

Doesn't bring much, but still a little bit.

The test for the lowest bit in C# is a bit complicated. In other programming languages there is often a function 'Odd(x)' which can do this. But this function is missing in C#. If you think outside the box you can see that the sign is also represented by a bit. This way you can easily test the highest bit instead of the lowest bit:

for (int i = 0; i < array.Length; i++)
{
  Int16 value = array[i];
  for (int bit = 0; bit < 16; bit++)
  {
    if (value < 0) count++;
    value <<= 1;
  }
}

Now you would think there is nothing more to simplify. But far from it. Already in the year 1960 Peter A. Wegner found an even simpler method to count the bits in one word. This method looks like this:

for (int i = 0; i < array.Length; i++)
{
  uint value = (UInt16)array[i];

  while (value != 0)
  {
    value &= value - 1;
    count++;
  }
}

Wait, a minus one combined with an AND, how should that work? It's not so easy to understand. My analysis shows the following:

If the value has the lowest bit set, it's a simple case: -1 will just clear this bit (because the value was odd) and nothing else happens. So in 50% of the cases it's really easy to understand.

If the value is even, the subtraction of 1 leads to an underflow that triggers a cascade of flipping bits propagating from right to left. The cascade stops after the first 1 has been encountered which will be flipped to a 0 (example 1000 -> 0111). Because all zeros to the right of the 1 became ones, they are deleted immediately by the following AND operation (they must necessarily be the inverse). Only the change of the 1 to a 0 remains.

Conclusion: In both cases, even or odd, exactly one 1 is cleared. And in both cases it's the most right one. So we can increase the 'count' at each iteration until the value goes to 0.

The last method to consider is whether the use of volatile might bring something. This modifier only exists in the newer C# versions. It tells the compiler to keep a variable in a register instead of storing it in memory. But I suspect it won't help much, because the compiler itself is already equipped with a clever logic that usually notices such things itself.

But now we're really done with optimizing. You can't go any further with this approach. After all, these measures have brought us already a factor of 9.3 compared to the straightforward programmed version.

We would gain a lot if we could get rid of the loop that counts the bits in a word. This would be done by precalculating the bits of a 16-bit value and storing the result in a lookup table. First, you can be a bit conservative and make the table only for 8 bits, that saves some space:

public static byte[] Create8BitLookupTableOptimized()
{
  byte[] table = new byte[256];
  FillLookupTablePartition(table, 0, 256, 8, 0);
  return table;
}

private static void FillLookupTablePartition(byte[] array, int partitionOffset, int partitionSize,
                                             int bits, byte countSoFar)
{
  byte countPlusOne = (byte)(countSoFar + 1);   // no native support for byte + byte available
  if (bits > 1)
  {
    int subSize = partitionSize >> 1;
    FillLookupTablePartition(array, partitionOffset, subSize, bits - 1, countSoFar);
    FillLookupTablePartition(array, partitionOffset + subSize, subSize, bits - 1, countPlusOne);
  }
  else
  {
    array[partitionOffset] = countSoFar;
    array[partitionOffset + 1] = countPlusOne;
  }
}

Filling can be done by a recursive method. The task "Fill a table for x+1 bits" can be split into "Fill the lower half with the sums where the bit is 0" and "Fill the upper half with the sums where the bit is 1 and therefore is added".

For fast execution, it is also important to ensure that arrays are always dimensioned to their full size right from the start. Otherwise, it may happen that an array is enlarged in several steps and thereby copied. This would waste a lot of computing power.

The counting method based on such an array is very simple and therefore very efficient:

if (lookup8Bit == null)
  lookup8Bit = LookupTables.Create8BitLookupTableOptimized();
int bitCount = 0;
for (int i = 0; i < array.Length; i++)
{
  byte[] bytes = BitConverter.GetBytes(array[i]);
  bitCount += lookup8Bit[bytes[0]] + lookup8Bit[bytes[1]];
}

The measurements show that filling the table takes almost no time. The larger the array is that is to be calculated, the more worthwhile it is to create a table. In my example with 1 million values the table creation takes 3 thousands of a percent of the array calculation time. For an array with only 10'000 values it is still only 3 thousands. It is therefore worthwhile in any case to build up such a lookup table.

But have we now won something with it? Yes, an enormous amount. The method with Lookup-Table is 25 times faster than the straightforward-programmed version.

What can we improve here afterall? Well, using the BitConverter class smells like effort. Let's do it ourselves, it's not a piece of witchcraft:

UInt16 value = (UInt16)array[i];
bitCount += lookup8Bit[value & 255] + lookup8Bit[value >> 8];

Right! Not only at mom's the homemade stuff tastes better, it's the same in computer science. It is almost 4 times faster.

If that has brought so much, why not precalculate all 65536 values at once? Nowadays a computer has gigabytes of memory. Such a table needs only 64 kB because in a 16-bit value there can be a maximum of 16 bits. So it can be coded with 5 bits. If needed it could even be stored in a compact form of 40'960 bytes. That poses no problem, if we don't want to integrate it into a microcontroller. But we have to bypass a small obstacle here, namely that the array contains integer values, and they can be negative. As a consequence the counting method looks like this:

if (lookup16Bit == null)
  lookup16Bit = LookupTables.Create16BitLookupTableOptimized();
int bitCount = 0;
for (int i = 0; i < array.Length; i++)
{
  int value = array[i];  // store the value and don't access it multiple times
  if (value >= 0)
    bitCount += lookup16Bit[value];
  else
    bitCount += lookup16Bit[65536 + value];   // -1 means 65535
}

Now we have already reached the factor 30. That has paid off extremely well. Can we still tune this a bit? The comparison with if can be simplified by making the 'value' positive beforehand:

int value = array[i]; // store the value and don't access it multiple times
if (value < 0) value += 65536;
bitCount += lookup16Bit[value];

But it brings only an insignificant gain, the factor is still stable around 30.

Any operation within the loop should be avoided. How about telling the compiler to treat the integer as an unsigned integer? We can then also get rid of the variable inside the loop:

int bitCount = 0;
for (int i = 0; i < array.Length; i++)
{
bitCount += lookup16Bit[(UInt16)array[i]];
}

Yep. The measurement shows, we are now about 100 times faster than the straight-forward programmed solution.

Now we could dedicate ourself to the table creation. Why always create a table that is constant? Wouldn't it be faster to just load the table? Therefore we simply save it to a file and embed it into the resources of the executable. If necessary we can then just load it.

The measurement shows that it is about 1% faster to load such a table than to calculate it:

Stream stream = Assembly.GetExecutingAssembly().GetManifestResourceStream("AtlasInformatik.CountBits.LookupTable16.bin");
byte[] lookupTable;
using (BinaryReader binaryReader = new BinaryReader(stream))
{
lookupTable = binaryReader.ReadBytes((int)stream.Length);
}

If we want to save some space, we can compress the lookup table. The numbers in it are limited to the range 0..16, where 0 only occurs once at the beginning and 16 only once at the end. If we hardcode these two values, the number range shrinks to 1..15 and we only need 4 bits per number. So we can shrink the table to half:

byte[] compressed = new byte[lookupTable.Length >> 1];

// Last iteration is length-3 which will pack length-3 and length-2
for (int i = 1; i < lookupTable.Length - 2; i += 2)
{
compressed[i >> 1] = (byte)(((lookupTable[i + 1] - 1) << 4) | (lookupTable[i] - 1));
}

When loading, we unpack it like this:

byte[] lookupTable = new byte[65536];

// lookupTable[0] = 0;   Not necessary, C# always initializes data types with zeros

int lookupIndex = 1;
for (int i = 0; i < compressed.Length - 1; i++)
{
  lookupTable[lookupIndex] = (byte)(1 + (compressed[i] & 15));
  lookupIndex++;
  lookupTable[lookupIndex] = (byte)(1 + (compressed[i] >> 4));  // no ++ here improves readability
  lookupIndex++;
}

lookupTable[65535] = 16;  // all 16 bits set

The measurement shows that this is about as fast as the uncompressed version, but we reduce the program size by 32 kB.

Let's go back to the calculation method based on such a table. What does it look like if we also place everything in a block of unchecked? There could be a few overflow tests falling away, right? No, unfortunately the measurements show hardly any difference.

But maybe we can tickle more optimizations out of the compiler. There is also the Reflection attribute "[MethodImpl(MethodImplOptions.AggressiveInlining)]". But unfortunately here, too, no success. All this hardly brings a speeding up.

One more idea: How about not loading the table, but programming it directly into the code? Here C# is unfortunately not as well equipped as C. In C there is a sophisticated preprocessor which supports - in analogy to programming languages - methods that can even have parameters. With this at hand it would be easy to generate the table recursively. In contrast, only a few rudimentary commands are available in the C# preprocessor and certainly no methods with parameters. So you have to do it another way. For example, you could use the T4 mechanism (Text Templates) to generate code. Although I have worked with it intensively, for this study it was too much to take on. I only implemented a very simple method: The solution contains a few routines that generate source code using a StringBuilder and copies it to the clipboard. From there I pasted it into the source code.

The measurements show that such an approach is slower than the 16-bit lookup array. You lose a little above a factor of 3 in speed.

"Just one more thing" [Steve Jobs]: The following idea is based on my C knowledge: In C, the compiler converts a large switch statement to a lookup table. Very good idea of the creators. If this would also be the case with C#, you could code the table in the form of a switch statement. This attempt is contained in the source "LookupTables.cs". The counting method based on such an 8-bit-switch would then look like this:

int bitCount = 0;
for (int i = 0; i < array.Length; i++)
{
UInt16 value = (UInt16)array[i];
bitCount += LookupTables.Lookup8(value & 255) + LookupTables.Lookup8(value >> 8);
}

Sadly you lose a factor of 10 with this. The assumption that the compiler implements the switch as a table is therefore falsified for C#. This approach doesn't work.

I tried also to implement a 16-bit switch, but a switch with 65536 lines overwhelmed the compiler. Funny enough, it freezes during compilation. Would be something for an entry in a hypothetical page "How to drive a C# compiler nuts". The Visual Studio text editor also has a lot of problems with the statement, it hangs several seconds or sometimes minutes. So we could actually also drive this one nuts (:-)

Measurement results

Let's come to the hard facts. These are the measurements of the processing if the compiler is not optimizing the code (checkbox in the properties of the project). The displayed times are in ticks (1 tick = 100 nanoseconds). Click on image to enlarge:

Measurement with compiler optimization switched on:

Conclusion

The fastest method is the loading and usage of a precalculated 16-bit lookup table.

Ticking the box "Optimize Code" in the project properties results in a significant speed improvement of about a factor of 4.

Even though C# is generally disadvantaged in optimizations compared to C because it is based on the Intermediate Language, this example shows that an optimization by a factor of 100 is sometimes actually within the realm of the possible. Of course, it always depends on the task you want to optimize. And it needs enough know-how, experience and of course some time. You should always keep in mind: If the program will be used often in the future and the factor is large, it is always worth investing time in an optimization. It will pay out quickly.

Count bits in an array, 100x faster

Computer programmer's question

Answer

Measurement results

Conclusion

Downloads