Tuesday, May 10, 2011

Unroll that loop, homeboy

A month or so ago I was wrestling with a pretty thorny algorithmic problem at work.  Eventually I worked out a solution that worked great on our Windows emulator, so I figured my job was done...was I ever wrong.

Once I got it running on our embedded processor (a lowly ARM926EJ-S at ~270 MHz), I quickly found my algorithm had some serious performance issues.  So, I fired up oProfile and did some profiling so I could see where I was eating cycles.  There were a couple hotspots that I went after, but one was a loop that looked like so:

UINT64 currentRow;

...
for(int row = 0; row < NUMBER_OF_ROWS; row++)
{
  currentRow = rows[row];
  for(int column = 0; column < NUMBER_OF_COLUMNS; column++)
  {
    if( (currentRow >> column ) & 1 )  // Check if this bit is set
      DoSomeWorkOnThisPointBaby(col,row);
  }
}

Nested loops...yuck.  This code ran for every column and row, shifting the current row over to check and see if the bit was set. To make matters worse, the row was represented with a 64 bit value, which costs extra to shift.  Even worse, this wasn't work I could figure out a way of avoiding entirely--enumerating over all the values was really a core part of the algorithm.

So, I decided to see if I could at least speed up the routine by optimizing. First, I unrolled the loop.  As wiki states, the goal of loop unrolling is to increase a program's speed by reducing (or eliminating) instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration.

So instead of the inner loop iterating through all of the columns, it gets replaced with a gigantic block of if-statements:

UINT64 currentRow;

...

if( currentRow & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( currentRow & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

if( currentRow & 0x8000000000000000 ) // 64th column!
  DoSomeWorkOnThisPointBaby(col,row);

This removes all of the overhead of the for loop--no more incrementing the counter, and no more testing to see if we're at the end of the loop.  It also avoids an expensive shift operation to test each column value.

But we can do even better!  One issue is currentRow is a 64 bit value, but we're on a 32 bit processor.  Each 64 bit AND takes multiple instructions, whereas a normal 32 bit AND is a single instruction on most platforms.

So, to speed it up even more, you can break the 64 bit value into its respective halves and test them individually--so instead of 64 64-bit AND operations, it now performs 2 shifts and 64 32-bit AND operations. Another extra bonus is you only need to code 32 constants into your code now:

UINT64 currentRow;

...
UINT32 val = (UINT32) (currentRow & 0x00000000FFFFFFFF);
if( val & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( val & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

val = currentRow >> 32;
if( val & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( val & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

I'm sure I could have written the code in assembler for further improvement, but my code would no longer be portable.  Furthermore, in C these improvements will apply to almost every target platform.  Admittedly my solution will be slightly slower than it could be on a 64-bit platform where a 64-bit shift operation would be single-instruction, but not enough to matter.

The real question, though, is what sort of performance gains did I reap? After measuring again in oProfile, I was spending one-fourth the time in the function that I previously did, and the work definitely helped me get the algorithm to fit on a lowly 270 mhz processor.