Wednesday, December 05, 2012

Thank You, Gordon Moore

In October of 2007, I wrote about hardware assisted brute force attacks.  Namely, that it was an error to assume today's computing power was what we'd have access to in the future.  For example, if you know some computing operation is going to take ten years, one option is to wait five years, at which point (assuming Moore's law) your operation will only take 2.5 years.  You save 2.5 years.

Well, today I stumbled across this:
In a test, the [GPU cluster] was able to churn through 348 billion NTLM password hashes per second. That renders even the most secure password vulnerable to compute-intensive brute force and wordlist (or dictionary) attacks...this means we could rip through any 8 character password (95^8 combinations) in 5.5 hours.
The point is: if you wait five years, the hardware you get is exponentially faster than what was available at the time.  In 2018, are we going to be down to minutes?  And what about combining this approach with rainbow tables?

Sunday, November 18, 2012

Rewriting directshow?

Directshow is a pretty tenacious interface, having been used for approaching two decades at this point.  The official party line from Redmond is "use MediaFoundation," but it's never seemed particularly useful given that MediaFoundation is absent in older versions of Windows.

I've always (perhaps foolishly) thought it'd be fun to write my own media framework.  So I spent some time figuring out what I would hope to improve on over Directshow.  I came up with the following list of things:

  1. I dislike most things about IMediaSample
    1. I dislike the way it communicates a media type--through AM_MEDIA_TYPE.  AM_MEDIA_TYPE is weird to work with, easy to use incorrectly, and littered with buckets of legacy junk that nobody really cares about anymore.  It is easy to leak memory with it.  It is difficult to understand.
    2. I dislike that media samples themselves don't have a method for communicating their media type.  You might counter with IMediaSample::GetMediaType(), but that method provides the media type only if it differs from the previous media type.
  2. Filters consist of separate "pin" and "filter" interfaces despite being one object; this distinction has always been nonsensical to me.  A pin is really the types of input or output a filter contains, as well as a means of "connecting" to the filter.  The filter is...presumably everything else.  The separation between the objects makes writing filters messy, and communicating with them odd.  And I don't see a clear reason for the distinction.  It also creates coupling problems (for example, see CSource and CSourceStream in the DShow base classes, particularly how a pin adds to a filter and deals with reference count)
  3. Support for B-frames is marginal at best.  DirectShow has only a timestamp--not separate timestamps for presentation and decode time--which makes handling certain types of media difficult.
  4. I dislike how duration is represented--100 NS units.  A better way to represent duration is to give individual media streams a timescale--this avoids nearly all rounding error and most floating point math, except at endpoints and render time.
    1. When you think about it, specifying the duration of units is really the same as specifying a timescale, except less flexible.  A duration based on 100 NS units is the same thing as having a timescale of ten million (there are ten million 100 NS units in one second).
    2. Example: we're processing 8 khz audio.  The best timescale to use here is 8000.  When buffers come along, the "duration" of the buffer is simply the number of samples contained within.  Let's say incoming buffers contain 1024 samples in it; our duration is set to 1024 as well.  Total sample duration is 1024/8000 seconds.
    3. Example: we're processing H.264.  Many applications use a timescale of 90000, although accuracy up to DShow's 100 NS units can be achieved by setting the timescale to ten million.
    4. This last example is worth mentioning--compatibility with other media frameworks--like DShow--is easy.  Set timescale to ten million, voila: no translation necessary.
  5. I dislike DirectShow's sample pool concept.  This was probably a great design choice in the early 90s, when processing media was a pretty serious task, but these days....it's irrelevant.
  6. Automatic graph building is a cool idea, but its never worked well for me in production code because random filters may be inserted into the graph.  It's neat for quick-and-dirty apps that do very little.  It's pretty much pointless for a serious production application.
  7. Surprisingly, many of the filters I'd want built-in are often absent.  For example, there really isn't a comprehensive filter for dealing with colorspace conversions, resampling audio, etc.  Many of the "bread and butter" filters you often desperately want aren't available.
  8. Live video streaming support is horribly complicated to implement, and it's questionable if most down-stream filters are even following suit with the API.  Getting AV sync to work well with a live stream literally gives me fits.  
  9. In general, a very complicated API.  Likely because at the time it was dealing with some very complicated subjects.
On the other hand, there are things I really like about DirectShow:
  1. It can be extended to handle nearly any audio or video type, metadata types, etc.  It is extremely flexible.
  2. Because it is so modular, often a major change to your application only involves replacing a filter or two here and there, so changes end up being quite tidy.  For example, I have a playback application using MP4 files containing H.264 and AAC.  Someday, if the business requirement changes, I only have to swap out one filter and make sure it implements IMediaSeeking....and everything should literally work as it previously did.
  3. IFileSourceFilter is a great idea.  I don't like the naming, but having a standard interface to configure source filters makes it much easier for client writers to swap out filters.
  4. IMediaSeeking is a great interface for advanced seek operations.  It's great to have this work well at a high level without having to be overly concerned with the guts of the underlying filter graph.
  5. Some of the utility filters, like the infinite tee, null renderer, sample grabber, etc. are always super handy.
  6. The event-driven framework--while overly complicated--is really useful given that processing audio and video (particularly in real-time) is a very "event" oriented process.
In the coming days, I'm going to take a shot at implementing a very basic media framework in C#.  People have said managed code isn't suitable for media processing; I plan to test that hypothesis, because I see no real evidence showing otherwise.

I'll update my progress here as I go.

Monday, October 29, 2012

State Machines Made Simple

I recently stumbled across a bug in some code where our embedded device would sometimes fail to take the correct behavior on disconnecting from our back-end servers.   This was actually the third fairly major bug in this particular code. When I dumped the code to a state machine using gliffy, it became clear we had a problem with state.  It looked something like:

Okay, okay....it wasn't that bad.  But it was very, very far from good.  Here's a (vastly) simplified overview of the flow of the code:

bool someState;
bool anotherState;
bool yetAnotherState;
while( someState )
{
    // some initial work...
}

while( !anotherState )
{
    while( ... )
    {
        // some initial work again...
    }
    // Wait here until something happens 

    // ruh roh, a disconnect, loop around...
    if( yetAnotherState)
    {
        // special blah here...
    }
}

...full version of the "driver" method was ~200 lines of code, not counting methods that loosely corresponded with the state transitions.  State was largely tracked with a motely assortment of boolean values.  The nested while loops were difficult to understand, as was the initial loop that seemingly duplicated the first inner loop.  Transitions between state were littered between the driver method and the various state methods, in part because there really was no clear home for them.

If you can clearly communicate your process with a state machine, it can be translated using a pretty typical design pattern that A) makes it very easy to understand, B) very easy to modify in the future and C) very reliable.  For instance, consider this basic (and yes, slightly silly) state machine--I was doing laundry this evening, so this was what immediately came to mind:


I obviously make no claims that this method of doing laundry is correct (I will claim my wife has drilled me into adding bleach to the water before putting the clothes in, for what it's worth).  But that's really not the point.  The point is this can translate quickly, cleanly and reliably to working code.  Start by defining all your states:

enum WashingState
{
    Start,
    AddDetergent,
    StartWash,
    AddBleach,
    FillWithClothes,
    WaitUntilComplete,
    OhNoesGoToStore
}

Then, in code, do the following:

while( not done )
{
    switch( current state )
    {
    case Start:
    case AddDetergent:
        current state = have detergent ? StartWash : OhNoesGoToStore;
        break;
    case StartWash:
        // start the wash
        current state = AddBleach;
        break;
    case AddBleach:
        current state = have bleach ? FillWithClothes : OhNoesGoToStore;
        break;
    case FillWithClothes:
        // Fill with clothes...
        current state = WaitUntilComplete;
        break;
    case WaitUntilComplete:
        // wait until done
        current state = Start;
        break;
    case OhNoesGoToStore:
        // return....nothing left to do.
        return;
    }
}

In a nutshell: your enum values correspond with states.  The switch statement handles transitioning between states, and the while loop causes the switch statement to repeat continuously, until told not to.

In some code it may be more appropriate not to have a long-running process, in which case having single methods that allow atomic transition of state is probably a better way to go--but even then, the pattern really remains the same--individual methods correspond with your states, and those methods allow atomic transitions from one state to the next.

But what I enjoy about the "while( not done ) { switch( state ) {} }" pattern--particularly when it's being applied to something like a long-running thread--is it's neat, well contained, easy to understand, and easy to extend and/or modify in the future.

A few passing comments before I close this out:
  • Ignore articles like this one that claim "there is no way to enforce state transition rules" in switch-based state machines.  This is clearly a claim desperately in search of a weird over-engineered solution; it is trivial to enforce state transition rules.
  • Avoid allowing outside forces to influence state, particularly if this is a dedicated thread.  If you must, the best option is to perform this transition immediately following the switch statement, before looping around the while, to "trump" whatever choice made by the case label.  Not pretty, but neither is life sometimes.
  • Keep It Simple Stupid. There are all sorts of fancy methods out there--hierarchical state machines, crazy inheritance, template wizardry (there is some relationship between Dr. Dobbs and a total inability to embrace simplicity), etc.  It is super awesome you can make C++ jump through hoops, but please, spend thirty seconds thinking about the last time somebody handed you a plate of completely incomprehensible code.
  • Consistency matters.  Not for you, probably, but for the next person to maintain your code.  Try to be consistent about where state transitions happen, locking (if necessary), and structure of the code.

Tuesday, February 21, 2012

Thunderstone, Or An Abbreviated Guide to Comp Climbing Treachery

Lately in my spare time I've been playing Thunderstone with my wife and an acquaintance.  It's a fantasy card game where players basically build up a deck of cards during the game in an attempt to vanquish monsters in a dungeon.  There's a lot more to it, of course, but the strategy and creativity of the game is really wonderful.

Anyway, our acquaintance is an avid gamer who's clearly spent an inordinate amount of time scheming about strategy, tactics and technique for winning at these games; he is a cunning adversary, and that's what makes it fun.  So it should have come to no surprise when he was dumbfounded at the complete lack of strategy some climbers demonstrated at a local bouldering competition.

The comp was fairly typical.  It has a qualifier phase where climbers could pick and choose problems to try from fifty to seventy five options of varying difficulty.  However, only the top three problems would be considered for a climber's final score during qualifiers.  Based on this final score, three climbers would proceed to finals.  The finals format was also fairly common: climbers were shown three problems, and given five minutes to attempt each.  Between attempts they were given five minutes of rest.

My friend inquired about these rules, and I explained them, as well as some of the tactics I generally employ (note these don't hold true for all comp formats).  In no specific order, during qualifiers:
  1. Warmup.  This gives me a chance to get ready for harder problems.
  2. The goal of qualifiers is to do the minimal amount of work to get to finals.  Burning yourself out in qualifiers is a sure-fire way to fail on finals problems, which are typically more difficult.
  3. Never get on a hard problem before someone else.  Best case scenario: your competitor is unable to finish the problem, in which case you've earned free beta ("beta" is info about how to do a climb) and you possibly don't even need to attempt it anyway.  Worst case scenario: they send and you still get free beta, hopefully enabling you to send it quicker and easier than them.
  4. Religiously track who's climbed what.  Even if your competition does do some problem more difficult than you, who gives a shit?  You need to get to finals, not thug it out with some rock-jock during qualifiers over a fake prize.
You'd think these rules would be obvious, but my friend seemed dumbfounded how the third rule could ever be feasible; he aptly noted that if everybody waited for someone else to get on a hard problem, then nobody would ever attempt anything difficult.  And this is true, assuming everyone there has considered strategy.

But the reality is they haven't.  Most comp climbers spend roughly five seconds considering the best approach to maximize success before throwing themselves at some stupid-hard problem.  For example, at this comp, I waited for my competition to solve a rather cryptic problem for me.  My two finals competitors did eventually climb the problem--but it took them a handful of costly attempts, whereas I was able to recycle their efforts unlocking this problem for a single-attempt ascent.  I believe this is a win not only from an efficiency standpoint (one try is way better than multiple tries), but for your competition it can be sobering from a psychological standpoint.

Even worse, both of them spent time and effort on problems that were irrelevant to getting into finals.  Had either of them taken a moment to survey the field, they would have realized they were virtually guaranteed a finals seat since the only other climber doing problems in their range was, well, me.  Instead, they both attempted a problem neither of them ultimately sent.  But had either of them actually succeeded at this problem, it wouldn't have changed the fact that they were still headed to finals, which is all that really matters during qualifiers.

In finals, the strategy is different.  You cannot leverage someone's beta in this format since you don't get to watch them climb.  But there's always strategy:
  1. With five minutes of time, realistically, you'll get three to four good attempts.  
  2. Your first attempt will also (likely) be your strongest.  Thirty seconds between attempts on a very hard climb simply isn't enough time to recover.  The first attempt is often worth the most since you get bonus points for climbing something on your first try (i.e. flashing).  Not in all comps, but most.
  3. This means you should have a damn good plan of how you're going to do stuff before you pull on the wall.  I generally take at least a minute surveying the problem before I pull on.
  4. Know when to cut your losses.  This can be hard to gauge, and also slightly heartbreaking to give up.  But if you honestly don't see yourself sending a problem, save energy for the next one.
Again, neither of my competitors seemed to realize any of this.  They spent seconds glancing at the problem before attempting, hurting their chances at first-attempt success.  They made too many attempts.  They did not save energy for future problems.

Combine their errors in the qualifiers with a lack of strategy in finals, and you end up with a commanding win by me.  Which, of course, I like.  But I have to wonder: did age and treachery simply beat youth and talent?  Is this really a legitimate part of "climbing," especially when many climbers would attribute this sort of strategic whoring (on plastic, no less) to pure foolishness?

It's not clear.  But I have to admit I really love strategy.  I love finding that chink in the armor, that little advantage that gives me some edge, my ace in the hole.  It's what I love about Thunderstone.  It's what I love about real stone.

Wednesday, December 21, 2011

What do we actually know about software development?

I recently saw this video on Vimeo; to say it blew my mind would be an understatement:


In it, Greg Wilson makes a compelling argument that we have a responsibility to seriously question the efficacy of the software engineering practices we employ. For example, do we know code reviews will really lead to better code, or are we just trusting our gut? Is pair programming effective?  Does refactoring really matter at all?   Is design paradigm "blah" effective, or is it really an untested, unproven method with no empirical basis whatsoever?  How much of the things we do are dogma, and how much have been scientifically proven to deliver results?

Painful admission: I'd never seriously considered any of this.  Many of these fads--such as unit testing or code reviews or agile or INSERT FAD HERE--are so beaten into developers that we never stop to question the basic foundation.  So as a personal exercise, I decided to seriously dig into one of them to see what I learned: test driven development. I picked this just as some arbitrary starting point, really, based on some reddit comments.

Test driven development is "...a software development process that relies on the repetition of a very short development cycle: first the developer writes a failing automated test case that defines a desired improvement or new function, then produces code to pass that test and finally refactors the new code to acceptable standards."  In other words, I start developing by writing tests that fail.  I produce code that passes these tests.  This idea seems like a good hypothesis, so I found two heavily cited scholarly articles that supported TDD and scrutinized their findings.

The first, Realizing quality improvements through test driven development: results and experiences of four industrial teams, is an empirical study cited by 40 authors according to google scholar.  The study tracked four different teams at two different companies and compared "defect density" with a "comparable team in the organization not using TDD."  Obviously "comparable team" is ambiguous, so in their "Threats to Validity" section, they mention:
The projects developed using TDD might have been easier to develop, as there can never be an accurate equal comparison between two projects except in a controlled case study. In our case studies, we alleviated this concern to some degree by the fact that these systems were compared within the same organization (with the same higher-level manager and sub-culture). Therefore, the complexity of the TDD and non-TDD projects are comparable.
First, I do not buy that "the complexity of the TDD and non-TDD projects are comparable" because "these systems were compared within the same organization (with the same higher-level manager and sub-culture)." I know this to be true within my own organization; I work on what I'd describe as a very complicated product (computer vision, lots of networking, embedded development, lots of compression, low-latency streaming, etc.), while I have coworkers in the same organization (with the same higher-level manager) who literally write software with a small subset of the requirements of our group; the difference in complexity is gigantic. By not providing specifics of the comparable projects, it's hard to take the findings seriously.  It's also clear--from total lines of code and team size comparison--that it is unlikely these are "comparable projects."  They also don't list the experience breakdown of team members for the comparable projects.

Second, none of these results are statistically significant, and the authors acknowledge that.  It's an empirical study--not an experiment--so at best they can only infer correlations between TDD and "defect density."

Third, they include figures on "Increase in time taken to code the feature due to TDD," and use management estimates as a source of data.  Seriously?  Given what we know about estimating things, is this really a valid method?

Lastly, how do they conclude that TDD was solely responsible for the improvement?  Did the teams have other differences, such as doing code reviews?  Were programmers working together, or were they geographically dispersed?  How did management affect the various projects?  What other factors could have influenced their results?  They allude to some of this obliquely in their "threats" section;  none of it stops them from recommending best practices for TDD in their discussion section.

The second paper, published in 2008--Assessing Test-Driven Development at IBM--has ~134 citations in google scholar. I found this to be a more interesting comparison: two software teams implementing the same specification, one team using a "Legacy" method for software development and the other (admittedly future) team using TDD.   It's still not an experiment since it wasn't controlled, but the final deliverable is more comparable: the same "point of sale" specification by IBM.

Their findings are summarized in two graphs; the top is legacy, and the bottom is the newer project using TDD:



At first glance, these look promising; we see that the TDD project had a much lower "defect rate per thousand lines of code (KLOC)" than the legacy project.  We also see the TDD project had better forecasting of their defect rates.

But on closer inspection, I have to seriously question these results.  In terms of absolute line count, the Legacy solution appears to be about ~11.5 KLOC (80 total defects / (7.0 defects/KLOC)) verses ~67 KLOC for the TDD version (247 total defects / ( 3.7 defects/KLOC)). In other words, from an absolute standpoint there were one-third as many defects on the legacy system and one-sixth the total line count. So I'm struggling to understand how the TDD team ended up with a massive pile of code, and what that cost them in terms of schedule/productivity/maintainability/performance, and how they justify "six times as much code and three times the defect count compared to legacy which purportedly does the same thing!" as being a positive result. I'm open to the possibility I'm misinterpreting these graphs, but if I'm not, the authors deserve a scientific keel-hauling.

There's no comparison of development time. No mention of how successful either product actually was in production. No comparison of productivity between the two teams, only a reference to the productivity of the TDD team.  No acknowledgement that other factors besides TDD might have been responsible for their findings, since again this is not a controlled experiment.  And none of this prevents them from presenting best-practices for TDD in their results section, as though two misleading graphs is a good proxy for a well-executed experiment and statistical significance.   Frankly, this source was enormously disappointing to me.  The discrepancies in KLOC and absolute defect rates are significant enough that I'm struggling to understand how this paper was A) published and B) cited by 134 other authors.

In my opinion, neither of these papers establish that TDD is an effective practice.  Of course, neither of them preclude it from being effective either, so the jury's still out.  And it's entirely possible I've made an error interpreting these findings; I graciously welcome your peer review, dear reader.  In any event, I think Greg Wilson's point stands strong: what are our standards of truth? Why do we do what we think works as opposed to what we've empirically shown to work?


Thursday, July 14, 2011

Who Really Wants a Digital Home?

Let me start by saying I'm not a technophobe. I work in video surveillance, I write code in multiple languages, I cross-compile C++ code, I browse various open source projects in my spare time, I like python's structured use of white space, etc. I'm a nerd, and an unabashed one at that.

But let me be the first to say: I do not want a "digital home." Yep, that's right. I don't want a digital home.  

Let me back up and put some context behind this. In maybe 2005ish, I was asked to head up to Seattle to interface our surveillance product with the products of a home automation/digital home company. Their office was in a drab commercial area of Seattle that presumably fed off the table scraps of larger tech companies in the area; it was one of the dirtiest offices I'd ever seen. Their key product was a little wall-mounted device with an LCD touch screen that would control lighting, HVAC, audio, etc., and they wanted me to make it work with our surveillance system.

I was given a tour by a stocky fellow with a thick eastern European (?) accent who oversaw their engineering. He proudly demonstrated the usage of the wall-mounted device. The whole demo went something like so:

HIM: "So, if you want to turn on the lights, you just do this..."

(turns towards the device, stares directly at it, flicks through a menu or two, hits a button, and a bunch of lights come on)

ME: "Huh....neat."

HIM: "If you want music, you can do this..."

(turns back towards the device, stares directly at it, punching keys, and after some completely boggling transitions through menus we hear music)

Me: "hrrrm..."

HIM: "If you want to control HVAC..."

(turns back towards the device, stares at device, browses unintelligible menus...)

...you get the idea.

While I was watching this, it occurred to me how completely wrong this product was on so many basic levels.  I imagined myself installing this system in my home, trying to explain it to my wife, having her be frustrated learning how to use it, having her be irritated that turning on the lights was suddenly a pain in the ass, me doing tech support when it's busted so my wife can turn on lights, her making me put the wall switch back, and so on.  It's a cool idea...if you're a nerd.

For the rest of the non-nerd world?  Here's how they want to turn on the lights:


...the simplicity is stunning.  The reliability?  Undeniable.  The lack of training required?  Breathtaking.  The familiarity of the interface, the ability to "feel" around a corner and flip on a light (instead of having to stare at a tiny screen on the wall), the speed at which you can turn on the lights, the lovely tactile response of a solid mechanical switch?  All very refined.

I think much of this nuance is often lost on my fellow techies--after all, we don't fear new technology.  To the contrary, we often bask in it, marveling at the possibilities it may bring.  But the truth of the matter is these companies are attempting to replace technology that has been used and refined over the last 100 years.  When my wife wants to turn on a light, she doesn't want to deal with weird technology or navigate menus or stare at a tiny screen on the wall--she wants to turn on the lights.  When a guest is using my restroom, I don't want to explain how to turn on the light by the toilet.  Lord knows I don't want to do IT work at my own home--as it is, I hate servicing people's computers.  Replacing a light switch with a touch screen sounds like a great idea, but in reality it's as misguided and ridiculous as replacing the steering wheel of a car with an iPhone app.

All of this comes full circle to the idea of a "digital home."  What does that even mean, a digital home?  Does it mean replacing technology willy-nilly with stuff we think is cool, or actually attempting to improve the ergonomics, efficiency and useability of our homes?

Companies hawking these products invariably see their role as purveyors of a better solution to "antiquated" technology that's been in use for the last few decades.  But until their products complement--not replace--the ergonomics and efficiency of our home in a reliable manner, they'll ultimately fail to have significant market traction.  This is why the term "digital home" itself is hopelessly misguided; the term itself really implies something nobody wants.  We're pretty happy with our fuddy-duddy analog homes, thank you very much.

I believe companies that truly leave their mark in this market won't replace conventional technology and interfaces like the light switch.  They'll be known for complementing the technology with smart solutions that solve problems people actually have.  For example,
  • Can we find ways of remotely controlling lighting?
  • Can we find ways to turn off lights when they're not necessary, thus saving energy?
  • Can we find ways to reduce the amount of copper wiring a home needs, thus driving down construction costs?  Or make it easier to arbitrarily add outlets and lighting to a home or business without tearing down walls or hiring electricians?
  • Can we find ways to delight people, such as automatically turning on porch or back lights when they come home, or automatically turning lights on and off as we go room to room?  Or letting me turn out lights I forgot while I was on vacation?  Or turn out lights from my iPhone because I'm in bed and I want want to pass through the house one last time?
The key difference is: the goal is to compliment our homes--not replace them with silly digital interfaces.

Tuesday, May 10, 2011

Unroll that loop, homeboy

A month or so ago I was wrestling with a pretty thorny algorithmic problem at work.  Eventually I worked out a solution that worked great on our Windows emulator, so I figured my job was done...was I ever wrong.

Once I got it running on our embedded processor (a lowly ARM926EJ-S at ~270 MHz), I quickly found my algorithm had some serious performance issues.  So, I fired up oProfile and did some profiling so I could see where I was eating cycles.  There were a couple hotspots that I went after, but one was a loop that looked like so:

UINT64 currentRow;

...
for(int row = 0; row < NUMBER_OF_ROWS; row++)
{
  currentRow = rows[row];
  for(int column = 0; column < NUMBER_OF_COLUMNS; column++)
  {
    if( (currentRow >> column ) & 1 )  // Check if this bit is set
      DoSomeWorkOnThisPointBaby(col,row);
  }
}

Nested loops...yuck.  This code ran for every column and row, shifting the current row over to check and see if the bit was set. To make matters worse, the row was represented with a 64 bit value, which costs extra to shift.  Even worse, this wasn't work I could figure out a way of avoiding entirely--enumerating over all the values was really a core part of the algorithm.

So, I decided to see if I could at least speed up the routine by optimizing. First, I unrolled the loop.  As wiki states, the goal of loop unrolling is to increase a program's speed by reducing (or eliminating) instructions that control the loop, such as pointer arithmetic and "end of loop" tests on each iteration.

So instead of the inner loop iterating through all of the columns, it gets replaced with a gigantic block of if-statements:

UINT64 currentRow;

...

if( currentRow & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( currentRow & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

if( currentRow & 0x8000000000000000 ) // 64th column!
  DoSomeWorkOnThisPointBaby(col,row);

This removes all of the overhead of the for loop--no more incrementing the counter, and no more testing to see if we're at the end of the loop.  It also avoids an expensive shift operation to test each column value.

But we can do even better!  One issue is currentRow is a 64 bit value, but we're on a 32 bit processor.  Each 64 bit AND takes multiple instructions, whereas a normal 32 bit AND is a single instruction on most platforms.

So, to speed it up even more, you can break the 64 bit value into its respective halves and test them individually--so instead of 64 64-bit AND operations, it now performs 2 shifts and 64 32-bit AND operations. Another extra bonus is you only need to code 32 constants into your code now:

UINT64 currentRow;

...
UINT32 val = (UINT32) (currentRow & 0x00000000FFFFFFFF);
if( val & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( val & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

val = currentRow >> 32;
if( val & 0x1 ) //first column
  DoSomeWorkOnThisPointBaby(col,row);
if( val & 0x2 ) // second column
  DoSomeWorkOnThisPointBaby(col,row);

...

I'm sure I could have written the code in assembler for further improvement, but my code would no longer be portable.  Furthermore, in C these improvements will apply to almost every target platform.  Admittedly my solution will be slightly slower than it could be on a 64-bit platform where a 64-bit shift operation would be single-instruction, but not enough to matter.

The real question, though, is what sort of performance gains did I reap? After measuring again in oProfile, I was spending one-fourth the time in the function that I previously did, and the work definitely helped me get the algorithm to fit on a lowly 270 mhz processor.