Goldfishy: 2010

Friday, October 15, 2010

A closer look at JPEG performance

In my previous post, I compared JPEG to the newcomers, WebP and JPEG XR. For the most part, JPEG held its own pretty well. WebP was better in a few respects, worse in a few...overall decent. JPEG XR was utterly horrible.

But this got me thinking: how optimized was my JPEG encoder? I first used MSFT's .NET JPEG implementation, but I had to wonder just how good that implementation actually was. For example,

What quantization tables does it use?
Are its quantization tables computed on a per-image basis, or is it using the IJG reference tables?
How are the tables being scaled, and are they being capped at 8 bits (i.e. no value larger than 256 in your quantization table) or allowed to go to 16?
For the discrete cosine transform, is it using integer or floating point math?
Is it using an optimized entropy encode?

Maybe our problem isn't JPEG. Maybe our problem is bad JPEG encoders. For example, libjpeg certainly isn't terrible, but there are some flaws that are hard to overlook, such as hard-coded quantization tables. The linear scaling of the quantization matrix is equally deficient, and this was acknowledged by the maintainers of the project ("The current method for scaling the quantization tables is known not to be very good at low Q values," taken from the libjpeg README in February of 1996! How's that for old?). The library itself still defaults to non-optimal settings (probably a perfectly legitimate design decision when you weren't sure if you had an FPU...these days, it's a much safer bet that you do). In summary, there's some serious room for improvement.

So do we really need WebP? Or do we really just need to give a venerable old method some much needed love?

It is clear to me that there are various methods that can improve current JPEG performance. Using libjpeg's optimize routine, followed by a run through jpegcrush, WebP is no longer ahead. Jpegcrush itself "...abuses combinations of multiple Huffman tables and scan combinations to improve compression more than merely optimizing the Huffman tables naively," according to Dark Shikari. There are several papers about generating image specific quantization tables that minimize error for a given bitrate that look promising. So in my spare time, I'll probably investigate some of this stuff further.

Sunday, October 10, 2010

Comparison: WebP, JPEG and JPEG XR

Google recently released a new image format called WebP, which is based on the intra-frame compression from the WebM (VP8) video codec. I'd seen a few comparisons of WebP floating about, but I wasn't really happy with any of them. Google has their own analysis which I critiqued here (executive summary: bad comparison metric--PSNR--and questionable methods).

Dark Shikari's comparison was limited by testing a single quality setting and a single image. There's also this comparison and this comparison, but neither of them are that rigorous. The first is, again, a single image at one arbitrary compression level, although it does clearly show WebP producing superior results. The second is at least multiple images, but the methods are a little fuzzy and there's no variation in compression level. It's also a bloodbath (WebP wins), so I'm a bit wary of its conclusions.

So, I coded up a quick and dirty app to compress an image to WebP, convert it back to PNG and then compute the MS-SSIM metric between the original image and the final PNG. I used Chris Lomont's MS-SSIM implementation. I also did the same process for JPEG, using the built-in JPEG compression in .NET. Then, for the hell of it, I decided it'd also be fun to add JPEG XR into the mix, since it's probably the only other format that has a snow-cone's chance in hell of become viable at some future date (possibly when IE9 hits the market). I eventually got JPEG XR working, although it was a pain in the butt due to a horrible API (see below).

The input image is encoded for every possible quality setting, so the final output is a group of SSIM scores and file sizes. I also computed db for the SSIM scores the same way x264 computes it, since SSIM output is not linear (i.e. 0.99 is about twice as good as 0.98, not 1% better):

SSIM_DB = -10.0 * log10(1.0 - SSIM)

For starters, I decided to test on the image Dark Shikari used to test with, since it's a clean source and I wanted to see if I could replicate his findings. Here's the results of the parkrun test:

Quick explanation: higher SSIM (dB) is better, smaller file size is better. So in the perfect world, you'd have codecs outputting in the upper left-corner of the graph. But when images are heavily compressed the quality drops, so you end up with graphs like the above: high compression equals bad quality, low compression equals large file size.

In parkrun, WebP seems to perform decently at very low bitrates, but around ~15 dB JPEG starts to outperform it. Even at low bitrates, it never has a substantial advantage over JPEG. This does seem to confirm Dark Shikari's findings: WebP really doesn't perform well here. When I examined the actual output, it also followed this trend quite clearly. JPEG XR performs so poorly that it is almost not worth mention.

Next, I did a test on an existing JPEG image because I wanted to see the effect of recompressing JPEG artifacting. In Google's WebP analysis, they do this very thing, and I had qualms about JPEG's block-splitting algorithm being fed into itself repeatedly. Here's the image I used:

Results:

Again, absolutely dismal JPEG XR performance. Notice the weirdness in the JPEG line--I believe this is due to recompressing JPEG. I got similarly odd results when I used other JPEG source material, which does support my hypothesis that recycling JPEG compressed material in a comparison will skew the results due to JPEG's block splitting algorithm. I'll try to post a few more with JPEG source material if I can.

As far as performance goes, WebP does a little better here--it's fairly competitive to about ~23 dB, at which point JPEG overtakes it. At lower bitrates it is resoundingly better than JPEG. In fact, it musters some halfway decent looking files around ~15 dB/34 KB (example), while JPEG at that same file size looks horrible (example). However, to really do a good job matching the original, including the grainy artifacts, JPEG eventually outperforms WebP. So for this test, I think the "winner" is really decided by what your needs are. If you want small and legible, WebP is preferable. If you're looking for small(ish) and higher fidelity, JPEG is a much better choice.

Next test is an anime image:

Results:

WebP completely dominates this particular comparison. If you've ever wondered what "13 dB better" looks like, compare JPEG output to WebP output at ~17 KB file sizes. JPEG doesn't reach a comparable quality to the ~17 KB WebP file until it's around ~34 KB in size. By the time JPEG surpasses WebP around 26 dB, the extra bits are largely irrelevant to overall quality. JPEG XR unsurprisingly fails.

Since a lot of web graphics aren't still-captures from movies, I decided a fun test would be re-encoding the Google logo, which is a 25 KB PNG consisting mostly of white space:

This is a better graphic for PNG compression, but I wanted to see what I could get with lossy compression. Results:

WebP clearly dominates, which confirms another hypothesis: WebP is much better at dealing with gradients and low-detail portions than JPEG. JPEG XR is, again, comically bad.

Incidentally, Google could net some decent savings by using WebP to compress their logo (even at 30 dB, which is damn good looking, it's less than 10 KB). And I think this may be a place WebP could really shine: boring, tiny web graphics where photographic fidelity is less of a concern than minimizing file size. Typically PNG does a good job with these sorts of graphics, but it's pretty clear there are gains to be had by using WebP.

My conclusions after all this:

JPEG XR performs so poorly that my only possible explanation is I must be using the API incorrectly. Let's hope this is the case; otherwise, it's an embarrassingly bad result for Microsoft.
WebP is consistently better looking at low bitrates than JPEG, even when I visually inspect the results.
WebP does particularly well with smooth gradients and low-detail areas, whereas JPEG tends to visually suffer with harsh banding and blocking artifacts.
WebP tends to underperform on very detailed images that lack gradients and low-detail areas, like the parkrun sample.
JPEG tends to surpass the quality of WebP at medium to high bitrates; if/where this threshold occurs largely depends on the image itself.
WebP, in general, is okay, but I don't feel like the improvements are enough. I'd expect a next-gen format to outperform JPEG across the board--not just at very low compression levels or in specific images.

A few more asides:

I'd like to test with a broader spectrum of images. I also need better source material that has not been tainted by any compression.
No windows binary for WebP....seriously? Does Google really expect widespread adoption while snubbing 90% of non-nerd world? I worked around this by using a codeplex project that ported the WebP command line tools to Windows. I like Linux, but seriously--this needs to be fixed if they want market penetration.
If you ask the WebP encoder to go too low, it tends to crash. Well...maybe. The aforementioned codeplex project blows up, but I'm assuming it's due to some error in the underlying WebP encoder. That said, by the time it blows up, it's generating images so ugly that it's no longer relevant.
JPEG/JPEG-XR go so low that your images look like bad expressionist art. It's a little silly.
I absolutely loathe the new classes in .NET required to work with JPEG XR. It is a complete disaster, to put it lightly, especially if all your code is using System.Drawing.Bitmap, and suddenly it all has to get converted to System.Windows.Media.Imaging.BitmapFrame. Why, MSFT, why? If you want this format to succeed, stop screwing around and make it work with System.Drawing.Bitmap like all the other classes already do.
WmpBitmapEncoder.QualityLevel appears to do absolutely nothing, so I ended up using ImageQualityLevel, which is a floating point type. Annoying, not sure why QualityLevel doesn't work.
The JPEG XR API leaks memory like nuts when used heavily from .NET; not even GC.Collect() statements reclaim the memory. Not sure why, didn't dwell on it too much given that JPEG XR seems to not be worth anyone's time.
People who think H.264 intra-frame encoding is a viable image format forget it's licensed by MPEG-LA. So while it would be unsurprising if x264 did a better job with still image compression than WebP and JPEG, this is ignoring the serious legal wrangling that still needs to occur. This is not to say WebM/WebP are without legal considerations (MPEG-LA could certainly attempt to put together a patent pool for either), but at least it's within the realm of reason.

Let me know if you're interested in my code and I'll post it. It's crude (very "git'er done"), but I'm more than happy to share.

EDIT: Just to make sure .NET's JPEG encoder didn't blow, I also tried a few sanity checks using libjpeg instead. I also manually tested in GIMP on Ubuntu. The results are almost identical. If anyone knows of a "good" JPEG encoder that can outdo the anime samples I posted, please let me know of the encoder and post a screenshot (of course, make sure your screenshot is ~17 KB in size).

Friday, October 01, 2010

Not again...

Google decided to use the intra-frame coding algorithm of WebM to create a new image format for the web called WebP. And, of course, they make bold claims like "...an average 39% reduction in file size." There's nothing wrong with bold claims, but after reading the details of how they arrived at that conclusion, there are a few problems.

First, Google should stop using PSNR as a metric for image quality. PSNR does not correlate well with mean opinion scores of image quality. Furthermore, this error is particularly egregious because the WebM appears to be PSNR optimized and it isn't clear that the JPEG/JPEG2000 encoders were. If the JPEG encoder was optimized for SSIM then it may very well score lower on a PSNR test despite being visually superior. This is the no child left behind approach to encoding: Google is engineering encoders to pass the test instead of encoders that actually produce visually stunning results. A much better choice for an objective metric is MS-SSIM. It's not perfect, but statistically, it is more meaningful than PSNR.

The most meaningful objective, of course, is actually collecting mean opinion scores, but that involves actual humans, tallying lots of data and generally "making science." In the absence of this, though, why not use the most meaningful objective image quality metric available?

Second, if Google is going to try and woo us with pretty graphs, they should get their axis labels right (see Figure 1).

Third, Google basically took existing JPEG images (90% of all samples), decoded them, and re-encoded with different encoders to determine which encoder worked best? This method is totally bogus. Once an image is encoded to JPEG, it's going to have block-based artifacting on a grid with squares 8 by 8 due to JPEG's block splitting. For example, this grid is clearly visible in a lot of JPEG images:

You can clearly see the 8 by 8 blocking, particularly around the woman's chest (stop staring, perv!).

Here's another example from the wikipedia article on JPEG, blown up to better illustrate the blocking artifacts:

Again, you can clearly see the 8x8 grid. This could seriously skew results for the second JPEG compression, particularly if the encode settings are not identical to the first compression (see question 10 here). The data being used in the comparison has already been skewed by a first JPEG compression; thus, the JPEG results (i.e. the most important comparison) are potentially flawed, and it's not clear to me if this would be an advantage or disadvantage for JPEG in the comparison.

Luckily, this can be fixed--Google should do the comparison on source material with loss-less formats like PNG or BMP, which hopefully will not contain existing JPEG artifacting.

Friday, May 28, 2010

HD Video Standard, Cont.

Last year, there was brief discussion about a standard for high definition video; you can see my previous comments here.

One of the big questions about a high definition standard is how to make sure the quality of video encoding is acceptable. Sure, video can be 1080P, but if it's over-compressed garbage produced with a sub-par encoder then it's really difficult to claim that it is "high definition." For example, here are two screenshots from blu-ray disks. I wish I were kidding about this, but it really underscores the importance of having an objective metric to ensure that some reasonable encoder quality is met.

Unfortunately, people can even screw this up. Consider today's post from the WebM blog about VP8 quality. First, it uses peak signal to noise ratio (PSNR), which is certainly a venerable metric, but almost everybody and their mother agrees that it correlates poorly with human perception of image quality. One need not look further than this paper, which outlines why this technique (particularly in the field of image processing) should be relegated to the dust-bin. Even worse is programmers can "cheat" when optimizing for PSNR, and the result is generally a blurry mess--they chose settings that result in higher PSNR, but this doesn't infer higher quality. (With regard to VP8, "rumor has it that there was some philosophical disagreement about how to optimize [encoder] tuning, and the tune for PSNR camp (at On2) mostly won out. Apparently around the time of VP6, On2 went the full-retard route and optimized purely for PSNR, completely ignoring visual considerations. This explains quite well why VP7 looked so blurry and ugly.")

A much more basic problem is how the results are presented: line graphs. Here's an example:

The vertical axis is PSNR score--higher is better. The bottom axis is bitrate. So it's pretty clear what this graph shows: it compares the quality between the VP8 encoder with and without a given feature at equivalent bitrates. However, there's also no mention of how each data point was calculated (there's even less mention of H264, but I'll leave that for another post). Video is a sequence of images, so PSNR and other image processing metrics must be run on each frame. Unlike a single image, this results in a problem: how do we take all of those scores and give an overall indication of the quality of the video encode? How do we know that the upper clip doesn't have a lot more volatility or substantial lapses in quality somewhere?

At first I assumed the WebM people were simply using average PSNR, which would be computed something like:

average = {SUM log(SNR)} / {# of frames}

I was wrong; they're using "global PSNR," which is computed as:

global = log ({SUM SNR} / {# of frames} )

But this makes no difference--global is still a mean. The reason to use global is because an encoder can recreate a pure black frame perfectly, which gives a PSNR score for that frame of infinity, which would skew average PSNR. Global PSNR avoids that problem by literally looking at the whole video as one big frame (this is wrong in so many ways; see aforementioned link on the deficiencies of PSNR, but I digress).

Problem is, means can be heavily affected by outliers, so there may be a few bad frames dragging down the score or a few great frames dragging up the score--or, there may even be a few horrible frames and a lot of great frames, in which case it all averages out, but it's also clear your video has a serious consistency problem. It's very clear that this "single value" representing video quality is suspect at best, and junk engineering at worst.

This "single value" really deprives us of understanding the overall quality of the video stream. So, without further ado, let me present some alternatives. In my previous post, I mapped SSIM scores on a per-frame basis:

Each line consists of a given bitrate. Vertical axis is SSIM scores, while the horizontal axis is frame number. You can see how drastically SSIM scores within a single clip can vary. This method also works really well with multiple encoders at a comparable bitrate, because you can clearly see where one encoder outperformed another.

Another more simple and compact way of visualizing video quality data is a box and whisker plot. Here's box plots of SSIM scores from a clip with 165 frames; each box plot corresponds with different encoder settings (all of them are the same bitrate);

This box plot conveys a lot of information about the video quality. The median value (middle of the box plot) is a general indication of the overall quality of the clip. The upper and lower quartiles (the area directly above and below the median value) encapsulate 50% of all samples, while the whiskers generally extend out some reasonable amount (this varies, see wiki article for details). Outliers are marked with circles.

What does this all mean? Consider the above box plot--the median values really don't vary that much. But what does vary is (a) the size of the quartiles and whiskers and (b) the number of outliers. Outliers, in this instance, correspond with frames that look like crap. Tighter box plots indicate more consistent quality, and higher box plots indicate higher overall quality. Using this technique, one can easily compare encoder quality and get an idea of how they really stack up against one another. From this box plot, it's pretty clear "highProfile" is the obvious winner. If you were just looking at averages or medians, it is hardly so clear. Certainly you have zero indication of any outliers, where the quality of most frames falls, etc.

Means are basically retarded. They deprive you of understanding your data well.

In any event, what really irks me about this WebM debate isn't that there's a legitimate competitor to H264--competition is always a good thing, and I welcome another codec into the fold. What bothers me is that we would let bogus metrics and faulty data presentation influence the debate. At this point, I see no indication that VP8 is even remotely on par with a good implementation of H264--perhaps it can be sold on other merits, but quality simply isn't one of them. This could change as the VP8 implementation improves, but even the standard itself is roughly equivalent to baseline profile of H264.

Putting that whole debate aside and returning to the notion of a high definition video standard, using these methods--and in particular, a box plot--one can establish that a video at least meets some baseline requirement with regard to encoder quality. The blu-ray shots above are a pretty clear indication that this industry needs such a standard. More importantly, consumers deserve it. Regardless of the codec, it's not right for people to shell out cash for some poorly-encoded, overly-quantized, PSNR-optimized mess.

Sunday, May 09, 2010

Object Tracking for Scientists

My significant other works in zoology. She needed a program to analyze about a thousand hours of video containing salamanders. Since image and video processing is my primary focus, I offered to take a stab at analyzing her video.

The goal was to determine how much distance the object (in this case, a cute little salamander known as Plethodon Shermani) traveled over the course of the video. There were definitely a few takeaway lessons from this exercise:

It turns out object tracking is more difficult that I would have originally imagined, and
You can make your life a lot easier during the analysis phase by making sure your video is as clean as possible. Our video was relatively clean despite some confounding factors inherent in working with salamanders, but in retrospect there were a few things that would have made analysis much easier.

I'd like to discuss both points in detail. I'm going to start with the process I used to track the object first, since that will explain many of the crucial aspects of #2.

Object Tracking for Dummies
Rather than write a bunch of code from scratch, I started with AForge.NET, which is a nice open source library with a number of image processing primitives. Specifically, I massage my input data until I can use some of the blob detection features of the library. Here is the general method I chose to perform object tracking:

Convert the image to gray scale--all color information is discarded. This is because (a) color information is not necessary and (b) the subsequent processing happens much quicker on an 8 bits/pixel image--color is 24 bits/pixel.
Because the salamander is darker than the background, I invert the image. Light colors become dark, and dark colors become light.
Run the image through some noise reduction filter, such as a median or mean filter. Median works better, but is rather expensive.
Here, I then apply a threshold filter. This filter converts all pixels to either black or white, depending on whether they're below or above a specified threshold. Since the salamander is lighter than the general background, it falls out of the woodwork here.
After that, I apply an opening filter to remove any small objects from the image.
At this point, hopefully all that is left is the salamander. Here, I use the blob counter class.

At this point, there are a few tricks to make sure the animal is tracked correctly. First, I track the position of the salamander with a list of points. I can use this list to (a) draw out the path of the salamander and (b) compute total distance traveled. Total distance traveled is simply the cumulative sum of the Euclidean distance between successive entries in the list. The final unit is measured in pixels. Doing some metric unit is much more difficult unless you have a known mapping between pixels and actual width.

Should there be more than one blob, I do some basic filtering on blob size--I find the salamander typically has a fairly consistent area. Furthermore, I utilized principles of temporal and spatial locality--in the event of multiples blobs, I choose the blob closest to the previous known location of the salamander. Given that salamanders don't move very quick, this is a surprisingly safe bet.

Lastly, I do some mild filtering on the list of points--I only record changes in distance greater than a few pixels to weed out any residual noise in the image.

There are some definite drawbacks to this method. Most importantly, it relies on the salamander being the darkest object in the image, which I've found to be (mostly) true. It also requires knowing the salamander's initial location, which can be tricky. Lastly, the "threshold" setting tends to vary due to variance in lighting conditions at the time of recording, so I attached that to a slider bar that's configurable in real time. There's also a big discrepancy between the sizes of salamanders, so the target min/max area is configurable.

Here's a picture of the final result:

The left screen is the original video. The middle screen is what the video looks like after the whole invert->medium->threshold->open->detect blob method. I flag the resulting object with a nice red bounding box, and the size of the object. The right screen draws out the path the salamander took in the box. Total distance traveled is listed in the bottom status bar.

Worth note is that white bar on the left side of the middle image--that's the result of using a dark cardboard divider in the video. In retrospect, this divider should have been as white as possible. This is a nice lead into our next topic...

Getting Clean Data

One thing that's important is making your video data as easy to analyze as possible. Here are a few considerations:

Low light conditions will result in salt and pepper noise in the video stream. This can be removed (somewhat) using a median filter, but the filter is computationally expensive. There are other cheaper filters, but they come with various performance tradeoffs. So, high-light conditions are vastly preferable, if possible.
Seek to have the highest contrast possible between the object(s) you need to track and the general background. This means all background in the video should be the opposite color of the thing you're tracking. For example, if you need to put a piece of cardboard somewhere, and you're tracking a black mouse, make your cardboard white.
Minimize glare as much as possible. Museum quality glass with low glare treatment (like the type used in picture frames) will help substantially; it is vastly preferable over the crappy plastic lids many trials use. Also consider a polarizing lens to further reduce any glare.
Make your lighting as consistent as possible. Having half the field of view be bright and half be dark--or the existence of shadows due to point sources of light--is difficult to manage.
If possible, mark objects with visually distinct queues. For example, if you can mark an object with a red circle, that's going to make your life a lot easier in the analysis stage.
Maintain positions throughout the experiment as much as possible. If you can, bolt the camera in place, and make sure the tray or dish (or whatever) is situated in the exact same location.
Make sure all cameras are the same distance from the object for the entire duration of the experiment. Otherwise, it can be very difficult to track distance traveled. Also, wide angle lenses will make it difficult to track distance traveled since the image is so distorted around the edges.
Higher resolution is "nice to have," but not nearly as important as consistent lighting, low glare, high contrast between object/background and consistent placement.
I would take consistent lighting performance from a camera--and excellent low-light capabilities--over higher resolution or frame rate.

Your mantra: high contrast, high contrast, high contrast. If your object is black, the ideal video should look like a blob of black amid a snowy field. Obviously this may be hard; you might need visual markers in your video, you may have edges or barriers setup that are dark(er), etc. Try to minimize this as much as possible.

Salamanders were tricky for another reason: they like dark places. We used low-light conditions, which results in particularly difficult lighting conditions. To compensate for this, high quality glass is a big help. Also, the median filter helps, as does having a camera with good low-light performance.

Saturday, March 13, 2010

Memory Corruption Makes Me Sad

Nothing makes programmers cower in fear more than memory corruption. These bugs are almost always A) fatal and B) ridiculously difficult to track down and C) hard to reproduce consistently. The combination of these three things can make you start thinking of your memory in surprisingly literal ways:

(and literal in more ways than one, since your memory is "trashed" hahahaha, oh that was bad...)

This particular blog posting is about my own struggle with a bug that I knew about for no less than four months, and for the last two weeks I worked on it 24/7. I fixed it, but it was very difficult to isolate.

Memory corruption happens when "the contents of a memory location are unintentionally modified due to programming errors. When the corrupted memory contents are used later in the computer program, it leads either to program crash or to strange and bizarre program behavior." What makes memory corruption particularly insidious is your program doesn't crash or behave strangely at the time of modification--it happens later, when something attempts to use the memory that was sullied.

Causes of memory corruption include use of uninitialized memory, use of unowned memory, buffer overflows, faulty heap memory management and multithreading problems. In all cases, the defining characteristic is where the program crashes isn't necessarily related to what went wrong.

Here's the best strategy I've found for dealing with memory corruption:

If you're really, really lucky, you might be able to catch the problem with something like DevPartner or Valgrind, but typically these only show the problem as it's blowing up. In the case of my bug, DevPartner ended up being beneficial only because it made my error more repeatable. Also, have a look at Application Verifier.
Get a static code analysis tool. If you're using VS.NET 2008, try using their Code Analysis tool. It is surprisingly effective at locating buffer overruns and errant pointer access. These tools may or may not help you, but they certainly can't hurt. If you are lucky, this may be all you need to find your problem. If it doesn't work, then you get to go on to the brute force methods (lucky you!)
Go through all classes and make sure every member variable is initialized correctly. Value types should be set to sensible values, pointer types should be initialized to NULL.
Search for all new/delete/malloc/free statements. Ensure that all pointer values that are allocated begin their life as NULL. Ensure that the memory being released is immediately NULL'd. Ensure you have no new/free or malloc/delete mismatches. Ensure you do not free/delete memory twice. Ensure that any memory allocated by a third party library is also destroyed by that library (i.e. do not call "CrazyLibraryMemAlloc" and use free/delete to clean up unless you are positive that is the correct thing to do). Make sure your destructors and cleanup methods release all memory and NULL everything. Make sure you're using delete[] if the type was allocated with new []. In essence, everything should begin its life as NULL and end its life as NULL. This is probably the single best thing you can do to isolate memory tom-foolery.
Review every memset, memcpy and mem-whatever in the program (ditto for Win32 variants like CopyMemory). If you are using raw buffer pointers (e.g. void*, int*, etc.), consider wrapping them in something like QByteArray. Review any and all string handling code (in particular, strcpy and the likes). If you have any raw pointer string types, consider replacing them with a decent string class like CString or Qt's QString.
Are you using threads? Does the crash happen in a shared object? If so, this strongly implies your locking strategy (or lack thereof--even if you have locks, be absolutely certain they are working as you expect).
Determine if you are experiencing corruption in the same location, or if it's more random. If it's random corruption, then it is more likely to be a buffer overflow. If it's localized corruption (i.e. let's say the crash always happens in a shared queue, or in the same place in code), then it is more likely that code touching the shared item is invalid. If it crashes in the exact same place always, then you are in luck--you should be able to watch that location in a debugger and break on any read/write. Whether or not you have crashes in the same place is a huge, huge clue about your problem. Track this information religiously.
One method for determining if you have local/random corruption is to declare "no man's land" buffers directly above and below the item being corrupted. Like, nice, big 10k buffers that are initialized to "0xDEADBEEFDEADBEEFDEADBEEF..." When your program crashes, inspect those buffers. If those buffers contain invalid data, then it is not localized corruption. If they aren't corrupted, but the data structure they wrap is, then that strongly implies something that touches the sandwiched object is where the problem may lie.
It is not likely to be the C-runtime, third party code, obscure linking issues, etc. Think about it: you're not the only person using these libraries and tools. They are generally more thoroughly vetted because of Linus' law. Is it possible? Yeah, sure, anything's possible. But is it likely? Not really.
Unless evidence strongly implies otherwise, assume the issue is in your code. This is good, because it means it's something you can potentially fix. Otherwise, you may have to start a support case with whomever owns the code. If it's an open source project, you might get a quick response (or possibly no response at all...). If it's somebody like MSFT, it is going to take weeks at a minimum. Only as a last resort should you assume it's somewhere else, and be certain you have Real Information™ to backup your theory.

It may take a couple of people a day or five to go through the program and make all these changes depending on how big the application is, but it's generally the only real way to isolate problems. And it also gets you in the habit of being religiously fanatical about default values, pointer checking and correct deletion of objects, which is good.

For me, the issue ended up being extremely subtle (it evaded two separate code reviews by my peers), and after finding it, painfully obvious and somewhat embarrassing. I was having localized corruption around a shared queue that two threads accessed. The culprit was invalid locking code I'd written. Once the queue became corrupted, it would fail somewhere in the bowels of whatever queue object I'd been using (I tried CAtlList, QList, etc, but it didn't matter because none of them are thread-safe).

Which brings me back to item #10 in the above list: it's always your fault. It was my fault. It can be very tempting to assume otherwise, but generally I don't find this to be the case. So keep an open mind, think analytically, write down what you know and what you don't know, and you'll be done with the bug sooner than you know it!

Monday, March 01, 2010

DMOs Considered Harmful

Lately I've been working with DMO filters. Typically I've written regular DirectShow filters derived from the base classes, but after reading this article I decided that maybe it was time to start writing DMOs instead of transform filters. The advantages seemed attractive, and being able to write a filter that would work in Media Foundation was tempting.

After writing two of them, I can now say this was a bad idea. There are serious performance implications, major limitations and many of the proposed advantages are simply not true.

Let's start with the deal-breaker for me: you may run into serious performance issues when using a DMO filter in DirectShow because you cannot set the number or size of output buffers. When using the DMO Wrapper Filter (think of this filter as the "translator" between the DMO model and DShow filters--you could write your own DMO wrapper if you wanted), the ALLOCATOR_PROPERTIES on the output pin will always look something like:

You will always get one, big buffer for the output allocator. There is no way to change this that I know of. The article on MSDN mentions "...DMOs have no control over the number of buffers or memory allocators," but what they omit to tell you is that this can have serious consequences for high-performance playback, and particularly so for variable rate playback.

The same filter implemented as a typical transform filter allows me to specify the size and count of the output buffers:

Here, I made the DShow filter use 20 output buffers. This filter, with the exact same decoder, would do up to 8x on my machine without missing a beat. I could not reliably do better than 2x with the DMO filter before the filter graph manager reverted my SetRate() call. In case it isn't obvious, you give up a lot of control by using a DMO in DirectShow. And for some scenarios, it's pretty clear you sacrifice a lot of performance as well.

But it gets worse. Let's say your H264 encoder pukes everywhere. Vomits ALL OVER. You, being the responsible programmer you are, would like to communicate this failure to some higher power so it can come in with scrubbies and bleach and all that good stuff and clean up the technicolor yawn that is your filter innards.

Normally you'd call IMediaEventSink->Notify() and be done with it--the event gets handled asynchronously by the filter graph manager so you need not worry about threading issues (woe is the DShow coder who fires a synchronous event from their streaming thread), and everything can be dealt with in a common event handler. But someone, in their infinite wisdom, did not provide a standard eventing method for a DMO filter. Which means: your events fall on the floor.

There are a few options to deal with this. You can do what normal DirectShow filters do: hold a weak reference to IMediaEventSink and send events through that. But this requires a custom interface, and suddenly your filter is no longer that nice, clean abstraction that works in Media Foundation and DirectShow. You could create your own eventing interface, but it would need to be asynchronous (since the DMO is being called on the streaming thread) so this isn't exactly trivial. These options are not appealing.

These are the two major grievances I have with DMOs. Minor grievances include:

The documentation mentions issues with IDMOQualityControl, which is the DMO version of IQualityControl. Purportedly, the DMO interface "...is not ideal for quality control due to the limitations of the interface itself." But nowhere are these "limitations" outlined. It'd be great if MSDN would make it clear what they are.
The claim that "DMOs require less methods to implement, and they provide a documented API" is total nonsense. My DMO implementation was about ~300 lines and included no fewer than 14 methods I had to implement (note the section at the bottom outlining required methods). For CTransformFilter? 200 lines of code and 5 methods. Also, note that CTransformFilter is completely documented.
Want to manually instantiate a filter? Good luck. Have fun. You basically have to know all sorts of complicated junk about apartment threading and COM and ATL to get this to work. It's possible, but it's a lot more work and not as well documented as manually instantiating a regular DirectShow filter.

I should be fair: some of these problems are really issues in the DMO Wrapper Filter and there's certainly nothing stopping someone from writing their own wrapper filter. But some of the issues such as the lack of eventing is not so easy to deal with. Regardless, the big question is: why bother? Why not just write a regular transform filter and not deal with any of these problems?

In retrospect, I'd be more inclined to devote my efforts to some cross-platform rendering solution, like GStreamer.

Thursday, January 07, 2010

A fellow poster on a forum I frequent often says to gravity skeptics (yes, sadly people like this exist) "well, if in doubt, jump off a cliff." Someone commented that there wasn't such a succinct retort for evolution skeptics, but I think there is:

...all the intelligent design advocates die; settles that debate, I believe.

Goldfishy