Monday, December 21, 2009

H264 media subtypes in Directshow

The media subtypes supported by Directshow are outlined in an MSDN article. Although the article lists five separate types, there are really only two distinct types:
  1. MEDIASUBTYPE_AVC1: h.264 bitstream without start codes, and
  2. MEDIASUBTYE_H264: h.264 bitstream with start codes

Basically, a subtype in directshow communicates the type of media that a filter outputs, so these types let connecting filters know that a filter outputs some variant of H.264. However, both of these types are--to differing degrees--nonsensical. Microsoft could have done a better job defining these types.

For AVC1, the actual sample data is prefixed with the length in big-endian format, which is completely redundant with DirectShow's IMediaSample implementation. IMediaSample already implements GetActualDataLength() to communicate the length of the buffer, so this additional data in the sample buffer is completely unnecessary and one source of error when writing a filter. You often see code that resembles:


IMediaSample pSample;
BYTE* pBuffer;
pSample->GetPointer(&pBuffer);
DoSomethingWithMyH264Data(pBuffer + 4,pSample->GetActualDataLength - 4));

The only possible benefit I can see to prefixing the length is it makes it easy to convert back to a type with the NALU start codes (0x00000001 can simply overwrite a 4-byte start size, for example, and voila: start codes), but that's really not a big deal.

The good thing about AVC1 is that the subtype defines a single NALU per sample buffer, which is great, because at least it makes this type somewhat compatible with the DirectShow architecture. Also good is the SPS/PPS info is communicated during pin negotiation, which is helpful when determining if your decoder is capable of handling the incoming H.264 video. (e.g. in the event of video a decoder cannot handle, you can not allow a pin connection in the first place instead of having to actually parse incoming video. Transform filters don't have to wait for SPS/PPS info to lumber along, which does not necessarily happen. And so on.)

I wouldn't say AVC1 is bad; I'd reserve that designation for the certifiably stupid H264 subtypes. The H264 subtypes define a raw H264 byte stream, which means they deliver NALUs delimited with the NALU start code (0x00000001 or 0x000001). Since the H264 subtypes do not specify one NALU per buffer, I assume this means a filter can potentially deliver multiple NALU per IMediaSample. People on the newsgroups even interpet this subtype to mean that "NALU boundaries don't even have to respect the sample boundaries," which means you could get a sample that didn't contain an entire NALU, or a sample that contained half of a NALU, and so on.

But this behavior is completely nonsensical from a timestamping perspective. Each IMediaSample has a timestamp, so this model effectively makes it impossible for a downstream filter to correctly timestamp incoming H.264, which really only makes this subtype feasible for live streaming, and you can pretty much forget about syncing with audio in any meaningful way.

Needless to say, this completely violates the IMediaSample abstraction. GetActualDataLength() is wrong in the same way AVC1 is wrong (the NALU start codes are present in the buffer), but so is GetMediaTime (time for which sample???), IsSyncPoint (precisely what NALU is a sync point?), IsDiscontinuity, etc. Of course, if you decide to make a filter that outputs the H264 subtype, you can always make it output a single NALU per IMediaSample buffer, but it doesn't change the fact that filters which accept H264 subtypes cannot depend on an IMediaSample being a discrete NALU, which makes interpreting the IMediaSample data problematic.

Furthermore, in the H264 subtypes, the SPS/PPS info is not communicated in the media format exchange, which means downstream filters cannot prepare their decoder until they have parsed incoming data and located the SPS/PPS info. This simply makes no sense. All proper RTP H.264 streams should communicate SPS/PPS info in the SDP exchange, and the info is shuffled away in the MP4 format because of how important it is for decoders to be able to quickly access it so decoding can begin immediately. To not have this info up front makes it far more difficult (although not impossible, by any means) to write a filter, and there is rarely a good reason to not provide it.

In any event, my advice: avoid the H264 subtypes, they are ridiculous and only make your filter a pain in the ass to deal with. The AVC1 format is better, but it could be improved by removing the superfluous size field from the media buffer.

Tuesday, September 29, 2009

Programming Fail: Directory.GetFiles()

I went to demo some shiny new code for a friend, and we both had a laugh when my program pretty much puked all over itself.

Admittedly I had not run this code on this particular machine before, and it was running Vista and .NET 3.5 (neither of which I'd tested against) But upon finding the bug, I find it difficult to understand what sort of design decision would involve such an arbitrary behavior.

Directory.GetFiles() seems like a pretty straight-forward function. You hand it a directory to search, and a search pattern (e.g. "*.exe"), and it returns a list of all the files in that directory which match the search. OK, I can handle that. Or so I thought.

The fine print of the documentation contains the gotcha:

The following list shows the behavior of different lengths for the searchPattern parameter:

* "*.abc" returns files having an extension of .abc, .abcd, .abcde, .abcdef, and so on.
* "*.abcd" returns only files having an extension of .abcd.
* "*.abcde" returns only files having an extension of .abcde.
* "*.abcdef" returns only files having an extension of .abcdef.


Ummm. OK. So, MSFT, basically your search pattern violates decades of common convention with regards to regular expressions, and the developer gets to figure this out when the app bombs? OK, sure. Brilliant. Ship it, yo.

Let's say you're looking for TIFF files, which can have either "tif" or "tiff" as the file extension. What happens is: calling GetFiles() with both of those extensions will result in the "tiff" files being added twice. Bzzt, fail MSFT, fail.

So, the fix is: either go through the aggregate list and remove duplicates, or remove the "*.tiff" search pattern (by the way, I got my supported extensions through the image handling classes in .NET). But it'd be a lot easier to simply have this method behave in a manner that is normal and predictable and sane: if I ask for "*.tif," I only want "*.tif". But I guess that's asking for too much.

Monday, September 21, 2009

More fun with AM_MEDIA_TYPE

While messing around with CMediaType (a wrapper class for AM_MEDIA_TYPE), I came across an inconsistency/bug. If you execute the following code:

 
pmt->cbFormat = sizeof(WAVEFORMATEX);
WAVEFORMATEX *wfex = (WAVEFORMATEX*)pmt->AllocFormatBuffer(sizeof(WAVEFORMATEX));


...you will find that wfex (which is a pointer to pbFormat) is NULL. My first thought was "out of memory?!" and the next thought was that simply wasn't possible. This call was about 30k lines of code deep in my application, so clearly memory allocation would have nailed me long before this point in time. So I poked around in mtype.cpp, which is the file that implements CMediaSource, and found the bug:



// allocate length bytes for the format and return a read/write pointer
// If we cannot allocate the new block of memory we return NULL leaving
// the original block of memory untouched (as does ReallocFormatBuffer)
BYTE*
CMediaType::AllocFormatBuffer(ULONG length)
{
ASSERT(length);

// do the types have the same buffer size

if (cbFormat == length) {
return pbFormat;
}

// allocate the new format buffer

BYTE *pNewFormat = (PBYTE)CoTaskMemAlloc(length);
if (pNewFormat == NULL) {
if (length <= cbFormat) return pbFormat; //reuse the old block anyway.
return NULL;
}

// delete the old format

if (cbFormat != 0) {
ASSERT(pbFormat);
CoTaskMemFree((PVOID)pbFormat);
}

cbFormat = length;
pbFormat = pNewFormat;
return pbFormat;
}


It becomes pretty clear what the issue is: if the requested length is the same as cbFormat, then it simply returns pbFormat, but since pbFormat was never allocated, it simply returns 0x00000000. Bzzt.

I guess the obvious moral of this story is: when using wrapper classes, it's probably not a good idea to manually set the underlying structure parameters. But still, this is a clear bug: this method will CoTaskMemFree pbFormat if it already exists. So this class should probably be like so:


if (pbFormat != NULL && cbFormat == length) {
return pbFormat;
}


If someone is allocating the same amount of memory, and pbFormat isn't null, it's acceptable to return pbFormat.

Again, I realize it doesn't make sense to set cbFormat if AllocFormatBuffer is going to do it for you, but that argument assumes one is aware that AllocFormatBuffer is going to do that for you. Nowhere in the documentation does it mention that I am forbidden from interacting with the underlying structure, so the only way you figure this out is by kicking yourself in the teeth, which I'd rather not do because it hurts.

Lastly, the same issue is present if the allocation itself fails. So a better implementation of the method is:


// allocate length bytes for the format and return a read/write pointer
// If we cannot allocate the new block of memory we return NULL leaving
// the original block of memory untouched (as does ReallocFormatBuffer)
BYTE*
CMediaType::AllocFormatBuffer(ULONG length)
{
ASSERT(length);

// do the types have the same buffer size, _and_
// do we have a valid pbFormat pointer???
if (pbFormat != NULL && cbFormat == length) {
return pbFormat;
}

// allocate the new format buffer
BYTE *pNewFormat = (PBYTE)CoTaskMemAlloc(length);
if (pNewFormat == NULL) {
// If the current pbFormat works, reuse it. Otherwise, fail.
return (pbFormat != NULL && length <= cbFormat) ?
pbFormat : NULL;
}

// delete the old format
if (pbFormat != NULL && cbFormat != 0) {
CoTaskMemFree((PVOID)pbFormat);
}

cbFormat = length;
pbFormat = pNewFormat;
return pbFormat;
}


Slightly better, but I still hate AM_MEDIA_TYPE.

Thursday, May 14, 2009

HD Video Standard

Despite the last couple of years being a time when "high definition" video has really gained traction, there's one surprising thing about HD video: it doesn't have an obvious definition. Dan Rayburn brings up this observation in a recent blog post:
For an entire industry that defines itself based on the word "quality", today there is still no agreed upon standard for what classifies HD quality video on the web....If the industry wants to progress with HD quality video, we're going to have to agree on a standard - and fast.
He's absolutely right. Many companies attempt to pass off 480p as HD video, but most video enthusiasts would reject such an assertion--after all, if it isn't HD for an analog signal, why would it be HD for a digital signal? Likewise, lots of video is encoded at an unacceptably low bit rate which results in obvious artifacts. Why would such poor quality video be considered "high definition?"

Wikipedia's definition for High-definition television is a decent start:
High-definition television (or HDTV) is a digital television broadcasting system with higher resolution than traditional television systems (standard-definition TV, or SDTV). HDTV is digitally broadcast; the earliest implementations used analog broadcasting, but today digital television (DTV) signals are used, requiring less bandwidth due to digital video compression.
This is still lacking. What exactly is "higher resolution than traditional television systems?" And just what SDTV system, since there were many of them? And is resolution all there is to it? What if I encode video to 1080p but at a horrible bit rate, which causes lots of blocking artifacts? What about video with an odd aspect ratio, where the number of verticals lines doesn't pass muster? Clearly this definition is lacking.

Some aspects of creating a standard are fairly straight-forward: most people seem to be fairly comfortable with 720p being the "minimum" resolution at which video can be encoded to. Ben Waggoner had an interesting proposal where 720p was acceptable, but it was also acceptable to generalize it to anything with "at least 16 million pixels per second," which takes into account both framerate and resolution. He also brought up the issue of using horizontal resolution as a criteria, since not everything is 16:9.

But on the question of "quality," most simply punted, and I find this odd. Ben Waggoner mentions:

Hassan Wharton-Ali brought up another good point on the thread - HD should actually be HD quality. It can’t be a lousy, over-quantized encode using a suboptimally high resolution just so it can be called HD.

A good test is the video should look worse (due to less detail), not better (due to less artifacts), if encoded at a lower resolution at the same data rate. If reducing your frame size makes the video look better when scaled to the same size, then the frame size is too high!

It is a good point, and I don't completely disagree with Ben's proposal--it should look worse due to less detail if encoded at a lower resolution. But this is the crux of the issue: what does it mean to look worse? Is this just a subjective judgment call on behalf of the person encoding the video? I don't think this addresses the problem of having a minimum acceptable "quality" for HD video.

Dan Rayburn's suggestion is even less desirable, in my opinion:

To me, the term HD should refer to and be defined by the resolution and a minimum bitrate requirement. Since you could have a 1080p HD video encoded at a very low bitrate, which could result in a poor viewing experience inferior to that of a higher-bitrate video in SD resolution, the resolution and bitrate is the only way to define HD.

The first issue with this is the "minimum bit rate" requirement would have to somehow scale with the resolution and frame rate. It would have to account for the codec being used. This would result in an impossibly complicated system, endless arguments, etc. (for example, would we impose the same bit rate requirement on H.264 as we would on VC1? What about "future" codecs?)

A bigger issue is not all video content is the same. The resulting "quality" of a video encoded to a given bitrate absolutely has a relationship with the video being encoded. A video with very little movement can often be encoded with a low bit rate and look fantastic, so the bit rate requirement would essentially amount to wasted bandwidth. Conversely, a video with a lot of motion and scene changes may require a lot more bits to get an acceptable, block-free viewing experience--and it's not clear what that acceptable threshold would be.

I find both of these suggestions insufficient. I propose an alternative: objective video quality algorithms. The idea is straight-forward: by comparing the source material with the output material, we can objectively establish a score that at least has some meaningful relationship with Mean Opinion Scores. In a nutshell, a MOS is how "good" the average person thinks some piece of video appears.

Peak Signal-to-Noise/Mean Squared Error is the most common algorithm, albeit one that is quite crude, widely considered to be deficient by most engineers and scientists. But it's 2009, baby--we can do better. We have better.

My suggestion would be the Structural SIMilarity Index, which is relatively inexpensive (its closely related brother, MSSIM, is much more pricey) and definitely correlates better with MOS.

How would this work?

  1. During the encode process, a SSIM score is computed for each frame using the input as a reference image.
  2. This process is repeated for every input frame, and every output frame.
  3. The lowest observed SSIM score is the resulting quality score for that piece of encoded video. (I suppose another alternative is to use the average. Yet another option is to use the variance. I'd avoid the median, since it's robust against outliers, and outliers matter)
  4. If the lowest observed SSIM score is less than some threshold, then the video cannot be considered High Definition.
For a visual representation of how this works, take a look at this graph:


This is a graph of SSIM over time, displaying multiple bit rates. The x-axis is frame number, and the y-axis is SSIM score. My input was a VGA, 30 FPS, ~30 second raw-RGB video clip. Each line corresponds with a bit rate requested of the encoder (x264's H.264 implementation, using a baseline profile--you can see this by the low SSIM scores at the beginning of the video due to single-pass encoding). Notice the clear relationship between SSIM scores and bit rate. Also note how much variance there is in video quality: clearly certain portions of this clip are "more difficult" to encode, and this results in a degradation of video quality. Also, notice a clear law of diminishing returns: as more and more bits are thrown at the video clip, the SSIM scores converge on 1.0--SSIM at 2 mbit/sec aren't substantially different from the SSIM scores at 500 kbit/sec.

There are a few gotchas with this plan: what if we're changing the frame rate (e.g. 3:2 pulldown) and there is no clear reference frame to which we compare the output? How do we determine the SSIM threshold? Do we really want to use SSIM, or is some other algorithm better?

The first question is answered relatively easy: we compare only what was input to the encoder and the resulting output. Presumably the process of manipulating the frame rate is separate from the process of encoding. What we're talking about is how well of a job our encoder does matching the input.

The second question is easier, but it requires someone conducting subjective video quality assessment tests to determine what threshold corresponds with a baseline SSIM number. In effect, someone has to do some statistical analysis on data captured during viewing sessions of actual people watching actual footage encoded with an actual compression algorithm, and determine a threshold that correlates well with people's perception of "High Definition." But at least with SSIM, this is a manageable process: once a threshold is determined, it's really independent of a whole slew of factors, like codec, the video being encoded, etc.

Let's say we decide that any SSIM score below 0.9 invalidates the video from being called "high definition"--for the above video, this would mean 500 kbps would be just slightly too poor to call HD (notice the poor quality at the beginning of the clip). And 1000 kbps would be more than acceptable.

Lastly, even though PSNR is an outdated method, I see no reason a high-definition standard could not include metrics for both objective quality tests. There are other objective video quality algorithms, and certainly more will be developed in the future, so any standard should be open to extension at a later date.

(side note: maybe part of our problem is this emphasis on bit rate--which has no relationship to quality beyond "more is probably better"--when our real emphasis should be on a metric that correlates with quality, but I digress)

I don't really care what objective metric is used, and certainly there is plenty of debate over which objective method correlates best with MOS, and what threshold should be used--but let's at least be scientific about this. If there's going to be a "standard" for High Quality video, then let's choose a standard that will carry us forward and not create a quagmire.

Monday, May 04, 2009

Dealing with Image Formats

One of the most common tasks when working with video is dealing with colorspaces and image formats. In this post, I'll discuss the two major colorspaces commonly used in Microsoft code, converting between different formats of a given colorspace. In some future post, I might talk about converting one colorspace to a totally separate colorspace, but that topic is worthy of its own discussion.

In the Microsoft world, there are two colorspaces that we're concerned about: YUV and RGB.

RGB Color Space
RGB is generally the easiest colorspace to visualize, since most of us have dabbled with finger paints or crayons. By mixing various amounts of red, green, and blue, the result is a broad spectrum of colors. Here is a simple illustration to convey this colorspace:

The top image of the barn is what you see. Each of the three pictures below are the red, green and blue components, respectively. When you add them together, voila, you get barnyard goodness. (sidenote: because you "add" colors together in the RGB colorspace, we call this an "additive" color model)

In the digital world, we have a convenient representation for RGB. Typically 0, 0, 0 corresponds with black (i.e. red, green and blue values are set to 0), and 255, 255, 255 is white. Intermediate values result in a large palette of colors. A common RGB format is RGB24, which allocates three 8 bit channels for red, green, and blue values. Since each channel has 256 possible values, the total number of colors this format can represent is 256^3, or 16,777,216 colors. There are also other RGB formats that use less/more data per channel (and thus, less/more data per pixel), but the general idea is the same. To get an idea of how many RGB formats exist, one need not go any farther than fourcc.org.

Despite the multitude of RGB formats, in the MSFT world, you can basically count on dealing with RGB24 or RGB32. RGB32 is simply RGB24, but with 8 bits devoted to an "alpha" channel specifying how translucent a given value is.

YUV Color Space
YUV is a substantially different from RGB. Instead of mixing three different colors, YUV separates out the luminance and chroma into separate values, whereas RGB implicitly contains this information in the combination of its channels. Y represents the luminance component (think of this as a "black and white" channel, much like black and white television) and U and V are the chrominance (color) components. There are several advantages to this format over RGB that make it desirable in a number of situations:
  • The primary advantage of luminance/chrominance systems such as YUV is that they remain compatible with black and white analog television.
  • Another advantage is that the signal in YUV can be easily manipulated to deliberately discard some information in order to reduce bandwidth.
  • The human eye is more sensitive to luminance than chroma; in this sense, YUV is generally considered to be "more efficient" than RGB because more information is spent on data that the human eye is sensitive to.
  • It is more efficient to perform many common operations in the YUV colorspace than in RGB--for example, image/video compression. By nature, these operations occur more easily in a YUV colorspace. Often, the heavy lifting in many image processing algorithms is applied only to the luminance channel.
Thus far, the best way I've seen to visualize the YUV colorspace was on this site.

Original image on the left, and the single Y (luminance) channel on the right:



...And here are the U and V channels combined:



Notice that the Y channel is simply a black and white picture. All of the color information is contained in the U and V channels.

Like RGB, YUV has a number of sub-formats. Another quick trip to fourcc.org reveals a plethora of YUV types, and Microsoft also has this article on a handful of the different YUV types used in Windows. YUV types are even more varied than RGB when it comes to different format.

The bad news is there's a lot of redundant YUV image formats. For example, YUY2 and YUYV are the exact same format entirely, but merely have different fourcc names. YUY2 and UYVY are exactly the same thing (16 bpp, "packed" format) but merely have the per-pixel byte order reversed. IMC4 and IMC2 are exactly the same thing (both 12 bpp, "planar" formats) but merely have the U and V "planes" swapped. (more on planar/packed in a moment)

The good news is that it's pretty easy to go between the different formats without too much trouble, as we'll demonstrate later.

Packed/Planar Image Formats
The majority of image formats (in both the RGB and YUV colorspaces) are in either a packed or a planar format. These terms refer to how the image is formatted in computer memory:
  • Packed: the channels (either YUV or RGB) are stored in a single array, and all of the values are mixed together in one monolithic chunk of memory.
  • Planar: the channels are stored as three separate planes. Fo
For example, the following image shows a packed format:



This is YUY2. Notice that the different Y, U, and V values are simply alongside one another. Also note that the above represents six pixels. They are not segregated in memory in any way. RGB24/RGB32/YUV2 are all examples of packed formats.

This image shows a planar format:



This is YV12. Notice that the three planes have been separated in memory, rather than being in a single, monolithic array. Often times this format is desirable (especially in the YUV colorspace, where the luminance values can then easily be extracted). YV12 is an example of a planar format.


Converting Between Different Formats in the Same Color Space
Within a given colorspace are multiple formats. For example, YUV has multiple formats with differing amounts of information per pixel and layout in memory (planar vs. packed). Additionally, you may have different amounts of information for the individual Y, U, and V values, but most Microsoft formats typically allocate no more than 8 bits per channel.

As long as the Y, U, and V values for the source and destination images have equivalent allocation, converting between various YUV formats is reduced to copying memory around. For this section we'll deal with YUV formats, since RGB will follow the same general principles. As an example, let's convert from YUY2 to AYUV.

YUY2 is a packed, 16 bits/pixel format. In memory, it looks like so:

The above would represent the first six pixels of the image. Notice that each pixel ends up with a Y value, and every other pixel contains a U and a V value. There is no alpha channel. The image contains a 2:1 horizontal down sampling.

A common misconception is that the # of bits per pixel is directly related to the color depth (i.e. the # of colors that can be represented). In YUY2, our color depth is 24 bits (there are 2^24 possible color combinations), but it's only 16 bits/pixel because the U and V channels have been down sampled.

AYUV, on the other hand, is a 32 bits/pixel packed format. Each pixel contains a Y, U, V, and Alpha channel. In memory, it ends up looking like so:

The above would represent the first three pixels of the image. Notice that each pixels has three full 8 bit values for the Y, U and V channels. There is no down sampling. There is also a fourth channel for an alpha value.

In going from YUY2 to AYUV, notice that the YUY2 image contains 16 bits/pixel whereas the AYUV contains 32 bits/pixel. If we wanted to convert from YUY2 to AYUV, we have a couple of options, but the easiest way is to simply reuse the U and V values contained in the first two pixels of the YUY2 image. Thus, we have to do no interpolation at all to go from YUY2 to AYUV--it's simply a matter of re-arranging memory. Since all the values are 8 bit, there isn't any additional massaging to do; they can simply be reused as is.

Here is a sample function to converty YUY2 to AYUV:
  
// Converts an image from YUY2 to AYUV. Input and output images must
// be of identical size. Function does not deal with any potential stride
// issues.
HRESULT ConvertYUY2ToAYUV( char * pYUY2Buffer, char * pAYUVBuffer, int IMAGEHEIGHT, int IMAGEWIDTH )
{
if( pYUY2Buffer == NULL || pAYUVBuffer == NULL || IMAGEHEIGHT < 2
|| IMAGEWIDTH < 2 )
{
return E_INVALIDARG;
}

char * pSource = pYUY2Buffer; // Note: this buffer will be w * h * 2 bytes (16 bpp)
char * pDest = pAYUVBuffer; // note: this buffer will be w * h * 4 bytes (32 bpp)
char Y0, U0, Y1, V0; // these are going to be our YUY2 values

for( int rows = 0; rows < IMAGEHEIGHT; rows++ )
{
for( int columns = 0; columns < (IMAGEWIDTH / 2); columns++ )
{
// we'll copy two pixels at a time, since it's easier to deal with that way.
Y0 = *pSource;
pSource++;
U0 = *pSource;
pSource++;
Y1 = *pSource;
pSource++;
V0 = *pSource;
pSource++;

// So, we have the first two pixels--because the U and V values are subsampled, we *reuse* them when converting
// to 32 bpp.
// First pixel
*pDest = V0;
pDest++;
*pDest = U0;
pDest++;
*pDest = Y0;
pDest += 2; // NOTE: not sure if you have to put in a value for the alpha channel--we'll just skip over it.

// Second pixel
*pDest = V0;
pDest++;
*pDest = U0;
pDest++;
*pDest = Y1;
pDest += 2; // NOTE: not sure if you have to put in a value for the alpha channel--we'll just skip over it.
}
}

return S_OK;
}

Note that the inner "for" loop processes two pixels at a time.

For a second example, let's convert from YV12 to YUY2. YV12 is a 12 bit/pixel, planar format. In memory, it looks like so:

...notice that every four pixel Y block has one corresponding U and V value, or to put it a different way, each 2*2 Y block has a U and V value associated with it. And, yet another way to visualize it: the U and V planes are one quarter the size of the Y plane.

Since all of the YUV channels are 8 bits/pixel, again--it comes down to selectively moving memory around. No interpolation is required:
  
// Converts an image from YV12 to YUY2. Input and output images must
// be of identical size. Function does not deal with any potential stride
// issues.
HRESULT ConvertYV12ToYUY2( char * pYV12Buffer, char * pYUY2Buffer, int IMAGEHEIGHT, int IMAGEWIDTH )
{
if( pYUY2Buffer == NULL || pYV12Buffer == NULL || IMAGEHEIGHT < 2
|| IMAGEWIDTH < 2 )
{
return E_INVALIDARG;
}

// Let's start out by getting pointers to the individual planes in our
// YV12 image. Note that the Y plane in a YV12 image's size is
// simply the image height * image width. This is because all values
// are 8 bits. Also notice that the U and V planes are one quarter
// the size of the Y plane (hence the division by 4).
BYTE * pYV12YPlane = pYV12Buffer;
BYTE * pYV12VPlane = pYV12YPlane + ( IMAGEHEIGHT * IMAGEWIDTH );
BYTE * pYV12UPlane = pYV12VPlane + ( ( IMAGEHEIGHT * IMAGEWIDTH ) / 4 );

BYTE * pYUV2BufferCursor = pYUV2Buffer;

// Keep in mind that YV12 has only half of the U and V information that
// a YUY2 image contains. Because of that, we need to reuse the U and
// V plane values, so we only increment that buffer every other row
// of pixels.
bool bMustIncrementUVPlanes = false;

for( int ImageHeight = 0; ImageHeight < IMAGEHEIGHT; ImageHeight++ )
{
// Two temporary cursors for our U and V planes, which are the weird ones to deal with.
BYTE * pUCursor = pYV12UPlane;
BYTE * pVCursor = pYV12VPlane;

// We process two pixels per pass through this equation,
// hence the (IMAGEWIDTH/2).
for( int ImageWidth = 0; ImageWidth < ( IMAGEWIDTH / 2 ) ; ImageWidth++ )
{
// first things first: copy our Y0 value.
*pYUY2BufferCursor = *pYV12YPlane;
pYUY2BufferCursor++;
pYV12YPlane++;

// Copy U0 value
*pYUY2BufferCursor = *pUCursor;
pYUY2BufferCursor++;
pUCursor++;

// Copy Y1 value
*pYUY2BufferCursor = *pYV12YPlane;
pYUY2BufferCursor++;
pYV12YPlane++;

// Copy V0 value
*pYUY2BufferCursor = *pVCursor;
pYUY2BufferCursor++;
pVCursor++;
}

// Since YV12 has half the UV data that YUY2 has, we reuse these
// values--so we only increment these planes every other pass
// through.
if( bMustIncrementUVPlanes )
{
pYV12VPlane += IMAGEWIDTH / 2;
pYV12UPlane += IMAGEWIDTH / 2;
bMustIncrementUVPlanes = false;
}
else
{
bMustIncrementUVPlanes = true;
}
}

return S_OK;
}


This code is a little more complicated than the previous sample. Because YV12 is a planar format and contains half of the U and V information contained in a YUY2 image, we end up reusing U and V values. Still, the code itself isn't particularly daunting.

One thing to realize: neither of the above functions are optimized in any way, and there are multiple ways of doing the conversion. For example, here's an in-depth article about converting YV12 to YUY2 and some performance implications on P4 processors. Some people have also recommended doing interpolation on pixel values, but in my (limited and likely anecdotal) experience, it doesn't make a substantial difference.

Monday, February 02, 2009

How to fix Vista lag spikes

Vista has a nasty bug that can cause lag spikes when attached to a wireless network; every sixty seconds, the PC will experience a rather substantial lag spike:

10.0.0.1 is my router; notice everything is going along just fine, and then out of the blue, I get a ping time of 836 milliseconds, which for any latency-sensitive application is a very consequential amount of time. And it isn't like my computer is attempting to ping some distant target over the nebulous framework of the Internet: this is my router. The most adjacent device to my computer on the network. If you let ping run for ten or fifteen minutes, you'll see the spacing between lag spikes is exactly 60 seconds.

It goes without saying that this is completely unacceptable, especially if you're into online gaming (WoW, CS, etc.) or VOIP. Even more unacceptable is that this issue has been going on for years and MSFT has yet to fix it. A quick google search for 60-second Vista Lag Spike reveals just how pervasive the issue is. Numerous people have "fixes" which involve installing weird applications from the Internet, which I was opposed to even trying since I don't like running "weird applications from the Internet."

The issue has to do with the WLAN Autoconfig service:

Unfortunately, this service is also in control of a whole host of junk, so stopping the service results in your wireless connection dying. For a while, it seemed like there may be some love in registry key HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Wlansvc\Interfaces\INTERFACE_NAME\ScanInterval, which is set to 60 EA 00 00 (EA60 is 60000 in hex, which is the number of milliseconds that elapse between each lag spike), but the Scan Interval field is simply overwritten every time the WLAN Autoconfig service is started with good ol' EA 60. (side rant: why bother putting a value in the registry if the service is simply going to overwrite it every time it starts??? Is this quality of code really worth paying several hundred dollars for?)

Finally, I found a fix that is explained in this article, which simply disables the autoconfig portion on a specific network adapter. Unfortunately, that fix has some problems--if you reboot your computer and forget to turn autoconfig on, the next time it runs your internet will be busted, which means restarting the service and then shutting it off once you've connected. It also means I have to explain to non-computer savvy significant others how to do this, since said significant other enjoys a lag-free WoW experience just as much as I do.

In any event, the quick and dirty fix is to open up a command prompt, and type in netsh wlan set autoconfig enabled=no interface="Wireless Network Interface" where "Wireless Network Interface is the name of your NIC card. The problem with this is if you forget to re-enable the service when you shutdown or reboot your computer, your network connection won't come up correctly. To fix this, enable the service (change "enabled=no" to "enabled=yes") and once the network is connected, disable the service.

All I can say is: totally unacceptable for a product that's been in the market for a few years now. Hopefully Windows 7 doesn't share this same "feature." And hopefully MSFT realizes that crap like this makes me want to either keep on using Windows XP SP3 or "upgrading" to Ubuntu; c'mon, if you're going to charge us for software, at least perform basic QA and resolve issues in a timely manner.