Monday, December 21, 2009

H264 media subtypes in Directshow

The media subtypes supported by Directshow are outlined in an MSDN article. Although the article lists five separate types, there are really only two distinct types:
  1. MEDIASUBTYPE_AVC1: h.264 bitstream without start codes, and
  2. MEDIASUBTYE_H264: h.264 bitstream with start codes

Basically, a subtype in directshow communicates the type of media that a filter outputs, so these types let connecting filters know that a filter outputs some variant of H.264. However, both of these types are--to differing degrees--nonsensical. Microsoft could have done a better job defining these types.

For AVC1, the actual sample data is prefixed with the length in big-endian format, which is completely redundant with DirectShow's IMediaSample implementation. IMediaSample already implements GetActualDataLength() to communicate the length of the buffer, so this additional data in the sample buffer is completely unnecessary and one source of error when writing a filter. You often see code that resembles:

IMediaSample pSample;
BYTE* pBuffer;
DoSomethingWithMyH264Data(pBuffer + 4,pSample->GetActualDataLength - 4));

The only possible benefit I can see to prefixing the length is it makes it easy to convert back to a type with the NALU start codes (0x00000001 can simply overwrite a 4-byte start size, for example, and voila: start codes), but that's really not a big deal.

The good thing about AVC1 is that the subtype defines a single NALU per sample buffer, which is great, because at least it makes this type somewhat compatible with the DirectShow architecture. Also good is the SPS/PPS info is communicated during pin negotiation, which is helpful when determining if your decoder is capable of handling the incoming H.264 video. (e.g. in the event of video a decoder cannot handle, you can not allow a pin connection in the first place instead of having to actually parse incoming video. Transform filters don't have to wait for SPS/PPS info to lumber along, which does not necessarily happen. And so on.)

I wouldn't say AVC1 is bad; I'd reserve that designation for the certifiably stupid H264 subtypes. The H264 subtypes define a raw H264 byte stream, which means they deliver NALUs delimited with the NALU start code (0x00000001 or 0x000001). Since the H264 subtypes do not specify one NALU per buffer, I assume this means a filter can potentially deliver multiple NALU per IMediaSample. People on the newsgroups even interpet this subtype to mean that "NALU boundaries don't even have to respect the sample boundaries," which means you could get a sample that didn't contain an entire NALU, or a sample that contained half of a NALU, and so on.

But this behavior is completely nonsensical from a timestamping perspective. Each IMediaSample has a timestamp, so this model effectively makes it impossible for a downstream filter to correctly timestamp incoming H.264, which really only makes this subtype feasible for live streaming, and you can pretty much forget about syncing with audio in any meaningful way.

Needless to say, this completely violates the IMediaSample abstraction. GetActualDataLength() is wrong in the same way AVC1 is wrong (the NALU start codes are present in the buffer), but so is GetMediaTime (time for which sample???), IsSyncPoint (precisely what NALU is a sync point?), IsDiscontinuity, etc. Of course, if you decide to make a filter that outputs the H264 subtype, you can always make it output a single NALU per IMediaSample buffer, but it doesn't change the fact that filters which accept H264 subtypes cannot depend on an IMediaSample being a discrete NALU, which makes interpreting the IMediaSample data problematic.

Furthermore, in the H264 subtypes, the SPS/PPS info is not communicated in the media format exchange, which means downstream filters cannot prepare their decoder until they have parsed incoming data and located the SPS/PPS info. This simply makes no sense. All proper RTP H.264 streams should communicate SPS/PPS info in the SDP exchange, and the info is shuffled away in the MP4 format because of how important it is for decoders to be able to quickly access it so decoding can begin immediately. To not have this info up front makes it far more difficult (although not impossible, by any means) to write a filter, and there is rarely a good reason to not provide it.

In any event, my advice: avoid the H264 subtypes, they are ridiculous and only make your filter a pain in the ass to deal with. The AVC1 format is better, but it could be improved by removing the superfluous size field from the media buffer.