Digital Audio/Video Playback

Mathieu Lacage


    
  

This article is copyrighted to Mathieu Lacage. Copying and/or reuse of this document is not allowed. You may link to this page but you may not redistribute copies and/or modifications of this article.

Revision History
Revision 0.0.321-07-2003

Describe solution, fix diagram conversion.

Revision 0.0.227-05-2003

Improve first section, discuss detailed buffering requirements, start work on second section.

Revision 0.0.124-05-2003

xmlize, most of first section completed.


Table of Contents

Player architecture
Encoder architecture
Decoder architecture
Decoder buffering
Buffering and Asynchronous user interaction
The problem
A solution

Player architecture

All digital audio/video players share a common architecture. This common common architecture is nothing but the result of common constraints applied to the types of input and output media and the processing required to digitize the input data, process it, and re-generate an analog output.

The traditional digital workflow is asymetric: the analog source is first digitized and then compressed by the digital content provider. The digital player then decompresses the digitized content and regenerates analog output for the viewer/listener through a digital to analog converter (usually, a TV set for the video and a set of speakers for the audio).

Encoder architecture

The encoder's architecture is outlined below:

The multiplexed stream generated by the encoder contains 3 types of information:

  • Audio: a compressed (sometimes uncompressed) version of the original analog audio.

  • Video: a compressed (sometimes uncompressed for really high-end professional applications) version of the original visual scene

  • Sub Picture: a data stream which contains subtitles and/or similar information.

  • Copyright information: a stream of data which describes the copyright status of the other streams.

  • Timing information: each audio frame and each video frame is tagged with a timestamp which represents the time at which the encoder generated this content. An audio and a video frame which were presented to the camera at the same time are now tagged with the same timestamp value. The decoder is expected to present at the same time content which is tagged with the same timestamp values.

A multiplexed stream looks like this:

In the example above, the video frame tagged with t2 is synchronized with the audio frame tagged with t1 because t2 is equal to t1.

The format of these multiplexed streams follows a rather generic pattern: they are all based on the idea that the information associated to the data is inserted in-band. Usually, the final bitstream is made of a concatenation of packets, each packet starts with a header which includes a startcode to enable the demuxer to synchronize with the input bitstream, the type of the packet payload, the timestamp associated to this payload if there is one and the size in bytes of the payload in this packet:

Decoder architecture

The decoder's architecture is pretty similar to the encoder's architecture:

the decoder feeds the multiplexed content read from the source device (a network socket or an IDE DVD drive) to the demultiplexer which separates the audio, video and timing bitstreams from the main bitstream. The output of the demux is then decompressed (the decompressed data keeps the timestamp tags which were associated to the compressed data).

Finally, the decoder's synchronization block synchronizes the output of the decompressed audio and video to make sure that the timestamps associated to each decompressed bistream are as close as possible from each other.

Decoder buffering

The rest of this paper will assume that we are dealing with an Mpeg2 decoder used for DVD or SVCD playback. This assumption is actually not that restrictive since all the other system-level formats share similar characteristics to those of a normal Mpeg2 decoder.

Buffering is necessary in the decoder for two rather different reasons:

  • Video decoding/reordering.

  • Transport latency.

(we assume here that the buffering required for audio playback is negligible compared to the requirements for the video stream. )

Mpeg video decoding

All mpeg video standards are based on the same set of ideas. To reach good compression factors, the encoder does not encode each video frame separatly but rather encodes the differences between the frames. Of course, in a real movie, a lot of the video frames are very similar if the character movement is minimal.

Encoding the difference rather than the frame taken into isolation means that to decode the frame, you need to keep around the reference frames to recalculate the final frame. Specifically, we have 3 kinds of frames: I, P and B frames. I frames are frames encoded without refering to any other frame: they are considered to be reference frames. P frames contain references to frames in the past: they are also considered to be be reference frames. Finally, B frames contain references to frames both in the past and in the future. To decode a P frame we need the previous reference frame present in the stream (that is, the previous P or I frame). To decode a B frame, we need the two previous reference frames (often, two P frames or an I and a P frame).

Typical Mpeg2 encoders generate rather simple video streams: if the source is a 24 image/s cinema movie, they will often output an I-frame every 12 images (that is, an I-frame every 12/24=0.5 second) and provide hooks to specify how many P-frames must be output between these I-frames. They fill the gaps with B frames. A classic DVD video stream looks like this: I0 P0 B0 B1 P1 B2 B3 P2 B4 B5 P3 B6 B7 P4 B8 I1 B9 B10 P5

The frames are stored in the stream in the decoding order which means you do not need to read the stream ahead to decode it. However, the display order is often different of the decoding order (the display order is described in each frame) which means you need to read ahead to display the decoded stream. For example, the canonical stream shown above will probably be displayed as follows: B0 B1 I0 P0 P1 B2 B3 P2 B4 B5 P3 B6 B7 P4 B8 I1 B9 B10 P5

XXX: make sure this is right. Insert evaluation of buffering time.

Transport latency

The second reason for buffering is related to the way the data is transfered between the encoder and the decoder.

In a satelite or cable broadcast using Mpeg 2 transport, the medium's transfer rate might vary due to transmission errors or congestions. The receiver must thus need to compensate for the transfer bitrate variation by buffering.

In DVD or VCD playback, the DVD drive must transfer the Mpeg data from the DVD disc to the decoder RAM. Of course, the seek time and transfer time of the DVD drive controler are not null and they must be taken into account to ensure that the Mpeg decoder and its display controller do not starve.

Insert evaluation of buffering time.