Video Coding Fundamentals

Juan Francisco Rodríguez Herrera
Vicente González Ruiz

September 27, 2014

1 Memory requirements of PCM video
2 Sources of redundancy
3 Block-based ME (Motion Estimation)
4 Sub-pixel accuracy
5 Matching criteria (similitude between macroblocks)
6 Searching strategies
7 The GOP (Group Of Pictures) concept
8 Lossy predictive video coding
9 MCTF (Motion Compensated Temporal Filtering)
10 t+2d vs. 2d+t vs. 2d+t+2d
11 Deblocking filtering
12 The bit-rate allocation problem
13 Quality scalability
14 Temporal scalability
15 Spatial scalability

1 Memory requirements of PCM video

In RGB (PCM) video, each color pixel need at least 24 bpp (bits/pixel).
The memory requirements of RGB video are enormous. For example, an hour of 640 × 480 × 25 Hz true-color of PCM video needs: $images pixels bits bits 25second-× 640⋅480image × 24pixel = 184.320.000second$

$bits seconds 1 G 1 byte 184.320.000------× 3.600-------× ----3-× ------≈ 77 Gbytes second hour 1.024 8 bits$

2 Sources of redundancy

Spatial redundancy: Pixels are very similar in its neighborhood or tends to repeat textures.
Temporal redundancy: Temporally adjacent images are typically very alike.
Visual redundancy: Humans hardly perceive high spatial and temporal frequencies (we like more low frequencies).

3 Block-based ME (Motion Estimation)

Usually, only performed by the encoder.
ME removes temporal redundancy. A predicted image can be encoded as the difference between it and another image called prediction image which is a motion compensated projection of one or more images named reference images. ME tries to generate residue images as close as possible to the null images.
The reference image/s is/are divided in blocks of 16 × 16 pixels called macroblocks.
Each reference block is searched in the predicted image and the best match it is indicated by mean of a motion vector.
Depending on if the search is successful or not, and the number of reference images, the macroblocks are classified into:
1. I: When the compression of residue block generates more bits than the original (predicted) one.
2. P: When it is better to compress the residue block and there is only one reference macroblock.
3. B: The same, but if we have two reference macroblocks.
4. S (skipped): When the energy of the residue block is smaller than a given threshold.
I-pictures are composed of I macroblocks, only.
P-pictures do not have B macrobocks.
B-pictures can have any type of macroblocks.

4 Sub-pixel accuracy

The motion estimation can be carried out using integer pixel accuracy of a fractional pixel accuracy.
For example, in MPEG-1, the motion estimation can have 1/2 pixel accuracy:

5 Matching criteria (similitude between macroblocks)

Let a and b the macroblocks which we want to compare:
1. Mean Square Error: $1∑6 1∑6 --1---- (aij − bij)2 16× 16 i=1 j=1$
2. Mean Absolute Error: $1 ∑16 ∑16 16-×-16 |aij − bij| i=1j=1$
These similitude measures are used only by the compressor. Therefore, any other one than have similar effects (such as the error variance or the error entropy) could be used also.

6 Searching strategies

Only performed by the compressor.
1. Full search: All the possibilities are checked. Advantage: the best compression. Disadvantage: CPU intensive.
2. Logaritmic search: It is a version of the full search algorithm where the macro-blocks and the search area are sub-sampled. After finding the best coincidence, the resolution of the macro-block is increased in a power of 2 and the previous match is refined in a search area of ±1, until the maximal resolution (1/1 or 1/2 pixel) is reached.
3. Telescopic search: Any of the previously described techniques can be speeded up if the searching area is reduced. This can be done supposing that the motion vector of the same macro-block in two consecutive images is similar.

7 The GOP (Group Of Pictures) concept

The temporal redundancy is exploited by blocks of images called GOPs. This means that a GOP can be decoded independently of the rest of GOPs.

8 Lossy predictive video coding

Let V _i the i-th image of the video sequence and V _i^[q] and approximation of V _i with quality q (most video compressors are lossy). In this context, an hybrid video codec (t+2d) has the following structure:

9 MCTF (Motion Compensated Temporal Filtering)

This is a DWT where the input samples are the original video images and the output is a sequence of residue images.

10 t+2d vs. 2d+t vs. 2d+t+2d

t+2d: The sequence of images is decorrelated first along the time (t) and the residue images are compressed, exploiting the remaining spatial (2d) redundancy. Examples: MPEG* and H.26* codecs (except H.264/SVC).
2d+t: The spatial (2d) redudancy is explited first (using typically the DWT) and next the coefficients are decorrelated along the time (t). To date this has only been experimental setup because most transformed domains are not invariant to the displacement.
2d+t+2d: The fist step creates a Laplacian Pyramid (2d), which is invariant to the displacement. Next, each level of the pyramid is decorrelated along the time (t) and finally, the remaining spatial redundancy is removed (2d). Example: H.264/SVC

11 Deblocking filtering

Block based video encoders (those than use block-based spatial decorrelation) improve their performance if a deblocking filter in used to create the quantized prediction predictions.

The low-pass filter is applied only on the block boundaries.

12 The bit-rate allocation problem

Under a constant quantization level (constant video quality), the number of bits that each compressed image needs depends on the image content. Example:

The encoder must decide how much information will be stored in each residue image, taking into account that this image can serve as a reference for other images.