Video Coding Fundamentals

Juan Francisco Rodríguez Herrera
Vicente González Ruiz

July 10, 2017

1 Sources of redundancy
2 Memory requirements of PCM video
3 Block-based ME (Motion Estimation)
4 Sub-pixel accuracy
5 Matching criteria (similitude between macroblocks)
6 Searching strategies
7 The GOP (Group Of Pictures) concept
8 Lossy predictive video coding
9 MCTF (Motion Compensated Temporal Filtering)
10 t+2d vs. 2d+t vs. 2d+t+2d
11 Deblocking ﬁltering
12 Bit-rate allocation
13 Quality scalability
14 Temporal scalability
15 Spatial scalability

1 Sources of redundancy

Spatial redundancy: Pixels are very similar in its neighborhood or tends to repeat textures.
Temporal redundancy: Temporally adjacent images are typically very alike.
Visual redundancy: Humans hardly perceive high spatial and temporal frequencies (we like more low frequencies).

2 Memory requirements of PCM video

In RGB (PCM) video, each color pixel need at least 24 bpp (bits/pixel).
The memory requirements of RGB video are enormous. For example, an hour of $640 \times 480 \times 25$ Hz true-color of PCM video needs: $25 \frac{images}{second} \times 640 \cdot 480 \frac{pixels}{image} \times 24 \frac{bits}{pixel} = 184.320.000 \frac{bits}{second}$

$184.320.000 \frac{bits}{second} \times 3.600 \frac{seconds}{hour} \times \frac{1 G}{1.02 4^{3}} \times \frac{1 byte}{8 bits} \approx 77 Gbytes$
Video coding techniques should be used to compress this data. Most of these techniques are bases on Block-based Motion Estimation.

3 Block-based ME (Motion Estimation)

Usually, only performed by the encoder.
ME removes temporal redundancy. A predicted image can be encoded as the diﬀerence between it and another image called prediction image which is a motion compensated projection of one or more images named reference images. ME tries to generate residue images as close as possible to the null images.
Usually, the reference image/s is/are divided in blocks of $16 \times 16$ pixels called macroblocks.
Each reference block is searched in the predicted image and the best match is indicated by mean of a motion vector.
Depending on the success of the search and the number of reference images, the macroblocks are classiﬁed into:
1. I: When the compression of residue block generates more bits than the original (predicted) one.
2. P: When it is better to compress the residue block and there is only one reference macroblock.
3. B: The same, but if we have two reference macroblocks.
4. S (skipped): When the energy of the residue block is smaller than a given threshold.
I-pictures are composed of I macroblocks, only.
P-pictures do not have B macrobocks.
B-pictures can have any type of macroblocks.

4 Sub-pixel accuracy

The motion estimation can be carried out using integer pixel accuracy or a fractional (sub-) pixel accuracy.
For example, in MPEG-1, the motion estimation can have up to 1/2 pixel accuracy. A bi-linear interpolator is used:

5 Matching criteria (similitude between macroblocks)

Let a and b the macroblocks which we want to compare. Two main distortion metrics are commonly used:
1. Mean Square Error: $\frac{1}{16 \times 16} \sum_{i = 1}^{16} \sum_{j = 1}^{16} {(a_{i j} - b_{i j})}^{2}$
2. Mean Absolute Error: $\frac{1}{16 \times 16} \sum_{i = 1}^{16} \sum_{j = 1}^{16} | a_{i j} - b_{i j} |$
These similitude measures are used only by the compressor. Therefore, any other one with similar eﬀects (such as the error variance or the error entropy) could be used also.

6 Searching strategies

Only performed by the compressor.
1. Full search: All the possibilities are checked. Advantage: the best compression. Disadvantage: CPU killer.
2. Logaritmic search: It is a version of the full search algorithm where the macro-blocks and the search area are sub-sampled. After ﬁnding the best coincidence, the resolution of the macro-block is increased in a power of 2 and the previous match is reﬁned in a search area of $\pm 1$ , until the maximal resolution (even using subpixel accuracy) is reached.
3. Telescopic search: Any of the previously described techniques can be speeded up if the searching area is reduced. This can be done supposing that the motion vector of the same macro-block in two consecutive images is similar.

7 The GOP (Group Of Pictures) concept

The temporal redundancy is exploited by blocks of images called GOPs. This means that a GOP can be decoded independently of the rest of GOPs. Here an example:

8 Lossy predictive video coding

Let $V_{i}$ the i-th image of the video sequence and $V_{i}^{[q]}$ and approximation of $V_{i}$ with quality $q$ (most video compressors are lossy). In this context, an hybrid video codec (t+2d)¹ has the following structure:

9 MCTF (Motion Compensated Temporal Filtering)

This is a DWT where the input samples are the original video images and the output is a sequence of residue images.

10 t+2d vs. 2d+t vs. 2d+t+2d

t+2d: The sequence of images is decorrelated ﬁrst along the time (t) and the residue images are compressed, exploiting the remaining spatial (2d) redundancy. Examples: MPEG* and H.26* codecs (except H.264/SVC).
2d+t: The spatial (2d) redudancy is explited ﬁrst (using typically the DWT) and next the coeﬃcients are decorrelated along the time (t). To date this has only been experimental setup because most transformed domains are not invariant to the displacement.
2d+t+2d: The ﬁst step creates a Laplacian Pyramid (2d), which is invariant to the displacement. Next, each level of the pyramid is decorrelated along the time (t) and ﬁnally, the remaining spatial redundancy is removed (2d). Example: H.264/SVC.

11 Deblocking ﬁltering

Block based video encoders (those than use block-based temporal decorrelation) improve their performance if a deblocking ﬁlter in used to create the quantized prediction predictions.

The low-pass ﬁlter is applied only on the block boundaries.

12 Bit-rate allocation

Under a constant quantization level (constant video quality), the number of bits that each compressed image needs depends on the image content. Example:

The encoder must decide how much information will be stored in each residue image, taking into account that this image can serve as a reference for other images.

13 Quality scalability

Ideal for remote visualization environments.
In reversible codecs, $V_{i}^{[0]} = V_{i}$ .

14 Temporal scalability

V^{t} = {V_{2^{t} \times i}; 0 \leq i < \frac{# V}{2^{t}}} = {V_{2 i}^{t - 1}; 0 \leq i < \frac{# V^{t - 1}}{2}},

(1)

where $# V$ is the number of pixtures in $V$ and $t$ denotes the Temporal Resolution Level (TRL).

Notice that $V = V^{0}$ .
Useful for fast random access.

15 Spatial scalability

Useful in low-resolution devices.
In reversible codecs, $V_{i} = V_{i}^{< 0 >}$ and $V_{i}^{< s >}$ has a $\frac{Y}{2^{s}} \times \frac{X}{2^{s}}$ resolution, where $X \times Y$ is the resolution of $V_{i}$ .