Video Coding Fundamentals
Juan Francisco Rodríguez Herrera
Vicente González Ruiz
September 27, 2014
Contents
1 Memory requirements of PCM video
- In RGB (PCM) video, each color pixel need at least 24 bpp (bits/pixel).
- The memory requirements of RGB video are enormous. For example, an hour of
640 × 480 × 25 Hz true-color of PCM video needs:
2 Sources of redundancy
- Spatial redundancy: Pixels are very similar in its neighborhood or tends
to repeat textures.
- Temporal redundancy: Temporally adjacent images are typically very
alike.
- Visual redundancy: Humans hardly perceive high spatial and temporal
frequencies (we like more low frequencies).
3 Block-based ME (Motion Estimation)
- Usually, only performed by the encoder.
- ME removes temporal redundancy. A predicted image can be encoded as
the difference between it and another image called prediction image which
is a motion compensated projection of one or more images named reference
images. ME tries to generate residue images as close as possible to the
null images.
- The reference image/s is/are divided in blocks of 16 × 16 pixels called
macroblocks.
- Each reference block is searched in the predicted image and the best match
it is indicated by mean of a motion vector.
- Depending on if the search is successful or not, and the number of reference
images, the macroblocks are classified into:
- I: When the compression of residue block generates more bits than
the original (predicted) one.
- P: When it is better to compress the residue block and there is only
one reference macroblock.
- B: The same, but if we have two reference macroblocks.
- S (skipped): When the energy of the residue block is smaller than
a given threshold.
- I-pictures are composed of I macroblocks, only.
- P-pictures do not have B macrobocks.
- B-pictures can have any type of macroblocks.

4 Sub-pixel accuracy
- The motion estimation can be carried out using integer pixel accuracy of
a fractional pixel accuracy.
- For example, in MPEG-1, the motion estimation can have 1/2 pixel
accuracy:
5 Matching criteria (similitude between macroblocks)
- Let a and b the macroblocks which we want to compare:
- Mean Square Error:
- Mean Absolute Error:
- These similitude measures are used only by the compressor. Therefore, any
other one than have similar effects (such as the error variance or the error
entropy) could be used also.
6 Searching strategies
- Only performed by the compressor.
- Full search: All the possibilities are checked. Advantage: the best
compression. Disadvantage: CPU intensive.
- Logaritmic search: It is a version of the full search algorithm where the
macro-blocks and the search area are sub-sampled. After finding the
best coincidence, the resolution of the macro-block is increased
in a power of 2 and the previous match is refined in a search
area of ±1, until the maximal resolution (1/1 or 1/2 pixel) is
reached.
- Telescopic search: Any of the previously described techniques can be
speeded up if the searching area is reduced. This can be done supposing
that the motion vector of the same macro-block in two consecutive images
is similar.
7 The GOP (Group Of Pictures) concept
- The temporal redundancy is exploited by blocks of images called GOPs.
This means that a GOP can be decoded independently of the rest of GOPs.
8 Lossy predictive video coding
Let V i the i-th image of the video sequence and V i[q] and approximation of V i with
quality q (most video compressors are lossy). In this context, an hybrid video codec
(t+2d) has the following structure:
9 MCTF (Motion Compensated Temporal Filtering)
- This is a DWT where the input samples are the original video images and
the output is a sequence of residue images.
10 t+2d vs. 2d+t vs. 2d+t+2d
- t+2d: The sequence of images is decorrelated first along the time (t) and
the residue images are compressed, exploiting the remaining spatial (2d)
redundancy. Examples: MPEG* and H.26* codecs (except H.264/SVC).
- 2d+t: The spatial (2d) redudancy is explited first (using typically the
DWT) and next the coefficients are decorrelated along the time (t). To
date this has only been experimental setup because most transformed
domains are not invariant to the displacement.
- 2d+t+2d: The fist step creates a Laplacian Pyramid (2d), which
is invariant to the displacement. Next, each level of the pyramid
is decorrelated along the time (t) and finally, the remaining spatial
redundancy is removed (2d). Example: H.264/SVC
11 Deblocking filtering
- Block based video encoders (those than use block-based spatial
decorrelation) improve their performance if a deblocking filter in used to
create the quantized prediction predictions.
- The low-pass filter is applied only on the block boundaries.
12 The bit-rate allocation problem
- Under a constant quantization level (constant video quality), the number
of bits that each compressed image needs depends on the image content.
Example:
- The encoder must decide how much information will be stored in each
residue image, taking into account that this image can serve as a reference
for other images.
13 Quality scalability
- Ideal for remote visualization environments.
- In reversible codecs, V i[0] = V i.
14 Temporal scalability
 | (1) |
where #V is the number of pixtures in V , t denotes the Temporal Resolution Level
(TRL).
- Notice that V = V 0.
- Useful for fast random access.
15 Spatial scalability
- Useful in low-resolution devices.
- In irreversible codecs, V i = V i<0> and V i<s> has a
×
resolution,
where X × Y is the resolution of V i.