MCJ2K (Motion Compensated JPEG 2000)

Juan Francisco Rodríguez Herrera
Vicente González Ruiz

June 26, 2014

1 Why MCJ2K?
2 MCJ2K overview
3 MCTF in MCJ2K
4 MCTF implementation
5 The 1-level DWT temporal transform
6 Over-pixel Motion estimation
7 Sub-pixel Motion estimation
8 Prediction step
9 Update step
10 Motion coding
11 Texture coding
12 Temporal scalability
13 Quality scalability
14 Spatial scalability

1 Why MCJ2K?

MCTF (Motion Compensated Temporal Filtering) [3] removes efficiently the temporal redundancy of image sequences [2], enables the temporal scalability and controls the impact of the errors.
JPEG 2000 is a good image compressor, lossy/lossless, and can create code-streams that are spatial and quality scalable.
Both codecs can be used together to design a spatial-temporal-quality scalable video compressor with a reasonable performance.

2 MCJ2K overview

T = log(GOPxsize) + 1. 2

(1)

3 MCTF in MCJ2K

In MCJ2K the GOPs are always open and simetrical, in each TRL (Temporal Resolution Level).

The image L_i^t is predicted by means of the reference images L_i^t−1 and L_i+1^t−1, where
$t−1 t t t L = L ⊕ {H ,M }$ (2)

and where H^t is the residue subband and M^t the motion information.

In MCJ2K, if not enough temporal correlation is found, the B pictures are replaced by I pictures (and the corresponding motion vector fields are erased).

4 MCTF implementation

The MCTF stage can be designed using a typical dyadic Discrete Wavelet Transform (DWT) [1], where the samples are pictures:

5 The 1-level DWT temporal transform

It is implemented using Lifting [?, 4, 5] (see also Section ??):

6 Over-pixel Motion estimation

MCJ2K uses an hierarchical motion estimation algorithm that, alghought is sub-optimal, is fast enought for real-time coding:

Compute the DWT of l = ⌊log ₂(search_range)⌋ levels to the predicted frame P = L_i^t and the two reference frames R₀ = L_2i^t−1 and R₁ = L_2i+1^t−1.
^l(M_i^t) ← 0 /* Or with other more suitable values */.
While l > 0:
1. Divide the subband ^l(P) in blocks of b-size and (±1)-search them into the subbands ^l(R₀) and ^l(R₁), calculating a low-resolution ^l(M_i^t) bi-directional motion vector field.
2. l ← l − 1.
3. Synthesize ^l(M_i^t), ^l(P), ^l(R₀) and ^l(R₁) computing the inverse DWT one step (the HH-subbands are 0).
4. ^l(M_i^t) ←^l(M_i^t) ⋅ 2.

7 Sub-pixel Motion estimation

Let s the sub-pixel accuracy. After the over-pixel ME stage, the refinement of M_i^t continues following the next algorithm:

l ← 1.
while l ≤ s:
1. Synthesize ^−l(P), ^−l(R₀) and ^−l(R₁) computing the inverse DWT one step (the HH-subbands are 0).
2. M_i^t ← M_i^t ⋅ 2 /* Multiply by 2 the vectors */.
3. b ← b ⋅ 2 /* Multiply by 2 the block size */.
4. Divide the subband ^−l(P) in blocks of b-size and (±1)-search them into the subbands ^−l(R₀) and ^−l(R₁), calculating a sub-pixel accuracy M_i^t bi-directional motion vector field.
5. l ← l + 1.

8 Prediction step

The prediction step minimizes the entropy of the subbands H.

9 Update step

The update step minimizes the aliasing of the subband L.

10 Motion coding

Each B-frame generates a bi-directional motion vector field M_i^t.
To compress (losslessly) each M_i^t, two stages are performed:
1. Redundancy removing. Two sources of redundancy can be found:
  1. The backward motion vectors are, in absolute value, similar to the forward motion vectors:
    $←−t −→ t M i ≈ − M i.$ (3)
  2. The motion vectors between temporal levels are linearly correlated:
    $t t−1 M i ≈ 2M 2i .$ (4)
2. Entropy coding. The residues are compressed with JPEG 2000, as images of 4 components (2 2D-vectors).

11 Texture coding

The image residues, that form the high-pass subbands ${Ht;0 < t < log (GOPxsize)}, 2$ are temporally decorrelated, i.e., they can be efficiently compressed with MJ2K.
The images in the L^T low-pass subband are very far away in time, and therefore, are minimally correlated. For this reason the L^T can be efficiently compressed with MJ2K.

12 Temporal scalability

The temporal subbands must be decoded in order.

13 Quality scalability

The temporal subbands must be decoded by quality (using LRCP).

14 Spatial scalability

Using the RLCP in each frame (LRCP can be also used, but with a small loss of performance), motion compensation is performed without sub-sampling.

MCJ2K in the lab

svn checkout http://svn.hpca.ual.es/svn/QSVC/Kakadu
cd Kakadu
source ./compile Linux-x86-64 # Otras alternativas: "./complie"
cd ..

svn checkout http://svn.hpca.ual.es/svn/QSVC/MCTF/trunk MCTF
cd MCTF
source ./compile
cd ..

svn checkout http://svn.hpca.ual.es/svn/vruiz/progs/snr SNR
cd DNR
source ./compile
cd ..

svn checkout http://svn.hpca.ual.es/svn/QSVC/performance_tests
cd performance_tests

./mobile_352x288x30_vs_1QL

./mobile_352x288x30_vs_QLs

References

[1] S. Mallat. A Theory for Multiresolution Signal Decomposition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 11:674–693, 1989.

[2] Heiko Schwarz, Detlev Marpe, and Thomas Wiegand. Overview of the scalable video coding extension of the h.264/avc standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103–1120, September 2007.

[3] A. Secker and D. Taubman. Lifting-based invertible motion adaptive transform (limat) framework for highly scalable video compression. IEEE Transactions on Image Processing, 12(12):1530–1542, December 2003.

[4] W. Sweldens. The Lifting Scheme: A new Philosophy in Biorthogonal Wavelet Constructions. In Proc. SPIE, volume 2569, pages 68–79, September 1995.

[5] W. Sweldens and P. Schröder. Building Your Own Wavelets at Home.