Vorbis

Vicente Gonzlez Ruiz

September 27, 2014

1 Introduction
1.1 What is Vorbis?
1.2 What is Ogg Vorbis?
1.3 Why Vorbis born?
1.4 How Vorbis is used?
1.5 Who uses Vorbis?
1.6 Licensing
2 How Vorbis works?
2.1 The Vorbis encoder
2.2 Overlaped processing
2.3 Windowing
2.4 MDCT (Modified Discrete Cosine Transform)
2.5 SAM (pSycho Acoustic Model) [3]
  2.5.1 ATH (Absolute Threshold of Hearing) model [4]
  2.5.2 Frequency resolution and simultaneous masking
  2.5.3 Temporal masking
2.6 Quantization
2.7 Floor and residue encoding
2.8 Vorbis’s VQ (Vector Quantization)
2.9 Huffman coding
2.10 Packet “peeling”
2.11 Channel coupling
3 Ogg
3.1 What is Ogg?
3.2 The Ogg format

Part 1
Introduction

1.1 What is Vorbis?

Vorbis [1, 2] is a lossy perceptual (psycho acoustic) digital audio compressor.
The encoder inputs PCM (digital) audio and outputs Ogg data.

PCM +---------+ Ogg +---------+ PCM
----->| Encoder |------->| Decoder |----->
audio +---------+ stream +---------+ audio’

audio != audio’

1.2 What is Ogg Vorbis?

Ogg Vorbis is a:
- fully open, non-proprietary, patent-and-royalty-free,
- general-purpose compressed audio format
- for mid to high quality (8 kHz - 48.0 kHz, 16 bit, polyphonic) audio (human voice, e.g.) and music
- at fixed and variable bitrates from 30 to 500 kbps/channel.

1.3 Why Vorbis born?

In September 1998 the Fraunhofer Society sends a letter of infringement to several small commercial and open source MPEG audio layer 3 development projects, announcing plans to charge licensing fees for the MP3 audio format. .
In that moment and for that reason (among others), a company named Xiphophorus and founded by Chris Montgomery, starts to develop the open-source Vorbis and Ogg projects.

1.4 How Vorbis is used?

Unfortunately, the technical information about how Vorbis works is quite obscure. However, there is a set of libraries and tools libogg/libvorbis/vorbis-tools that do the hard work for the programmers. The Xiph.org foundation maintains these source/binary codes.
Several independent open-source encoders and players are also available.

1.5 Who uses Vorbis?


Spotify	Audacity	WinAmp	GStreamer

VLC Media Player	Firefox	Chrome	Android

HTML5

1.6 Licensing

Vorbis is used under the BSD (Berkeley Software Distribution) license, which basically means:
1. You can use the source code to develop new open source BSD applications and
2. you can use the source code to develop propietary applications.

Part 2
How Vorbis works?

2.1 The Vorbis encoder

          PCM   +---------+  Ogg
          ----->| Encoder |------->
          audio +---------+ stream
               /           \
            /                 \
         /                       \
     /                               \
/                                       \
+--------+    +-----+    +---+    +-------+
| W+MDCT |--->| SAM |--->| Q |--->| VQ+HE |
+--------+    +-----+    +---+    +-------+

W+MDCT = Windowed Modified Discrete Cosine Transform
SAM = pSycho Acoustic Model
Q = Quantization
VQ+HE = Vector Quantization + Huffman Encoding

2.2 Overlaped processing

0 N-1 2N-1 3N-1
+---------------+---------------+---------------+ s[n]
<--------Transform Step--------->
<---------Transform Step-------->

Each transform step inputs 2N samples and outputs N MDCT coeficients.
The window size (N) must me a power of two between 64 and 8192 samples.
N can vary depending on the characteristics of the sound. For complex sounds (without clear armonics such as a plosive sound), shortened windows improve the performance. For simple sounds (such as a music instrument), large windows are better.

2.3 Windowing

Samples are windowed before the transform because this acurates the spectral energy estimation¹ :

2.4 MDCT (Modified Discrete Cosine Transform)

Determines the correlation between a set of 2N numbers (samples) and N orthogonal cosine functions. Therefore, at the input of the DCT there are 2N samples and at the output, N coefficients.
The MDCT coefficients S[u] of the PCM samples s[n] are defined as:
$2∑N−1 π 1 N 1 S[w ] = s[n]cos[N-(n + 2 + 2-)(w + 2)]. n=0$ (2.1)

2.5 SAM (pSycho Acoustic Model) [3]

2.5.1 ATH (Absolute Threshold of Hearing) model [4]

This means that, for example, if a we have a tone of 0dB in 1KHz and a tone smaller than 20dB in 100 Hz, this second tone can not be perceived, and viceversa.

2.5.2 Frequency resolution and simultaneous masking

The HAS has a limited frequency resolution. Psychoacoustic experiments have demonstrated that the audible frequencies can be grouped into barks.
Each bark defines the group of frequencies that excite the same cochlear area, i.e., those frequencies that can be masked by the tone with the highest energy (in that bark).

2.5.3 Temporal masking

The human auditory system has inertia: sounds are not instantly perceived and remains after they are disapered, specially if the sounds have the same frequency.

2.6 Quantization

Depending on the desired output bit-rate and the frequency (see the ATH model), the SAM applies a different quantization step to sets of frequencies (barks) or different size. Roughly, the higher the compression ratio, the larger the quantization step and therefore, the quantization noise; and the higher the frequency, the wider the bark. Notice also that ocurrence of a tone in a bark depends also on the temporal masking.

These are the quantization levels available for Vorbis:

Quality	Expected bit-rate (Kbps)

-2	32
-1	45
0	64
1	80
2	96
3	112
4	128
5	160
6	192
7	224
8	256
9	320
10	500

At decoding time, those barks that suffered the biggest lossess are usually filled with noise in order to increase the perceived quality.

2.7 Floor and residue encoding

The quantized frequency spectrum is approximated with a polynomial (the floor curve). This curve is lossless encoded and has much less data (coefficients) and information than the quantized spectrum.
The difference (residue) between the floor curve and the quantized spectrum is VQ+Huffman (lossy) encoded.

2.8 Vorbis’s VQ (Vector Quantization)

Embedded (truncable) CBR (Constant Bit-Rate) lossy encoding.
Minimize quantization error when tuples of symbols are encoded.

tuples +---------+ code-vectors
-------->| Encoder |--------->
+---------+
A output code-vector is the index of the tuple in the codebook (the set of code-vectors) that is most similar to the input tuple.
In Vorbis, the code-book is computed for each audio sequence.

2.9 Huffman coding

Variable bit-rate (VBR) lossless encoding (asign less bits to those code-vectors with a high probability, and viceversa).
The Huffman codes are computed using the probabilities of the code-vectors, floor coefficients, etc.
This lossless stage increase the compression ratio of the spectral residues.

2.10 Packet “peeling”

Because the order in which the infomation is written into a Vorbis packet (firt the floor data and next the residue data), these can be truncated in order to reduce the bit-rate, without re-encoding.

2.11 Channel coupling

Vorbis supports up to 255 channels.
Most of times, similar sounds are transported in several channels.
Channel coupling decreases inter-channel redundancy.
Residue spectrums of mutichannel audios tend to be correlated.
The differences between these residue spectrums are coded using one of the following methods:
1. Lossless coupling: All stereo information (differences between left and right samples) is lossless compressed. This provides the maximal quality but the minimal compression.
2. Phase stereo: Stereo information is quantized and compressed. Sterie information is represented in a square polar representation (module and phase) and phase is quantized.
3. Point stereo: All polar (stereo) information is discarded. In this case, all the stereo information comes from the difference in the spectral floors for the channels. This provides the maximal compression but the minimal quality.

Part 3
Ogg

3.1 What is Ogg?

A free, open container format maintained by the Xiph.Org Foundation.
It can multiplex a number of independent streams for audio, video, text (such as subtitles), and metadata.

3.2 The Ogg format

struct Ogg_Stream {
  struct* Ogg_page;
};

struct Ogg_page {
  uint8[4] Ogg_Magic_Number = "OggS" /* the Ogg magic number */
  uint8    Version = 0;
  uint8    Header_Type;              /* type of page that follows: BOS, Continuation or EOS */
  uint64   Granule Position;         /* A time marker */
  uint32   Bit-stream_Serial_Number; /* Identifies the stream in multi-stream seqs */
  uint32   Page_Sequence_Number;
  uint32   CRC32,
  uint8    Page_Segmens;             /* Number of segments in this page */
  struct   Segment_Table;
};

struct Segment_Table {
  uint8* Segment_Length; /* In bytes */
};

In a Ogg Vorbis stream, the first pages store a header with the information neccesary to decode the rest os pages (e.g. the code-book and the Huffman tree). The rest of pages store audio.

Bibliography

[1] The Xiph Open Source Community. Vorbis audio compression. http://xiph.org/vorbis.

[2] Xiph.Org Foundation. Ogg vorbis documentation. http://xiph.org/vorbis/doc.

[3] Erik Montnémery and Johannes Sandvall. Ogg/Vorbis in embedded systems. PhD thesis, Lunds Tekniska Högskola, Lunds Universitet, 2004.

[4] E. Terhardt. Calclating virtual pitch. Hearing Res., 1:155–182, 1979.