The Burrows-Wheeler transform

Juan Francisco Rodríguez Herrera
Vicente González Ruiz

September 12, 2016

Contents

1 The forward transform
2 The inverse transform

Intro

1 The forward transform

Let B the block-size in symbols:

  1. Read B symbols.
  2. Build a square matrix of size B × B where the first row is the original sequence, the second one is the same sequence but cyclically shifted one symbol to the left, and so on ...
  3. Sort lexicographically the matrix by rows.
  4. Search in the last column the row in which the first symbol of the original sequence it is found. This is the index i.
  5. Output i and the last column O[].

Encoding example

Initial matrix Sorted matrix

<a >babcbababaaaaaaa 0
babcbababaaaaaaa <a > 1
abcbababaaaaaaa <a >b 2
bcbababaaaaaaa <a >ba 3
cbababaaaaaaa <a >bab 4
bababaaaaaaa <a >babc 5
ababaaaaaaa <a >babcb 6
babaaaaaaa <a >babcba 7
abaaaaaaa <a >babcbab 8
baaaaaaa <a >babcbaba 9
aaaaaaa <a >babcbabab 10
aaaaaa <a >babcbababa11
aaaaa <a >babcbababaa 12
aaaa <a >babcbababaaa13
aaa <a >babcbababaaaa 14
aa <a >babcbababaaaaa15
a <a >babcbababaaaaaa 16


aaaaaaa <a >babcbabab10 0
aaaaaa <a >babcbababa11 1
aaaaa <a >babcbababaa 12 2
aaaa <a >babcbababaaa13 3
aaa <a >babcbababaaaa 14 4
aa <a >babcbababaaaaa15 5
a <a >babcbababaaaaaa 16 6
abaaaaaaa <a >babcbab 8 7
ababaaaaaaa <a >babcb 6 8
<a >babcbababaaaaaaa 0 9
abcbababaaaaaaa <a >b 2 10
baaaaaaa <a >babcbaba 911
babaaaaaaa <a >babcba 7 12
bababaaaaaaa <a >babc 513
babcbababaaaaaaa <a > 1 14 i
bcbababaaaaaaa <a >ba 315
cbababaaaaaaa <a >bab 4 16

2 The inverse transform

  1. Sort O[] over S[].
  2. Compute T[] where if S[j] = O[l] (being l the first symbol of O[] that matches this condition), then T[j] = l. Notice that all of symbols of T[] have to be different.
  3. Let k i.
  4. Execute B times:
    1. Output O[k].
    2. k T[k].

Decoding example

Let’s go to the lab!

  1. Download http://www.ace.ual.es/~vruiz/doctorado/text-compression.tar.gz, make bwt and make unbwt.
  2. Encode each image of the Test Image Corpus using the BWT and RLE codecs (pipe both codecs writing rle < raw-file | bwt > encoded-file). Continue filling the table:
        Codec | lena boats pepers zelda Average  
    ----------+--------------------------------  
          rle | ....  ....   ....  ....    ....  
    bwt + rle | ....  ....   ....  ....    ....

  3. Better results?
  4. Verify that the encoding/decoding is lossless!