# One-bit catastrophe

11 Sep 2017Hi! Today Guillaume Lagarde gives an overview of the recent preprint he has with his advisor Sylvain Perifel: Lempel-Ziv: a “one-bit catastrophe” but not a tragedy. He presented parts of it at the PhD seminar last spring.

## A strange scenario

Imagine you compressed a file using your favorite compression algorithm, but you realize that there was a typo in the file. You then correct it, adding a single bit to the original file.

Compress it again and you get a much larger compressed file… for just a one-bit difference! Much compression algorithms do not have this strange behavior, but for LZ’78, one of the most famous of them, this surprising scenario might well happen. This is what we will overview in this post.

## Introduction to LZ’78 and notations

Lempel-Ziv (LZ in short)
algorithms refer to a
bunch of lossless compression techniques introduced by Abraham
Lempel and Jacob
Ziv that are used as key
ingredients in various places such as
*deflate*,
*gif*,
*gzip*, etc. The version we
consider is LZ’78, and works by cutting the word we want to
compress into substrings, called blocks, ,
such that each block , except maybe the last one, is a maximal
extension of a previous block, that is: for some
letter and is not equal to , for .

For example, consider the word . The first block is always the first letter (it is the extension of the empty word ), so that . Then, , since this is the smallest prefix of which is not equal to a word in the set . Then, since this is the smallest prefix of which is not equal to a word in , and so on. In the end, is cut as

It is not hard to see that there is a unique cut that satisfies this
property. Then, the compression algorithm LZ’78 encodes each block
as a couple where is a *pointer* to its
*predecessor* together with the letter such that .

The previous word is thus encoded as

The compression size is the number of bits needed to represent the
compressed version of the word and is equal to . In fact, it can be shown that the order of the
compression size is where is the number of
blocks; this shows that the important and unique parameter needed is
, the number of blocks. The *dictionnary* of a word ,
written will be the set of all the blocks . With this
notation in hand, the size of the compression can be rephrased as
. A word is said to be *incompressible*
if the size of the compression is of the order of the size of the
word. Equivalently, a word is incompressible if .

The most compressible words are obtained by taking the concatenation of the prefixes of any word . That is, the most compressible words are of the form . The size of the dictionnary of such word is .

## One-bit catastrophe and results

The *one-bit catastrophe question* asks whether there exists a
compressible (infinite) word such that is
incompressible. We answer positively this question by showing our main
theorem:

**Theorem 1:** There exists an infinite word such that the
compression ratio changes from to at least by
adding one single bit in front of .

In fact, this result is not as tragic as it seems. Indeed, for the ratio to change, the word needs to be compressible, but very slowly; in symbol – this is quite close to an incompressible word!

More generally, we can be more precise by considering the possible variations of the compression size on finite words:

**Theorem 2** For any finite word and any letter

Moreover, we can show that this result is tight up to a multiplicative constant.

In particular, Theorem 2 tells us that any compressible word for which the size of the dictionnary is remains compressible when we add any letter in front of it, i.e .

Theorem 2 also tells us that for the most compressible words, the variation can change from to .

## High level ideas of the proofs

The first step is to understand how to get a *weak* catastrophe, that
is a word such that and . For that, we consider a word which is
the concatenation of the prefixes of a word , where has a
maximal number of different substrings: is a
de Bruijn word;
this alone should be sufficient to get a weak catastrophe (simulations
say!), but we were not able to prove it directly. Instead, we need to
add in an adaptative way small words between the prefixes, called
gadgets, in order to have more control on the parsing of and
simultaneously.

With this *weak* catastrophe in hand, the idea is now to create the
*true catastrophe* (from compressible to incompressible word). We do
that by creating probabilistically a lot of “independent de
Bruijn-style words” which will be used to create a lot of independent
weak catastrophes at the same time. Then by tuning some parameters
like the size or the number of weak catastrophes, together coupled
with new and independent gadgets between the words, we get a
catastrophe.

## Open questions

Is it possible to push the limits as far as to get a catastrophe where the compression ratio changes from to ? This would yield as the worst possible configuration.

The main challenge, to our mind, is to remove the gadgets in our
constructions. This would mean that the *weak catastrophe* is the
typical case for optimally compressible words.