# Lossless Compression

Assume we want to transmit the following sentence "real array".

10 characters, 6 different symbols: r e a l y <space>

Encoding all possible symbols

ASCII encoding: 7 bits for each symbol

7*10 = 70 bits

Encoding the symbols and order of the symbols

7 bits ASCII x 6 = 42 bits

4 bits for order x 10 = 40

Altogether 42+40 = 82 bits

Store the symbol ASCIIs on both encoder and decoder only need to transmit 40 bits

After compression: 24 bits

Theoretical low bound

2.371 x 10 = 23.71 bits

2.371 comes from Shannon's "entropy"

## Variable-Length Coding

Shannon-Fano Algorithm is a top-down approachSort the symbols according to the frequency count of their occurrences

Recursively divide the symbols into 2 parts, each with approximately the same number of counts until all parts contain only 1 symbol.

## Huffman Coding

Huffman Coding Algorithm is a bottom-up approachInitialization: put all symbols on a list sorted according to their frequency counts

Repeat until the list has only 1 symbol left

From the list pick 2 symbols with the lowest frequency counts.

Form a Human sub-tree that has these 2 symbols as child nodes and create a parent node

Assign the sum of the children frequency counts to the parent and insert it into the list such that the order is maintained.

Delete the children from the list

Assign a codeword for each leaf based on the path from the root

#### Properties of Huffman Coding

Unique prefix property: no code is a prefix of any other codeoptimality: minumum redundancy code

the 2 least frequent symbols will have the same length for their huffman codes differing only at the last bit

Symbols that occur more frequently will have shorter codes

#### Store Huffman Codes

Header- Pixel dimension (width height)
- Code in a code table (color, code and length of code)
- OR frequency table (decoder generates the huffman tree)
- Sequence of bits

Now picture how the decoder would decode the encoded bitmap

- Bits are read sequentially from the encoded file, guiding the traversal down the Huffman tree until a leaf node is encountered
- A 0 bit means take a left branch. A 1 bit means take a right branch.
- When the decoder arrives at a left node, it finds the color associated with the bit sequence just consumed
- Then the code for this sequence with the decoder moving back to the root of the Huffman tree.

#### Extended Huffman Coding

Grouping several symbols together and assign a single codeword to the group as a whole.#### Adaptive Huffman Coding

Huffman Coding requires prior statistical knowledge about the information source like the probabilities of all symbols which are not often availableAdaptive Huffman Coding gathers statistics and update dynamically as data steam arrives.

The probabilities is no longer based on prior knowledge bu on the actual data received so far

Update: The Huffman tree must always maintain its sibling property such that the counts of the nodes have to be in an increasing order in the way of from left to right and bottom to top

An additional rule: if any character/symbol is to be sent the first time, it must be preceded by a special symbol, NEW. The initial code for NEW is 0(the count is never increased); hence it is always denoted as NEW:(0)

The code for a particular symbol changes during the adaptive Huffman coding process.

## Dictionary-based Coding

### LZW

Limpel-Ziv=Welch (LZW) uses fixed-length codewords to represent variable-lenth strings of symbols/characters that commonly occur together, e.g. words in English textthe LZW encoder and decoder build up the same dictionary dynamically while receiving the data

LZW places longer and longer repeated entries into a dictionary and then emits the code for an element, rather than the string itself, if the element has already been placed in the dictionary.

Widely used in Unix compression, GIF, TIFF, PNG, etc

### RLC

#### Run-Length Coding

Memoryless Source: an information source that is independently distributed. Namely the value of the current symbol does not depend on the values of the previously appeared symbolsInstead of assuming memoryless source, RLC exploits memory present in the information source.

Rationale for RLC: if the informatio source has the property that symbols tend to form continuous groups, then such symbol and the length of the group can be coded.