Lossless Compression


Assume we want to transmit the following sentence "real array".
10 characters, 6 different symbols: r e a l y <space>

Encoding all possible symbols
ASCII encoding: 7 bits for each symbol
7*10 = 70 bits

Encoding the symbols and order of the symbols
7 bits ASCII x 6 = 42 bits
4 bits for order x 10 = 40
Altogether 42+40 = 82 bits

Store the symbol ASCIIs on both encoder and decoder only need to transmit 40 bits

After compression: 24 bits

Theoretical low bound
2.371 x 10 = 23.71 bits
2.371 comes from Shannon's "entropy"

Variable-Length Coding

Shannon-Fano Algorithm is a top-down approach
Sort the symbols according to the frequency count of their occurrences
Recursively divide the symbols into 2 parts, each with approximately the same number of counts until all parts contain only 1 symbol.

Huffman Coding

Huffman Coding Algorithm is a bottom-up approach
Initialization: put all symbols on a list sorted according to their frequency counts
Repeat until the list has only 1 symbol left
From the list pick 2 symbols with the lowest frequency counts.
Form a Human sub-tree that has these 2 symbols as child nodes and create a parent node
Assign the sum of the children frequency counts to the parent and insert it into the list such that the order is maintained.
Delete the children from the list
Assign a codeword for each leaf based on the path from the root

Properties of Huffman Coding

Unique prefix property: no code is a prefix of any other code
optimality: minumum redundancy code
the 2 least frequent symbols will have the same length for their huffman codes differing only at the last bit
Symbols that occur more frequently will have shorter codes

Store Huffman Codes

Header Assuming huffman tree is available
Now picture how the decoder would decode the encoded bitmap

Extended Huffman Coding

Grouping several symbols together and assign a single codeword to the group as a whole.

Adaptive Huffman Coding

Huffman Coding requires prior statistical knowledge about the information source like the probabilities of all symbols which are not often available
Adaptive Huffman Coding gathers statistics and update dynamically as data steam arrives.
The probabilities is no longer based on prior knowledge bu on the actual data received so far

Update: The Huffman tree must always maintain its sibling property such that the counts of the nodes have to be in an increasing order in the way of from left to right and bottom to top

An additional rule: if any character/symbol is to be sent the first time, it must be preceded by a special symbol, NEW. The initial code for NEW is 0(the count is never increased); hence it is always denoted as NEW:(0)

The code for a particular symbol changes during the adaptive Huffman coding process.

Dictionary-based Coding

LZW

Limpel-Ziv=Welch (LZW) uses fixed-length codewords to represent variable-lenth strings of symbols/characters that commonly occur together, e.g. words in English text
the LZW encoder and decoder build up the same dictionary dynamically while receiving the data
LZW places longer and longer repeated entries into a dictionary and then emits the code for an element, rather than the string itself, if the element has already been placed in the dictionary.

Widely used in Unix compression, GIF, TIFF, PNG, etc

RLC


Run-Length Coding

Memoryless Source: an information source that is independently distributed. Namely the value of the current symbol does not depend on the values of the previously appeared symbols
Instead of assuming memoryless source, RLC exploits memory present in the information source.

Rationale for RLC: if the informatio source has the property that symbols tend to form continuous groups, then such symbol and the length of the group can be coded.
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki