Eytan Modiano Slide 1 16.36: Communication Systems Engineering Lecture 5: Source Coding Eytan Modiano Eytan Modiano Slide 2 Source coding ? Source symbols – Letters of alphabet, ASCII symbols, English dictionary, etc... – Quantized voice ? Channel symbols – In general can have an arbitrary number of channel symbols Typically {0,1} for a binary channel ? Objectives of source coding – Unique decodability – Compression Encode the alphabet using the smallest average number of channelsymbols Source Alphabet {a 1 ..a N } Encode Channel Alphabet {c 1 ..c N } Eytan Modiano Slide 3 Compression ? Lossless compression – Enables error free decoding – Unique decodability without ambiguity ? Lossy compression – Code may not be uniquely decodable, but with very high probabilitycan be decoded correctly Eytan Modiano Slide 4 Prefix (free) codes ? A prefix code is a code in which no codeword is a prefix of anyother codeword – Prefix codes are uniquely decodable – Prefix codes are instantaneously decodable ? The following important inequality applies to prefix codes and ingeneral to all uniquely decodable codes Kraft Inequality Let n 1 …n k be the lengths of codewords in a prefix (or any Uniquely decodable) code. Then, 21 1 ? = ∑ ≤ n i k i Eytan Modiano Slide 5 Proof of Kraft Inequality ? Proof only for prefix codes – Can be extended for all uniquely decodable codes ? Map codewords onto a binary tree – Codewords must be leaves on the tree – A codeword of length n i is a leaf at depth n i ? Let n k ≥≥≥≥ n k-1 … ≥≥≥≥ n 1 => depth of tree = n k – In a binary tree of depth n k , up to 2 nk leaves are possible (if all leaves are at depth n k ) – Each leaf at depth n i < n k eliminates a fraction 1/2 ni of the leaves at depth n k => eliminates 2 nk - n i of the leaves at depth n k – Hence, 22 2 1 11 nn i k n n i k k i ki ? = ? = ∑∑ ≤? ≤ Eytan Modiano Slide 6 Kraft Inequality - converse ? If a set of integers {n 1 ..n k } satisfies the Kraft inequality the a prefix code can be found with codeword lengths {n 1 ..n k } – Hence the Kraft inequality is a necessary and sufficient condition for theexistence of a uniquely decodable code ? Proof is by construction of a code – Given {n 1 ..n k }, starting with n 1 assign node at level n i for codeword of length n i . Kraft inequality guarantees that assignment can be made Example: n = {2,2,2,3,3}, (verify that Kraft inequality holds!) n 1 n 2 n 3 n 4 n 5 Eytan Modiano Slide 7 Average codeword length ? Kraft inequality does not tell us anything about the average lengthof a codeword. The following theorem gives a tight lower bound Theorem: Given a source with alphabet {a 1 ..a k }, probabilities {p 1 ..p k }, and entropy H(X), the average length of a uniquely decodablebinary code satisfies: ≥≥≥≥ H(X) Proof: n HX n p p pn p p inequality X X HX n p p i i i ik ii i ik i ni i ik i ni i ik n i ik i i i () log log log log( ) () ?= ? = => ≤ ? => ?≤ ? ?? ? ?? ? =? ≤ = = = = ? = = ? = = ? = = ∑∑ ∑ ∑∑ 12 1 2 12 1 0 11 1 11 Eytan Modiano Slide 8 Average codeword length ? Can we construct codes that come close to H(X)? Theorem: Given a source with alphabet {a 1 ..a k }, probabilities {p 1 ..p k }, and entropy H(X), it is possible to construct a prefix (henceuniquely decodable) code of average length satisfying: Proof (Shannon- fano codes): n < H(X) + 1 Let pp p p pp ii i i k i i k ii nn Kraft inequality satisfied! Can find a prefix code with lengths, n ii n n i i i = ?? ? ?? ? ?≥ ? ≤ ?≤ ≤ ?? = ?? ? ?? ? <+ ? ? == ∑∑ log( ) log( ) log( ) log( ) 11 2 21 11 1 11 n i = ?? ? ?? ? <+ =< + ?? ? ?? ? =+ ≤< + == ∑∑ log( ) l og( ) , , log( ) ( ) . , () () 11 1 1 11 1 11 pp Nownp n p p HX HenceHX n H X ii ii i k i i i k Eytan Modiano Slide 9 Getting Closer to H(X) ? Consider blocks of N source letters – There are K N possible N letter blocks (N- tuples ) – Let Y be the “ new” source alphabet of N letter blocks – If each of the letters is independently generated, H(Y) = H(x 1 .. x N ) = N*H(X) ? Encode Y using the same procedure as before to obtain, Where the last inequality is obtained because each letter of Y corresponds to N letters of the original source ? We can now take the block length (N) to be arbitrarily large andget arbitrarily close to H(X) HY n H Y NH X n NH X HX n H X N y y () () *( ) * ( ) () () / ≤< + ?≤ < + ?≤ < + 1 1 1 Eytan Modiano Slide 1 0 Huffman codes ? Huffman codes are special prefix codes that can be shown to be optimal(minimize average codeword length) Huffman Algorithm:1) Arrange source letters in decreasing order of probability (p 1 ≥≥≥≥ p 2 .. ≥≥≥≥ p k ) 2) Assign ‘0’ to the last digit of X k and ‘ 1 ’ to the last digit of X k-1 3) Combine pk and p k -1 to form a new set of probabilities {p 1 , p 2 ,.., p k- 2 ,(p k-1 + p k )} 4) If left with just one letter then done, otherwise go to step 1 and repeat H(X) H(X)+1 Shannon/Fano codes Huffmancodes Eytan Modiano Slide 1 1 Huffman code example A = {a 1 ,a 2 ,a 3 , a 4 , a 5 } and p = {0.3, 0.25,0.25, 0.1, 0.1} a 1 0.3 a 2 0.25 a 3 0.25 a 4 0.1 a 5 0.1 0.3 0.250.250.2 0.3 0.250.45 + + + 0.550.45 + 1.0 10 0 1 0 1 0 1 Letter Codeword a 1 11 a 2 10 a 3 01 a 4 001 a 5 000 n bits symbol HX p p Shannon Fano codes n p nn n n n n bits symbol H X i i i i =× + × = == ?? = ?? ? ?? ? == = = = ?= < + ∑ 20 8 3 02 22 1 2 1855 1 24 24 1 12 3 4 5 .. . / () log( ) . log( ) , ./ ( ) Eytan Modiano Slide 1 2 Lempel -Ziv Source coding ? Source statistics are often not known ? Most sources are not independent – Letters of alphabet are highly correlated E.g., E often follows I, H often follows G, etc. ? One can code “ blocks ” of letters, but that would require a very large and complex code ? Lempel -Ziv Algorithm – “Universal code ” - works without knowledge of source statistics – Parse input file into unique phrases – Encode phrases using fixed length codewords Variable to fixed length encoding Eytan Modiano Slide 1 3 Lempel -Ziv Algorithm ? Parse input file into phrases that have not yet appeared – Input phrases into a dictionary – Number their location ? Notice that each new phrase must be an older phrase followed bya ‘ 0’ or a ‘ 1 ’ – Can encode the new phrase using the dictionary location of theprevious phrase followed by the ‘0’ or ‘ 1 ’ Eytan Modiano Slide 1 4 Lempel -Ziv Example Input: 0010110111000101011110Parsed phrases: 0, 01, 011, 0111, 00, 010, 1, 01111 Dictionary Loc binary rep phrase Codeword comment 0 0000 null 1 0001 0 0000 0 loc-0 + ‘ 0 ’ 2 0010 01 0001 1 loc-1 + ‘ 1 ’ 3 0011 011 0010 1 loc-2 + ‘ 1 ’ 4 0100 0111 0011 1 loc-3 + ‘ 1 ’ 5 0101 0 0 0001 0 loc-1 +’ 0’ 6 0110 010 0010 0 loc-2 + ‘ 0 ’ 7 0111 1 0000 1 loc-0 + ‘ 1 ’ 8 1000 01111 0100 1 loc-4 + ‘ 1 ’ Sent sequence: 00000 00011 00101 00111 00010 00100 00001 01001 Eytan Modiano Slide 1 5 Notes about Lempel - Z i v ? Decoder can uniquely decode the sent sequence ? Algorithm clearly inefficient for short sequences (input data) ? Code rate approaches the source entropy for large sequences ? Dictionary size must be chosen in advance so that the length ofthe codeword can be established ? Lempel -Ziv is widely used for encoding binary/text files – Compress/uncompress under unix – Similar compression software for PCs and M A C s