Eytan Modiano
Slide 1
16.36: Communication Systems Engineering
Lecture 5: Source Coding
Eytan Modiano
Eytan Modiano
Slide 2
Source coding
?
Source symbols
–
Letters of alphabet, ASCII symbols, English dictionary, etc...
–
Quantized voice
?
Channel symbols
–
In general can have an arbitrary number of channel symbols
Typically {0,1} for a binary channel
?
Objectives of source coding
–
Unique decodability
–
Compression
Encode the alphabet using the smallest average number of channelsymbols
Source
Alphabet
{a
1
..a
N
}
Encode
Channel
Alphabet
{c
1
..c
N
}
Eytan Modiano
Slide 3
Compression
?
Lossless compression
–
Enables error free decoding
–
Unique decodability
without ambiguity
?
Lossy compression
–
Code may not be uniquely decodable, but with very high probabilitycan be decoded correctly
Eytan Modiano
Slide 4
Prefix (free) codes
?
A prefix code is a code in which no codeword is a prefix of anyother codeword
–
Prefix codes are uniquely decodable
–
Prefix codes are instantaneously decodable
?
The following important inequality applies to prefix codes and ingeneral to all uniquely decodable codes
Kraft Inequality
Let n
1
…n
k
be the lengths of
codewords
in a prefix (or any
Uniquely decodable) code. Then,
21
1
?
=
∑
≤
n
i
k
i
Eytan Modiano
Slide 5
Proof of Kraft Inequality
?
Proof only for prefix codes
–
Can be extended for all uniquely decodable codes
?
Map codewords
onto a binary tree
–
Codewords must
be
leaves on the tree
–
A codeword of length
n
i
is a leaf at depth
n
i
?
Let n
k
≥≥≥≥
n
k-1
…
≥≥≥≥
n
1
=> depth of tree =
n
k
–
In a binary tree of depth n
k
, up to 2
nk
leaves are possible (if all leaves
are at depth
n
k
)
–
Each leaf at depth
n
i
< n
k
eliminates a fraction 1/2
ni
of the leaves at
depth n
k
=> eliminates 2
nk -
n
i
of the leaves at depth
n
k
–
Hence,
22
2
1
11
nn
i
k
n
n
i
k
k
i
ki
?
=
?
=
∑∑
≤?
≤
Eytan Modiano
Slide 6
Kraft Inequality - converse
?
If a set of integers {n
1
..n
k
} satisfies the Kraft inequality the a prefix
code can be found with codeword lengths {n
1
..n
k
}
–
Hence the Kraft inequality is a necessary and sufficient condition for theexistence of a uniquely decodable code
?
Proof is by construction of a code
–
Given {n
1
..n
k
}, starting with n
1
assign node at level
n
i
for codeword of
length n
i
. Kraft inequality guarantees that assignment can be made
Example:
n = {2,2,2,3,3}, (verify that Kraft inequality holds!)
n
1
n
2
n
3
n
4
n
5
Eytan Modiano
Slide 7
Average codeword length
?
Kraft inequality does not tell us anything about the average lengthof a codeword. The following theorem gives a tight lower bound
Theorem: Given a source with alphabet {a
1
..a
k
}, probabilities {p
1
..p
k
},
and entropy H(X), the average length of a uniquely decodablebinary code satisfies:
≥≥≥≥
H(X)
Proof:
n
HX
n
p
p
pn
p
p
inequality
X
X
HX
n
p
p
i
i
i
ik
ii
i
ik
i
ni
i
ik
i
ni
i
ik
n
i
ik
i
i
i
()
log
log
log
log(
)
()
?=
?
=
=>
≤
?
=>
?≤
?
?? ?
?? ?
=?
≤
=
=
=
=
?
=
=
?
=
=
?
=
=
∑∑
∑
∑∑
12
1
2
12
1
0
11
1
11
Eytan Modiano
Slide 8
Average codeword length
?
Can we construct codes that come close to H(X)?
Theorem: Given a source with alphabet {a
1
..a
k
}, probabilities {p
1
..p
k
},
and entropy H(X), it is possible to construct a prefix (henceuniquely decodable) code of average length satisfying:
Proof (Shannon-
fano
codes):
n
<
H(X) + 1
Let
pp
p
p
pp
ii
i
i
k
i
i
k
ii
nn
Kraft
inequality
satisfied!
Can find a prefix code with lengths,
n
ii
n
n
i
i
i
=
?? ?
?? ?
?≥
?
≤
?≤
≤
??
=
?? ?
?? ?
<+
?
?
==
∑∑
log(
)
log(
)
log(
)
log(
)
11
2
21
11
1
11
n
i
=
?? ?
?? ?
<+
=<
+
?? ?
?? ?
=+
≤<
+
==
∑∑
log(
)
l
og(
)
,
,
log(
)
(
)
.
,
()
()
11
1
1
11
1
11
pp
Nownp
n
p
p
HX
HenceHX
n
H
X
ii
ii
i
k
i
i
i
k
Eytan Modiano
Slide 9
Getting Closer to H(X)
?
Consider blocks of N source letters
–
There are K
N
possible N letter
blocks (N-
tuples
)
–
Let Y be the
“
new”
source alphabet of N letter blocks
–
If each of the letters is independently generated,
H(Y) = H(x
1
..
x
N
) = N*H(X)
?
Encode Y using the same procedure as before to obtain,
Where the last inequality is obtained because each letter of Y corresponds to
N letters of the original source
?
We can now take the block length (N) to be arbitrarily large andget arbitrarily close to H(X)
HY
n
H
Y
NH
X
n
NH
X
HX
n
H
X
N
y
y
()
()
*(
)
*
(
)
()
()
/
≤<
+
?≤
<
+
?≤
<
+
1
1
1
Eytan Modiano
Slide 1
0
Huffman codes
?
Huffman codes are special prefix codes that can be shown to be optimal(minimize average codeword length)
Huffman Algorithm:1) Arrange source letters in decreasing order of probability (p
1
≥≥≥≥
p
2
..
≥≥≥≥
p
k
)
2) Assign
‘0’ to the last digit of
X
k
and
‘
1
’ to the last digit of
X
k-1
3) Combine
pk and
p
k
-1 to form a new set of probabilities
{p
1
, p
2
,..,
p
k-
2
,(p
k-1
+ p
k
)}
4) If left with just one letter then done, otherwise go to step 1 and repeat
H(X)
H(X)+1
Shannon/Fano codes
Huffmancodes
Eytan Modiano
Slide 1
1
Huffman code example
A = {a
1
,a
2
,a
3
, a
4
, a
5
} and p = {0.3, 0.25,0.25, 0.1, 0.1}
a
1
0.3
a
2
0.25
a
3
0.25
a
4
0.1
a
5
0.1
0.3
0.250.250.2
0.3
0.250.45
+
+
+
0.550.45
+
1.0
10
0
1
0
1
0
1
Letter
Codeword
a
1
11
a
2
10
a
3
01
a
4
001
a
5
000
n
bits
symbol
HX
p
p
Shannon
Fano
codes
n
p
nn
n
n
n
n
bits
symbol
H
X
i
i
i
i
=×
+
×
=
==
??
=
?? ?
?? ?
==
=
=
=
?=
<
+
∑
20
8
3
02
22
1
2
1855
1
24
24
1
12
3
4
5
..
.
/
()
log(
)
.
log(
)
,
./
(
)
Eytan Modiano
Slide 1
2
Lempel
-Ziv Source
coding
?
Source statistics are often not known
?
Most sources are not independent
–
Letters of alphabet are highly correlated
E.g., E often follows I, H often follows G, etc.
?
One can code
“
blocks
”
of letters, but that would require a very
large and complex code
?
Lempel
-Ziv Algorithm
–
“Universal code
” - works without knowledge of source statistics
–
Parse input file into unique phrases
–
Encode phrases using fixed length
codewords
Variable to fixed length encoding
Eytan Modiano
Slide 1
3
Lempel
-Ziv Algorithm
?
Parse input file into phrases that have not yet appeared
–
Input phrases into a dictionary
–
Number their location
?
Notice that each new phrase must be an older phrase followed bya
‘
0’ or a
‘
1
’
–
Can encode the new phrase using the dictionary location of theprevious phrase followed by the
‘0’ or
‘
1
’
Eytan Modiano
Slide 1
4
Lempel
-Ziv Example
Input: 0010110111000101011110Parsed phrases: 0, 01, 011, 0111, 00, 010, 1, 01111
Dictionary
Loc
binary rep
phrase
Codeword
comment
0
0000
null
1
0001
0
0000 0
loc-0 +
‘
0
’
2
0010
01
0001 1
loc-1 +
‘
1
’
3
0011
011
0010 1
loc-2 +
‘
1
’
4
0100
0111
0011 1
loc-3 +
‘
1
’
5
0101
0
0
0001 0
loc-1
+’
0’
6
0110
010
0010 0
loc-2 +
‘
0
’
7
0111
1
0000 1
loc-0 +
‘
1
’
8
1000
01111
0100 1
loc-4 +
‘
1
’
Sent sequence: 00000 00011 00101 00111 00010 00100 00001 01001
Eytan Modiano
Slide 1
5
Notes about
Lempel
-
Z
i
v
?
Decoder can uniquely decode the sent sequence
?
Algorithm clearly inefficient for short sequences (input data)
?
Code rate approaches the source entropy for large sequences
?
Dictionary size must be chosen in advance so that the length ofthe codeword can be established
?
Lempel
-Ziv
is widely used for encoding binary/text files
–
Compress/uncompress under
unix
–
Similar compression software for PCs and
M
A
C
s