1
清华大学 软件学院 宋斌恒 1
Lecture 5. Greedy Algorithm
?Huffman code
?Matroid( 拟阵 )
清华大学 软件学院 宋斌恒 2
Huffman Codes
?Huffman codes用于数据压缩一般可以达到
压缩 20% 到 90%的效果 。
?什么是编码 : 编码实质上是用给定的一组
符号集的字符串到另外一组符号集的映
射 。
?例如用 {0,1}字符串到所有英文字符 {a,b,c,… }
的映射的 8位码
清华大学 软件学院 宋斌恒 3
Prefix codes
?我们只考虑任一个编码字不是另一个编码
字的 前缀 的编码 。
?这样的编码叫 Prefix Codes。
?下页给出几种编码例子 :
a b c d e f
Frequency (in thousands) 45 13 12 16 9 5
Fixed-length codeword 000 001 010 011 100 101
Variable-length codeword 0 101 100 111 1101 1100
Figure 16.3 A character-coding problem. A data
file of 100,000 characters contains only the
characters a-f, with the frequencies indicated. If
each character is assigned a 3-bit codeword, the
file can be encoded in 300,000 bits. Using the
variable-length code shown, the file can be
encoded in 224,000 bits.
清华大学 软件学院 宋斌恒 5
Prefix codes are desirable
because they simplify decoding.
?示例 : 在上图非定长编码下 , 串
0001110101唯一代表字符串 aaaeab
清华大学 软件学院 宋斌恒 6
编码表示方式
?利用树来表示 , 下图表示等长编码 :
100
86
58 28
a:45 b:13 c:12 d:16 e:9 f:5
14
14
2
清华大学 软件学院 宋斌恒 7
An optimal code for a file is
always represented by a full
binary tree.
100
86
58 28
a:45 b:13 c:12 d:16 e:9 f:5
14
14
清华大学 软件学院 宋斌恒 8
Given a tree T corresponding to a prefix code, it
is a simple matter to compute the number of bits
required to encode a file. Let c denote the character
in alphabet C, f(c) denote the frequency of c
in the file, dT(c) the depth of c in the tree. The
number of bits required to encode a file is thus
.cost
)()()(
Ttreetheoftheasdefinewewhich
cdcfTB
Cc
T∑
∈
=
Figure 16.4 Trees corresponding to the coding schemes in
Figure 16.3. Each leaf is labeled with a character and its
frequency of occurrence. Each internal node is labeled with
the sum of the frequencies of the leaves in its subtree .
(a)The tree corresponding to the fixed - length code a=000, … ,f=101.(b)The tree corresponding to the optimal prefix
code a=0,b=101, … ,f=1100.
100
86
58 28
a:45 b:13 c:12 d:16 e:9 f:5
14
14
100
55
30
d:16
25
14
a:45
c:12 b:13
f:5 e:9
清华大学 软件学院 宋斌恒 10
Constructing a Huffman code
?Huffman invented a greedy algorithm that
constructs an optimal prefix code called a
Huffman code.
?下面我们给出 Huffman code算法的伪代码
清华大学 软件学院 宋斌恒 11
1. Huffman(C)
1. n?|C|
2. Q?C //Q is a priority queue =new Q(C)
3. For i?1 to n-1
1. Do allocate a new node z
2. Left[z]?x?Q.Extract-Min()
3. Right[z]?y?Q.Extract-Min()
4. f[z]?f[x]+f[y]
5. Q.insert(z)
4. Return Q.Extract-Min()
清华大学 软件学院 宋斌恒 12
f:5 e:9 c:12 b:13 d:16 a:45
3
清华大学 软件学院 宋斌恒 13
14c:12 b:13
f:5 e:9
d:16 a:45
清华大学 软件学院 宋斌恒 14
d:1614 a:45
f:5 e:9
25
c:12 b:13
清华大学 软件学院 宋斌恒 15
30
d:16
25
14
a:45
c:12 b:13
f:5 e:9
清华大学 软件学院 宋斌恒 16
55
30
d:16
25
14
a:45
c:12 b:13
f:5 e:9
清华大学 软件学院 宋斌恒 17
100
55
30
d:16
25
14
a:45
c:12 b:13
f:5 e:9
清华大学 软件学院 宋斌恒 18
最小频率者最优编码可等长
?Let C be an alphabet in which each character c
∈ C has frequency f [c].Let x and y be two
characters in C having the lowest frequencies.
? Then there exists an optimal prefix code for C
in which the code words for x and y have the
same length and differ only in the last bit.
4
清华大学 软件学院 宋斌恒 19
证明
?设 T是任意一个最优编码对应的树 , a, b是
深度最深的一对兄弟 【 为什么存在 ? 】 ,
然后交换 x、 a; y、 b; 【 见下图 】
?T’’是最优编码对应的树
清华大学 软件学院 宋斌恒 20
y
a b
x
y
x b
a a
b
x y
清华大学 软件学院 宋斌恒 21
上图不同树的成本之差
∑∑
∈∈
?=?
Cc
T
Cc
T cdcfcdcfTBTB )()()()()'()( '
[ ] ( ) [ ] ( ) [ ] ( ) [ ] ( )adafxdxfadafxdxf TTTT '' ??+=
[ ] [ ]( ) ( ) ( )( )xdadxfaf TT ??=
[ ] ( ) [ ] ( ) [ ] ( ) [ ] ( )xdafadxfadafxdxf TTTT ??+=
,0≥
清华大学 软件学院 宋斌恒 22
Huffman的贪心选择律
?设 C是给定的字符集 , f[c]是 C中 元素 c的频
率 。 设 x, y是 C中频率最小的两个元素 。 令
C’= C- {x,y}+{z}, 其中 z是一个新的元素 ,
即 z不属于 C。 我们在 C’上 更新 f, 使得
f[z]=f[x]+f[y], 其它值不变 。 则如果 T’是 C’上
的任意一个最优前缀编码对应的树 , 则把 z
用一棵由 x和 y为叶子点的树代替而生成的
新树 T是 C上的一个最优编码对应的树 。
清华大学 软件学院 宋斌恒 23
证明提示
1. 树 T和 T’编码成本关系 B(T)-B(T’)=f[x]+f[y]
2. 反证法 , 假设 T不是最优编码树 èExists
T’’ such that B(T’’)<B(T), which has x and
y are the deepest siblings, èConstruct
T’’’ to represents a prefix code for
C’èT’’’ is a better coding than T’, a
contradiction
清华大学 软件学院 宋斌恒 24
Huffman编码小结
?前缀码
?最优码的特征
?最小频率的优化码可最长化
?贪婪选择律
?贪婪算法
5
清华大学 软件学院 宋斌恒 25
Matroid( 拟阵 )
?Theoretical Foundation for Greedy
Algorithms.
?Its application
清华大学 软件学院 宋斌恒 26
线性代数中的线性无关概念回顾
?一组向量 : a={1,0,0}, b={1,1,0}, c={0,2,0},
d={1,1,1}, e={0,0,0}, f={0,1,0}
?所有线性无关向量组 构成的集合
? l ={φ,{a}, {b}, {c}, {d}, {f}, {a,b}, {a,c}, {a,d},
{a,f}, {}, {},……}
? l的特征 : 如果 {a,b}属于 l则 {a}, {b}都 属于 l
清华大学 软件学院 宋斌恒 27
Matroids
? Definition: A matroid is an ordered pair
M = (S, l )
satisfying the following 3 conditions:
1. S is a finite nonempty set
清华大学 软件学院 宋斌恒 28
拟阵定义 ( 续 )
1. l is a nonempty family of subsets of S,
called the independent subsets of S, such
that if B ∈ l and A ? B, then A ∈ l . We say
that l is hereditary if it satisfies this property.
Note that empty set is necessarily a member
of l .
2. If A in l , B in l , and |A|<|B|, then there is
some element x ∈ B-A such that A U {x} ∈ l .
We say that M satisfies the exchange
property
清华大学 软件学院 宋斌恒 29
Concepts detail
1. l is a set whose elements are subset of
S
2. Every element in l is a set whose
elements are independent.
3. A subset of independent elements must
be independent too.
4. If one independent set is small than
another, then it can be expands to the
same size with elements in the other set
清华大学 软件学院 宋斌恒 30
Example 1. Linear algebra
? Let S={a1, a2, a3, … , an} be a set of n
independent vectors, and l ={all subsets of S}.
1. Proof M={S, l } is a Matroid!
1. The element in l is independent subset of S.
1. Let A ∈ l and B is a subset of A, It is trivial that B is in
l.
2. If |A|<|B|, then there exists an element b in B-A such
that A U {b} is an element of l.
1. Its trivial too!
6
清华大学 软件学院 宋斌恒 31
Example 2. Linear algebra
?Let S={a1, a2, a3, … , an} be a set of n
vectors, and l ={A is subsets of S: A’s
elements are linearly independent }.
Proof M={S, l } is a Matroid too!
?This proof is somewhat harder than the
one for the last example. 如何利用线性代
数的知识来证明 ?
清华大学 软件学院 宋斌恒 32
Example 3. Graph
?S = {all edges in an undirected graph G}
? l = {A?S: A defines a sub-graph GA={V, A}
such that GA is acyclic, that is GA is a forest}
?M= {S, l }
?定理 : M={S, l } is a Matroid!
清华大学 软件学院 宋斌恒 33
证明
1. Hereditary: is easy to prove, 证明什
么 ?
2. Exchange property: needs some tricks!
1. 设 A和 B是 l 的 元素 , 且 |A|<|B|, then GA and
GB are two forests of Graph G,
2. GB has less tree than in GA .
3. ? How many trees are there in GA?
清华大学 软件学院 宋斌恒 34
– Pigeon hole principle: There exists one tree T
in GB such that it intersects more than one
tree in GA .【 GB中树少 】
– In tree T, there exists at least an edges e
whose vertices u and v belongs two trees in
GA【 保证下一步加入 (u,v)后无环 】
– A U {(u,v)} is another subset of S, and it
defines a subgraph such that it is acyclic.
清华大学 软件学院 宋斌恒 35
Properties of Matriod
?Theorem 16.6: All maximal independent
subsets in a matroid have the same size.
?The prove is trivial.
清华大学 软件学院 宋斌恒 36
Illustration of the same size of
maximum independent subset
Properties
?A graphic matroid for a connected,
undirected graph G. Every independent
subset of MG must be a free tree with
exactly |V|-1 edges that connect all the
vertices of G. Such a tree is called a
spanning tree of G.
?How to do in linear algebra?
7
清华大学 软件学院 宋斌恒 37
加权拟阵
?Weight function: w: SàR+ and
?w(A) = ∑x∈A w(x) for any A contains in S
??What does this function means in a graph?
清华大学 软件学院 宋斌恒 38
加权拟阵的优化问题
?Problem of maximum weighted
independent subset in a weighted matroid
is a typical problem of matroid and it can
be solved by a greedy algorithm.【 要求权
函数必须是正的 】
清华大学 软件学院 宋斌恒 39
变种问题
?There are also another type of problem
called the problem of “minimum weight for
the maximum size independent subset”in
a weighted matroid, it can be changed to
the problem of the first type【 没有正权限
制 】 .【 如何做 】
清华大学 软件学院 宋斌恒 40
问题实例 : 图的最小支撑树
?Spanning tree:无向连通图的最大无圈子
图 , 叫支撑树 。
?minimum spanning tree problem: 求路径
最短的支撑树 。
?以上问题是一个变种问题
清华大学 软件学院 宋斌恒 41
The greedy algorithm
? Greedy(M,w)
1. A?empty set
2. Sort S[M] into monotonically decreasing
order by weight w(x)
3. For each x in S[M]
1. If A U {x} ∈ l [M] then A?A U {x}
4. Return A
清华大学 软件学院 宋斌恒 42
Analysis of the complexity
?Ordering the elements.
?To decide the whether A U {x} is in l [M]
?O(n lg n + n f(n))
8
清华大学 软件学院 宋斌恒 43
贪心选择律
?Let M={S, l } be a weighted matroid,and S
is sorted into monotonically decreasing
order by weight;
? Let x be the first independent element in S
if such x exists;
? If x exists then there exists an optimal
solution A such that x is an element of A
? 【 最大的独立元素一定在某一个最优解
中 】
清华大学 软件学院 宋斌恒 44
贪心选择律之证明
? If no such x exists, then the only independent
subset of S is the empty set. Hence we are done.
Else,
? Let B be an optimal solution,then it is not empty,
assume x is not in B (if it is in B then the
property is true),
? Claim: There is no elements in B whose weight
is bigger than w(x). To see this, any element y in
B can construct an independent subset {y}, with
assumption we get the claim
清华大学 软件学院 宋斌恒 45
切粘技术
?Construct an A such that A is optimal and
contains x: using cut and paste, or the
exchange property of M, we can grow
A={x} by adding elements from B till to A
has the same size with B
?B and A only have one different element:
?w(B)-w(A)=w(y)-w(x)≤0
?Claim: A is optimal too!
清华大学 软件学院 宋斌恒 46
扩展引理
Let M={S, l } be any matroid, if x is an
element of S that is an extension of
some independent subset A of S, then x
is also an extension of empty set.
The proof is simple. Really?
清华大学 软件学院 宋斌恒 47
淘汰无情 !
?If x is not an extension of empty set, then it
is not an extension of any independent
subset A of S.
清华大学 软件学院 宋斌恒 48
最优子结构定理
? Let x be the first element of S chosen by greedy
for the weighted matroid M= (S, l ), the
remaining problem of finding a maximum-weight
independent subset containing x reduces to
finding a maximum-weight independent subset
of the weighted M’= (S’, l ’), where
? S’={y in S: {x,y} in l }
? l ’={B subset of S-{x}: B U {x} in l } and
? The weight function to W’is the weight function
of M restrict to W’
9
清华大学 软件学院 宋斌恒 49
加权拟阵贪心算法的正确性
?定理 : The greedy algorithm on matroid is
correct
?证明 : is just the combination of the last
lemmas and corollaries.
清华大学 软件学院 宋斌恒 50
A task-scheduling problem
?Unit-time task: A task need unit time to
complete
?Schedule of task set S: is a permutation of
S specifying the order in which these tasks
are to be performed
?Scheduling unit-time task with deadlines
and penalties for a single processor:
清华大学 软件学院 宋斌恒 51
Input of the problem
?A set of task S={a1,a2,… ,an}
?A set of integer deadlines d1,d2,… ,dn.
?A set of nonnegative weights of penalties:
w1,w2,… ,wn, such that if task ai is not
finished before its deadlines there will
incur a penalty wi.
清华大学 软件学院 宋斌恒 52
Convert the problem to a
matroid
?Independent tasks:if there exists a
schedule for these tasks such that no
tasks are late.
? l ={all subsets of independent tasks}
? S is the set of all tasks
? M={S, l } is a matroid
清华大学 软件学院 宋斌恒 53
拟阵小结
?拟阵定义
?拟阵实例
?拟阵的简单性质
?加权拟阵
?加权拟阵的优化问题及其变种
?上述问题的最优子结构
?上述问题的贪婪选择律
?贪婪算法 , 及其在最小支撑树的应用