1 清华大学 软件学院 宋斌恒 1 Lecture 5. Greedy Algorithm ?Huffman code ?Matroid( 拟阵 ) 清华大学 软件学院 宋斌恒 2 Huffman Codes ?Huffman codes用于数据压缩一般可以达到 压缩 20% 到 90%的效果 。 ?什么是编码 : 编码实质上是用给定的一组 符号集的字符串到另外一组符号集的映 射 。 ?例如用 {0,1}字符串到所有英文字符 {a,b,c,… } 的映射的 8位码 清华大学 软件学院 宋斌恒 3 Prefix codes ?我们只考虑任一个编码字不是另一个编码 字的 前缀 的编码 。 ?这样的编码叫 Prefix Codes。 ?下页给出几种编码例子 : a b c d e f Frequency (in thousands) 45 13 12 16 9 5 Fixed-length codeword 000 001 010 011 100 101 Variable-length codeword 0 101 100 111 1101 1100 Figure 16.3 A character-coding problem. A data file of 100,000 characters contains only the characters a-f, with the frequencies indicated. If each character is assigned a 3-bit codeword, the file can be encoded in 300,000 bits. Using the variable-length code shown, the file can be encoded in 224,000 bits. 清华大学 软件学院 宋斌恒 5 Prefix codes are desirable because they simplify decoding. ?示例 : 在上图非定长编码下 , 串 0001110101唯一代表字符串 aaaeab 清华大学 软件学院 宋斌恒 6 编码表示方式 ?利用树来表示 , 下图表示等长编码 : 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 2 清华大学 软件学院 宋斌恒 7 An optimal code for a file is always represented by a full binary tree. 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 清华大学 软件学院 宋斌恒 8 Given a tree T corresponding to a prefix code, it is a simple matter to compute the number of bits required to encode a file. Let c denote the character in alphabet C, f(c) denote the frequency of c in the file, dT(c) the depth of c in the tree. The number of bits required to encode a file is thus .cost )()()( Ttreetheoftheasdefinewewhich cdcfTB Cc T∑ ∈ = Figure 16.4 Trees corresponding to the coding schemes in Figure 16.3. Each leaf is labeled with a character and its frequency of occurrence. Each internal node is labeled with the sum of the frequencies of the leaves in its subtree . (a)The tree corresponding to the fixed - length code a=000, … ,f=101.(b)The tree corresponding to the optimal prefix code a=0,b=101, … ,f=1100. 100 86 58 28 a:45 b:13 c:12 d:16 e:9 f:5 14 14 100 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学 软件学院 宋斌恒 10 Constructing a Huffman code ?Huffman invented a greedy algorithm that constructs an optimal prefix code called a Huffman code. ?下面我们给出 Huffman code算法的伪代码 清华大学 软件学院 宋斌恒 11 1. Huffman(C) 1. n?|C| 2. Q?C //Q is a priority queue =new Q(C) 3. For i?1 to n-1 1. Do allocate a new node z 2. Left[z]?x?Q.Extract-Min() 3. Right[z]?y?Q.Extract-Min() 4. f[z]?f[x]+f[y] 5. Q.insert(z) 4. Return Q.Extract-Min() 清华大学 软件学院 宋斌恒 12 f:5 e:9 c:12 b:13 d:16 a:45 3 清华大学 软件学院 宋斌恒 13 14c:12 b:13 f:5 e:9 d:16 a:45 清华大学 软件学院 宋斌恒 14 d:1614 a:45 f:5 e:9 25 c:12 b:13 清华大学 软件学院 宋斌恒 15 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学 软件学院 宋斌恒 16 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学 软件学院 宋斌恒 17 100 55 30 d:16 25 14 a:45 c:12 b:13 f:5 e:9 清华大学 软件学院 宋斌恒 18 最小频率者最优编码可等长 ?Let C be an alphabet in which each character c ∈ C has frequency f [c].Let x and y be two characters in C having the lowest frequencies. ? Then there exists an optimal prefix code for C in which the code words for x and y have the same length and differ only in the last bit. 4 清华大学 软件学院 宋斌恒 19 证明 ?设 T是任意一个最优编码对应的树 , a, b是 深度最深的一对兄弟 【 为什么存在 ? 】 , 然后交换 x、 a; y、 b; 【 见下图 】 ?T’’是最优编码对应的树 清华大学 软件学院 宋斌恒 20 y a b x y x b a a b x y 清华大学 软件学院 宋斌恒 21 上图不同树的成本之差 ∑∑ ∈∈ ?=? Cc T Cc T cdcfcdcfTBTB )()()()()'()( ' [ ] ( ) [ ] ( ) [ ] ( ) [ ] ( )adafxdxfadafxdxf TTTT '' ??+= [ ] [ ]( ) ( ) ( )( )xdadxfaf TT ??= [ ] ( ) [ ] ( ) [ ] ( ) [ ] ( )xdafadxfadafxdxf TTTT ??+= ,0≥ 清华大学 软件学院 宋斌恒 22 Huffman的贪心选择律 ?设 C是给定的字符集 , f[c]是 C中 元素 c的频 率 。 设 x, y是 C中频率最小的两个元素 。 令 C’= C- {x,y}+{z}, 其中 z是一个新的元素 , 即 z不属于 C。 我们在 C’上 更新 f, 使得 f[z]=f[x]+f[y], 其它值不变 。 则如果 T’是 C’上 的任意一个最优前缀编码对应的树 , 则把 z 用一棵由 x和 y为叶子点的树代替而生成的 新树 T是 C上的一个最优编码对应的树 。 清华大学 软件学院 宋斌恒 23 证明提示 1. 树 T和 T’编码成本关系 B(T)-B(T’)=f[x]+f[y] 2. 反证法 , 假设 T不是最优编码树 èExists T’’ such that B(T’’)<B(T), which has x and y are the deepest siblings, èConstruct T’’’ to represents a prefix code for C’èT’’’ is a better coding than T’, a contradiction 清华大学 软件学院 宋斌恒 24 Huffman编码小结 ?前缀码 ?最优码的特征 ?最小频率的优化码可最长化 ?贪婪选择律 ?贪婪算法 5 清华大学 软件学院 宋斌恒 25 Matroid( 拟阵 ) ?Theoretical Foundation for Greedy Algorithms. ?Its application 清华大学 软件学院 宋斌恒 26 线性代数中的线性无关概念回顾 ?一组向量 : a={1,0,0}, b={1,1,0}, c={0,2,0}, d={1,1,1}, e={0,0,0}, f={0,1,0} ?所有线性无关向量组 构成的集合 ? l ={φ,{a}, {b}, {c}, {d}, {f}, {a,b}, {a,c}, {a,d}, {a,f}, {}, {},……} ? l的特征 : 如果 {a,b}属于 l则 {a}, {b}都 属于 l 清华大学 软件学院 宋斌恒 27 Matroids ? Definition: A matroid is an ordered pair M = (S, l ) satisfying the following 3 conditions: 1. S is a finite nonempty set 清华大学 软件学院 宋斌恒 28 拟阵定义 ( 续 ) 1. l is a nonempty family of subsets of S, called the independent subsets of S, such that if B ∈ l and A ? B, then A ∈ l . We say that l is hereditary if it satisfies this property. Note that empty set is necessarily a member of l . 2. If A in l , B in l , and |A|<|B|, then there is some element x ∈ B-A such that A U {x} ∈ l . We say that M satisfies the exchange property 清华大学 软件学院 宋斌恒 29 Concepts detail 1. l is a set whose elements are subset of S 2. Every element in l is a set whose elements are independent. 3. A subset of independent elements must be independent too. 4. If one independent set is small than another, then it can be expands to the same size with elements in the other set 清华大学 软件学院 宋斌恒 30 Example 1. Linear algebra ? Let S={a1, a2, a3, … , an} be a set of n independent vectors, and l ={all subsets of S}. 1. Proof M={S, l } is a Matroid! 1. The element in l is independent subset of S. 1. Let A ∈ l and B is a subset of A, It is trivial that B is in l. 2. If |A|<|B|, then there exists an element b in B-A such that A U {b} is an element of l. 1. Its trivial too! 6 清华大学 软件学院 宋斌恒 31 Example 2. Linear algebra ?Let S={a1, a2, a3, … , an} be a set of n vectors, and l ={A is subsets of S: A’s elements are linearly independent }. Proof M={S, l } is a Matroid too! ?This proof is somewhat harder than the one for the last example. 如何利用线性代 数的知识来证明 ? 清华大学 软件学院 宋斌恒 32 Example 3. Graph ?S = {all edges in an undirected graph G} ? l = {A?S: A defines a sub-graph GA={V, A} such that GA is acyclic, that is GA is a forest} ?M= {S, l } ?定理 : M={S, l } is a Matroid! 清华大学 软件学院 宋斌恒 33 证明 1. Hereditary: is easy to prove, 证明什 么 ? 2. Exchange property: needs some tricks! 1. 设 A和 B是 l 的 元素 , 且 |A|<|B|, then GA and GB are two forests of Graph G, 2. GB has less tree than in GA . 3. ? How many trees are there in GA? 清华大学 软件学院 宋斌恒 34 – Pigeon hole principle: There exists one tree T in GB such that it intersects more than one tree in GA .【 GB中树少 】 – In tree T, there exists at least an edges e whose vertices u and v belongs two trees in GA【 保证下一步加入 (u,v)后无环 】 – A U {(u,v)} is another subset of S, and it defines a subgraph such that it is acyclic. 清华大学 软件学院 宋斌恒 35 Properties of Matriod ?Theorem 16.6: All maximal independent subsets in a matroid have the same size. ?The prove is trivial. 清华大学 软件学院 宋斌恒 36 Illustration of the same size of maximum independent subset Properties ?A graphic matroid for a connected, undirected graph G. Every independent subset of MG must be a free tree with exactly |V|-1 edges that connect all the vertices of G. Such a tree is called a spanning tree of G. ?How to do in linear algebra? 7 清华大学 软件学院 宋斌恒 37 加权拟阵 ?Weight function: w: SàR+ and ?w(A) = ∑x∈A w(x) for any A contains in S ??What does this function means in a graph? 清华大学 软件学院 宋斌恒 38 加权拟阵的优化问题 ?Problem of maximum weighted independent subset in a weighted matroid is a typical problem of matroid and it can be solved by a greedy algorithm.【 要求权 函数必须是正的 】 清华大学 软件学院 宋斌恒 39 变种问题 ?There are also another type of problem called the problem of “minimum weight for the maximum size independent subset”in a weighted matroid, it can be changed to the problem of the first type【 没有正权限 制 】 .【 如何做 】 清华大学 软件学院 宋斌恒 40 问题实例 : 图的最小支撑树 ?Spanning tree:无向连通图的最大无圈子 图 , 叫支撑树 。 ?minimum spanning tree problem: 求路径 最短的支撑树 。 ?以上问题是一个变种问题 清华大学 软件学院 宋斌恒 41 The greedy algorithm ? Greedy(M,w) 1. A?empty set 2. Sort S[M] into monotonically decreasing order by weight w(x) 3. For each x in S[M] 1. If A U {x} ∈ l [M] then A?A U {x} 4. Return A 清华大学 软件学院 宋斌恒 42 Analysis of the complexity ?Ordering the elements. ?To decide the whether A U {x} is in l [M] ?O(n lg n + n f(n)) 8 清华大学 软件学院 宋斌恒 43 贪心选择律 ?Let M={S, l } be a weighted matroid,and S is sorted into monotonically decreasing order by weight; ? Let x be the first independent element in S if such x exists; ? If x exists then there exists an optimal solution A such that x is an element of A ? 【 最大的独立元素一定在某一个最优解 中 】 清华大学 软件学院 宋斌恒 44 贪心选择律之证明 ? If no such x exists, then the only independent subset of S is the empty set. Hence we are done. Else, ? Let B be an optimal solution,then it is not empty, assume x is not in B (if it is in B then the property is true), ? Claim: There is no elements in B whose weight is bigger than w(x). To see this, any element y in B can construct an independent subset {y}, with assumption we get the claim 清华大学 软件学院 宋斌恒 45 切粘技术 ?Construct an A such that A is optimal and contains x: using cut and paste, or the exchange property of M, we can grow A={x} by adding elements from B till to A has the same size with B ?B and A only have one different element: ?w(B)-w(A)=w(y)-w(x)≤0 ?Claim: A is optimal too! 清华大学 软件学院 宋斌恒 46 扩展引理 Let M={S, l } be any matroid, if x is an element of S that is an extension of some independent subset A of S, then x is also an extension of empty set. The proof is simple. Really? 清华大学 软件学院 宋斌恒 47 淘汰无情 ! ?If x is not an extension of empty set, then it is not an extension of any independent subset A of S. 清华大学 软件学院 宋斌恒 48 最优子结构定理 ? Let x be the first element of S chosen by greedy for the weighted matroid M= (S, l ), the remaining problem of finding a maximum-weight independent subset containing x reduces to finding a maximum-weight independent subset of the weighted M’= (S’, l ’), where ? S’={y in S: {x,y} in l } ? l ’={B subset of S-{x}: B U {x} in l } and ? The weight function to W’is the weight function of M restrict to W’ 9 清华大学 软件学院 宋斌恒 49 加权拟阵贪心算法的正确性 ?定理 : The greedy algorithm on matroid is correct ?证明 : is just the combination of the last lemmas and corollaries. 清华大学 软件学院 宋斌恒 50 A task-scheduling problem ?Unit-time task: A task need unit time to complete ?Schedule of task set S: is a permutation of S specifying the order in which these tasks are to be performed ?Scheduling unit-time task with deadlines and penalties for a single processor: 清华大学 软件学院 宋斌恒 51 Input of the problem ?A set of task S={a1,a2,… ,an} ?A set of integer deadlines d1,d2,… ,dn. ?A set of nonnegative weights of penalties: w1,w2,… ,wn, such that if task ai is not finished before its deadlines there will incur a penalty wi. 清华大学 软件学院 宋斌恒 52 Convert the problem to a matroid ?Independent tasks:if there exists a schedule for these tasks such that no tasks are late. ? l ={all subsets of independent tasks} ? S is the set of all tasks ? M={S, l } is a matroid 清华大学 软件学院 宋斌恒 53 拟阵小结 ?拟阵定义 ?拟阵实例 ?拟阵的简单性质 ?加权拟阵 ?加权拟阵的优化问题及其变种 ?上述问题的最优子结构 ?上述问题的贪婪选择律 ?贪婪算法 , 及其在最小支撑树的应用