1 清华大学 宋斌恒 1 Lecture 3. Dynamic Programming 清华大学 宋斌恒 清华大学 宋斌恒 2 综合练习 : 如何查资料 、 整理资料 n 资料来源 n 资料整理 n 资料使用 清华大学 宋斌恒 3 Dynamic programming n Longest common subsequence n Printing neatly n Edit distance n Optimal binary search tree n 技巧 : Memoization[备忘录法 ] 清华大学 宋斌恒 4 动态规划方法回顾 n 与分治方法有没有共同点 ? 【 不同点 】 n 子问题空间 ? 决定复杂性因子之一 n 子问题的治与合 ? 复杂性因子之二 n 最优值 n 最优解 n [讨论 ] 清华大学 宋斌恒 5 最长公共子序列 清华大学 宋斌恒 6 问题定义 n Given : n A finite alphabet A, finite set R of strings from A* , and apositiveinteger K n Is There : n A string w element of A* with | w | >= K such that w is a subsequence of each x element of R ? n The problem remains NP-complete for all A ( R ) >= 2 n Analogous Longest Common Substring problem is trivially solvable in polynomial time 2 清华大学 宋斌恒 7 实例 n Given as input, the strings : n DEABC n ABCD n FGHABD n Common Subsequences Are : n A, B, D, AB n The Longest Common Subsequences Are : n AB 清华大学 宋斌恒 8 一些应用 n ?String - to - String Correction (spell checking & correction) n ?Distance Between Strings (database keys) n ?Minimum Cost Edit Operation Between Strings n ?Amino Acid Sequences in Molecular Biology n ?Data Compression Techniques 清华大学 宋斌恒 9 Longest Common subsequence n Background: n In bioinformatics, it needs often to compare the DNA of two or more different organisms. n A strand of DNA consists of a string of molecules called base, where the possible bases are: n Adenine:腺嘌呤 n Guanine:鸟嘌呤 n Cytosine:胞核嘧啶 , 氧氨嘧啶 n Thymine:胸腺嘧啶 n A DNA can be represented by a string over the finite set [A,C,G,T] 清华大学 宋斌恒 10 Longest Common subsequence n Two possible DNA string; n S1 = ACCGGTCGAGTGCGCGGAAGCCGGC n S2 = GTCGTTCGGAATCGGCTTGCCGTTG n Our goal: Compare the two string to determine how “similar”the two strands are. 清华大学 宋斌恒 11 What is the similarity of two strings? n Issue: The similarity can be defined in many ways, here we use the “longest common subsequence”to describe the similarity n Show your opinions![Many] 清华大学 宋斌恒 12 Concepts n Subsequence and substring n A substring of a string is a continuous part of the string n A subsequence of a string x is a string y with its elements in y in the same order. 3 清华大学 宋斌恒 13 Longest Common Subsequence Problem n Given two sequences n X=<x1, x 2, x 3,… ,xm> n Y=<y 1, y 2, y 3,… ,yn> n Find the longest common subsequence, where a common subsequence of two strings is another string y such that it is a subsequence of both of the two given strings. 清华大学 宋斌恒 14 Step1: Analyzing the problem n A brute-force method: [Na?ve] n How many subsequences for string x? n Check each one is also a subsequence of string y. n Complexity? 清华大学 宋斌恒 15 Theorem 15.1 Optimal substructure of an LCS n Let X=<x1,x2,… ,xm> ,Y=<y1,y2,… ,yn> be two sequences and let Z=<z1,z2,… ,zk> be a LCS of X and Y. n If xm=yn, then zk=xm=yn and Zk-1 is an LCS of Xm-1 and Y n-1 n If xm !=yn, then zk!= xm implies that Zk is an LCS of Xm-1 and Yn n If xm != yn, then zk!= yn and Zk is an LCS of Xm and Y n-1 清华大学 宋斌恒 16 Proof of Theorem 15.1 n Proof of “If xm=yn, then zk=xm=yn and Zk-1 is an LCS of Xm-1 and Yn-1” n If zk!=xmàwe could append xm to Z to obtain a longer common subsequence of X and Y, which is a contradiction, hence we have zk=xm=yn and n Zk-1 is a common subsequence of Xm-1 and Y n-1, which we will show it is an LCS of Xm-1 and Y n-1. If not, there are another common subsequence which is longer than Zk-1 and then we append yn to it, which is a contradiction. 清华大学 宋斌恒 17 Continuing proof of Theorem 15.1 n Proof of “If xm!=yn, then (zk != xm implies that Zk is an LCS of Xm-1 and Yn)” n If Zk is not an LCS of Xm-1 and Yn , then it will conclude a contradiction that Z is an LCS of X and Y. 清华大学 宋斌恒 18 Step 2: A recursive solution n Let C[i,j] be the length of the LCS of Xi and Yj, then ?? ??? ≠?? =+?? = = ji ji yxjicjic yxjic ji jic ]},1[],1,[max{ 1]1,1[ 0,0 ],[ 4 清华大学 宋斌恒 19 Step 3: Computing the length of an LCS LCS-Length(X,Y) ① m=length[X] ② n=length[Y] ③ For i ?1 to m {c[i,0]=0} //初始化 ④ For j ?0 to n {c[0,j]=0} //初始化 ⑤ For i ?1 to m, j ? 1 to n ⑥ { ① if xi=yj then {c[i,j]=c[i-1,j-1]+1; continue} ② if c[i,j-1]>c[i-1,j] then {c[i,j]=c[i.j-1]; continue} ③ c[i,j]=c[i-1,j] ⑦ } 清华大学 宋斌恒 20 Step 4: Constructing the LCS Modification for the last algorithm: store the information of choices. Here after the b[i,j] stands for the choice of computing c[i,j]. ⑤ For i ?1 to m, j ?1 to n ⑥ { ① if xi=yjthen {c[i,j]=c[i-1,j-1]+1; b[i,j]=upperleft; continue} ② if c[i,j-1]>c[i-1,j] then {c[i,j]=c[i.j-1]; b[i,j]=upper; continue} ③ c[i,j]=c[i-1,j]; b[i,j]=left ⑦ } 清华大学 宋斌恒 21 Print-LCS(b,X,i,j) ① If i=0 or j=0 then return ② If b[i,j]=upperleft then{ ① Print-LCS(b,X,i-1,j-1) ② Print Xi; ③ Return; ③ } ④ If b[i,j]=upper Print-LCS (b,X,i -1,j); return; ⑤ Print-LCS(b,X,i,j); ⑥ return; 清华大学 宋斌恒 22 How to improve your code n Save the space used in the last algorithm n How to write the code neatly n Complexity analysis n Complete the LCS n 备忘录法 清华大学 宋斌恒 23 如何分析问题 n 优美打印为例 : n什么是优美打印问题 : n英文打印每行空格的差别要小 : 一般是空格个 数的三次方的和达到极小 。 【 举例说明 】 清华大学 宋斌恒 24 子问题空间 n 原问题 ? 不优美度 ( 一段文字 ) n 考虑不考虑首行缩进问题 ? n 不优美度 ( 空字符数 , 一段文字 ) n 一般子问题 : 从某位置开始的子文字段 n ? 分析子问题维数 ? n ? 如何由子问题构造原问题 ? n 特殊子问题 : 所有子文字段都从头排版 n ? 子问题问题维数 ? n 是否可以由这类子问题构造原问题的解 ? n 有没有更小的子问题空间 ? 5 清华大学 宋斌恒 25 Analyzing Printing neatly again n 断句点 n 断句点确定后留下两类问题 n 一类问题的不优美度要考虑最后一行数值 ; 另 一类问题的不优美度不考虑最后一行的数值 ; n 说明上述选择方法 、 证明上述选择方法的正确 性 。 n 递归公式准备工作如何做 ? 【 见下页 】 清华大学 宋斌恒 26 Recursive Formula n 请问 : 如果 C(i,j)表示从第 i个词开始到第 j个词结束段落 包括 最后一行的不优美度 , 如果 D(i,j)表示从第 i个词开 始到第 j个词结束段落 不包括 最后一行的不优美度 , 假 设有 n个词 , 原问题的不优美度为 n D(1,n)! 递归公式 ? n C(i,j )=mini<k<j{C(i,k)+ C(k+1,j)} n D(i,j )=mini<k<j{C(i,k)+ D(k+1,j)} n 初始值 ? 清华大学 宋斌恒 27 初始值 ( Initial values) n 当 k(i,j)=j- i+ 所有词的字符数 ≤ 行字符数时 【 词与词 之间空格数为 1】 n C(i,j )=[行字符数 - k(i,j)]3 n D(i,j )=0 n 如果编程 : 请问最严重的异常情况是什么 ? n 思考 : 如果字符宽度不同 , 上述方法有效否 ? n 进一步 : 如果字符宽度不同 , 其宽度可以无级缩放 , 每行两头完全对齐 , 如何定义优美度 ? 清华大学 宋斌恒 28 最优子结构 n 原问题的解是最优的 n 如何说明原问题的解是由子问题的最优解构成 的 , 这是分析问题的关键 清华大学 宋斌恒 29 优美打印问题变种 n 字符宽度不同 n 可以调整某些字符的宽度 , 如空格 , 但调整要 计算在优美度内 。 n 可以调整所有字符的宽度 。 n 其它语言如汉语的优美度问题 n 图文混排问题等等 清华大学 宋斌恒 30 编辑距离 n 从一个字符串到另一个字符串可以由若干变换构成 : n Copy n Replace n Delete n Insert n Twiddle n Kill n x=σ(y), 其中 σ是一连串的变换 σ= t1t2… tn 6 清华大学 宋斌恒 31 变换成本与距离 n Cost(σ)=ΣCost(ti) n 我们称由 x到 y的费用最小的 σ的费用就是 x和 y之间的距离 。 n 子问题空间如何确定 : n C[xn,ym]=C[xn-2,ym-2]+twiddle 可以 twiddle, n C[xn,ym]=C[xn-1,ym-1]+copy 可以 copy, n C[xn,ym]=C[xn-1,ym-1]+replace 可以 replace, n C[xn,ym]=C[xn-1,ym]+delete 可以 delete, n C[xn,ym]=C[xn,ym-1]+insert 可以 insert, n C[xn,ym]=C[xn-s,ym]+kill 可以 kill, n 最优子结构 : 最优解可以分解为子问题的最优解的综合 。 n 问题 : 清华大学 宋斌恒 32 Optimal binary search tree n Binary Search Tree? n Binary tree n Search tree n Examples? n Optimal: under some condition: n The elements appear with different frequencies n How to formally define it? 清华大学 宋斌恒 33 A search tree k1 k3 k4 k2 d0 d1 d2 d3 d4 清华大学 宋斌恒 34 Definition for the optimal binary search tree n Given a sequence K=<k1,k2,… kn> of n distinct keys stored orderly from smaller value to larger value. n We need n+1 dummy keys for values not in K, these dummy keys are d0,d1,… ,dn n The appearance probabilities of ki and dj are given n Find a search tree such that: 清华大学 宋斌恒 35 Cost of the searches are minimum n The expectation cost of search:[See p358 15.16] n E[search cost in T]= Sum[(depth(ki)+1)*pi]+ sum[(depth(di)+1)*qi]+ Where pi are the occurrence possibility of ki, qi are the occurrence possibility of di 清华大学 宋斌恒 36 The structure of an optimal binary search tree n We need a structure that the optimal solution can be constructed from one or many optimal solution for subproblems n Any sub tree of binary search tree contains the keys continuing from ki to kj and also the dummy keys from di-1to dj 7 清华大学 宋斌恒 37 费用递归函数 n e[i,j ], the optimal expectation value for the optimal tree n If kr is the root of the optimal subtree containing the key ki to kj, we have n e[i,j ]=pr+(e[i,r-1]+w(i,r- 1))+e[r+1,j]+w(r+1,j)) n e[i,j ]=e[i,r-1]+e[r+1,j]+w(i,j) k1 k3 k4 k2 d0 d1 d2 d3 d4 清华大学 宋斌恒 38 The optimal structure of the binary search tree n If the tree is optimal, then its all subtree is optimal. n Prove it! 清华大学 宋斌恒 39 Chapter notes n R. Bellman’s systematic study on dynamic programming in 1955 with solid math base n Matrix chain production has an O(n lg n) solution by Hu and Shing n Knuth posed a question is there any solution whose complexity is less than O(mn) for LCS, the answer is yes with a solution with O(mn/lg n) by Masket and Paterson