清华大学：算法分析与设计：算法分析与设计2005春.课件.第11讲

分类：计算机格式：pdf 日期：2007年03月28日

1 1 Lecture 11 模式匹配清华大学宋斌恒 2 内容介绍 ?解决问题：模式匹配，对应实际问题： –文本文件字符串查找 –格式文件中的字符串及其格式查找问题 ?算法：Na?ve，Rabin Karp, 自动机， KMP，BM算法 3 Figure 32.1 The string-matching problem. The goal is to find all occurrences of the pattern P = abaa in the text T = abcabaabcabac. The pattern occurs only once in the text, at shift s = 3. The shift s = 3 is said to be a valid shift. Each character of the pattern is connected by a vertical line to the matching character in the text, and all matched characters are shown in green color. text T pattern P b aaba a cabacbaabac 4 本章算法及其复杂度列表 Algorithm preprocessing time matching time Na?ve 0 O((n-m+1)m) Rabin-KarpΘ(m) O((n-m+1)m) Finite automaton O(m|Σ|)Θ(n) Knuth Morris Pratt O(m)Θ(n) BM 5 Terminology ? Σ: 字符集 ? Σ * :由Σ生成的所有字符串，包括空串的集合,以下叙述x, y, z, w表示其元素 ? Concatenation of x，y：xy ?前缀：[: w [ x ?there exists y，such that x=wy ?后缀：] : w ]x ?there exists y，such that x=yw 6 Lemma 32.1 (Overlapping-suffix lemma) Suppose that x, y, and z are strings such that x ] z and y ] z. If |x| ≤|y|, then x] y. If |x| ≥|y|, then y ] x. If |x| = |y|, then x = y. 2 7 Figure 32.3 A graphical proof of Lemma 32.1. We suppose that x ] a and y ] z. The three parts of the figure illustrate the three cases of the lemma. Vertical lines connect matching regions (shown shaded) of the strings. (a) If |x| ≤|y|, then x ] y. (b)If |x| ≥|y|, then y ] x. (c) If |x| = |y|, then x = y. 8 Na?ve String matching algorithm ? Na?ve-String-matcher(T,P) 1. n?length[T] 2. m?length[P] 3. For s ?0 to n-m 1. If P[1,…,m]==T[s+1,…,s+m] 1. Print “Pattern occurs with shift” s 9 Rabin Karp algorithm ? Hash Value p of a string P with modulo q: – p=hash(P,m,q)=(P[m]|Σ| 0 + P[m-1]|Σ| 1 + P[m-2]|Σ| 2 +…+ P[m-i]|Σ| i +…+ P[1]|Σ| m-1 ) mod(q) –计算方法：多项式求值方法， ?算法描述： –给定字符串T和要找的子字符串P， –从i=1到Length[T]-Length[P] ? If Hash(P,m,q)!=Hash(T[i,m],m,q) continue ?继续判断（na?ve方法） ?其中T[i,m]表示T从i开始长度为m的子串 10 Key Points of Karp-Rabin ? How to compute the value efficiently? –t s+1 = Hash(T[s+1,m],m,q) –t s+1 =(d(t s -T[s+1]h)+T[s+m+1]) mod(q) –h=d m-1 (mod q) –d是进位，如果是26个字符，则可以是26，一般来讲取d为|Σ| ? Spurious Hit 11 2 76251413209 7 53 3 9 9 2 1 12 2 76251413209 7 53 3 9 9 2 1 81 4 5 10 110 7 9 1111398 3 13 3 25141 87 14152≡(31415－3·10000) · 10+2(mod 13) ≡(7－3 · 3) · 10+2(mod 13) ≡8(mod 13) 14 ? Rabin-Karp-Matcher(T,P,d,q) 1. n=length[T] 2. m=length[P] 3. h=d m-1 (mod q) 4. p=0 5. T 0 =0 6. For i =1 to m 1. p=(dp+P[i]) (mod q) 2. t 0 =(dt 0 +T[i]) (mod q) 7. For s = 0 to n-m 1. If p=t s 1. If P[1,…,m]=T[s,…,m+s], print “pattern occurs at ” s 2. If s<n-m 1. t s+1 =(d(t s -T[s+1]h)+T[s+m+1] mod(q) 15 String matching with finite automata ?有限自动机M：5-tuple (Q,q 0 ,A,Σ,δ) ? Q is a finite set of states ?q 0 in Q is the start state ? A contains in Q is a distinguished set of accepting states ? Σ is a finite input alphabet ? δ is a function from Q×Σ into Q called the transition function of M 16 有限自动机示例: 0 1 001 010 bastate input ?δ ?Σ 1A ?q 0 ?Q abaaa accepted abbbaaa accepted abbaa rejected 17 Final-state function ? φ: Σ ? ?Q ? φ(ε)=q 0 ? φ(wa)=δ(φ(w),a) for w in Σ ? , ainΣ 18 String matching automata ? Given pattern P, to construct an automaton: ?M= (Q,q 0 ,A,Σ,δ) where ? Σ = the alphabet in P ? Q = {0,1,2,…,m: integer s means the current state has exactly s characters match the pattern}, m=length[P] ?q 0 =0 ?A={m} ? δ : δ(q,a)=σ(P q a), σ(x)=max{k:P k =x}, P k 表示P 的前k个字符，称为后缀函数。【研究！】 4 19 Automaton for “ababaca” 0 1 2 3 4 5 6 7 其它转移如何做？其它转移如何做？转移函数？转移函数？ 20 0 2 3 4 5 61 7 21 1 00 1 02 3 00 1 04 5 00 1 64 7 00 1 02 22 自动机工作的一个例子 i — 1 2 3 4 5 6 7 8 9 10 11 T[i] — a b a b a b a c a b a stateφ(T i ) 0 1 2 3 4 5 4 5 6 7 2 3 23 算法描述： ? Finite-Automaton-Matcher(T,δ,m) 1. n=length[T] 2. q=0 3. For i = 1 to n 1. q=δ(q,T[i]) 2. If q==m then print “pattern occurs at” i-m 24 后缀函数σ的属性：【引/定理32.2/3/4】 ? σ(xa)≤σ(x)+1 ? If q= σ(x) then σ(xa)=σ(P q a) ?如果φ是模式P决定的自动机的最终状态函数，则φ(T i )=σ(T i ) 5 25 Figure 32.8 An illustration for the proof of Lemma 32.2. The figure shows that r ≤σ(x) + 1, where r =σ(xa). 1?r p r p 26 Figure 32.9 An illustration for the proof of Lemma 32.3. The figure shows that r =σ(P q a), where q = σ(x) and r =σ(xa). q p r p 27 COMPUTE-TRANSITION-FUNCTION(P, ∑) 1m←length[P] 2for q←0 to m 3 do for each character a∈∑ 4 do k←min(m + 1, q + 2) 5 repeat k←k－1 6 until P k ] P q a【是后缀！】 7δ(q, a)←k 8 return δ 28 Knuth-Morris-Pratt Algorithm ?中心思想：克服na?ve 算法和有限自动机算法的重复工作！ ?主要步骤：计算辅助函数π[1,2,…,m]它只与Pattern相关，用来说明Pattern的重复情况。 29 a aaba b bcbaabababc ba acb 在某S处开始有5个字符匹配，第六个不匹配，这时找一个短一些的匹配串：即从第S＋2处开始的长度为3的地方匹配,见下页 30 a aaba b bcbaabababc ba acb 6 31 aba ababa q P k P 32 21 109876543 ba acbababa 00 10654321 33 ba bababa ba baba ba ba ba 8 P 6 P 4 P 2 P 0 P c a a b c a π[8]=6 a b a b c a π[6]=4 a b a b a b c a π[4]=2 a b a b a b a b c a π[2]=0 ε 34 KMP-MATCHER(T, P) 1n←length[T] 2m←length[P] 3π←COMPUTE-PREFIX-FUNCTION(P) 4q←0 Number of characters matched. 5for i←1 to n Scan the text from left to right. 6 do while q > 0 and P[q + 1] ≠T[i] 7do q←π[q] Next character does not match. 8ifP[q + 1] = T[i] 9 then q←q + 1 Next character matches. 10 if q = m Is all of P matched? 11 then print “Pattern occurs with shift” i－m 12 q←π[q] Look for the next match. 35 COMPUTE-PREFIX-FUNCTION(P) 1m←length[P] 2π[1] ←0 3k←0 4for q←2 to m 5 do while k > 0 and P[k + 1] ≠P[q] 6 do k←π[k] 7if P[k +1] = P[q] 8 then k←k +1 9 π[q] ←k 10 return π 36 BM算法 ? The Boyer-Moore algorithm ?一种从右边开始进行匹配的算法 7 37 主要思想从右向左比：当目标串中的字符根本就没有在模式串中出现，模式串就可以从这一字符开始向右移动m位置数。例子：位置0 1 2 3 4 5 6 7 8 9 … 目标串a b b a d a b a c b a 模式串b a b a c b a b a c 这种算法最好的情况就是模式串每一次和目标串比较，它们第一个字符就不匹配，且目标串中的字符不在模式串中出现，这样模式串每一次都可移动m位。 38 坏字符试探 Bad character heuristics 目标串和模式串第一个字符不匹配，且目标串的第一个字符在模式串的其它位置出现，在这种情况下，模式串右移到其相应字符与目标串第一个字符匹配的位置上。 39 示例位置0 1 2 3 4 5 6 7 8 9 … 目标串a b b a b a b a c b a 模式串b a b a c b a b a c b-c导致不匹配，但目标串字符b出现在模式串的位置0和位置2，于是模式串可右移2 位，使其最右边的字符b和目标串的第一个字符b匹配。 40 Good suffix heuristics 位置0 1 2 3 4 5 6 7 8 9 … 目标串a b a a b a b a c b a 模式串c a b a b c a b a b √ 目标串和模式串的后缀ab已匹配，模式串应当右移2位，使其从右至左的第二个子串ab与目标串的后缀ab匹配。 41 Good suffix heuristics 下面这个例子中，同样目标串和模式串的后缀ab已匹配，但子串ab没有在模式串的其它位置出现。所以模式串可以右移5位。位置0 1 2 3 4 5 6 7 8 9 … 目标串a b c a b a b a c b a 模式串c b a a b c b a a b 42 Good suffix heuristics 目标串和模式串的后缀bab已匹配，在位置 1出现a-b不匹配，同时子串bab也没有再出现在模式串中，但在这种情形下模式串不能象上一种情形右移5位，而只能右移3位，因为模式串的前缀ab和子串bab（也就是目标串的子串 bab）的后缀匹配。位置0 1 2 3 4 5 6 7 8 9 … 目标串a a b a b a b a c b a 模式串a b b a b a b b a b 8 43 对“坏字符”情况的预处理对于这种情况，我们首先定义一个函数occ 生成一个数组，在这个数组中记录广义字母表的每一个字符在模式串中出现的最右边的位置，如果根本没有出现在模式串中，则记-1。定义：其中p 0 p 1 …p m-1 代表模式串，a是广义字母表中的任一字符。 a1- aa} = p | j max{ = a) occ(p, j ? ? ? ? ? 没有在模式串中出现如果出现在模式串中如果 44 算法： bmInitocc() for a ←0 to alphabetsize do occ[a] ←-1 for j ←0 to m do a ←p[j] occ[a] ←j 例子： occ(text, x) = 2 occ(text, t) = 3 45 对“好后缀”情况的预处理我们也要建立一个数组s。数组里的第i项代表不匹配发生在模式串的位置i-1 （即模式串位置i开始的后缀已和目标串相应子串匹配）时模式串应当右移的位数。然后再建立一个数组f。第i项代表模式串中从位置i开始的后缀其最长边界开始的位置。 46 对“好后缀”情况的预处理第一种情形：已匹配的后缀子串在模式串的其它位置又出现。 i: 0 1 2 3 4 5 6 7 p: a b b a b a b f: 5 6 4 5 6 7 7 8 s: 0 0 0 0 2 0 4 1 47 对“好后缀”情况的预处理第二种情形：已匹配的后缀子串仅仅有一部分又出现在模式串的开头。例子：i: 0 1 2 3 4 5 6 7 p: a b b a b a b f: 5 6 4 5 6 7 7 8 s: 5 5 5 5 2 5 4 1 48 bmPreprocess1（） i ←m j ←m+1 f[i] ←j while i>0 do while j<=m and p[i-1]≠p[j-1] do if s[j]=0 then s[j] ←j-1 j ←f[j] i ←i-1 j ←j-1 f[i]=j bmPreprocess2() j ←f[0] for i ←0 to m do if s[i]=0 then s[i] ←j if i=j then j ←f[j] BM预处理算法 bmPreprocess() call bmInitocc() call bmPreprocess1() call bmPreprocess2() BM检索算法 bmSearch() i ←0 while i<=n-m do j ←m-1 while j>=0 and p[j]==t[I+j] do j ←j-1 if j<0 then return i i ←i+s[0] else i ←i+max(s[j+1],j- occ[t[i+j]]) 9 49 总结 BM算法也将匹配过程分为了对模式串的预处理阶段(preprocessing)，和检索(Searching)阶段。一般情况下，算法的时间复杂度是O (n)。在可选的字符数量远远大于模式串的长度时，由于这将通常导致“坏字符”情况，模式串一次即可右移m个位置，因此算法的时间复杂度平均可以达到O (n/m)。

课件简介

课件名称：	清华大学：算法分析与设计
课件分类：	计算机
课件类型：	电子教案
文件大小：	2.69MB
下载次数：	5
评论次数：	8
用户评分：	6.8

用户列表