1
本节要点
正则表达式常应用于文本匹配:
– 串的查找
– 串的替换
– 将输入识别为一个个的记号正则表达式的应用
Use #1,Text-processing the web
– Web is full of data,but it’s in text form for
humans to read
Screenscraping
– extracting the data you want from screen output
– these days,the output format is HTML
Examples:
– extract tour schedule of your favorite bands
from Ticketmaster
– web sites as web services,convert address to
geo coordinates
2
正则表达式的应用
Use #2,Text processing in general
– a spectrum of uses,from small to big
Small fixes,
– replacing "ugly quotes" with,smart quotes”
– converting files between operating systems
Bigger tasks
– spell checking formatted documents (HTML),must
extract text
– pretty printing code,find comments,etc; add
format directives
3
正则表达式的应用
Use #3,Program processing
– especially on the web
On the Web,procedure calls = http requests
–,procedure arguments” passed as strings
– argument extraction can be done with regular
expressions
Other uses:
– extract components of an email address
– obfuscation,want to obfuscate all JS functions
except those called from HTML embedded scripts;
so scan web page for names of functions called
from HTML,to avoid obfuscating them,4
Regular Expression Tutorial
Focus on the two languages:
– JavaScript
– Python
A key rules common to both,Given a string
and an regex:
Find the first position in string where a
match is possible.
(except for the match() function in Python,which must match
at the beginning of the string.)
5
String search,from simple to regexp
(JavaScript)
Basic search methods,for String objects:
– "string".indexOf("rin")? 2
– "string".indexOf(new RegExp("r*n"))? -
1
– "string".search(new RegExp("r*n"))? 4
– "string".search(new RegExp("r.*n"))? 2
– "string".search(/r.*n/)? 2
– "string".match(/tri|str/)?
["str"]
– "string".match(/ri|st/g)? ["st",
"ri"]
– "string".match(/tri|str/g)?
["str"]
6
等效
String search,from simple to regexp (JavaScript)
indexOf
– Syntax,object.indexOf(searchValue,[fromIndex])
– When called from a String object,this method returns the
index of the first occurance of the specified searchValue
argument,starting from the specified fromIndex argument.
search
– Syntax,object.search(regexp)
– This method is used to search for a match between a
regular expression and the specified string.
– RegExp
– Syntax,
new RegExp(“pattern”[,“flags”]) 或者 myReg=pattern/flags
String search,from simple to regexp (JavaScript)
match
– Syntax,object.match(regexp)
– This method is used to match a specified regular
expression against a string
– If one or more matches are made,an array is
returned that contains all of the matches,Each
entry in the array is a copy of a string that
contains a match,If no match is made,a null is
returned,
To perform a global match you must include the 'g'
(global) flag in the regular expression and to
perform a case-insensitive match you must include
the 'i' (ignore case) flag.
匹配用过的串不再用于匹配
Same for Python
Basic search methods,for String objects:
– re.match(r"tri|rin","string")? no match
– re.search(r"tri|rin","string").group(0)?
'tri'
– re.compile(r"tri|str").findall("string")?
['str']
– re.compile(r"ri|st").findall("string")?
['st','ri']
– re.search(r"(tr)|(in)","string").groups()?
('tr',None)
note,match() expects the match to start at
9
表示是原始字义
Python正则表达式
支持,.”,”*”,”+”,”?”,”|”,“[
]”,” \”
,^”,匹配串的开始
,$”,匹配到串尾
{m},m个重复
{m,n},m到 n个重复
*?,+?,,{m,n}?,在第一个符号的意义上
,改贪婪的最大匹配为最小匹配
例:用正则表达式 <.*>匹配,<H1>title</H1>” 时最大匹配可匹配整个串,最小匹配匹配,<H1>“
(...),匹配括号内的任意正则表达式,常用于分组
Python正则表达式
compile(pattern[,flags])
– Compile a regular expression pattern into a regular
expression object.
– The expression‘s behaviour can be modified by specifying
a flags value,如,i表示忽略大小写
– The sequence
prog = re.compile(pat) result = prog.match(str)
– is equivalent to
result = re.match(pat,str)
– but the version using compile() is more efficient when the
expression will be used several times in a single program,
Python
findall(pattern,string[,flags])
– Return a list of all non-overlapping matches of
pattern in string,If one or more groups are
present in the pattern,return a list of groups;
this will be a list of tuples if the pattern has
more than one group,Empty matches are included
in the result unless they touch the beginning of
another match,
Python
group([group1,...])
– Returns one or more subgroups of the match,If
there is a single argument,the result is a
single string; if there are multiple arguments,
the result is a tuple with one item per
argument,Without arguments,group1 defaults to
zero (the whole match is returned).
Regular Expression Operators
Regular expressions contain:
– characters (these will be matched)
– meta-characters (these are operators controlling
the match)
If you designed the language,what ops
would you select?
re* zero or more repetitions of re
re1 | re2 match re1 or re2
re + one or more instances of re
\d matches any digit
\w matches any letter
[1-5] same as (1|2|3|4|5) -- character class
[^”] any character but the quote 14
Escaping headaches
What string will s point to? What length
will it have?
var s = ′″″′;
How about t?
var t = ″′\″′″;
What string liuteral give you this 3-
character string? [″][\][″]
15
var s = ′″ \\″′;
The repeat operators are greedy,but?
The repeat operators are greedy
– they do match the longest string while still
allowing the remainder of the regular expression
to match
This is not the same as,the whole regex
is greedy”
– that is,no automatic maximal munch,
– which is needed for compiler lexers
Example
“aab”.match(/(a|aa)a/)? [,aa” ]
“aab”.match(/(a|aa)b/)? [,aab” ]
“aaa”.match(/(a|aa)*a/)? [,aaa” ] 16
Tricky example
Regular expression,‘X(.+)+X’
– Input1,‘=XX=================X=,match
– Input2,‘=XX===================,no
match
JavaScript,
"=XX====?========X=".search(/X(.+)+X/)
– unbearably slow
Python,re.search(r'X(.+)+X',
'=XX======?======X=')
– unbearably slow
awk,echo '=XX====?=====X=' | gawk
'/X(.+)+X/'
– uncomparably fast
17
Implementations differ widely
We will learn
– how to match regular expressions efficiently
– why some implementations chose a slower
implementation
Getting a good implementation for a regexp
– your first compilation problem
– very neat,goes back to 1968 Ken Thompson,a
Berkeley grad
18
19
Regular Expressions
automaton is a good visual aid
– but is clumsy to enter,at least in textual
format
often,regular expressions are a better
formalism
– a compact way to define the language to be
accepted by an automaton,
20
Operands of a regular expression
Operands are same as labels on the edges of
an FSM
– single characters,or
– the special character? (the empty string)
"letter" is a shorthand for
– a | b | c |,.,| z | A |,.,| Z
"digit,is a shorthand for
– 0 | 1 |? | 9
sometimes we put the characters in quotes
– necessary when denoting regular expression
operators,|,*
21
Precedence of |,* operators,
Consider regular expressions:
– letter.letter | digit*
– letter.(letter | digit)*
Regular
Expression
Operator
Analogous
Arithmetic
Operator
Precedence
| plus lowest
,times middle
* exponentiation highest
More examples
Parse a URL
Replace
Parse an arithmetic expression
22
More on regular expression operators
Repetition operators:
re* match zero or more occurrences of
re
re+ match one or more occurrences of re
re? match zero or one occurrences of re
Greedy repetition,*,+,?
– match as much as possible
– while allowing the following patterns to match
Non-greedy repetition,*?,+?,
– matches as few as necessary
– while allowing the following patterns to match23