语法分析技术小结
自上而下分析
自下而上分析
二义性处理
比较自上而下分析
自上而下分析 一般过程
LL( 1) 文法
预测分析程序
递归下降分析技术
表驱动分析技术自上而下分析一般过程
In the top-down parsing,we begin with the start
symbol and at each step,expand one of the
remaining nonterminals by replacing it with the
right side of one its productions.
We repeat until only terminals remain,The top-
down parse prints a leftmost derivation of the
sentence.
LL( 1)文法
The first ―L‖ means we scan the input from left to right; the
second ―L‖ means we create a leftmost derivation; and the 1
means one input symbol of lookahead.
A grammar G is LL(1) iff whenever A –> u | v are two distinct
productions of G,the following conditions hold:
- for no terminal a do both u and v derive strings beginning
with a (i.e,first sets are disjoint)
- at most one of u and v can derive the empty string
- if v =>*? then u does not derive any string beginning with a
terminal in Follow(A)
The first set of a sequence of symbols u,written
as First(u ) is the set of terminals which
start all the sequences of symbols derivable from
u,A bit more formally,consider all
strings derivable from u by a leftmost derivation,
If u =>* v,where v begins with some
terminal,that terminal is in First(u),If u =>*?,
then? is in First(u ).
First 集
Follow 集
The follow set of a nonterminal A is the set of
terminal symbols that can appear immediately
to the right of A in a valid sentential form,A bit
more formally,for every valid sentential form S
=>*uAv,where v begins with some terminal,
that terminal is in Follow(A).
计算 First 集
To calculate First(u) where u has the form
X1X2...Xn,do the following:
1,If X1 is a terminal,then add X1 to First(u),
otherwise add First(X1) -? to First(u ),
2,If X1 is a nullable nonterminal,i.e.,X1 =>*?,add
First(X2) -? to First(u),Furthermore,
if X2 can also go to?,then add First(X3) -? and
so on,through all Xn until the first nonnullable one.
3,If X1X2...Xn =>*?,add? to the first set.
计算 Follow 集
For each nonterminal in the grammar,do the following:
1,Place# in Follow(S) where S is the start symbol and # is
the input's right endmarker.The endmarker might be end of
file,it might be newline,it might be a special symbol,
whatever is the expected end of input indication for this
grammar,We will typically use # as the endmarker.
2,For every production A –> uBv where u and v are any
string of grammar symbols and B is a nonterminal,
everything in First(v) except? is placed in Follow(B).
3,For every production A –> uB,or a production A –> u Bv
where First(v ) contains? (i.e,v is nullable),then
everything in Follow(A) is added to Follow(B).
预测分析程序
Predictive parser is a non-backtracking top-
down parser,A predictive parser is
characterized by its ability to choose the
production to apply solely on the basis of the
next input symbol and the current nonterminal
being processed,
To enable this,the grammar must take a
particular form,that is,a grammar LL(1),
递归下降分析技术
The first technique for implementing a predictive
parser is called recursive-descent,
A recursive descent parser consists of several small
functions(procedures),one for each nonterminal in
the grammar,As we parse a sentence,we call the
functions (procedures) that correspond to the left
side nonterminal of the productions we are applying,
If these productions are recursive,we end up
calling the functions recursively.
表驱动分析技术
Another method for implementing a predictive parser is called
table-driven parsing that uses a table,called parse table,to store
that production along with an explicit stack to keep track of where
we are in the parse,We push the start symbol on the stack and read
the first input token,As the parser works through the input,there are
the following possibilities for the top stack symbol X and the input
token nonterminal a:(1) If X = a and a = end of input (#),parser
halts and parse completed successfully.( 2) If X = a and a != #,
successful match,pop X and advance to next input token,This is
called a match action.( 3) If X != a and X is a nonterminal,pop X
and consult table at [X,a] to see which production applies,push
right side of production on stack,This is called a predict action.( 4)
If none of the preceding cases applies or the table entry from step 3
is blank,there has been a parse error
分析表的构造
1,For each production A –> u of the grammar,
do steps 2 and 3
2,For each terminal a in First(u),add A –> u to
M[A,a]
3,If? in First(u),(i.e,A is nullable) add A –> u
to M[A,b] for each terminal b in Follow(A),
If? in First(u),and # is in Follow(A),add A –
> u to M[A,#]
4,All undefined entries are errors
自下而上分析
自下而上分析 一般过程
移进归约技术
LR 分析技术
补充:算符优先分析技术自下而上分析一般过程
A bottom-up parse works in reverse,We begin with
the sentence of terminals and each step applies a
production in reverse,replacing a substring that
matches the right side with the nonterminal on the
left,
We continue until we have substituted our way back
to the start symbol,If you read from the bottom to
top,the bottom-up parse prints out a rightmost
derivation of the sentence.
移进归约技术
Bottom-up parsing algorithms are in general more
powerful than top-down methods,but not surprisingly,
the constructions required in these algorithms are
also more complex,It is difficult to write a bottom-up
parser by hand for anything but the most trivial of
grammars,but fortunately,there are excellent parser
generator tools like yacc that build a parser from an
input specification.
Shift-reduce parsing is the most commonly used and
most powerful of the bottom-up techniques.
LR parsers (―L‖ for left to right scan of input; ―R‖ for rightmost
derivation) are efficient,tabledriven shift-reduce parsers,An LR
parser uses two tables:
1,The action table Action[s,a] tells the parser what to do when the
state on top of the stack is s and terminal a is the next input token,
The possible actions are to shift a state onto the stack,to reduce the
handle on top of the stack,to accept the input,or to report an error.
2,The goto table Goto[s,X] indicates the new state to place on top of
the stack after a reduce of the non-terminal X while state s is on top
of the stack.
The two tables are usually combined,with the action table
specifying entries for terminals,and the
goto table specifying entries for non-terminals.
LR技 术
There are three types of LR parsers,LR(k),simple
LR(k),and lookahead LR(k) (abbreviated to LR(k),
SLR(k),LALR(k)),The k identifies the number of
tokens of lookahead,We will usually only concern
ourselves with 0 or 1 tokens of lookahead,but it does
generalize to k > 1,The different classes of parsers all
operate the same way (as shown above,being driven
by their action and goto tables),but they differ in how
their action and goto tables are constructed,and the
size of those tables.
Types of LR parsers
LR(0) configuration or item,A configuration is a production of the grammar
with a dot at some position on its right side.
1,Construct F = {I0,I1,..,In},the collection of configurating sets for G'.
2,State i is determined from Ii,The parsing actions for the state are determined
as follows:
a) If A –> u? is in Ii then set action[i,a] to reduce A –> u for all input,(A not
equal to S').
b) If S' –> S? is in Ii then set action[i,#] to accept.
c) If A –> u?av is in Ii and GO(Ii,a) = Ij,then set action[i,a] to shift j (a is a
terminal).
3,The goto transitions for state i are constructed for all non-terminals A using
the rule,If GO(Ii,A) = Ij,then goto [i,A] = j.
4,All entries not defined by rules 2 and 3 are errors.
5,The initial state is the one constructed from the configurating set containing
S' –>?S.
Constructing LR(0) parsing tables
A grammar is LR(0) if the following two
conditions hold for each configurating set:
For any configurating set containing the item A –>
u?xv there is no complete item B –> w? in that
set,In the tables,this translates to no shift-
reduce conflict on any state,
There is at most one complete item A –> u? in each
configurating set,This translates to no reduce-
reduce conflict on any state.
LR(0) grammar
In SLR(1),the S stands for Simple,
SLR(1) parser uses the same LR(0) configurating sets
and has the same table structure and parser
operation,,The difference comes in assigning table
actions,where we are going to use a token of
lookahead to help arbitrate among the conflicts,If
the string on top of the stack could be reduced to the
non-terminal A,what tokens do we expect to find as
the next input?
What tokens would tell us that the reduction is not
appropriate? Answer is to use Follow(A).
SLR(1)分析
1,Construct F = {I0,I1,..,In},the collection of LR(0) configurating
sets for G'.
2,State i is determined from Ii,The parsing actions for the state are
determined as follows:
a) If A –> u? is in Ii then set action[i,a] to reduce A –> u for all a in
Follow(A) (A may not be S').
b) If S' –> S? is in Ii then set action[i,#] to accept.
c) If A –> u?av is in Ii and succ(Ii,a) = Ij,then set action[i,a] to shift j
(a must be a terminal).
3,The goto transitions for state i are constructed for all non-terminals A
using the rule,If GO(Ii,A) = Ij,then goto [i,A] = j.
4,All entries not defined by rules 2 and 3 are errors.
5,The initial state is the one constructed from the configurating set
containing S' –>?S.
SLR(1) table construction
A grammar is SLR(1) if the following two conditions hold for
each configurating set:
1,For any item A –> u?xv in the set,with terminal x,there is no
complete item B –> w? in that set with x in Follow(B),In the
tables,this translates no shift-reduce conflict on any state,
This means the successor function for x from that set either
shifts to a new state or reduces,but not both.
2,For any two complete items A –> u? and B –> v? in the same
configurating set,the follow sets must be disjoint,e.g,
Follow(A)? Follow(B) is empty,This translates to no reduce
reduce conflict on any state,If more than one non-terminal
could be reduced from this set,it must be possible to uniquely
determine which using only one token of lookahead.
SLR(1) grammars
The SLR(1) technique still leaves something to be desired,because we
are not using all the information that we have at our disposal,When
we have a completed configuration (i.e,dot at the end) such as X –>
u?,we know this corresponds to a situation in which we have u as a
handle on top of the stack which we then can reduce,i.e.,replacing
u by X,We allow such a reduction whenever the next symbol is in
Follow(X),However,it may be that we should not reduce for every
symbol in Follow(X),because the symbols below u on the stack
preclude u being a handle for reduction in this case,In other words,
SLR(1) states only tell us about the sequence on top of the stack,
not what is below it on the stack,We may need to divide an SLR(1)
state into separate states to differentiate the possible means by
which that sequence has appeared on the stack,By carrying more
information in the state,it will allow us to rule out these invalid
reductions.
The limitations of SLR(1)
LR or canonical LR parsing incorporates the required extra
information into the state by redefining configurations to
include a terminal symbol as an added component,LR(1)
configurations have the general form:
A –> X1...Xi? Xi+1...Xj,a
This means we have states corresponding to X1...Xi on the stack
and we are looking to put states corresponding to Xi+1...Xj on
the stack and then reduce,but only if the token following Xj is
the terminal a,a is called the lookahead of the configuration,
The lookahead only comes into play with LR(1)
configurations with a dot at the right end:
A –> X1…Xj?,a
LR or canonical LR parsing
The method for constructing the configurating sets of LR(1)
configurations is essentially the same as for SLR,but there
are some changes in the closure and successor operations
because we must respect the configuration lookahead,To
compute the closure of an LR(1) configurating set I:
Repeat the following until no more configurations can be
added to state I:
— For each configuration [A –> u?Bv,a] in I,for each
production B –> w in G',and for each terminal b in First(va)
such that [B –>?w,b] is not in I,add [B –>?w,b] to I.
Constructing configurating sets of LR(1)
1,Construct F = {I0,I1,..,In},the collection of configurating sets for the
augmented grammar G' (augmented by adding the special production
S' –> S).
2,State i is determined from Ii,The parsing actions for the state are
determined as follows:
a) If [A –> u?,a] is in Ii then set Action[i,a] to reduce A –> u (note A may
not be S').
b) If [S' –> S?,$] is in Ii then set Action[i,$] to accept.
c) If [A –> u?av,b] is in Ii and succ(Ii,a) = Ij,then set Action[i,a] to shift j
(a must be a terminal).
3,The goto transitions for state i are constructed for all non-terminals A
using the rule,If succ(Ii,A) = Ij,then Goto [i,A] = j.
4,All entries not defined by rules 2 and 3 are errors.
5,The initial state is the one constructed from the configurating set
containing S' –>?S.
Constructing LR(1) parsing tables
A grammar is LR(1) if the following two conditions are
satisfied for each configurating set:
1,For any item in the set [A –> u?xv,a] with x a
terminal,there is no item in the set of the form [B –>
u?,x] In the action table,this translates no shift-
reduce conflict for any state..
4,The lookaheads for all complete items within the set
must be disjoint,e.g,set cannot have both
[A –> u?,a] and [B –> v?,a] This translates to no
reduce-reduce conflict on any state..
LR(1) grammars
Because a canonical LR(1) parser splits states based on differing
lookahead sets,it typically has many more states that the
corresponding SLR(1) or LR(0) parser,Potentially it could
require the Cartesian product of all possible LR(0)
configurations and the power set of the terminals.It never
actually gets that bad in practice,but a canonical LR(1) parser
for an ordinary programming language might have an order of
magnitude more states than an SLR(1) parser,Is there
something in between?
With LALR (lookahead LR) parsing,we attempt to reduce the
number of states in an LR(1) parser by merging similar states,
This reduces the number of states to the same as SLR(1),but
still retains some of the power of the LR(1) lookaheads.
The motivation for LALR
The,brute-force” method
Construct all canonical LR(1) states.
2,Merge those states that are identical if the lookaheads are ignored,
i.e,two states being merged must have the same number of items
and the items have the same core (i.e,the same productions,
differing only in lookahead),The lookahead on merged items is
the union of the lookahead from the states being merged.
3,The GO function for the new LALR(1) state is the union of the
GO function of the merged states,If the two configurations have
the same core,then the original successors must have the same
core as well,and thus the new state has the same successors.
4,The action and goto entries are constructed from the LALR(1)
states as for the canonical LR(1) parser..
LALR table construction
几种文法之间的联系
1.文法的关系,不是语言的关系,
2.没有一个二义文法是 LL(1) 或 LR(1)
的,
3,LR家族的关系是非常清楚的,
4,每个 LL(1) 文法是
LR(1)的,
Implementation.
Simplicity:
Generality:
Grammar conditioning:
Error repair:
Table sizes:
Comparing LL(1) parsers to LALR(1)
解决二义性的不同途径
A different way of resolving ambiguity
Ambiguity in a grammar means we have two or more leftmost
derivations for the same input string,or equivalently,that
we can build more than one parse trees for the same input
string.
Common examples:
E –> E + E | E * E | (E) | id
S –> iSeS | iS | a
1.by re-writing the grammar to introduce new intermediate
non-terminals that force the precedence that we desired.
2.try to build an SLR(1) table for this grammar.
3,try to build an LL(1) table for this grammar
途径 1 根据算符优先性和结合性重写文法
G([E]),
E –> E + E | E * E | (E) | id 重写为
E? E+T |T T? T*F|F F?(E) | id
G([S]):
S? if Expr then Stmt else Stmt | if Expr then Stmt | Other…
抽象为,
S –> iSeS | iS | a
重写为:
S –> M | U
M –> iMeM | a
U–> iS | iMeS
重写文法
LR方法应用二义文法
E? E + E
id + E
id + E * E
id + id * E
id + id * id
E? E * E
E + E * E
id + E * E
id + id * E
id + id * id
E –> E + E | E * E | (E) | id
Parsing input,id + id * id
I 0,E' –>?E
E –>?E + E
E –>?E * E
E –>?(E)
E –>?id
I 1,E' –> E?
E –> E? + E
E –> E? * E
I 2,E –> (?E)
E –>?E + E
E –>?E * E
E –>?(E)
E –>?id
I 5,E –> E*?E
E –>?E + E
E –>?E * E
E –>?(E)
E –>?id
I 6,E –> (E?)
E –> E? + E
E –> E? * E
I 7,E –> E + E?
E –> E? + E
E –> E? * E
I 3,E –>id?
I 4,E –> E+?E
E –>?E + E
E –>?E * E
E –>?(E)
E –>?id
I 8,E –> E * E?
E –> E? + E
E –> E? * E
I9,E –> (E)?
利用 precedence rules.利用 left(right)associativity.
不用重写文法的理由,
Dangling else.
Stmt –> if Expr then Stmt else Stmt |
if Expr then Stmt | Other…
which we rewrite for brevity as:
S –> iSeS | iS | a
Dangling else 文法
I 0,S' –>?S
S –>?iSeS
S –>?iS
S –>?a
I 1,S' –> S?
I 2,S –> i?SeS
S –> i?S
S –>?iSeS
S –>?iS
S –>?a
I 3,S –> a?
I 4,S –> iS?eS
S –> iS?
I 5,S –> iSe?S
S –>?iSeS
S –>?iS
S –>?a
I 6,S –> iSeS?
对状态 4的冲突,按照 C,Pascal 和
Java,规则,移进 e,让它和最近的
if-then结合,
LL(1)方法