Parsers Computer programming

need codes and stuffs….please make a video also how to get the output

Chapter 3
Chomsky Grammars
3.1
Formal Grammars
Formal grammars are a special form of a more general concept called a rewriting system. There are many
general rewriting systems. Some are Markov algorithms, Post productions, and Thue and semi-Thue systems.
These are not covered here, but the reader should be aware that the formal grammars presented here are those
studied by Noam Chomsky and can be viewed as restricted semi-Thue system. Before showing the classes
of grammars and the corresponding languages generated by these Chomsky grammars, several definitions
along with notation is presented.
In the context of studying programming languages, we are interested in using these grammars for not only
describing what is and is not in a given programming language, but also use these for writing translators.
The grammar-related concepts of parsing, parse trees, ambiguity, and the relationship to translation and
compiler building are presented. It will be shown that two restricted forms of Chomsky grammars, namely
Context-Free and Regular, are the most useful for writing translators.
3.1.1
Generating Languages with Grammars
The definition given next is one for a general Unrestricted Grammar, also known as a Phrase-Structured
grammar.
Definition 3.1.1 Grammar:
A grammar defined by four items represented as a quadruple, < VN , VT , P, S >, where:
VN is a nonempty, finite set of symbols called nonterminals.
VT is a nonempty, finite set of symbols called terminals such that VN ∩ VT = ∅.
P ⊂ V + × V ∗ is a non-empty, finite set of production rules, where V = VN ∪ VT .
S ∈ VN is a one and only one designated symbol called the start symbol.
63
3.1 Formal Grammars
Chomsky Grammars
Definition 3.1.2 Immediate (Direct) Derivation: If α → β is a production and ωαϕ is a string, where
ω, ϕ ∈ V ∗ and α ∈ V + , then ωαϕ ⇒ ωβϕ is an immediate (or direct) derivation.
Definition 3.1.3 Derives and Reduces Relations: A sequence of immediate derivations α1 ⇒ α2 , α2 ⇒
α3 , . . . , αn−1 ⇒ αn , usually written, α1 ⇒ α2 ⇒ . . . ⇒ αn−1 ⇒ αn is a derivation of length n. We say that
α1 derives αn and αn reduces to α1 ; similarly, αn is derivable from α1 and α1 is reducible from αn . We
represent this as α1 ⇒+ αn (n > 1). We write α1 ⇒∗ αn (n ≥ 1).
Definition 3.1.4 Sentential Forms and Sentences: A string α derivable from the start symbol, S, of a
grammar, G =< VN , VT , P, S >, is called a sentential form, i.e. S ⇒∗ α, where α ∈ V ∗ = (VN ∪ VT )∗ . A
string α, is a sentence iff α is a sentential form and α ∈ VT∗ .
Definition 3.1.5 Grammar-Generated L(G): Language, L(G), generated by grammar G, is defined as the
set of sentences:
L(G) = {α ∈ VT∗ | S ⇒+ α}
Generally, for definitions and abstract examples, we will continue to use the metasymbols ’→’ and ’|’. In
addition, when the non-terminal and terminal sets are not explicitly listed, the following conventions will be
used:
Nonterminal symbols:strings starting with upper-case Latin letters (e.g. A,B,C,Expr,. . . )
Terminal symbols: strings starting with lower-case Latin letters and strings of printable non-alpha
characters, excluding ’|’, and ’→’) (e.g. a,b,c, id, *, (, ) . . . )
Lower-case Greek letters to denote strings in V ∗ (e.g. α, β, γ, . . .).
3.1.2
Examples of Grammars
Example 3.1 Binary Number Language:
Let G =< {N, A, B}, {., 0, 1}, P, N > where P :
N
→ A | A.A
A
→ B | AB
B
→ 0|1
Here, we have:
V = VN ∪ VT = {N, A, B, ., 0, 1}
N
Copyright 2024
⇒ A.A ⇒ AB.A
64
⇒ A1.A
Chomsky Grammars
3.1 Formal Grammars
A1.A is a sentential form, but it is not a sentence.
N
⇒
A1.A ⇒
11.B
⇒
⇒ AB.A
A.A
B1.A ⇒
11.A
⇒
⇒
11.0
N ⇒∗ 11.0 so 11.0 is a sentential form and is also a sentence. After trying some derivations of sentences
you should see (intuitively) that L(G) is the language of binary numbers.
Example 3.2 Expression Grammar: E, T, F
Let G =< {E, T, F }, {id, (, ), ∗, +}, P, E > where P:
E
→
E+T |T
T
→
T ∗F |F
F
→
(E) | id
This grammar is used to describe arithmetic expressions in many programming languages. E, T, and F
stand for Expression, Term, and Factor, respectively. Some derivations of sentences are shown below:
1.
E ⇒ T ⇒ F ⇒ id
2.
E
⇒
E+T
⇒
E+T ∗F
⇒
E+F ∗F
⇒
E + id ∗ F
⇒
E + id ∗ id
⇒
F + id ∗ id ⇒
id + id ∗ id
T + id ∗ id ⇒
3.
E
⇒
E+T
⇒
E+F
⇒
E + (E)
⇒
E + (T )
⇒
E + (T ∗ F )
⇒
E + (T ∗ id)
⇒
E + (F ∗ id)
⇒
E + (id ∗ id)
⇒
T + (id ∗ id) ⇒ F + (id ∗ id) ⇒ id + (id ∗ id)
4.
E
⇒
E+T
⇒
T +T
⇒
F +T
⇒
id + T
⇒
id + F
⇒
id + (E)
⇒
id + (T )
⇒
id + (T ∗ F )
⇒
id + (id ∗ F ) ⇒
id + (id ∗ id)
id + (F ∗ F ) ⇒
Copyright 2024
65
3.2 The Chomsky Hierarchy
Chomsky Grammars
Note that the last two derivations produced the same sentence. Derivations 3 and 4 are called rightmost and
leftmost, respectively. In the right(left)most derivations the right(left)most nonterminal is replaced. The
formal definition for a rightmost derivation is as follows:
Let G =< VN , VT , P, S > be a grammar. The derivation
α0 ⇒ α1 ⇒ · · · ⇒ αn−1
(n > 1) is a rightmost derivation if and only if each direct derivation αi−1 ⇒ αi (1 ≤ i ≤ n − 1) is of the
form αi−1 = ϕAω and αi = ϕβω where ω ∈ VT∗ , A ∈ VN , and ϕ ∈ V ∗ Remember V = (VN ∪ VT ). A leftmost
derivation is defined similarly — just swap the ϕ with the ω.
3.2
The Chomsky Hierarchy
Up to this point the reader may have observed that the productions of all the examples have been restricted
forms of that given in the definition of a grammar: each left-hand side of the productions have consisted
only of a single nonterminal. Grammars with these types of productions are called context-free grammars
(CFGs). Computer scientists have found that CFGs are very useful in describing many artificial languages
used in computer science, especially programming languages. The other Chomsky grammars are also very
important to computer scientists and other researchers who find their research intersecting the general areas
of theory of computation and linguistics — there are too many specific areas to begin listing them.
In this section we will examine four types of grammars that make up the Chomsky hierarchy and mention
a few of the relationships between them.
3.2.1
Definitions of the Grammar Classes
We begin by considering the grammar given in the definition of grammars as the most general type and
proceed by imposing more restrictions on the production forms. Note that the type-0 grammar is identical
to the original definition of a grammar given in Definition 3.1.1. For the more restrictive grammar types,
1,2,and 3, we add the condition that only the productions with ϵ on the right-hand side, is the ones with
only the start symbol on the left-hand side; no other production can have ϵ on the right-hand side.
There are four classes of Chomsky grammars, each adding more constraints on the production rule form.
Definition 3.2.1 Phrase-Structured Grammar (type 0):
A phrase-structured grammar (PSG) has pro-
ductions of the form:
α→β
where α ∈ V + and β ∈ V ∗ . This type of grammar is like a semi-Thue with the designated string restricted
to the start symbol; this type 0 grammar form is also called unrestricted in many computer science and
mathematics journals and textbooks.
If G is a PSG, the L(G) is a phrase-structured language (PSL).
Copyright 2024
66
Chomsky Grammars
3.2 The Chomsky Hierarchy
Definition 3.2.2 Context-Sensitive Grammar (type 1):
A context-sensitive grammar (CSG) has pro-
ductions of the form:
α ∈ V + , α ̸= β, and |β| ≥ |α|.
α→β
Note that since β must be at least as long as α, so obviously β ∈ V + . This length restriction classifies this
type of grammar as non-contracting.
The reason CSGs are called “context-sensitive” is because each CSG production set can be transformed
into productions that generate the same but have the following form:
ϕ1 Aϕ2 → ϕ1 ωϕ2
where ϕ1 , ϕ2 ∈ V ∗ and ω ∈ V + . In other words nonterminal A can only be replaced in the “context” of ϕ1
and ϕ2 . We will not prove that all CSGs can be written in this form.
If G is a CSG, then L(G) is a context-sensitive language (CSL).
Definition 3.2.3 Context-Free Grammar (type 2):
A context-free grammar (CFG) has productions of
the form:
A ∈ VN and β ∈ V + .
A→β
Note that the productions of CFGs are special cases of CSG productions: simply let ϕ1 = ϵ and ϕ2 = ϵ.
Also note that all of the examples presented in the preceding sections are CFGs.
If G is a CFG, then L(G) is a context-free language (CFL).
Definition 3.2.4 Regular (or linear) Grammar (type 3): A regular (RG) grammar is either a left or right
linear grammar. A left linear grammar (RG) is a restricted CFG that has productions only of the form:
A → Ba
or
A→a
A, B ∈ VN and a ∈ VT .
A right linear grammar has productions only of the form:
A → aB
or
A→a
A, B ∈ VN and a ∈ VT .
If G is an RG, L(G) is a regular language (RL).
We need to discuss languages that contain the empty string, ϵ. For these languages we may derive it as
simply: S ⇒ ϵ, where S ∈ VN is the start symbol, no matter what type of grammar. For describing some
languages, it may be a matter of convenience to allow some productions of the form A → ϵ, where A ∈ Vn
and A not the start symbol. However, it can be shown that if grammar G1 has n > 0 ϵ-productions, where
ϵ may or may not be in L(G1 ), then there exists a grammar G2 where G2 has no ϵ-productions and where
L(G2 ) = L(G1 ). If ϵ ∈ L(G1 ), then one must include in G2 the production S → ϵ, where S is the start
symbol.
Copyright 2024
67
3.2 The Chomsky Hierarchy
Chomsky Grammars
For the remainder of this textbook, we will assume that ϵ-productions are allowed for grammars. Just
remember that the remaining CSG productions may not contract.
3.2.2
Examples of Classifying Grammars
When classifying a grammar within the context of the Chomsky hierarchy, always begin by trying to see
if every one of the grammar’s production rules has the form of a regular grammar. If it does not, check
whether it is a CFG. If not, then check whether it is a CSG. Remember that every Chomsky grammar is a
PSG and that RGs are the smallest, most restrictive class; always start with the smallest class, RGs, and
“work your way up” to the most general class.
Note that if there is more than one symbol, or if there is a terminal on the left-hand side of any production,
then it cannot be an RG or CFG. To help distinguish between CSGs and PSGs remember that PSGs can
have productions α → β with |α| ≥ |β|; i.e. PSGs are contracting grammars and CSGs are non-contracting.
This can help reduce the amount of work classifying a Chomsky grammar.
Example 3.3 Context-Sensitive Grammar
S
→ cE | cS | dS | b
cE
→ cdS | cd
The last two productions do not fit the form of CFG production rules. Thus, we must check to see if they
fit the CSG form. Recalling the definition of CSGs all CSG productions are non-contracting or have the
following form ϕ1 Aϕ2 → ϕ1 ωϕ2 where ϕ1 , ϕ2 ∈ V ∗ and ω ∈ V + . Grammar G2 is non-contracting so it is
a CSG. Moreover, it fits the second form also. Consider the production cE → cdS. We see that if ϕ1 = c,
A = E, ϕ2 = ϵ, and ω = dS, the production fits the form. Similarly, the last production also fits the form if
the same assignments are made with the exception of ω = d.
Example 3.4 Context-Free Grammar
A context-free grammar, G3:
S
→ bDc | Db | b
D
→ bbDc | b | ϵ
Grammar G3 is not an RG because the first and fourth productions do not fit the form of an RG. Since we
are allowing ϵ-productions in our CFG production forms all productions in G3 do fit the form of a CFG. Try
describing the language L(G3).
Example 3.5 Regular Grammar
A regular grammar, G4:
Copyright 2024
S
→ aU | bV
U
→ bS | b
V
→ aS | a
68
Chomsky Grammars
3.2 The Chomsky Hierarchy
The grammar G4 is a right linear grammar. An RG is either right linear or left linear and does not mix
productions of the two types.
3.2.3
Relationships of the Grammar and Language Classes
As discussed and illustrated earlier, the four types of Chomsky grammars are identified by the form of their
productions. Regular grammar productions are special cases of context free productions and context free
productions are special cases of context sensitive productions. All productions are phrase structured. Each
class of grammars has a corresponding class of languages. Given a language, L, one is not only interested
in constructing a grammar G such that L = L(G), but also constructing a grammar of the most restrictive
type possible. This begs the questions: are there languages that no regular grammar can generate? Are
there languages that no context free grammar can generate? The answers to these questions are all yes. To
discuss this we introduce some definitions and notation and then look at some examples.
Let Li be the class of languages that can be generated by a grammar of type i, i ∈ {0, 1, 2, 3}.
It is beyond the scope of this textbook, but it can be shown that:
L3 ⊂ L2 ⊂ L1 ⊂ L0
Example 3.6 Three Languages
Shown below are three languages and their respective types and sample grammars.
L(G)
Language Type
G
→
S
n m
{a b
| n > 0, m > 0}
{an bn | n > 0}
{an bn cn | n > 0}
Regular
A →
Context Free
Context Sensitive
aA
aA | bB | b
B
→
bB | b
S
→ aSb | ab
S
→
aSBC | aBC
CB
→
BC
aB
→
ab
bB
→
bb
bC
→
bc
cC
→ cc
Suppose that a software engineer is given a language L and must find a grammar G, such that L = L(G).
How does the software engineer know that they have found the most restrictive grammar type that can
generate language L? Computer scientists have a collection of theorems and lemmas that can identify the
smallest class of languages to which a language belongs. If the smallest language type is known, grammars
of a particular form can be pursued and properties can be exploited. Once the language type is known the
Copyright 2024
69
3.3 Parse Trees
Chomsky Grammars
appropriate known algorithms and models of computations can be utilized in solving the decision language
membership problem: “α ∈ L(G)?” Some of these algorithms try to construct a derivation of the string
being tested for membership in the usual sense by starting with the start symbol and reading one symbol at
a time in one direction (usually left-to-right) with possibly looking k ≥ 0 symbols ahead. Alternatively, one
can re-construct the derivation “backward” through a series of reductions while reading the string in one
direction with possibly looking ahead.
Another approach to testing membership is to utilize a model of computation, such as a finite state
automaton or push-down automaton. Later we will see that finite state machines can be used for testing
string membership in regular languages. Push-down automata can be used for testing string membership
in context-free languages. It should be pointed out that linear-bounded Turing machines can be used for
testing string membership in context-sensitive languages and Turing machines for testing membership of any
language. We will see that these machines can be represented as formal rewrite rules and also have direct
relationships with the corresponding grammars in that there are algorithms for translating between the two
types of formalisms.
These membership algorithms are referred to as parsers. An important underlying abstraction used
implicitly and, sometimes explicitly, by these parsers is a parse tree.
3.3
Parse Trees
As mentioned earlier, to solve problem of the testing α ∈ L(G), one tries to construct a derivation of the
string α using productions of G. The construction of a derivation is called parsing. A parse of some sentence
(or more generally a sentential form) shows how the sentence (or sentential form) was derived. A parser
is simply the name for the procedure that performs the construction of the derivation. Each parse has a
corresponding parse tree. We define a parse tree.
3.3.1
Terminology
A directed acyclic graph is a directed graph with no cycles. A tree is a directed acyclic graph with a unique
path from a designated node called the root to each of the other nodes. A node n1 is a descendant of another
node n2 if there is a directed path from that n2 to n1 . Leaf nodes have no descendants and the root node
is not a descendant of any node. A node that is neither a root node nor a leaf node is an interior node. A
node n1 is a child of node n2 if (n2 , n1 ) is an edge (branch) in the tree. An ordered tree has a linear order
on the children of every node.
Copyright 2024
70
Chomsky Grammars
3.3 Parse Trees
A
…
X1
X2
Xm
Figure 3.1: A → X1 X2 . . . Xm
Definition 3.3.1 Parse Tree: Given a CFG, G =< VN , VT , P, S > an ordered tree is a parse tree if the
following conditions are met:
1. The root is labeled by the start symbol.
2. A leaf node is labeled by a terminal or ϵ.
3. Each interior node is labeled by a nonterminal.
4. If A is an interior node and X1 , X2 , . . . , Xm are children of the node, then A → X1 X2 . . . Xm is a
production of the G, where Xi ∈ V ∪ {ϵ}. See Figure 3.1.
Note that we have defined parse trees for CFGs. Since the usage of one production rule in a particular
context might modify the context of applying another production rule, trees are not usually used as a means
of studying context-sensitive grammars. The orders of nonterminal substitution in a CFG derivation does
not matter, whereas for CSGs it does matter in some cases.
3.3.2
A Parse Tree Example
Example 3.7 Parse Tree for id + id ∗ id using EFT-Grammar
Consider the expression grammar:
E
→
E+T |T
T
→
T ∗F |F
F
→
(E) | id
The parse tree for the derivation of the sentence id + id ∗ id is shown in Figure 3.2. The derivation of
id + id ∗ id is:
E
⇒
E+T
⇒
E+T ∗F
⇒
E+F ∗F
⇒
E + id ∗ F
⇒
E + id ∗ id
⇒
T + id ∗ id ⇒ F + id ∗ id ⇒ id + id ∗ id
Try doing a rightmost derivation then a leftmost derivation of the same sentence: id + id ∗ id. Draw and
compare the parse trees for each of the derivations.
Copyright 2024
71
3.3 Parse Trees
Chomsky Grammars
E
E
+
T
T
T
*
F
F
id
id
F
id
Figure 3.2
3.3.3
Ambiguity
In the previous Example (3.7) you should have discovered that each derivation corresponded to the same
parse tree. Also, you should note that there is only one left- and rightmost derivation for that sentence.
Some grammars, however, may allow more than one parse tree for a sentence or more than one left- or
rightmost derivation. This is referred to as an ambiguous grammar. We define ambiguity formally as:
Definition 3.3.2 Ambiguous Sentences, Grammars, and Languages: An ambiguous sentence in L(G) is
a sentence that has two or more distinct leftmost derivations with respect to grammar G. An ambiguous
grammar is a grammar that generates at least one ambiguous sentence. A language is (inherently) ambiguous
if and only if there are no unambiguous grammars that generate the language.
NOTE: We could have used “rightmost derivations” or “parse trees” in place of “leftmost derivations.”
Each definition would be equivalent. Note that a sentence may still be considered ambiguous even if the two
corresponding unlabeled parse trees have the same structure, i.e. a different labeling of the interior nodes
indicates ambiguity.
Example 3.8 Structural Ambiguity
Consider the following grammar:
G =< {E}, {∗, +, id}, P, E > where P is:
E → E + E | E ∗ E | (E) | id
This grammar generates the same language as the grammar of example 3.7. One of the differences between
the two grammars is that this grammar is ambiguous and the one in example 3.7 is not. For example, there
are two leftmost derivations for the following sentence id + id ∗ id. Using grammar G in this example we
have at least two leftmost derivations:
1.
E
id + E ∗ E
⇒
E+E
⇒
id + E
⇒ id + id ∗ E
⇒ id + id ∗ id
⇒
⇒
⇒
2.
E
id + E ∗ E
Copyright 2024
E∗E
⇒ id + id ∗ E
72
E+E∗E
⇒ id + id ∗ id
⇒
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
E
E
E
+
E
id
E
*
id
E
E
id
id
(a)
E
*
E
+
E
id
id
(b)
Figure 3.3: Structural ambiguity
The parse trees in Figures 3.3(a) and (b) correspond to derivations 1 and 2, respectively. This example
illustrates a sentence with two structurally distinct parse trees. Compare these trees to the one in Figure 3.2.
The parse tree in Figure 3.3(a) is the structure that corresponds to the semantics or evaluation that is used
in mathematics. When defining the semantics or meaning of programs, one must keep in mind the underlying intended semantics when designing a grammar for the syntax. Many translators build parse trees when
performing the syntax analysis and attach semantic attributes to the nodes for use by other components of
the compiler, such as the code generator.
Example 3.9 Labeling Ambiguity
Consider the following grammar:
G =< {S, A}, {a, b}, P, S >
where P is:
S
→ Sa | Aa | b
A
→ Aa | b
The sentence baa can be shown to be ambiguous by looking at the two parse trees for it shown in figure 3.4.
This illustrates labeling ambiguity.
3.4
MetaLanguages for Presenting Grammars
The prefix, “meta-”, in this context is used for symbols or syntactic structures that are used to describe,
express and define grammars, that, in turn, express the syntactic structure of languages. In this section,
several metanotations that allow designers, implementors, and programmers to communicate the syntactic
structure of programming languages are shown. The material covered in this section provides the background
needed for designing and implementing parsers for programming languages.
Copyright 2024
73
3.4 MetaLanguages for Presenting Grammars
S
A
A
S
a
a
b
Chomsky Grammars
S
S
a
a
b
Figure 3.4: Labeling ambiguity
3.4.1
BNF Notation
Backus-Naur Form (BNF) is a metasyntactic notion for expressing grammars. It was originally introduced
by two computer language and compiler designers, John Backus and Peter Naur. Emil Post’s notation for
his production system computational provided a starting point for John Backus to utilize an early version of
BNF for describing a language, called IAL, that would later evolve into Algol 58. Later Peter Naur utilized
it for describing Algol 60. Although John Backus is most known for his leadership on the design team for
the first designs of Fortran in the mid 1950’s, he and Peter Naur were on the design team for Algol 60. For
details and the history of the development of BNF, the interested reader should consult [1] and [9].
A variation of BNF has been utilized earlier in the text for presenting grammar-related and language
related concepts. Thus far, for the purposes definitions, theorems, and examples illustrating the definitions,
the meta-symbols,of → and | have been utilized. Uppercase and lowercase letters have been used to represent
nonterminals and terminals, respectively. For “real” programming languages, more meaningful names, such
as parameter-list as apposed to A, need to be used by language designers, implementors, when writing
and reading grammars. In order to distinguish between terminals and nonterminal symbols, the strings
representing nonterminals are simply surrounded with a pair of angle-brackets, . Several metanotations
are used for terminals. With the ability to distinguish between nonterminals and terminals, one need not
explicitly list the elements of the nonterminal and terminal sets, VN and VT , respectively. One can express
the four sets that comprise a grammar, < VN , VT , P, S >, by simply listing the production rule set in BNF
and putting the production rule(s) with the start symbol as the left-hand side symbol as the first one(s)
listed. A summary of the BNF metasymbols and their meanings is shown in Table 3.1.
When BNF was first introduced in the late 1950’s, there were not many fonts and typefaces available
on most printers used within general purpose computer systems; even if one could interchange the font and
typeface, the ability to mix fonts and typeface during the printing of one file was almost non-existent at
that time. Word processors with bold and underline functions were not widely available. So, the adoption
of quoting terminals or leaving strings without surrounding angle brackets as terminals were utilized also.
Usually one font was used using a single character set, such as ASCII. So, many of these notational conven-
Copyright 2024
74
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
MetaSymbol
Meaning
::= or →
delimits left-hand side from right-hand side
|
“or”
Surrounds nonterminal names
Unaltered strings not between < >
Underlined string
bold-faced string
terminals
Single quotes ’string’
Double quotes ”string”
Table 3.1: BNF Metasymbols
tions were adopted because of this. For example, the ::=1 was adopted instead of →, since ASCII does not
include →, but consists of a subset of ASCII characters.
Note that when metasymbols are part of the language generated by a grammar expressed in BNF, single
or double quotes must be used for those symbols. If single or double quotes are part of the language, then
two consecutive single or double quotes, respectively, are often used when expressing a grammar in BNF for
that language. For example, two consecutive double quotes, ””, used presenting a grammar using EBNF
specifies a single double quote terminal symbol, ”. Note this is how C and C++ programmers can write a
single-quote character literal and express a string literal with a double quote embedded.
For some purposes it is also convenient to mix notations for terminals in a set of productions. For
example, one might put single quotes around single character terminals, underline reserved words and/or
leave operator symbols unquoted in the same EBNF-specified production rule set.
As one can see there are many metanotations for even simple BNF. One must keep in mind that notational
consistency provides the best readability and understandability of the grammar and, in turn, of the language
generated from that grammar. If you are designing a language and/or a grammar, for a language, it is always
best to supply a legend for your readers, the programmers and implementors.
1 A personal speculation:
because = and := were terminals in Algol-like languages at the time, ::= seemed to be the most
natural metasymbol.
Copyright 2024
75
3.4 MetaLanguages for Presenting Grammars
Chomsky Grammars
Example 3.10 Expression Grammar in BNF
In this example a grammar for representing the syntactic structure of simple expressions used in many
imperative languages is shown. What the type of the expressions are or the operator symbols actually represent
is not known; this only expresses the syntax.
| < Expr > − < Term >
| < Term >
< Term > ::= < Term > ∗ < Factor > | < Term > / < Factor >
| < Factor >
< Expr > ::= < Expr > + < Term >
| + < Primary >
< Factor > ::= < Primary >
|
− < Primary > | (< Expr >)
| number
< Primary > ::= identifier
From the BNF notation the set of terminals and nonterminals can be inferred:
3.4.2
VT
= {+, −, ∗, /, (, ), identifier, number}
VN
= {Expr, Term, Factor, Primary}
EBNF
BNF can be extended by introducing additional metasymbols so that production rules can be combined and
reduced. There are many different extensions and a common set of symbols along with their meanings is
summarized in Table 3.2. The variable α represents any legal right-hand side of a BNF or EBNF rule. One
BNF + Additional MetaSymbols = EBNF
MetaSymbols Added to BNF
Meaning (α is a EBNF expression)
[α]
α is optional
{α}
α can be repeated 0 or more times
{α}+
α can be repeated 1 or more times
(α)
Groups α into a EBNF subexpression
Table 3.2: EBNF’s Additional Metasymbols
can view the right-hand side of a production rule as an expression with these additional metasymbols as
operators. The precedence rules for these additional metasymbol operators are described with the following
metagrammar expressed in EBNF in Figure 3.5: From this grammar one can see that the metasymbol, |,
has lower precedence than concatenation, which is implicit by the juxtaposition of EBNF expressions. The
metasymbol pairs, (), [], and {}, are used to form subexpressions, with the parentheses having no operational
meaning; the parentheses, (), simply can alter the precedence rules.
Copyright 2024
76
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
< EBN F >
→
< A > { ’|’ < A > }
→
< B > {< B >}
→
’[’< C >’]’
→
’(’< EBN F >’)’ | < terminal > | < nonterminal >
< terminal >
→
string | ””string”” | ’ ’string’ ’ | string
< nonterminal >
→
’’
|
’{’< C >’}’
|
Figure 3.5: Metagrammar for EBNF written in EBNF
Example 3.11 Expression Grammar in EBNF
In this example we utilize EBNF notation and rewrite Example 3.10.
< Expr > ::=
< Term > {(+ | −) < Term >}
< Term > ::=
< Factor > {(∗ | /) < Factor >}
< Factor > ::=
[+ | −] < Primary >| ’(’ < Expr > ’)’
< Primary > ::=
identifier | number
Note that in the parentheses are quoted since they are metasymbols in EBNF. As with BNF used in Example 3.10, the EBNF allows one to distinguish nonterminal symbols from terminals.
3.4.3
VT
= {+, −, ∗, /, (, ), identifier, number}
VN
= {Expr, Term, Factor, Primary}
Syntax Diagrams
Just as finite state diagrams give a two-dimensional representation of a finite state machine, context free
grammars can be depicted visually by utilizing a set of diagrams. Both types of diagrams are intended
to aid programmers with understanding the syntactic structure of a given language and to aid compiler
implementors with designing parsers, especially top-down parsers, of a given language. Although all contextfree grammars can be represented with syntax diagrams, traditionally syntax diagrams have been utilized for
designing a particular subset of top-down parsers, called recursive-descent parsers. Before discussing this, it
is shown how to represent a grammar G a set of syntax diagrams.
Each syntax diagram represents a group of related production rules, namely ones with the same left-hand
side nonterminal. Each group is called a syntactic category. Using the BNF (or EBNF) metasymbol, |, we
combine rules that are not already combined as follows. Production rules
→
α1
→
α2
…
Copyright 2024
→
77
αn
3.4 MetaLanguages for Presenting Grammars
Chomsky Grammars
are grouped into:
A → α1 | α2 | . . . | αn
This is done for each nonterminal. Each syntactic category, for example A, is represented by a syntax
diagram. The diagram is named by the name of the syntactic category it represents.
Toward a more formal characterization we note that each syntax diagram is a node-labeled, directed
graph. We will see that a finite state diagram is a node-labeled, edge-labeled, directed graph. Whereas
finite state machine has one corresponding diagram (graph), a grammar is represented by possibly many
syntax diagrams (graphs). In a finite state diagram the nodes are labeled with state names, represented as
circles; transitions are directed edges that are labeled by a symbol by one symbol from the terminal set. In
a syntax diagram there are three types of nodes, junction nodes, terminal nodes, and nonterminal nodes.
Conventionally junction nodes are implicit in the drawing, a terminal node is represented by an ellipse and
labeled with a terminal (token) symbol, and a nonterminal node is represented by a rectangle and labeled
with a nonterminal symbol. All edges in a syntax graph are unlabeled, directed edges.
When using a finite state diagram, one can trace a path through the state diagram by reading the input
string and following the labels of the transition edges. Parsing the input string requires following a directed
path through a syntax diagram and may require jumping between several syntax graphs, each of which
correspond to a group of production rules. Informally, when traversing the syntax graph, if one encounters
a terminal node, it requires matching the next input symbol. In such a case, this requires moving the input
pointer to the next input symbol and continuing down a path in the graph. upon encountering a nonterminal
rectangle, one jumps to the syntax graph named by that nonterminal (maybe even the same graph in the
case of a direct recursive rule); after traversing that graph and, possibly several others, one returns and
continues on the edge emanating from the rectangle. One can convert each syntactic category into a syntax
graph by putting each right-hand side as a branch at a junction node. Each right-side is a sequence of EBNF
elements: terminals, nonterminals, optional elements, iterative elements, and nested alternative elements.
The junction nodes represents forks in the road where decisions a have to be made or they represent the
merging of two or more paths.
Example 3.12 Syntax Graph of a Simple Graph
Consider the grammar G with start symbol E, written in EBNF:
→
(a|b) < A > {b}|c < B > d
→
c < A > d|ef [g]
→
{cd}
The corresponding syntax graphs is:
Copyright 2024
78
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
S
a
A
b
b
c
B
d
c
A
d
A
e
f
g
B
d
c
To trace a path of edges in syntax graph is equivalent to parsing an input string. For example, the string
ccdcdd may be parsed by tracing a path through the syntax diagrams. One starts with graph S. The first
symbol c matches the terminal node labeled c and moves past that node to the rectangle node labeled with
nonterminal B and the input pointer advances to the next symbol in the string, which is the second c. Because
a rectangle B is encountered, one jumps to syntax diagram B. There is a path that can be taken that goes
through a fork junction node and leads to a match with another terminal node c. Continuing further down
the input string d is read and is matched against the next terminal node labeled d. One continues around
passing through the merge junction node and then passing through the fork junction node to match the next
symbol c, then d. Finally, when reading the last symbol of the input string d, there is no match at the fork
junction point the B diagram is exited back to the diagram from which the jump originated. In this case,
that diagram of S and one jumps back to the backside of the B rectangle and a path is traced from there. A
terminal node d is matched and the string is accepted since the starting graph can be exited.
To convert a set of production rules written in EBNF notation, intermediate graphs are created. Before
beginning the conversion, the EBNF grammar is put into a “canonical form” by grouping all context-free
productions with same left-hand symbol. For each nonterminal A, group all rules with A as the left-hand
side in following form:
A ::= α1 | α2 | . . . | αn
Let the group be called a syntactic category A. The right-hand side of each production consists of an EBNF
expression, which in turn, consists of subexpressions. In the following algorithm, a syntax graph is recursively
constructed from each production’s right-hand side EBNF expression. The base case is represented by
converting terminals and nonterminals to their respective graphs.
Copyright 2024
79
3.4 MetaLanguages for Presenting Grammars
Chomsky Grammars
1. For each syntactic category, A, with production rule, A → α1 | α2 | . . . αn , let γ represent the righthand-side EBNF expression and recursively construct syntax diagram, named A, according to rules
2-8 by converting the underlying subexpressions of γ as structured by the metagrammar for EBNF.
2. if γ = t, where t is a terminal, let the diagram be represented as:
t
3. If γ = X, where X is a nonterminal let the graph be represented as:
X
4. If γ = α1 | α2 | . . . | αn , a syntax graph for γ is constructed as follows.
α1
α2
.
.
.
.
.
.
.
.
.
αn
5. If γ = α1 α2 . . . αn , a syntax graph for γ is constructed as follows.
α1
α2
…
αn
Each EBNF subexpression, αi , are converted to syntax graphs, which are represented as αi , by
recursively applying rules 2-8.
6. If γ = [α], a syntax graph is constructed as follows.
Copyright 2024
80
Chomsky Grammars
3.4 MetaLanguages for Presenting Grammars
α
EBNF subexpression, α, is converted to a syntax graph by recursively applying rules 2-8. The syntax
graphs is represented here as: α .
7. If γ = {α}, a syntax graph is constructed as follows.
α
EBNF subexpression, α, is converted to a syntax graph by recursively applying rules 2-8. The syntax
graphs is represented here as: α .
8. If γ = (α), a syntax graph is constructed as follows.
α
EBNF subexpression, α, is converted to a syntax graph by recursively applying rules 2-8. The syntax
graphs is represented here as: α .
Copyright 2024
81
3.4 MetaLanguages for Presenting Grammars
Chomsky Grammars
An example of a grammar for a subset of Pascal expressed as a syntax diagram is shown below:
Mini-Pascal Syntax Diagram
PROGRAM
PROGRAM
BLOCK
;
Identifier
BLOCK
CONST
CONSTDEC
;
VARDEC
VAR
;
STATEMENT
BEGIN
END
;
CONSTDEC
TYPE
Number
=
Identifier
INTEGER
VARDEC
TYPE
:
Identifier
,
STATEMENT
BEGIN
END
STATEMENT
;
SIMPLE
EXPRESSION
:=
Identifier
IF
STATEMENT
EXPRESSION
THEN
ELSE
WHILE
EXPRESSION
READ
STATEMENT
STATEMENT
DO
(
)
Identifier
,
WRITELN
(
)
SIMPLE
EXPRESSION
,
SIMPLE EXPRESSION
TERM
+
+
TERM
–
–
TERM
FACTOR
*
FACTOR
/
FACTOR
Identifier
Number
(
)
EXPRESSION
EXPRESSION
SIMPLE
EXPRESSION
=
< = >
SIMPLE
EXPRESSION
Copyright 2024
82
.
Chapter 4
Parsers
We have seen how simulations of finite state machines are used for the decision problem of testing whether a
string is in a regular language. Now we look algorithms that implement decision problem of testing whether a
string is in a context free language. These algorithms are called parsers. Besides implementing this decision
problem, parsers are used for translation and uncovering the syntactic structure of the sentences of a context
free language. In essence, parsers recreate a derivation of a sentence and can be used to create parse trees.
Algorithms for parsing context free languages can be categorized into three general categories: universal,
top-down, and bottom-up. Context free grammar production rules play a role in designing parsing algorithms.
The universal parsing algorithms can parse sentences from context free languages generated by any context
free grammar with some grammars transformed into canonical or normal forms. Two examples of universal
algorithms are the Cocke-Younger-Kasami (CYK) algorithm and Earley’s algorithm. For a sentence of length
n, the CYK algorithm has time-complexity of O(n3 ) and space complexity of O(n2 ). Earley’s algorithm
has time-complexity of O(n3 ) for general context free grammars and O(n2 ) for unambiguous context free
grammars.
Top-down and bottom-up parsers require linear time. Top-down parsers build parse trees from the root
(start non-terminal) downward to the leaves, where the leaves are labeled with the terminal symbols of the
sentence. Bottom-up builds the complete parse tree for a sentence in essence by creating a forest of trees each
representing phrases of the sentence. We shall see that top-down parsing and bottom-up parsing correspond
to leftmost derivations and rightmost derivations, respectively. Furthermore, bottom-up parsing constructs
the rightmost derivation in reverse. For both the top-down and bottom-up parsing algorithms, decisions
about what production to use are based upon the next input symbol or up to k > 0 upcoming symbols.
While these parsers are linear in time, they will not work on all grammars. It turns out the bottom-up
parsers can handle more grammars than those handled by top-down parsers.
In this section we first look at top-down parsers and a specific subcategory called recursive-descent
and then look at another subcategory of predictive top-down parsers, which are table-driven. Recursivedescent parsers may need to use backtracking, but if a grammar meets certain conditions, a non-backtracking
recursive descent parser can be constructed. We look at two important set constructions of First and Follow
83
4.1 Top-Down Parsing
Parsers
for defining those constraints. Recursive descent parsing is presented first since it is straight forward to
creating such a parser “by hand” using a compliant grammar.
After being introduced to recursive-descent parsers and before covering predictive top-down parsers, we
look at some strategies for designing algorithms for transforming grammars so they not only can be used
for top-down parsers, but bottom-up parsers. We revisit ambiguous grammars and strategies for handling
ambiguities in parsers and how to create unambiguous grammars from ambiguous grammars, such that they
generate the same language. We also look at algorithms for removing left-recursive rules and transforming
grammars by performing left factor to construct a grammar appropriate for top-down parsers.
4.1
Top-Down Parsing
The family of top-down parsers are based upon building a parse tree from the root and extending the tree
by creating interior (non-terminal nodes) and leaves based as the input string is scanned. Top-down parsers
build the tree in a manner similar to a depth-first or preorder traversal. Consider a tree-node with a nonterminal label A and the next node to be extended by applying an immediate derivation. If A is on the
left-hand side of more than one production, the decision about what production to use is based upon the
next input symbols or by looking k > 0 symbols ahead in the input. The building of the parse tree in a
preorder manner corresponds to a left-most derivation.
There are two general categories of top-down parsers: recursive-descent parsing with and without backtracking. Recursive descent without backtracking is called predictive parsing. Predictive parsers are called
LL(k), where the first L indicates Left-to-right reading of the input string of symbols, the second L indicates
that the parser is based upon the reconstruction of a Leftmost-derivation, and the k indicates that the parser
may look up to k symbols ahead in the input to decide what to do next, such as read past the next symbol
or apply another immediate derivation. One may also say that a grammar G is LL(k) if a LL(k) parser
could used for grammar G. LL(k) parsers can be table-driven. Algorithms for constructing the table and
the execution of a state-machine with memory that uses the table will be shown later in this section.
We start examining top-down parsing by studying recursive descent parsing, which may include backtracking when simulating or actually doing a left-most derivations or actually generating a parse tree. If
the grammar is LL(k), a recursive-descent parser can be written without backtracking. Most programming
language grammars can be designed so that recursive descent parser can be written without backtracking or
an LL(1) parser can be created.
Consider the following grammar:
′
GET F :
Copyright 2024
E
→ T E′
E′
→ +T E ′ |ϵ
T
→ FT′
T′
→
F
→ (E)|id
∗F T ′ |ϵ
84
Parsers
4.1 Top-Down Parsing
′
Using GET F , the left-most derivation of id + id:
lm
lm
T E′
E
⇒
idE ′
⇒ id + T E ′
lm
id + idE ′
lm
⇒
F T ′E′
⇒
lm
⇒ id + F T ′ E ′
⇒
lm
idT ′ E ′
⇒
lm
id + idT ′ E ′
⇒
⇒
lm
lm
id + id
The top-down construction of the corresponding parse tree is shown in Figure 4.1
E
E
E
E
E
lm
lm
⇒
′
T
⇒
E
T
′
E
T
lm
E
⇒
⇒
T
id
E
T
F
E
T
id
+
′
T
E′
T
lm
⇒
′
F
E
T
id
ϵ
T
ϵ
id
ϵ
lm
′
T
F
⇒
E
T
′
+
E′
T
T′
F
E′
T
E
id
T
E
′
T′
+
ϵ
E
F
F
′
E
′
′
E′
T
lm
′
F
T′
F
′
lm
⇒
F
T′
id
ϵ
+
id
E′
T
F
T′
id
ϵ
lm
⇒
E
E′
T
F
T′
id
ϵ
+
E′
T
F
T′
id
ϵ
ϵ
′
Figure 4.1: Top-down parse of id + id using Grammar GET F
Copyright 2024
85
lm
⇒
4.1 Top-Down Parsing
4.1.1
Parsers
Recursive-Descent Parsing
To begin looking at top-down parsing, we consider a kind class of top-down parsing called recursive-descent
parsing. The implementation of recursive descent parsing involves creating a function of each non-terminal,
A. For all the productions with A → X1 X2 . . . Xm , where each Xi is a terminal or non-terminal, The writing
a function definition for A where the right-side of the productions with A on the left-hand side as calling the
function associated with non-terminals on the right-hand Note that set of all recursive-descent parsers form
a proper subset LL(1) parsers.
Assume A → α1 ∥α2 ∥ · · · ∥αn is in the production rules for a grammar G. For writing the definition of A
for an implementation of a top-down parser, assume, without loss of generality, that the production, A → αi ,
is chosen in line /*2*/ and that αi = X1 X2 . . . Xm . A typical function definition for non-terminal A is shown
below.
/*1*/ void A()
/*2*/ {
… code for choosing a production with A on the left-hand side …
/*3*/
for_each of X1 X2 … Xm
//emphasize X1 is processed first, then X2 …
/*4*/
if( Xi is a nonterminal ) invoke function Xi()
/*5*/
else if ( Xi is a terminal) read past the input symbol Xi
/*6*/
else invoke error handling function
/*7*/}
Implementation of non-terminal A in a top-down parser
Grammars that work best for top-down parsing must not have left-recursive rules, such as, E → E +
T . Algorithms exist that transform a grammar, G, that contains left-recursive production rules into a
grammar G′ without left-recursive production rules, such that L(G) = L(G′ ). This may involve introducing
ϵ-productions. Later, an algorithm that will transform GET F into a grammar suitable for top-down parsing.
′
For now, we show a grammar GET F that is suitable for top-down parsing.
GEF T :
E
→ E + T |T
T
→ T ∗ F |F
F
→ (E)|id
Using grammar GET F , to parse the sentence id, the left- (and right-most) derivation is E ⇒ T ⇒ F ⇒ id. A
parser must make a decision at each immediate derivation about what production to apply while only seeing
id as the next token in the input string. If the parser can look-ahead and have the information that the end
of the string is next, the correct decision could be made without backtracking. The sentence, id + id, requires
looking ahead to find ‘+’ symbol to chose E ⇒ E + T instead of E ⇒ T . As more positions are used for
look-ahead, the design of a parser gets more complex and less efficient. Part of the problem with using the
grammar GET F is that it has left-recursive rules. Backtracking can be used by trying each production in line
2 of the general top-down algorithm above when running into a dead-end or error. When all A-productions
have been tried, then an error can be indicated.
′
We can transform grammar GET F another grammar GET F as shown below.
Copyright 2024
86
Parsers
4.1 Top-Down Parsing
′
GET F :
E
→
T E′
E′
→
+T E ′ |ϵ
T
→
FT′
T′
→
∗F T ′ |ϵ
F
→
(E)|id
′
′
It can be shown that L(GET F ) = L(GET F ). Using GET F , to parse id + id, the parser can chose the
correct production by looking at only the next input symbol. The left-most derivation of id + id:
lm
T E′
E
⇒
idE ′
⇒ id + T E ′
id + idE ′
lm
lm
⇒
⇒
lm
F T ′E′
lm
id + F T ′ E ′
⇒
lm
⇒
idT ′ E ′
lm
⇒ id + idT ′ E ′
lm
⇒
lm
⇒
id + id
When writing parsers, one must keep in mind that some grammars are better for certain parser designs
and worse for other designs. We have just looked at a top-down parser design. We have already observed
that if a top-down parser is designed for grammar GEF T , the performance would be inefficient in that a
great amount of backtracking must take place. To use a top-down parser design, the grammar for which the
parser is built must have certain properties. We look at a particular class of context-free grammars that have
properties that allow one to write corresponding top-down parsers called recursive-descent. The motivation
for name will become apparent shortly. It is noted that there are other top-down parser algorithms that
can be based upon a larger family of context-free grammars than those grammars meeting the constraints
needed for recursive-descent parsing.
To build recursive-descent parsers that require no backtracking, the constraints on the corresponding
grammar require that each subset of production rules with the same left-hand side non-terminal, must have
right-hand sides that either start with unique terminals or, if a particular left-hand side symbol were treated
as a start symbol, could generate sentential forms that start with unique terminals. To formally express
these constraints, two subsets of terminals must be defined for each non-terminal and terminal symbol X,
called F irst(X) and F ollow(X) sets. In addition to containing subsets of terminals these sets may contain
ϵ string and/or $ symbols. (For emphasis, there are no non-terminals in these two sets.) These sets are also
used to build the more general LL(k) class of parsers.
4.1.2
First and Follow Sets
For a given grammar, G =< VN , VT , P, S >, we compute the F irst(X) for all grammar symbols, X ∈
(VN ∪ VT ), we continue to apply each of the rules until no more terminals can be added or ϵ can be added.
For each X ∈ (VN ∪ VT ), the definition of the F irst(X) set is defined as follows:
1. If X ∈ VT , then F irst(t) is {t}
2. If X ∈ VN and X → ϵ ∈ P , then add ϵ to F irst(X).
3. If X ∈ VN and X → Y1 Y2 . . . Ym ∈ P , then add t to F irst(X) if t ∈ F irst(Yi ) for some i with
F irst(Y1 ), . . . , F irst(Yi−1 ) contains ϵ. If for all 1 ≤ i ≤ m has ϵ ∈ F irst(Yi ), then add ϵ to F irst(X).
Copyright 2024
87
4.1 Top-Down Parsing
Parsers
The F irst sets parameters can be augmented to strings over VN ∪ VT . F irst(X1 X2 . . . Xn ) is computed
as follows. All non-ϵ (terminal) symbols of F irst(X1 ) is added to F irst(X1 X2 . . . Xn ). For each j, with
1 ≤ j < n, if F irst(X1 ) . . . F irst(Xj ) contains ϵ, then add all terminals (non-ϵ symbols of F irst(Xj+1 ) to F irst(X1 X2 . . . Xn ). If for all Xj , F irst(Xj ) contains ϵ, then F irst(X1 X2 . . . Xn ) contains ϵ. Another set construction used for constructing both top-down and bottom-up parsers are F ollow sets. They also can be utilized to design error recovery within parsers. Like F irst sets, the F ollow sets are parameterized over nonterminals. The follow sets, F ollow(X) for each X ∈ VN , are constructed by applying the following rules: 1. Add $ to F ollow(S), where S is the start symbol and $ is the end-of-string symbol. 2. If there is a production A → αBβ, then everything in F irst(β) except for ϵ is added to F ollow(B). 3. If there is a production, A → αB, or a production A → αBβ, where F irst(β) contains ϵ , then everything in F ollow(A) is added to F ollow(B). Example 4.1 First and Follow Sets ′ Using grammar GET F : E → T E′ E′ → +T E ′ |ϵ T → FT′ T′ → ∗F T ′ |ϵ F → (E)|id First, we construct the F irst sets since they are needed to construct the F ollow sets. Remember the F irst sets for the terminals are all singleton sets containing the terminal symbol. Clearly, F irst(+) = {+}, F irst(∗) = {∗}, F irst(′ (′ ) = {′ (′ }, F irst(′ )′ ) = {)}, and F irst(id) = {id}. F irst(′ (′ E ′ )′ ) ∪ F irst(id) = F irst(′ (′ ) ∪ F irst(id) = {(, id} F irst(F ) = F irst(T ′ ) = F irst(∗F T ′ ) ∪ {ϵ} = F irst(∗) ∪ {ϵ} = {∗, ϵ} F irst(T ) = F irst(F ) = {(, id} ′ F irst(E ) = F irst(+T E ′ ) = F irst(+) ∪ {ϵ} = {+, ϵ} F irst(E) = F irst(T ) = {(, id} The F ollow sets are parameterized only over the non-terminals. The F ollow set of the start symbol always contains the end-of-string symbol, $ and thus, for this example, F ollow(E) = {$}. To construct the other F ollow sets, the right-hand sides of the production rules are examined. Looking at F → (E), the ′ )′ must be added to F ollow(E), thus yielding F ollow(E) = {$, )}. Using the third rule for constructing the F ollow sets, the F ollow(E) must be added to F ollow(E ′ ). Nothing else can be added to F ollow(E ′ ), hence F ollow(E ′ ) = F ollow(E). The F ollow(T ) must contain the F irst(E ′ ) except for ϵ since E ′ follows T in the right-hand side of the production E ′ → +T E ′ . Thus far, we have F ollow(T ) = {+}. However, since E ′ → ϵ is a production, then, because we have the first production, E → T E ′ , all of F ollow(E) must be included in F ollow(T ). Nothing else is added to F ollow(T ) = {+, )}. Since we have T → F T ′ , then all of F ollow(T ) must be included in F ollow(T ′ ). Nothing else is added to F ollow(T ′ ). Because of the right-hand side of Copyright 2024 88 Parsers 4.1 Top-Down Parsing T ′ → ∗F T ′ , the F irst(T ′ ) must be added to F ollow(F ). Since T ′ → ϵ is a production, all of the F ollow(T ) is added F ollow(F ). Nothing else is added to F ollow(F ). Thus, we have the following F irst and F ollow sets: F irst(E) = F irst(T ) = F irst(F ) = {(, id} F irst(E ′ ) = {+, ϵ} F irst(T ′ ) = {∗, ϵ} F ollow(E) = F ollow(E ′ ) = {), $} F ollow(T ) = F ollow(T ′ ) = {+, ), $} F ollow(F ) = {+, ∗, ), $} 4.1.3 Constraints for Non-backtracking Recursive Descent Parsing For a grammar G =< VN , VT , P, S > to be utilized for a non-backtracking recursive-descent parser, it must
meet the following two constraints:
Rule 1:
For each production, A → α1 |α2 | . . . |αn , the first sets of the right-hand sides must be mutually disjoint:
f irst(αj ) ∩ f irst(αk ) = ∅
for all j ̸= k.
Rule 2:
For all A ∈ VN , such that A ⇒∗ ϵ
f irst(A) ∩ f ollow(A) = ∅
′
As shown in Example 4.1 the grammar GET F meets the constraints of Rules 1 and 2 and can be used
for designing a top-down, recursive-descent parser.
To aid the design of a recursive-descent parser, it is convenient to represent the grammar as syntax
diagram shown in Figure 4.2:
The syntax diagrams can be reduced in number and simplified as shown in the Figure 4.3.
A recursive-descent parser based on the syntax diagram from Figure 4.3 is shown in Program 1. This
parser relies on a scanner, called scan(). We assume the token names, PLUS_TOK, STAR_TOK, LEFT_P_TOK,
RIGHT_P_TOK, and IDENT for the lexemes ’+’, ’*’, ’(’, ’)’ and legal identifier strings, respectively. The
scanner returns ERROR_TOK if an unrecognized character is seen and, EOF_TOK, if the end of the input string
is encountered. The EOF_TOK token corresponds to the $. It is assumed that these token names are symbolic
constants of type int.
The parser is called parse(). If the input string is accepted, the function parse() returns 0, and a nonzero value otherwise. The scanner and parser functions communicate via a global variable currentToken.
The error processing function, error(string), prints the supplied error message, increments the number of
syntax errors via global variable, numberOfErrors, and tries to recover from the error by scanning further
into the input string.
Copyright 2024
89
4.1 Top-Down Parsing
Parsers
′
Program 1 Recursive Descent Parser for GET F using syntax diagram in Figure 4.3
/* global variable */
int currentToken;
int numberOfErrors = 0;
void main()
{
if( parse()==0 ) print(“String Accepted”);
else print(“Input contains syntax errors.”);
}
int parse()
{
scan();
E();
if(currentToken != EOF_TOK) syntax_error(“Unexpected end of input”);
return numberOfErrors;
}
void E()
{
T(); while( currentToken == PLUS_TOK ) {
scan();
T();
}
}
void T()
{
F(); while( currentToken == STAR_TOK ) {
scan();
F();
}
}
Copyright 2024
90
Parsers
4.1 Top-Down Parsing
E
T
E’
E’
T
+
T
E’
F
T’
T’
F
*
F
(
T’
E
)
id
′
Figure 4.2: Syntax Diagram for GET F
′
Program 1 (Continued) Recursive Descent Parser for GET F using syntax diagram in Figure 4.3
void F()
{
if( currentToken == IDENT) scan();
else if ( currentToken == LEFT_P_TOK) {
scan();
E();
if( currentToken == RIGHT_P_TOK )
scan();
}
else syntax_error(“Missing Right Parenthesis”);
}
else syntax_error(“Missing Expression symbol”);
}
4.1.4
LL(1)Table-Driven Predictive Parsing
After the First and Follow set are computed and the two constraints are satisfied for predictive parsing,
instead of writing or using a tool to generate a recursive descent parser, a table can be used. An LL(1) table,
T , can be constructed as follows.
Input: G, F irst and F ollow sets.
Output: LL(1) Table T .
Copyright 2024
91
4.1 Top-Down Parsing
Parsers
E
T
+
T
F
*
F
E
(
)
id
Figure 4.3: A reduced syntax diagram for diagram shown in Figure 4.2
1. The rows are labeled with the nonterminals.
2. The columns are labeled with the terminals and $.
3. Initialize all entries as empty.
4. the filled with production rules as follows:
(a) Add A → α to T [A][t] for each terminal t ∈ F irst(α).
(b) If ϵ ∈ F irst(α) for A → α, then add A → α to T [A][t] for each terminal t ∈ F ollow(A) and if
$ ∈ F ollow(A), then also add A → α to T [A][$].
′
Using the grammar GET F and the First and Follow sets, the LL(1) table T is shown below.
F irst(E) = F irst(T ) = F irst(F ) = {(, id}
F irst(E ′ ) = {+, ϵ}
F irst(T ′ ) = {∗, ϵ}
F ollow(E) = F ollow(E ′ ) = {), $}
F ollow(T ) = F ollow(T ′ ) = {+, ), $}
F ollow(F ) = {+, ∗, ), $}
Vn
Input Symbols (Vt )
id
E
E → TE
+
E′
T
∗
(
′
E → TE
E ′ → +T E ′
T → FT′
T′
)
$
E′ → ϵ
E′ → ϵ
T′ → ϵ
T′ → ϵ
′
T → FT′
T′ → ϵ
T ′ → ∗F T ′
F
F → id
F → (E)
The algorithm used for the engine of the Predictive Parsing Machine in Figure 4.4 is shown in Figure 4.1.
Recall the empty entries in the table are considered error markers.
Copyright 2024
92
Parsers
4.1 Top-Down Parsing
Input
x
X
Y
Z
$
Stack
*
y
$
Predictive Parser
Engine
Output
Parser Table
T
Figure 4.4: Top-Down Predictive Parser Abstract Machine Model
In the algorithm in Figure 4.1 when an error occurs, the machine terminates with an error message. Error
recovery can be introduced with special routines to try to recover from the error after the error is added to
a list of errors.
Prefix Scanned
id
Stack
Position in Input
E$
id + id ∗ id$
Start Configuration
T E′$
id + id ∗ id$
Apply E → T E ′
F T ′E′$
id + id ∗ id$
Apply T → F T ′
idT ′ E ′ $
id + id ∗ id$
Apply F → id
T ′E′$
+id ∗ id$
Scan id
id
E$
+id ∗ id$
Apply T ′ → ϵ
id
+T E ′ $
+id ∗ id$
Apply E ′ → +T E ′
id+
T E′$
id ∗ id$
Scan +
id+
F T ′E′$
id ∗ id$
Apply T → F T ′
id+
idT ′ E ′ $
id ∗ id$
Apply F → id
id + id
T ′E′$
∗id$
Scan id
id + id
′
∗F T E $
∗id$
Apply T ′ → ∗F T ′
id + id∗
F T ′E′$
id$
Scan ∗
id + id∗
′
idT E $
id$
Apply F → id
id + id ∗ id
T ′E′$
$
Scan id
id + id ∗ id
′
′
T E$
$
Apply T ′ → ϵ
id + id ∗ id
E′$
$
Apply E ′ → ϵ
id + id ∗ id
$
$
Successful Parse
Copyright 2024
′
Action: Apply Production for Immediate Derivation; or Scan
′
′
93
4.2 Grammar Transformations
Parsers
Algorithm 4.1: Algorithm for Predictive Parsing Engine
input : A string of terminals, ω, and Predictive Parser Table T for a Grammar G
output: If ω ∈ L(G), a leftmost derivation or parse tree of ω, else an error is returned.
Push $ onto stack and then Push start symbol S onto stack Let a be the first symbol of ω;
Let X be the top symbol on stack;
while X ̸= $ do
if X = a then
pop the stack and set a to next symbol in ω
else if X is a terminal then
indicate error
else if T [X][a] = error then
indicate error
else if T [X][a] = X → Y1 Y2 . . . Yk then
output X → Y1 Y2 . . . Yk , pop stack, and Yk , . . ., Y2 , Y1 on stack with Y1 on top.
Set X to top of stack
4.2
Grammar Transformations
To implement a recursive descent parser, a grammar must meet the constraints indicated above. When
presented with a non-recursive-descent grammar, G, there are transformations that may be applied to G that
can create a grammar G′ conducive for recursive descent parsing, such that L(G) = L(G′ ). The main objective
is to be able to eliminate backtracking in the parser for the grammars. Ambiguity is a problem for both topdown or bottom-up parsers. To be able to use particular types of parsers some grammar transformations
are needed and three general approaches are examined. All involve rewriting the problematic grammar.
Two have algorithms that can be used in rewriting a grammar. One is to eliminates left recursion and
another performs left factoring. To eliminate ambiguity takes some creativity for rewriting a non-ambiguous
grammar or apply some meta-level techniques while parsing. It is an unsolvable problem to detect whether
an arbitrary CFG is ambiguous.
4.2.1
Handling Ambiguity
Ambiguity is a problem for programming language translation since there can be different semantics associated with the different parse trees for one ambiguous sentence. Which structure is chosen must be consistent
between the designer, translator developers and the users of the language. There are several kinds of constructs that occur in CFGs that are ambiguous. One problem is in the specification of expressions as we
have seen in Chapter 3. A specific kind of problem are particular kinds of expression that access Another
classic problem is the dangling else problem. The last kind of ambiguity is mixing specific grammar rules
with general grammar rules. the designer of a grammar that
Consider the following grammar for generating if-then and if-then-else statements along with other statements represented as terminal sothers . The statements include the reserved or keyword then to separate the
conditional expression, represented by a terminal c, from the start of a statement, S, for the then-statement.
Copyright 2024
94
Parsers
4.2 Grammar Transformations
Note that a language designer could have eliminated the need for then by requiring left and right parentheses around the conditional expression c, resulting in the right parenthesis acting as the separator of the
conditional expression c and the then-statement S. Gite :
S
→
if c then S
S
→
if c then S else S
S
→
sothers
The sentence if c then if c then sothers else sothers has two parse trees using Gite and shown in Figure 4.5:
A non-ambiguous grammar, Gite−non :, for if-then and if-then-else statements is shown below. The strategy
(a)
S
if
c
S
then
if
c
then
S
else
S
sothers
sothers
(b)
S
if
c
then
S
if
c
else
then
S
sothers
S
sothers
′
Figure 4.5: Parse trees illustrating the dangling else problem for grammar Gite
is to separate the matched if-then-else statements from the if-statements without an else. Two non-terminals
are respectively introduced, MatchedElse and UnmatchedElse.
Gite−non :
S
→
MatchedElse
| UnmatchedElse
| sothers
MatchedElse
→
if c then MatchedElse else MatchedElse
| sothers
UnmatchedElse
→
if c then S
| if c then MatchedElse else UnmatchedElse
Shown in Figure 4.6 is the parse tree using Gite−non for the sentence if c then if c then sothers else sothers .
Copyright 2024
95
4.2 Grammar Transformations
Parsers
S
UnmatchedElse
if
c
then
S
MatchedElse
if
c
then
MatchedElse
else MatchedElse
sothers
sothers
′
Figure 4.6: Parse trees illustrating the dangling else problem for grammar Gite−non
This transformation resolves the ambiguity, but the new grammar is not good for top-down parsing
because of the non-empty intersection of first sets of the right-hand sides of the nonterminal Match and
non-empty intersection of MatchedElse and UnmatchedElse.
In the Section 4.2.3 we will perform a transformation of the productions for Gite called left factoring
and create a better grammar for top-down parsing. This helps with top-down predictive parsing but does
not always remove ambiguity. Often parser-building tools, such as Yacc and Antlr4 have meta-level ways to
handle ambiguity.
4.2.2
Removing Left Recursion in Productions
For top-down parsing, left-recursion is not permitted. If there is a derivation for a grammar that requires
at least one or more immediate derivations, then the grammar is said to have left-recursion . That is, a
grammar is left recursive if there is a derivation A ⇒+ Aα. A production rule has immediate left-recursion
if the rule is of the form A → Aα, where α ∈ (VN ∪ VT )+ .
The production rules E → E + T and T → T ∗ F are both immediate left-recursive rules in the Grammar
GET F shown earlier and repeated here.
E
→ E+T |T
GET F : T
→ T ∗F |F
F
→ (E) | id
Note that there is a derivation E ⇒ T ⇒ F ⇒ (E). This shows that the grammar has indirect recursion,
but is not left-recursive.
To eliminate immediate-left recursive, new non-terminals are introduced along with additional production
rules in the following manner. Suppose we have the following two A-productions
A
Copyright 2024
→ Aα|β
96
Parsers
4.2 Grammar Transformations
where β does not start with A. To eliminate the immediate-left recursion, we replace these rules with
′
GET F :
A
→ βA′
A′
→ αA′ |ϵ.
Transforming the immediate-left recursive rules in GEF T we have the grammar that we have seen before
and repeated here:
E
→
T E′
E′
→
+T E ′ |ϵ
T
→
FT′
T′
→
∗F T ′ |ϵ
F
→
(E)|id
For the rules E → E + T |T , A, α, and β are substituted with E, +T , and T , respectively, and E ′ and
E ′ → +T E ′ |ϵ. The rules T → T ∗ F |F are transformed similarly.
A more general transformation for immediate left recursion of a non-terminal,A, with more than two
A-productions. Suppose we have the following productions.
A → Aα1 | Aα2 | · · · | Aαn | β1 | β2 | · · · | βm
where A ∈
/ F irst(βi ) for all 1 ≤ i ≤ m.
By introducing non-terminal A′ , we can replace the above immediate-left recursion of A-productions with
the following productions.
A →
β 1 A′ | β 2 A′ | · · · | β m A′
A′
α1 A′ | α2 A′ | · · · | αn A′ | ϵ
→
The removal of general left recursion is not covered here.
4.2.3
Left Factoring
We saw this ambiguous grammar, Gite , in Section 4.2.1. This is not only ambiguous, but not appropriate
for top-down, predictive parsing. When a top-down parser is given the if lexeme, the parser does not know
which of the first two rules to use. One approach to fix this is to factor out the longest common prefixes for
right-side of two or more A-productions. Suppose we have A → αβ1 |αβ2 , where α ̸= ϵ. If we factor out α
and we could parse α and then have the parser decide between β1 and β2 .
Suppose we have A-productions with common prefixes α and with the rest of the right-hand sides for
A-productions represented collectively as γ:
A →
αβ1 | αβ2 | · · · | αβm | γ.
Note that any βi can be ϵ for any i where 1 ≤ i ≤ m. We can transform these productions into an equivalent
grammar:
A → αA′ | γ
A′
→ β 1 | β2 | · · · | βm .
For the first two rules of Gite , the longest shared prefix is: if c then S. Gite :
Copyright 2024
97
4.3 Bottom-up Parsing
Parsers
S
→
if c then S
S
→
if c then S else S
S
→
sothers
Applying left-factoring we have α = if c then S. γ = sothers . The left factoring transformation of Gite is:
S
→
if c then S S′ | sothers
S′
→
else S | ϵ
This resultant grammar is still ambiguous because when the parser is presented with lexeme else, it will
not be clear which S′ alternative to use. When using a table-driven LL(1) predictive parser the designer of
the table can resolve these conflicts. Also, this can be handled by meta-level actions by a top-down parser
generator like Antrl4.
4.3
Bottom-up Parsing
Bottom-up parsers parse sentences by performing a derivation in reverse and, equivalently, build the parse
tree bottom-up. More precisely, bottom-up parsers implicitly construct a rightmost derivation in reverse; that
is, they perform a series of immediate reductions instead of immediate derivations. The class of bottom-up
parsers we will examine are called LR(k) parsers. The L indicates that the input string of terminal symbols
are read from Left-to-right and the R indicates that the parser recreates a Rightmost derivation in reverse.
The number k represents the number of terminal symbols that from the current point of reading that are
examined in order to make decisions about what reduction to apply. For now, we only consider an LR(1)
parser for a grammar G shown below. In contrast, as shown in Section 4.1 there is a class of parsers,
called LL(k), that perform a Left-to-right scan while using a (forward) Leftmost derivation with k symbol
lookahead; these parser build parse trees top-down from the root, which is labeled with the start symbol. A
grammar that has a LL(k) or LR(k) parser is classified, respectively, as an LL(k) or LR(k) grammar.
Each bottom-up parser is based upon a model of computation called a pushdown automaton. They are
called “pushdown” since each of these automata include a stack in addition to a finite state automaton.
Intuitively, the stack is needed to recognize non-regular languages, such as {an bn |n ≥ 0}. A string an bn
(n ≥ 0) is accepted if the stack is empty after pushing the as and then pop’ing each a each time a b is read.
The components of a pushdown automaton is depicted in Figure 4.7.
Bottom-up parsers based upon pushdown automata are also referred to as shift-reduce parsers. To build a
parse tree from bottom-up the shift-reduce parser performs a sequence of reductions of a rightmost derivation.
In order to perform reductions on right-sentential forms, the parser shifts terminal symbols from the input
string onto stack; i.e., it pushes a scanned token onto the pushdown stack. Whether a shift or a reduce is
executed depends on the current input symbol and the current state of the automaton. Before examining
details of the operations of parsing using a shift-reduce parser, the concept of “a handle of a right-sentential
form” must be defined.
Copyright 2024
98
Parsers
4.3 Bottom-up Parsing
INPUT
a0
a1
a2
…
ai
…
an-1
$
Next Token
Shift Reduce
Table
STACK
q0
X0
GoTo
Table
LR Parser
Engine
Top
q1
X1
…
qm-1 Xm-1 qm
Figure 4.7: LR PDA components
Example 4.2 Handles
Consider the following grammar G:
(1)
S
→ aSb
(2)
S
→ a
The (G is unambiguous) rightmost derivation for aab is
rm
rm
S ⇒ aSb ⇒ aab
Looking at this in reverse we have:
rm
rm
aab ⇐ aSb ⇐ S
When presented with the input string aab, a shift-reduce parser must find a substring in aab that matches a
right-hand side of a production rule—in this case, one of the two production rules. The first a matches the
right-hand side of rule (2). If a reduction is made, then we have: aab ⇐ Sab. However, the string, Sab is
not a sentential form, i.e., S ⇒∗ Sab, does not hold. If the second a was matched to the righthand side of
production rule (2), the correct sentential form, aSb, would be reached. At this point the entire string would
match the right-hand side of rule (1). .
From Example 4.2 one can see that there can be several substrings that match at least one one righthand side rule. However, the substring that must be chosen is the one that, when that substring is replaced
in a sentential form with the corresponding left-hand side nonterminal, the resulting string is also a rightsentential form. Such a production rule for a right-sentential form is called a handle. It should be noted
that, in general, there may be more than one handle when the corresponding grammar is ambiguous.
A handle of a right-sentential form, σ, is a production rule, A → β and an index in σ where β begins
and when β is replaced in σ with A, the resulting string is also a right-sentential form. Using notation we
rm∗
rm
have: for ω ∈ (VT )∗ , α ∈ (VN ∪ VT )∗ , S ⇒ αAω ⇒ αβω.
In Example 4.2, we have: σ = aab and the handle S → a with replacement of the second a. Reducing
we have α = a, A = S, and ω = b.
Copyright 2024
99
4.3 Bottom-up Parsing
Parsers
To help understand how parsing is accomplished using a shift-reduce actions, one can envision that the
parser “prunes” out handles right-hand sides off a conceptual parse tree that represents a derivation of a
sentence when a reduction is made. In order to implement this scheme, we must be able to represent each
right-sentential form and be able to find a handle. To do this shift-reduce parsers utilize the stack in the
pushdown automaton. The stack will hold a prefix of each sentential form, with the leftmost symbol on the
bottom of the stack and the rightmost symbol on the top. There are four possible actions that a shift-reduce
parser can make: (1) Shift, (2) Reduce, (3) accept, or (4) report an error in the input string.
1. Shift: Shift involves reading the next input terminal (token)and pushing it onto the stack. That is, it
shifts it from the input string onto a stack.
2. Reduce: It can be shown that if the input string is a sentence, the symbols in the handle’s right-hand
side, β, forms top symbols of the stack with the rightmost symbol of a handle is the top of the stack.
A reduce action involves popping the symbols of β off the stack until the leftmost symbol of the β is
popped. Then the left-hand side nonterminal, A, is pushed onto the stack.
3. Accept: The parser indicates that the input string is a sentence.
4. Error: The indicates that the input string is ill-formed. Some parsers can try to recover and continue
parsing.
To introduce how a shift-reduce parser works, consider the following grammar GET F and sentence id ∗
(id + id) ∈ L(GET F ). GET F :
E
→ E + T |T
T
→ T ∗ F |F
F
→ (E)|id
First, consider a rightmost derivation of the string id ∗ (id + id) using grammar GET F :
rm
⇒
E
T ∗ (E)
rm
T ∗ (E + id)
rm
T ∗ (id + id)
rm
⇒
⇒
⇒
rm
⇒
T
T ∗F
rm
⇒
T ∗ (E + T )
rm
T ∗ (E + F )
rm
T ∗ (T + id)
rm
⇒
T ∗ (F + id)
rm
F ∗ (id + id)
rm
id ∗ (id + id)
⇒
⇒
⇒
⇒
Derivations show the abstract generation of sentences in L(G) for some grammar G. When reading input
string left to right and building the parse tree from bottom-up (conceptually) performing a derivation in
reverse, is more helpful for studying such a parsing strategy. The reverse of the derivation shown above is
shown above via a sequence of reductions as shown below.
id ∗ (id + id)
T ∗ (F + id)
T ∗ (E + F )
T ∗F
(1)rm
⇐
(4)rm
⇐
(7)rm
⇐
(10)rm
⇐
F ∗ (id + id)
T ∗ (T + id)
T ∗ (E + T )
(2)rm
⇐
(5)rm
⇐
(8)rm
⇐
(11)rm
⇐
T
T ∗ (id + id)
T ∗ (E + id)
T ∗ (E)
(3)rm
⇐
(6)rm
⇐
(9)rm
⇐
E
Shown in Figure 4.8 (3 Figures) is a trace of a shift-reduce parse of id ∗ (id + id). To trace a bottom-up
parse of this sentence we use a stack, the current reading position of the input, and a conceptual parse tree.
Copyright 2024
100
Parsers
4.3 Bottom-up Parsing
The parse tree is conceptual since it does not exist at the beginning of the parse; it is what we are trying
to build either abstractly or concretely. Each step shows either a shift action or a reduction. We show a
reduction by pruning the handle of the conceptual parse tree.
How the shift-reduce parsers, such as
LR(k), know which action to use will require a finite state machine and a table of actions whose entries
are one of the four possible shift-reduce actions and whose entries are indexed by the state and the next
input symbol. The finite state machine is represented by a “GoTo” table, whose entries are next states and
whose entries are indexed by state and nonterminal symbols, with each nonterminal symbol representing a
left-hand nonterminal symbol of a handle.
It is not within the scope of this text to discuss how to classify whether a grammar is LR(k) for some k
or to study how to construct pushdown automata from LR grammars. To build the Action and GoTo tables
“by hand” is time-consuming. There are algorithms that translate a LR(k) grammar, G, into a pushdown
automaton, M , such that L(G) = L(M ). L(M ) means that all and only sentences of L are recognized by
M . Thus, there are software tools for accomplishing this. Some tools restrict the class of grammars to a
particular subset of LR(k) grammars and/or to limit k to a maximum value. For example, yacc is limited
to Look-Ahead-Left-Recursive grammars with k=1, notated as LALR(1) grammars.
The stack of each these pushdown automata can be viewed as string of the form: q0 X1 q1 X2 q2 . . . Xm qm ,
qi is a state and Xj is a terminal or nonterminal symbol and state qm at the top of the stack. The algorithm
for running a pushdown automaton is shown in Program 2.
Program 2 Algorithm for Pushdown Automata
push state 0 on top of an empty stack
set ip to be the index pointer to the first position of the input string, w$
repeat
let s be the top of the stack
let a be the first symbol of w
if action[s,a] = shift s’ then
push a on the stack; push s’ on the stack
else if action[s,a] = reduce r
then
//r represents the r-th production rule
A -> X1 X2 … Xn
pop 2*n symbols and states (n pairs) of the stack.
A state, s’, is exposed on the top of the stack
Now A on the stack and then push state in GoTo[s’,A] on the stack
else if action[a,s] = accept then
Return success code
else either try error recovery or return error code.
Copyright 2024
101
4.3 Bottom-up Parsing
Parsers
Example 4.3 Bottom-up Parsing with an LR(1) Shift-Reduce Table
We show how to parse id ∗ (id + id) in Table 4.2 using the LR(1) PDA represented in the Shift-Reduce
Table 4.1 created from the grammar GET F . GET F :
(1) E
→
E+T
(2) E
→
T
(3)
T
→
T ∗F
(4)
T
→
F
(5)
F
→
(E)
(6)
F
→
id
With grammar GET F , one can construct an LR(1) Shift-Reduce Table 4.1.
Copyright 2024
102
Parsers
4.3 Bottom-up Parsing
Action
State
id
0
S5
+
*
Goto
(
)
$
S4
1
S6
2
R2
S7
R2
R2
3
R4
R4
R4
R4
4
S4
R6
T
F
1
2
3
8
2
3
9
3
acc
S5
5
E
R6
R6
6
S5
S4
7
S5
S4
R6
10
8
S6
S11
9
R1
S7
R1
R1
10
R3
R3
R3
R3
11
R5
R5
R5
R5
Table 4.1: Shift-Reduce Table for Grammar G
Copyright 2024
103
4.3 Bottom-up Parsing
Parsers
Stack
Input
Action
0
id ∗ (id + id)$
Shift 5
0id5
∗(id + id)$
Reduce 6 (Use GOTO[0, F])
0F 3
∗(id + id)$
Reduce 4 (Use GOTO[0, T])
0T 2
∗(id + id)$
Shift 7
0T 2 ∗ 7
(id + id)$
Shift 4
0T 2 ∗ 7(4
id + id)$
Shift 5
0T 2 ∗ 7(4id5
+id)$
Reduce 6 (Use GOTO[4, F])
0T 2 ∗ 7(4F 3
+id)$
Reduce 4 (Use GOTO[4, T])
0T 2 ∗ 7(4T 2
+id)$
Reduce 2 (Use GOTO[4, E])
0T 2 ∗ 7(4E8
+id)$
Shift 6
0T 2 ∗ 7(4E8 + 6
id)$
Shift 5
0T 2 ∗ 7(4E8 + 6id5
)$
Reduce 6 (Use GOTO[6, F])
0T 2 ∗ 7(4E8 + 6F 3
)$
Reduce 4 (Use GOTO[6, T])
0T 2 ∗ 7(4E8 + 6T 9
)$
Reduce 1 (Use GOTO[4, E])
0T 2 ∗ 7(4E8
)$
Shift 11
0T 2 ∗ 7(4E8)11
$
Reduce 5 (Use GOTO[7, F])
0T 2 ∗ 7F 10
$
Reduce 3 (Use GOTO[0, T])
0T 2
$
Reduce 2 (Use GOTO[0, E])
0E1
$
acc
–ACCEPT–
Table 4.2: Parsing id ∗ (id + id)
Copyright 2024
104
Parsers
4.3 Bottom-up Parsing
E
E
T
F
*
T
T
(
F
id1
T
F
F
id3
F
*
(
F
T
id2
Empty Stack
T
)
E
+
E
E
id1
Input
id1*(id2+id3)$
(
T
id3
*(id2+id3)$
F
*
(
)
E
+
)
E
+
T
E
T
F
T
F
F
id3
F
id3
id2
F
*(id2+id3)$
id2
T
(ii) Prune Handle F → id (id1 )
T
*(id2+id3)$
(iii) Prune Handle T → F
E
E
T
T
F
*
(
T
+
(
)
E
+
T
E
T
F
T
F
F
id3
F
id3
id2
T*
F
*
)
E
E
(id2+id3)$
T*(
E
E
T
F
(
T
+
F
*
(
)
E
E
id2+id3)$
(v)Shift (
T
*
T
id2
(iv) Shift *
T
F
F
T
F
*
E
T
T
E
T
F
T
(i) Shift id1
E
T
+
id2
id1
(0) Starting Configuration
)
E
)
E
+
T
E
T
F
T
F
F
id3
F
id3
id2
T*(id2
+id3)$
T*(F
T
+id3)$
(vii)Prune Handle F → id (id2 )
(vi)Shift id2
Figure 4.8: Shift-Reduce Parsing: Handle Pruning of Conceptual Tree (Part 1 of 3))
Copyright 2024
105
4.3 Bottom-up Parsing
Parsers
E
E
T
T
F
*
T
(
(
)
E
+
E
)
E
+
E
T
T
T
F
F
id3
id3
+id3)$
T*(T
+id3)$
T*(E
(viii) Prune Handle T → F
(ix) Prune Handle E → T
E
E
T
T
F
*
T
F
*
T
(
(
)
E
+
E
F
*
T
)
E
+
E
T
T
F
F
id3
id3
id3)$
T*(E+
(x) Shift +
(xi) Shift id3
E
E
T
T
F
*
T
(
+
(
T
)
E
E
F
+
T
)$
T*(E+T
)$
T*(E+
(xii) Prune Handle F → id (id3 )
(xiii) Prune Handle T → F
E
E
T
T
F
*
(
T*(E
F
*
T
)
E
E
T
)$
T*(E+id3
E
T
F
*
(
)
E
T*(E)
)$
(xiv) Prune Handle E → E + T
)
$
(xv) Shift )
Figure 4.8: Shift-Reduce Parsing: Handle Pruning of Conceptual Tree (cont’d: Part 2 of 3)
Copyright 2024
106
Parsers
4.3 Bottom-up Parsing
E
E
T
T
T
F
*
$
T*F
(xvi)Prune Handle F → (E)
(xvii) Prune Handle T → T ∗ F
Accepted
E
E
$
T
$
$
(xviii)Prune Handle E → T
(ixx) Accepted
Figure 4.8: Shift-Reduce Parsing: Handle Pruning of Conceptual Tree (cont’d: Part 3 of 3)
Copyright 2024
107
CS 435 Project 1
Scanner for Simple_PL1
Overview of Assignment
Complete a “hand-written” scanner for Simple_PL1.
This project must be written in C (code files “.c” and “.h”) using Visual Studio 2022.
Simple_PL1 Lexical Specification
For definitions below: letter = [a..zA..Z] digit = [0..9].
Token Name (used
internally by compiler)
ID
READ
Regular expression
definition
WRITE
write
NUMBER
ASSIGN
PLUS
MINUS
TIMES
DIV
SEMICOLON
COMMA
LPAREN
RPAREN
SCAN_EOF
digit digit*
:=
+
*
/
;
,
(
)
(letter|_)(letter|digit|_)*
read
Lexeme examples
max, a_1, _a1,
note: read is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
note: write is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
1, 21, 1024
scanner returns this when
the source end-of-file is
reached.
WHAT TO SUBMIT: Name your Visual Studio 2022 solution/project:
CS435P01. After completing the project, Zip the entire
solution/project folder and test cases. Name your zip file CS435P01.
I will grade your project by running it against a collection of test files.
Example: The command line should be:
C:> CS435P01YourLastName.exe src1.txt
See the example shown below illustrating the input – output relationship.
Project 1 Scanner for Simple_PL1
Page 1 of 5
x := 3;
y := 4;
read(x);
z1 := x + y;
write( x, y, z1,x*y/2-23 );
⇒
ID, x
ASSIGN
NUMBER, 3
SEMICOLON
ID, y
ASSIGN
NUMBER, 4
SEMICOLON
READ
LPAREN
ID, x
RPAREN
SEMICOLON
ID, z1
ASSIGN
ID, x
PLUS
ID, y
SEMICOLON
WRITE
LPAREN
ID, x
COMMA
ID, y
COMMA
ID, z1
COMMA
ID, x
TIMES
ID, y
DIV
NUMBER, 2
MINUS
NUMBER, 23
RPAREN
SEMICOLON
TO DO: Write a driver program to test your scanner. You must use the command-line
parameters to access the source file, i.e. argv[1]. Open the source file, read it and
output all tokens on separate lines. For identifiers and numbers, you must also output
the lexeme as shown above in the example.
You must use the internal names for tokens shown above in your scanner. That is, the
scanner should return these names to your driver program. For the purpose of testing
the scanner, you’ll need a corresponding array of strings that represent the internal
Token name. You’ll also need a lexeme char array (c-string).
Although it is poor programming practice to use global variables, for this project, you
will be allowed to use global variables current_token and lexeme. You can also use the
src file pointer variable as a global, so you don’t have to pass that as a parameter to
the scanner.
Shown below is some starting code.
Project 1 Scanner for Simple_PL1
Page 2 of 5
Outline of Scanner and Driver for testing the Scanner. Written in C.
//starter code for Simple_PL1 scanner
#include //for c I/o
#include // for exit()
#include // for isalpha(), isalnum(), …
enum tokenType {
READ, WRITE, ID, NUMBER, LPAREN, RPAREN, SEMICOLON, COMMA, ASSIGN, PLUS, MINUS, TIMES, DIV, SCAN_EOF
};
char *mnemonic[] = { “READ”, “WRITE”, “ID”, “NUMBER”, “LPAREN”, “RPAREN”, “SEMICOLON”, “COMMA”,
“ASSIGN”, “PLUS”, “MINUS”, “TIMES”, “DIV”, “SCAN_EOF”};
void lexical_error(char ch)
{
fprintf(stderr, “Lexical Error. Unexpected character: %c.\n”, ch);
}
char lexeme[256] = { ‘\0’ };
unsigned int lexLen = 0;
FILE *src;
enum tokenType scan()
{
static int currentCh = ‘ ‘;
static int tempCh = ‘ ‘;
char* reserved[2] = { “read”, “write” };
lexLen = 0;
// clear lexeme buffer for each scan
lexeme[0] = ‘\0’;
extern FILE *src;
//pointer to FILE handle that binds to source file.
if (feof(src)) {
return SCAN_EOF;
}
while ((currentCh = fgetc(src)) != EOF) {
if (isspace(currentCh)) {
continue;
}
else if (isalpha(currentCh)) { //needs to be modified
lexeme[0] = currentCh;
lexLen = 1;
for (tempCh = fgetc(src); isalnum(tempCh) || tempCh == ‘_’;) {
//build identifier lexeme
}
lexeme[lexLen] = ‘\0’;
ungetc(tempCh, src); //put back character that is not a alpha/digit or ‘_’
// see if lexeme is a reserved word, if not, return ID.
}
else if (isdigit(currentCh)) {
// build lexeme for number
// finish fixing lexeme string, ungetc the last character read that is not a digit and then return NUMBER
}
else if (currentCh == ‘+’) {
return PLUS;
}
// use selection statements to look for tokens for operators and delimiters and assignment (:=)
else {
lexical_error(currentCh);
}
}
return SCAN_EOF;
}
Project 1 Scanner for Simple_PL1
Page 3 of 5
int main(int argc, char *argv[])
{
extern FILE *src;
enum tokenType currentToken;
if (argc > 1) {//should be better testing for proper number of arguments, but not required for this project
if (fopen_s(&src, argv[1], “r”)) {
fprintf(stderr, “Error opening source file: %s “, argv[1]);
exit(1);
}
}
while ((currentToken = scan()) != SCAN_EOF) {
//finish body for displaying the string version of the internal token name and
//also print lexeme if the token is a ID or NUMBER. Do not print lexeme for the other tokens.
}
fclose(src);
return 0;
} //end driver
Project 1 Scanner for Simple_PL1
Page 4 of 5
A preview of a Grammar for Simple_PL1 is shown, so you can start writing programs
in Simple_PL1. However, you do not need to know proper syntax in order to test your
scanner.
The “$$” is treated as eof.
program
stmt_list
stmt
expr
term_tail
term
factor_tail
factor
add_op
mult_op
→
→
→
→
→
→
→
→
→
→
stmt_list $$
stmt stmt_list | ε
id := expr; | read id; | write expr;
term term_tail
add_op term term_tail |ε
factor factor_tail
mult_op factor factor_tail | ε
( expr ) | id | number
+ | * | /
Grammar specifications for Simple_PL1 will be given in the future.
Project 1 Scanner for Simple_PL1
Page 5 of 5
CS 435 Project 2
Recursive Descent Parser (with Scanner) for Simple_PL1
Overview of Assignment
•
•
Using your Project 1 scanner, implement a recursive descent parser in C for
Simple_PL1.
If a lexical error or parser error is encountered, report the error according to the
specification shown below and exit.
Review of Simple_PL1 Lexical Specification
Your scanner must recognize the following tokens. For definitions below: letter =
[a..zA..Z] and digit = [0..9].
Token Name (used
internally by compiler)
ID
READ
Regular expression
definition
WRITE
write
NUMBER
ASSIGN
PLUS
MINUS
TIMES
DIV
SEMICOLON
COMMA
LPAREN
RPAREN
SCAN_EOF
digit digit*
:=
+
*
/
;
,
(
)
(letter|_)(letter|digit|_)*
read
Lexeme examples
max, a_1, _a1,
note: read is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
note: write is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
1, 21, 1024
scanner returns this when
the source end-of-file is
reached.
WHAT TO SUBMIT: Name your project CS435P02. After completing
the project, zip the entire project folder and test cases. Name your zip file
CS435P02.
I will grade your project by running it against a collection of test files. (The executable
file name for this example is CS435P02Jeffrey. Your executable will be your project
name.)
Example: The command line should be: $ CS435P02Jeffrey src1.txt
TO DO: This parser must be written in C using the Visual Studio 2022 IDE. Utilize
your scanner from Project 1. You must use the command-line parameters to access
Project 2 Parser for Simple_PL1
Page 1 of 5
the source file, i.e. argv[1]. Open the source file and parse it. You must use recursive
descent parsing. For this simple parser, when a parsing error occurs, just print the
expected token (use mnemonic array of strings), or, indicate that an unexpected token
was found and exit the program. You will need to analyze which message is
appropriate. When a lexical error occurs, indicate that an unrecognized character was
seen. Do not attempt error recovery.
Simple_PL1 Syntax Definition and Grammar:
The Simple_Pl1 programming language is defined and is generated by the grammar
shown below. (The “$” represents the SCAN_EOF token.) The recursive descent parser
must be based upon the following grammar, which meets the conditions for building a
recursive descent parser. You can also alter it to make it more concise and for building
a more efficient recursive descent parser. For example, see the syntax diagram on the
next page.
Note that you can take advantage of checking whether the next token is in the first set
of a particular non-terminal when parsing a production rule for that non-terminal.
program
stmt_list
stmt
expr_list
expr_list_tail
id_list
id_list_tail
expr
term_tail
term
factor_tail
factor
add_op
mult_op
→
→
→
→
→
→
→
→
→
→
→
→
→
→
stmt_list $
stmt stmt_list | ε
id := expr; | read( id_list ); | write( expr_list );
expr expr_list_tail
, expr expr_list_tail | ε
id id_list_tail
, id id_list_tail | ε
term term_tail
add_op term term_tail |ε
factor factor_tail
mult_op factor factor_tail | ε
( expr ) | id | number
+ | * | /
Project 2 Parser for Simple_PL1
Page 2 of 5
Syntax Diagram: This syntax diagram expresses an alternative presentation of a
grammar. This grammar is a revision of the original grammar shown above and
generates the same Simple_PL1 language. This grammar shown below can aid in the
engineering of a recursive descent parser.
program
stmt
stmt
EOF
ID
:=
expr
;
READ
(
ID
)
;
,
expr
(
WRITE
)
;
,
expr
term
term
factor
+
*
–
/
factor
ID
NUMBER
(
Project 2 Parser for Simple_PL1
expr
)
Page 3 of 5
Handling Syntax Errors in Simple_PL1 source code.
If there are no parsing errors, print to standard output the following message:
Parsing complete. No errors.
If there is a detected parsing (or lexical error), print the token name that was going to
be or was recognized and exit the program. Use the following format for a parser error:
Expected symbol: or Unexpected symbol: .
You can also use the first sets to detect an error inside a certain construct, such as
expressions. Use an error message such as (just an example):
Error in expression: Expected ID, NUMBER, or ‘(‘.
The expected or unexpected symbol should be a token nearby the actual parsing error.
Note: sometimes the error might appear to be misleading. Error reporting and recovery
is hard. Utilize the mnemonic array of c-strings from Project 1 to print the token
name.
Shown below are some examples.
Source File
Example 1
Example 2
x := 2;
y := 3;
read(a, b);
write(a,b,a+b*(2*x/y));
x := 2;
y := 3;
read(a, b);
write(a,b,a+b*(2*x/y);
Standard output
⇒
⇒
Parsing complete. No errors.
Expected symbol: RPAREN
Example 3
x := 2;
y := 3;
read(a, b)
write(a, b,a+b*(2*x/y));
⇒
Expected symbol: SEMICOLON
Example 4
x := 2;
y := 3;
read(a b);
write(a, b,a+b*(2*x/y));
⇒
Expected symbol: RPAREN
Example 5
read(x, y, z, v);
read( c ); temp := (x y)*(2*x + z/4) /v;
a := (c);
write( temp /
);
⇒
Error in expression: Expected
ID, NUMBER, or ‘(‘.
NOTES:
In Example 2, there is a missing right parenthesis in the last statement.
In Example 3, there is a missing semicolon that should terminate the read statement
on the third line.
Project 2 Parser for Simple_PL1
Page 4 of 5
In Example 4, there is a missing comma. Since a comma separates identifiers in an
identifier list for READ, the parser “thinks” it is at the end of the list when no comma
is found and expects a RPAREN. Perhaps the message should be “missing comma,” but
adding code to detect this kind of error would complicate the parser. Although this is a
simple fix in this case, this demonstrates the difficulty of reporting errors and the
design decisions compiler engineers face.
In Example 5, you can see how a specific construct can be referenced in the error
message. The first(Expr) set can be helpful for detecting and reporting this kind of
error.
Implementing a Recursive Descent Parser for Simple_PL1 and Some Supporting
Functions: Utilize the textbook as a guide for writing your recursive descent parser.
Also, shown below are two helpful functions for supporting your recursive descent
parser and implementing error reporting, such as in the examples shown above.
void parse_error(char *errMsg, char *lexeme) {
extern unsigned numErrs; //for future if error recovery used
numErrs++;
fprintf(stderr, “%s: %s\n”, errMsg, lexeme);
}
void match(enum tokenType expected)
{
if (currentToken == expected) {
currentToken = scan();
}
else{
parse_error(“Expected symbol”, mnemonic[expected]);
exit(1);
}
}
Project 2 Parser for Simple_PL1
Page 5 of 5
Chapter 1
Introduction
n
Programming a physical machine requires composing ordered sequences of machine-specific operation codes
ut
io
for instructions along with providing codes for data and access to data. The operation codes are decoded
by circuits that activate other circuits in the machine to execute the instructions. With the first electronic
computers, it was difficult to read, debug, and modify the numeric codes representing the programs, thus
rib
taking more time to get programs to run. Also, deploying the programs were time-consuming because it often
took many hours or days to load the program codes into the “memory” since it involved physically wiring
st
some of the circuits together. Because of these reasons, programming languages were designed to express
programs that resembled the syntax of natural languages, like English, but with very precise meaning.
di
Translator programs were developed to translate those programs into the machine code.
The first languages (called assembler languages) and translators (assemblers) were developed to overcome
or
the challenges of reading and writing machine level code. Assembler languages use mnemonic names for the
ot
f
operations, such as ADD; operands represented as names of registers or memory addresses (variables), such
as R1 or X; and operands as constant values represented with specific syntax, such as 2 or ’A’. The assembler
instructions stand in almost a one-to-one correspondence with the machine instructions. This correspondence
N
together with mnemonic names and names for operands are what makes assembler programs easier to read
and write than the respective machine program. An example of an assembler program is shown later in this
chapter. Assembler language and assemblers were introduced in the early 1950s. Since then, and still today,
assembler languages are used when certain applications require access to specific low-level hardware (e.g.,
device drivers) or where the programmer needs to write more time and/or space efficient code than that
produced by other higher-level language translators.
High-level programming languages were developed to express abstractions found in mathematical concepts, algorithms, and computational models and used by many programmers. One of the first high-level
programming languages developed originally at IBM in the mid 1950s was ForTran (which stands for Formula Translator), which was more abstract than assembler languages. Fortran was intended for representing
scientific computations and related abstractions, although has been used for other application domains. In
the late 1950s and early 1960s COBOL, Algol60, and Lisp (List Processing) were designed and implemented
with translators that generated assembler code or machine code. COBOL, Algol60 and Lisp were designed
1
Introduction
to “abstract away” lower-level views of hardware. COBOL is closest to English and was originally designed
for business applications. Many institutions still have large amounts of legacy COBOL code in use today.
Algol60 was introduced to be more general purpose in terms of use in application domains. Algol60 also introduced many programming concepts relevant to many widely-used programming languages of today. Lisp
is based upon the concepts of expressing programs as pure mathematical functions and treating functions
as values or objects as operands to other functions. The design philosophy of Lisp was based upon a pure
function evaluation without regards to an underlying computer.
Programming language concepts consist of abstractions for expressing computations and representing
data. They serve as means to examine and compare languages. Examples of concepts include scoping
rules, lifetimes of objects, bindings and binding times, recursion, block-structures, data types, orthogonality,
abstract control structures, concurrency, functions and procedures, and parameter passing rules. There
are many more. This book is organized around groups of related concepts. For each concept, several
n
languages are sampled and compared in terms of how they utilize and implement the concept and other
ut
io
related concepts. Many concepts are interrelated and cannot be studied in isolation. Again, they are
viewed from the perspectives of the designer, programmer (user), and implementor. For example, functions,
recursion, binding times, and parameters are all interrelated. We look at the motivation of the designer to
rib
include or not include some concept in a language, the pros and cons of how a language employs a concept
from the programmer perspective, and how implementors implement that concept on a computational model
st
according to the concept’s semantics.
di
To help study programming languages, it helps to have an understanding of three areas of study: 1) theory
of computation (theoretical computer science); 2) software design methodologies or techniques (software
or
engineering); and 3) general knowledg…

Turn in your highest-quality paper
Get a qualified writer to help you with

“ Parsers Computer programming ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now