Name your VS 2022 (C language) project CS435P02
After completing the project, Zip the entire project folder and test cases. Name your zip file CS435P02
CS 435 Project 1
Scanner for Simple_PL1
Overview of Assignment
Complete a “hand-written” scanner for Simple_PL1.
This project must be written in C (code files “.c” and “.h”) using Visual Studio 2022.
Simple_PL1 Lexical Specification
For definitions below: letter = [a..zA..Z] digit = [0..9].
Token Name (used
internally by compiler)
ID
READ
Regular expression
definition
WRITE
write
NUMBER
ASSIGN
PLUS
MINUS
TIMES
DIV
SEMICOLON
COMMA
LPAREN
RPAREN
SCAN_EOF
digit digit*
:=
+
*
/
;
,
(
)
(letter|_)(letter|digit|_)*
read
Lexeme examples
max, a_1, _a1,
note: read is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
note: write is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
1, 21, 1024
scanner returns this when
the source end-of-file is
reached.
WHAT TO SUBMIT: Name your Visual Studio 2022 solution/project:
CS435P01. After completing the project, Zip the entire
solution/project folder and test cases. Name your zip file CS435P01.
I will grade your project by running it against a collection of test files.
Example: The command line should be:
C:> CS435P01YourLastName.exe src1.txt
See the example shown below illustrating the input – output relationship.
Project 1 Scanner for Simple_PL1
Page 1 of 5
x := 3;
y := 4;
read(x);
z1 := x + y;
write( x, y, z1,x*y/2-23 );
⇒
ID, x
ASSIGN
NUMBER, 3
SEMICOLON
ID, y
ASSIGN
NUMBER, 4
SEMICOLON
READ
LPAREN
ID, x
RPAREN
SEMICOLON
ID, z1
ASSIGN
ID, x
PLUS
ID, y
SEMICOLON
WRITE
LPAREN
ID, x
COMMA
ID, y
COMMA
ID, z1
COMMA
ID, x
TIMES
ID, y
DIV
NUMBER, 2
MINUS
NUMBER, 23
RPAREN
SEMICOLON
TO DO: Write a driver program to test your scanner. You must use the command-line
parameters to access the source file, i.e. argv[1]. Open the source file, read it and
output all tokens on separate lines. For identifiers and numbers, you must also output
the lexeme as shown above in the example.
You must use the internal names for tokens shown above in your scanner. That is, the
scanner should return these names to your driver program. For the purpose of testing
the scanner, you’ll need a corresponding array of strings that represent the internal
Token name. You’ll also need a lexeme char array (c-string).
Although it is poor programming practice to use global variables, for this project, you
will be allowed to use global variables current_token and lexeme. You can also use the
src file pointer variable as a global, so you don’t have to pass that as a parameter to
the scanner.
Shown below is some starting code.
Project 1 Scanner for Simple_PL1
Page 2 of 5
Outline of Scanner and Driver for testing the Scanner. Written in C.
//starter code for Simple_PL1 scanner
#include //for c I/o
#include // for exit()
#include // for isalpha(), isalnum(), …
enum tokenType {
READ, WRITE, ID, NUMBER, LPAREN, RPAREN, SEMICOLON, COMMA, ASSIGN, PLUS, MINUS, TIMES, DIV, SCAN_EOF
};
char *mnemonic[] = { “READ”, “WRITE”, “ID”, “NUMBER”, “LPAREN”, “RPAREN”, “SEMICOLON”, “COMMA”,
“ASSIGN”, “PLUS”, “MINUS”, “TIMES”, “DIV”, “SCAN_EOF”};
void lexical_error(char ch)
{
fprintf(stderr, “Lexical Error. Unexpected character: %c.\n”, ch);
}
char lexeme[256] = { ‘\0’ };
unsigned int lexLen = 0;
FILE *src;
enum tokenType scan()
{
static int currentCh = ‘ ‘;
static int tempCh = ‘ ‘;
char* reserved[2] = { “read”, “write” };
lexLen = 0;
// clear lexeme buffer for each scan
lexeme[0] = ‘\0’;
extern FILE *src;
//pointer to FILE handle that binds to source file.
if (feof(src)) {
return SCAN_EOF;
}
while ((currentCh = fgetc(src)) != EOF) {
if (isspace(currentCh)) {
continue;
}
else if (isalpha(currentCh)) { //needs to be modified
lexeme[0] = currentCh;
lexLen = 1;
for (tempCh = fgetc(src); isalnum(tempCh) || tempCh == ‘_’;) {
//build identifier lexeme
}
lexeme[lexLen] = ‘\0’;
ungetc(tempCh, src); //put back character that is not a alpha/digit or ‘_’
// see if lexeme is a reserved word, if not, return ID.
}
else if (isdigit(currentCh)) {
// build lexeme for number
// finish fixing lexeme string, ungetc the last character read that is not a digit and then return NUMBER
}
else if (currentCh == ‘+’) {
return PLUS;
}
// use selection statements to look for tokens for operators and delimiters and assignment (:=)
else {
lexical_error(currentCh);
}
}
return SCAN_EOF;
}
Project 1 Scanner for Simple_PL1
Page 3 of 5
int main(int argc, char *argv[])
{
extern FILE *src;
enum tokenType currentToken;
if (argc > 1) {//should be better testing for proper number of arguments, but not required for this project
if (fopen_s(&src, argv[1], “r”)) {
fprintf(stderr, “Error opening source file: %s “, argv[1]);
exit(1);
}
}
while ((currentToken = scan()) != SCAN_EOF) {
//finish body for displaying the string version of the internal token name and
//also print lexeme if the token is a ID or NUMBER. Do not print lexeme for the other tokens.
}
fclose(src);
return 0;
} //end driver
Project 1 Scanner for Simple_PL1
Page 4 of 5
A preview of a Grammar for Simple_PL1 is shown, so you can start writing programs
in Simple_PL1. However, you do not need to know proper syntax in order to test your
scanner.
The “$$” is treated as eof.
program
stmt_list
stmt
expr
term_tail
term
factor_tail
factor
add_op
mult_op
→
→
→
→
→
→
→
→
→
→
stmt_list $$
stmt stmt_list | ε
id := expr; | read id; | write expr;
term term_tail
add_op term term_tail |ε
factor factor_tail
mult_op factor factor_tail | ε
( expr ) | id | number
+ | * | /
Grammar specifications for Simple_PL1 will be given in the future.
Project 1 Scanner for Simple_PL1
Page 5 of 5
CS 435 Project 2
Recursive Descent Parser (with Scanner) for Simple_PL1
Overview of Assignment
•
•
Using your Project 1 scanner, implement a recursive descent parser in C for
Simple_PL1.
If a lexical error or parser error is encountered, report the error according to the
specification shown below and exit.
Review of Simple_PL1 Lexical Specification
Your scanner must recognize the following tokens. For definitions below: letter =
[a..zA..Z] and digit = [0..9].
Token Name (used
internally by compiler)
ID
READ
Regular expression
definition
WRITE
write
NUMBER
ASSIGN
PLUS
MINUS
TIMES
DIV
SEMICOLON
COMMA
LPAREN
RPAREN
SCAN_EOF
digit digit*
:=
+
*
/
;
,
(
)
(letter|_)(letter|digit|_)*
read
Lexeme examples
max, a_1, _a1,
note: read is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
note: write is an identifier,
but is reserved and needs
to be returned from
scanner as distinct token.
1, 21, 1024
scanner returns this when
the source end-of-file is
reached.
WHAT TO SUBMIT: Name your project CS435P02. After completing
the project, zip the entire project folder and test cases. Name your zip file
CS435P02.
I will grade your project by running it against a collection of test files. (The executable
file name for this example is CS435P02Jeffrey. Your executable will be your project
name.)
Example: The command line should be: $ CS435P02Jeffrey src1.txt
TO DO: This parser must be written in C using the Visual Studio 2022 IDE. Utilize
your scanner from Project 1. You must use the command-line parameters to access
Project 2 Parser for Simple_PL1
Page 1 of 5
the source file, i.e. argv[1]. Open the source file and parse it. You must use recursive
descent parsing. For this simple parser, when a parsing error occurs, just print the
expected token (use mnemonic array of strings), or, indicate that an unexpected token
was found and exit the program. You will need to analyze which message is
appropriate. When a lexical error occurs, indicate that an unrecognized character was
seen. Do not attempt error recovery.
Simple_PL1 Syntax Definition and Grammar:
The Simple_Pl1 programming language is defined and is generated by the grammar
shown below. (The “$” represents the SCAN_EOF token.) The recursive descent parser
must be based upon the following grammar, which meets the conditions for building a
recursive descent parser. You can also alter it to make it more concise and for building
a more efficient recursive descent parser. For example, see the syntax diagram on the
next page.
Note that you can take advantage of checking whether the next token is in the first set
of a particular non-terminal when parsing a production rule for that non-terminal.
program
stmt_list
stmt
expr_list
expr_list_tail
id_list
id_list_tail
expr
term_tail
term
factor_tail
factor
add_op
mult_op
→
→
→
→
→
→
→
→
→
→
→
→
→
→
stmt_list $
stmt stmt_list | ε
id := expr; | read( id_list ); | write( expr_list );
expr expr_list_tail
, expr expr_list_tail | ε
id id_list_tail
, id id_list_tail | ε
term term_tail
add_op term term_tail |ε
factor factor_tail
mult_op factor factor_tail | ε
( expr ) | id | number
+ | * | /
Project 2 Parser for Simple_PL1
Page 2 of 5
Syntax Diagram: This syntax diagram expresses an alternative presentation of a
grammar. This grammar is a revision of the original grammar shown above and
generates the same Simple_PL1 language. This grammar shown below can aid in the
engineering of a recursive descent parser.
program
stmt
stmt
EOF
ID
:=
expr
;
READ
(
ID
)
;
,
expr
(
WRITE
)
;
,
expr
term
term
factor
+
*
–
/
factor
ID
NUMBER
(
Project 2 Parser for Simple_PL1
expr
)
Page 3 of 5
Handling Syntax Errors in Simple_PL1 source code.
If there are no parsing errors, print to standard output the following message:
Parsing complete. No errors.
If there is a detected parsing (or lexical error), print the token name that was going to
be or was recognized and exit the program. Use the following format for a parser error:
Expected symbol: or Unexpected symbol: .
You can also use the first sets to detect an error inside a certain construct, such as
expressions. Use an error message such as (just an example):
Error in expression: Expected ID, NUMBER, or ‘(‘.
The expected or unexpected symbol should be a token nearby the actual parsing error.
Note: sometimes the error might appear to be misleading. Error reporting and recovery
is hard. Utilize the mnemonic array of c-strings from Project 1 to print the token
name.
Shown below are some examples.
Source File
Example 1
Example 2
x := 2;
y := 3;
read(a, b);
write(a,b,a+b*(2*x/y));
x := 2;
y := 3;
read(a, b);
write(a,b,a+b*(2*x/y);
Standard output
⇒
⇒
Parsing complete. No errors.
Expected symbol: RPAREN
Example 3
x := 2;
y := 3;
read(a, b)
write(a, b,a+b*(2*x/y));
⇒
Expected symbol: SEMICOLON
Example 4
x := 2;
y := 3;
read(a b);
write(a, b,a+b*(2*x/y));
⇒
Expected symbol: RPAREN
Example 5
read(x, y, z, v);
read( c ); temp := (x y)*(2*x + z/4) /v;
a := (c);
write( temp /
);
⇒
Error in expression: Expected
ID, NUMBER, or ‘(‘.
NOTES:
In Example 2, there is a missing right parenthesis in the last statement.
In Example 3, there is a missing semicolon that should terminate the read statement
on the third line.
Project 2 Parser for Simple_PL1
Page 4 of 5
In Example 4, there is a missing comma. Since a comma separates identifiers in an
identifier list for READ, the parser “thinks” it is at the end of the list when no comma
is found and expects a RPAREN. Perhaps the message should be “missing comma,” but
adding code to detect this kind of error would complicate the parser. Although this is a
simple fix in this case, this demonstrates the difficulty of reporting errors and the
design decisions compiler engineers face.
In Example 5, you can see how a specific construct can be referenced in the error
message. The first(Expr) set can be helpful for detecting and reporting this kind of
error.
Implementing a Recursive Descent Parser for Simple_PL1 and Some Supporting
Functions: Utilize the textbook as a guide for writing your recursive descent parser.
Also, shown below are two helpful functions for supporting your recursive descent
parser and implementing error reporting, such as in the examples shown above.
void parse_error(char *errMsg, char *lexeme) {
extern unsigned numErrs; //for future if error recovery used
numErrs++;
fprintf(stderr, “%s: %s\n”, errMsg, lexeme);
}
void match(enum tokenType expected)
{
if (currentToken == expected) {
currentToken = scan();
}
else{
parse_error(“Expected symbol”, mnemonic[expected]);
exit(1);
}
}
Project 2 Parser for Simple_PL1
Page 5 of 5