scanner specifications and parser specifications for the C_flat language

Follow the specifications: scanner specifications and parser specifications for the C_flat language. You are given the scanner code, but should make sure it is working before using it in your overall project. The scanner is given in the cfScannerDrJ.l file. The tokens file is also given file: cfTokensDrJ.y. Read the specifications and the Yacc-Flex book and tutorial for the purpose of the .y file.

Translator Project for C♭ (C-Flat)
Scanner
1
Introduction
This document contains the specifications for a translator project that translates a subset of C into code to
be interpreted by a virtual machine. The subset of C is C♭ , pronounced “C flat.” This scanner will be used by
a parser for C♭ . Additional projects for static semantic analysis and code generation parts will be assigned.
The virtual machine will be a stack-based instruction set. To be precise, C♭ is a not a true subset of C in
that the identifiers, printf and scanf are considered reserved words in C♭ and are not reserved words in C;
they will have similar semantics as the C corresponding library functions provided in C implementations.
The syntax analysis of the translator must be constructed with software tools, flex and yacc, that
construct a scanner and parser, respectively, each written in C for the C♭ translator. These tools are
available on the CS lab computers. Thus, the C♭ translator must be written in C. Your executable C♭
scanner and test driver must be named cfScanner.
Below is summary of the overview of specifications:
Source Language:
C♭ in files named with suffix: .cf
Listing file:
C♭ must produce a listing file that is the same
as the source file name, except with suffix: .txt
Scanner-Test Name:
cfScanner.exe
Construction Tools:
flex and yacc
Implementation
Environment:
Windows 10/11, Visual Studio 2022 C++11 compiler, x86 Console target
1
2
File and Command-Line Specifications
This section specifies the usage of the translator and the files needed by the user of translator, but does not
indicate the files needed for constructing the translator. First, the required user files and their descriptions
are provided. Finally, the command-line syntax and semantics are presented.
2.1
Files Used and Descriptions
Below are descriptions of all the files that will ultimately be used by the fully-completed version of the
translator. Since the translator will be developed incrementally the code listing file will not be used by the
scanner and parser. However, the files containing the source and source listing must be handled by both the
scanner and parser. The handling of the code listing file is to be added during the development of the code
generator.
1. Input Files:
(a) Source File: A text file containing a program written in C♭ source code. A user of C♭ supplies
the name of this file on a command-line. The file should have an extension of “.cf”. The C♭
translator has the responsibility of opening, scanning, parsing, and closing this file.
2. Output Files:
(a) Standard Output: During this phase of the project the main test driver of scanner prints the
tokens (string version of internal compiler name and lexeme for identifiers, string literals, and
numeric (all type) literals.
(b) Source Listing: This file is always produced and overwritten if it already exists. The contents
are a copy of the source file except with line numbers in starting in column 1. If there are any
scanner errors, you must indicate where the error occurs. At the end of the source listing you
must print on a separate lines the following:
Compilation Complete. ### errors reported.
Source Program = ### lines.
The ### is replaced with the actual number, right-justified. All lines count, including all lines in
a multi-line comment.
For this part of the project, the source listing file is always produced when you C♭ scanner is invoked.
The source listing file name that should be the source filename with .cf replaced with .txt. (To
generalize, no matter what the suffix extension of the source file, that extension should be replaced
with .txt and, if there is no extension, .txt should be appended.) Also, for this scanner part of
the project, simply use: Compilation Complete. No errors reported.. Print the line count on the
next line using the format string shown above, but with the actual number of lines in the source code.
The scanner driver should print these last two lines. Eventually, the parser will produce these lines.
2
2.2
Command-Line Syntax
The command line syntax is as follows:
cfScannerLastName.exe source-filename
Note: For future projects implements code generation, a target-code listing file will be generated. For
this first part of the project, there is no requirement to handle the target0code listing file.
3
What, When and Where to turn in
What to turn in: See Parser specifications.
When and Where to turn in:
Before the due date and time shown on the assessment link on Black-
board.
In regards to testing your scanner, it is best to also include a source file that includes every possible
token kind. Make sure to test all kinds of integer, float and double lexemes (all bases and all data types),
several kinds of string literals, single-line comments, and multi-line comments. There is a sample source file
example at the end of this document.
4
Token and Lexeme Specifications
The compiler tools, flex and yacc, communicate via numeric codes for the recognized tokens. For readability
reasons, compiler developers prefer to use internal mnemonic names for these number codes. The compiler
developer can indicate the mnemonic name via the %token directive in the yacc input file (.y file), and
the corresponding code is assigned by yacc via a #define directive, which is written to a yacc-generated
y.tab.h file, which requires a -d option on the command line. The y.tab.h file can then be #include’d in
the definition section of the .l file, which is the input file for the scanner. The scanner can use these mnemonic
names when returning the token code following the recognition of a lexeme of the respective token kind.
For example, the directive %token ABC in the definition section (first section of three sections) of a
yacc input file (with extension .y)results in the directive #define ABC numeric-code in the y.tab.h, where
numeric-code ≥ 257 and is relative to the order of it position in all the %token directives.
4.1
Some Visual Studio 2022 Build Requirements
When using flex and yacc with Visual Studio 2022, you need to add two new “other” files: the tokens.y
file and a .l file, which you name, to the source folder in your solution/project. You will add your tokens
and supporting C code to the .y and add regular expressions and C supporting .l files. Utilize the examples
in the book, in the supplements, and lecture notes for instructions on how to properly add this code. What
you add to these files comprises the core work of this project.
3
VS 2022 Project Settings: You will need to add a pre-build event to your VS 2022 scanner project:
yacc -d tokens.y & flex yournamedfile.l. After yacc and flex successfully complete you need to add
the following files to your VS 2022 project: y.tab.h, y.tab.c, and lex.yy.c. Don’t forget to add libfl.a to
your input of libraries under the setting: Project->Linker ->Input->Additional Dependencies.
Also in your .l file you will need to put somewhere the beginning of your verbatim C code (near the
includes), a #pragma warning(disable : 4996). This will shutoff the deprecated warning (now an error
in C++11, C++14, C++17) of using old C string functions that are generated by flex.
4.2
Overview of C♭ Tokens and Comments
In C♭ there are six categories of tokens:
Reserved Words
Identifiers
Integer Literals (int, unsigned, long, unsigned long expressed with decimal, octal, and hexadecimal
digits)
Real Literals (float and double literals)
String Literals
Separators and Operators
Comments could be designed to be a token category, but for this project, the scanner must read past
them before returning to the calling program, which is usually the parser, but could be the test driver code.
There are two kinds of comments:
Single-line: // … end-of-line
Multi-line: /* … end-of-line … end-of-line … … */
For each token category, the codes and the lexeme syntax is given in the following subsections. Each
reserved word is considered to be one token; so, there is one unique lexeme for each reserved word. Several
different lexemes comprise the Identifier tokens. Thus, the scanner must communicate not only that it has
recognized an identifier, but must also indicate what specific lexeme was recognized. This is important for
semantics and code generation. The lexeme must also indicated to the parser when the scanner recognizes
integer, real (float and double), and string literals. In C♭ (as in C) separators and operators tokens have
unique corresponding lexemes, except for some tokens that digraphs as alternatives. The syntax of the
lexemes for each of the token categories along with the specifications of codes to be used as discussed in
detail in each of the following subsections.
4
4.3
RESERVED IDENTIFIERS
In the following table specifies the reserved words for the ANSI C (C99 version). Again, remember that
printf and scanf are not reserved words in standard C (C99). Also note that not all of these reserved
words are actually part of C♭ as specified in the parser specification; only a subset of them are used such
that printf and scanf are included C♭ . The reason all of the C99 reserved words are required for this part
of the project is purely for pedagogical reasons: to develop your skills as scanner developers and for being
able to say that you wrote a scanner that recognized a full C99 reserved word set.
Each reserved word is represented by the lexeme, which is a legal identifier (which are defined next), a
unique numeric code and a unique MNEMONIC NAME to be used internally in the compiler. These are listed in
alphabetical order in a table where each cell consists of a three vertical subcells that represent: the reserved
word lexeme, the code that must be used, and the mnemonic name that must be used.
In C♭ all identifiers (and reserved words) are case-sensitive.
lexeme
numeric code
MNEMONIC NAME
5
RESERVED WORDS: Codes and Mnemonic Names
4.4
_Bool
_Complex
_Imaginary
auto
break
257
258
259
260
261
_BOOL_TOK
_COMPLEX_TOK
_IMAGINARY_TOK
AUTO
BREAK
case
char
const
continue
default
262
263
264
265
266
CASE
CHAR
CONST
CONTINUE
DEFAULT
do
double
else
enum
extern
267
268
269
270
271
DO
DOUBLE
ELSE
ENUM
EXTERN
float
for
goto
if
inline
272
273
274
275
276
FLOAT
FOR
GOTO
IF
INLINE
int
long
printf
register
restrict
277
278
279
280
281
INT
LONG
PRINTF
REGISTER
RESTRICT
return
scanf
short
signed
sizeof
282
283
284
285
286
RETURN
SCANF
SHORT
SIGNED
SIZEOF
static
struct
switch
typedef
union
287
288
289
290
291
STATIC
STRUCT
SWITCH
TYPEDEF
UNION
unsigned
void
volatile
while
292
293
294
295
UNSIGNED
VOID
VOLATILE
WHILE
Identifiers
Identifiers are represented by code 296 internal translator mnemonic name of IDENTIFIER. The legal identifier
lexemes are defined by the following context free grammar expressed in EBNF. The matched lexemes must
be stored in the variable yytext so that the parser can utilize the name for semantic actions. The variable,
yytext, must be used by the test-harness (driver) code for purposes of testing the scanner. More about the
structure and requirements of the scanner tests are given later.
::=
( | < decimal digit>)∗
::=
A | ··· | Z | a | ··· | z | _
::=
0 | 1 | ··· | 9
6
4.5
Integer Literals
Each Integer literal must be given the numeric code of 297 and the mnemonic name INT_LITERAL.
Each INT_LITERAL lexeme represents a denotation of a value that belongs to a set of values associated
with an integer data type. C♭ has four integer data types: integer, long integer, unsigned integer, and
unsigned long integer. The actual sets of values associated with each of these data types varies with the
corresponding environment. For the VM platform of this project, the set of values for each these types is
shown below. (The notation, [a, b], represents the closed interval of integer values.)
Integer Data Type
Set of Values
integer
[−231 , 231 − 1]
long integer
[−231 , 231 − 1]
unsigned integer
[0, 232 − 1]
unsigned long
[0, 232 − 1]
As with C, C♭ uses the reserved word, int, for the name of the integer data type. In combination with
int or without int, the reserved words, long and unsigned, are used to name the long integer and unsigned
integer data types, respectively. In C♭ , each non-negative integer value has three lexeme representations
based upon decimal, octal, and hexadecimal bases. The negative values for integer and long integer data
types are represented by using the unary minus operator, -, which is another token defined later in this
specification. Thus, there is no negative (or positive) sign as part of any integer lexeme; in this sense, there
are no lexemes for negative values for integer and long integer data types. No matter what base is used,
when the scanner recognizes an INT_LITERAL, the scanner must convert the lexeme to a two’s complement
value according to the base, store the result in a variable accessible by the parser or the test driver, and then
return code 297. The mnemonic name for code 297 must be INT_LITERAL.
A grammar for the language of INT_LITERALs is shown below. You must use this to write the regular
expressions for flex so that INT_LITERAL lexemes are recognized.
7
::=
[]
[]
[]
::=
|
::=
0 |
::=
|
::=
0x | 0X
::=
0 | 1 | 2 | ··· | 9
::=
1 | 2 | ··· | 9
::=
0 | 1 | ··· | 7
::=
0 | 1 | 2 | ··· | 9 | A | ··· | F | a | ··· | f
::=
[] |
[]
::=
u|U
::=
l|L
INT LITERAL Example
Example Value: 28
Lexemes
Decimal
Octal
Hex
28
034
0x1C
Target VM Architecture 2’s Complement Representation
0000 0000 0000 0000 0000 0000 0001 1100
4.6
Real Literals: float and double literals
A grammar for the language of REAL_LITERALs are shown below. This token type has two lexeme patterns:
for float and double. The only difference between these two is the f or F suffix for float lexemes. You must
translate this grammar below (in EBNF) to write the regular expressions for flex so that float and double
lexemes are recognized correctly. You must also convert the float and double lexemes into internal float and
double values using flex actions. These values will be stored in a global variable produced by yacc. This is
discussed later.
Recall that EBNF the notation, {α}, means 0 or more repetitions of α.
8
::=
{} . {} [ ] [] |
. {} [ ][] |
{} []
< exponentPart >
::=
(E | e)[+ | -] {}
< floatSuffix >
::=
(f | F)
::=
0 | 1 | 2 | ··· | 9
4.7
String Literals
String literals must have code 299 and mnemonic name of STR_LITERAL.
::=
” ∗ ”
::=
::=
sp | ! | | # | $ | % | & | ’ |
( | ) | * | + |, | – | . | / |
0 | ··· | 9 | : | ; | < | = | > | ? | @ |
A | ··· | Z | [ | ] | ^ | _ | ‘ |
a | ··· | z | { | | | } | ~
4.8
::=
\
Separator and Operator Literals
The codes for the separator and operator literals must correspond according to associations shown in the
following table. Note that the last row indicate alternate way to write the left and right brackets and left and
right braces along with the pound and double pound symbols. These alternate symbols are called digraph
symbols.
When expressing a grammar within the syntax for yacc, one can express single symbol tokens as the
usual C character literals and need not utilize the mnemonic name of the token. It is simply up to the
compiler writer which approach to use. The mnemonic names are required here for the purposes of testing
the scanner.
9
4.9
(
)
[
]
{
}
300
301
302
303
304
305
LPAREN
RPAREN
LBRACKET
RBRACKET
LBRACE
RBRACE
;
,
.
+
–
*
306
307
308
309
310
311
SEMICOLON
COMMA
DOT
PLUS
MINUS
STAR
/
%
&
|
^
~
312
313
314
315
316
317
SLASH
PERCENT
AMPERSAND
BAR
CARET
TILDE
->
!
++
—
>
318
319
320
321
322
323
ARROW
EXCLAIM
DBL_PLUS
DBL_MINUS
DBL_LESS
DBL_GTR
?
:
…
&&
||
==
324
325
326
327
328
329
QUESTION
COLON
TRIPLE_DOT
DBL_AMP
DBL_BAR
DBL_EQ
!=
< >=
=
330
331
332
333
334
335
EXCL_EQ
LESS
LESS_EQ
GTR
GTR_EQ
EQ
+=
-=
*=
/=
%=
&=
336
337
338
339
340
341
PLUS_EQ
MINUS_EQ
STAR_EQ
SLASH_EQ
PERCENT_EQ
AMP_EQ
|=
^=
=
#
##
342
343
344
345
346
347
BAR_EQ
CARET_EQ
DBL_LESS_EQ
DBL_GTR_EQ
POUND
DBL_POUND
%:
%:%:
302
303
304
305
346
347
LBRACKET
RBRACKET
LBRACE
RBRACE
POUND
DBL_POUND
Comments
C♭ has both kinds of comments as C: line and mult-line. When the scanner recognizes two consecutive forward
slashes it should read past all characters up to the first newline character. When the scanner recognizes /*
it should ignore every character until the first string pattern, */, is recognized. Grammar rules with extra
extended BNF are used to show comments. These are only for specification of comments and not to be used
by the parser. The not(ch) means any ASCII character that is not that ch.
::=
// ∗ newlinecharacter
::=
“/” “*” ( not(“*”) | ( “*” not(“/”) ) )* “*”+ “/”
10
5
Scanner Design and Implementation Framework
5.1
yacc input (.y) file format
The scanner must return the appropriate token code when recognizing a lexeme in that corresponding token
category. As discussed above in the specification section, token code number must be given a mnemonic
name. To associate these names with codes you must use create a .y file, which be be the input for yacc.
For the purposes of discussion here, we shall use the name, tokens.y, for this yacc input file.
There are three sections (delimited by %% characters) in each .y file:
declarations:
internal meta-level definitions for production rules
verbatim C/C++ code between %{ %} delimiters which is copied to y.tab.c and y.tab.h files.
%%
production rules for a LALR(1) grammar
%%
utility and supporting functions
The following should be placed in the tokens.y file. Note that there is only one production rule, a start
symbol going to an empty string. This is only a dummy ϵ-production rule because the middle section cannot
be empty.
The first section should have the following %token directives that are turned into #define macros. Here
is the starting point for the Tokens.y file.
%token _BOOL_TOK _COMPLEX_TOK _IMAGINARY_TOK
%token AUTO BREAK CASE CHAR CONST CONTINUE
%token DEFAULT DO DOUBLE ELSE ENUM EXTERN
%token FLOAT FOR GOTO IF INLINE INT LONG
%token PRINTF REGISTER RESTRICT RETURN
%token SCANF SHORT SIGNED SIZEOF STATIC STRUCT SWITCH
%token TYPEDEF UNION UNSIGNED VOID VOLATILE WHILE
%token IDENTIFIER INT_LITERAL REAL_LITERAL STR_LITERAL
%token LPAREN RPAREN LBRACKET RBRACKET LBRACE RBRACE
%token SEMICOLON COMMA DOT PLUS MINUS STAR
%token SLASH PERCENT AMPERSAND BAR CARET TILDE
%token ARROW EXCLAIM DBL_PLUS DBL_MINUS DBL_LESS DBL_GTR
%token QUESTION COLON TRIPLE_DOT DBL_AMP DBL_BAR DBL_EQ
%token EXCL_EQ LESS LESS_EQ GTR GTR_EQ EQ
%token PLUS_EQ MINUS_EQ STAR_EQ SLASH_EQ PERCENT_EQ AMP_EQ
%token BAR_EQ CARET_EQ DBL_LESS_EQ DBL_GTR_EQ POUND DBL_POUND
%%
11
program :
%%
void yyerror(char *s) {
fprintf(stderr, “%s\n”, s);
Before %token directives, add the following declarations in the first (definition) section to the tokens.y
file:
%{
#include
int yylex(void);
void yyerror(char *);
%}
In the code (last third) section of tokens.y one define the yyerror() function as follows:
void yyerror(char *s) {
fprintf(stderr, “%s\n”, s);
Just copy the above for this part of the project.
To create a file, y.tab.h, one must issue:
yacc -d tokens.y
The file, y.tab.h is created with the following contents (partial contents)
#define _BOOL_TOK 257
#define _COMPLEX_TOK 258
#define _IMAGINARY_TOK 259
#define AUTO 260
#define BREAK 261
#define CASE 262
#define CHAR 263
#define CONST 264
#define CONTINUE 265
#define DEFAULT 266
…
#define WHILE 295
#define IDENTIFIER 296
#define INT_LITERAL 297
#define REAL_LITERAL 298
#define STR_LITERAL 299
#define LPAREN 300
12
…
#define BAR_EQ 342
#define CARET_EQ 343
#define DBL_LESS_EQ 344
#define DBL_GTR_EQ 345
#define POUND 346
#define DBL_POUND 347
The y.tab.h will be used by the scanner. It is placed in #include y.tab.h” in the .l and ultimately the
code for the scanner.
In addition, semantic information must also be provided for INT_LITERALs, REAL_LITERALs, STR_LITERALs,
and IDENTIFIERs. To pass semantic information back, you must utilize a global data structure: yylval. The
actual string for the lexeme corresponding to INT_LITERAL, REAL_LITERALs, STR_LITERAL, and IDENTIFIER
tokens may be used by the semantic analysis phase. Also, in the case of numeric literals, these lexeme strings
are converted to an appropriate internal value of integer (int, unsigned, long, unsigned long) or real (float or
double). For the STR_LITERAL token, string (without the quote characters) is used. These values are stored
in one the %union field shown below.
To do this use the following in .y file, for example, tokens.y:
%union{
int
int_value;
unsigned
u_int_value;
long
l_int_value;
unsigned long ul_int_value;
float float_value;
double double_value;
char*
string_value;
char*
ident_name;
}
With this %union directive inside the definition (first) section of tokens.y along with the %tokens above,
yacc produces y.tab.h and places the following in the y.tab.h file:
typedef union{
int
int_value;
unsigned
u_int_value;
long
l_int_value;
unsigned long ul_int_value;
float float_value;
double double_value;
char*
string_value;
13
char*
ident_name;
} YYSTYPE;
extern YYSTYPE yylval;
For INT_LITERAL tokens, the scanner must convert each INT_LITERAL lexeme from decimal, octal, or
hexadecimal string representation to a int, long int, unsigned int, or unsigned long internal representation.
Inside the scanner, the action associated with the recognition of a INT_LITERAL lexeme one must convert the
lexeme, which is found in yytext, into the appropriate integer value (according to the suffix), and store the
value in the appropriate field of yylval: yylval.int_value, yylval.u_int_value, yylval.l_int_value,
or yylval.ul_int_value. The parser and code generator will need to use the integer values.
Some useful string functions (with sample actual parameters) for this string-to-integer translation include:
strtol(yytext, NULL, 0) and strtoul(yytext, NULL, 0).
Also, string functions, such as strchr( char*, char). For example, one might utilize the following for
finding out if there is a unsigned integer indicator in the suffix, for instance: strchr( yytext,’u’). Read
the literature to find out how these functions work.
For REAL_LITERAL tokens, the scanner must convert each REAL_LITERAL lexeme into a float or double
internal representation. Inside the scanner, the action associated with the recognition of a REAL_LITERAL
lexeme one must convert the lexeme, which is found in yytext, into the appropriate value: float type if
suffix of f or F is present and double if the float suffix is not present. The resultant value should be stored
in the appropriate field of yylval: yylval.int_value, yylval.u_int_value, yylval.l_float_value, or
yylval.double_value. The parser and code generator will need to use these values. When testing the
scanner, you should use format specifiers, %f or %g for float or double values, respectively.
Some useful string functions (with sample actual parameters) for this string-to-float and string-to-double
translation rstrtof(yytext, NULL) and strtod(yytext, NULL), respectively. Again, string functions, such
as strchr( char*, char). For example, one might utilize the following for finding out if there is an f or F
as the suffix, for instance: strchr( yytext,’f’). Read the literature to find out how these functions work.
Similarly, the actual string literal lexeme must be stored in yylval.string_value if a string literal
is recognized. Note that the double quotes surrounding the string in the lexeme must be removed before
storing it in yylval.string_value. Also, in C♭ the characters \n, \t, and ” (the double quote character)
are allowed in strings. The double quote must be escaped, as \”. To make processing strings easier, it is
best to follow some of the suggestions in the course literature resources and utilize a temporary start symbol.
For example, one could use the following directive in the definition of the .l file: %x STRING. Read in your
textbook and literature how this can be utilized. Do not forget to read about the BEGIN macro to go back
to original state machine. Note that these escaped characters must remain escaped in the string stored
in yylval.string_value; that is, for example, \n should remain a two character substring, \ and n. In
addition this applies to echoing the string into the listing file.
A useful string function for copying strings into destination string variables is: _strdup(yytext). Read
the literature to find out how this works.
14
To recognize identifiers and the reserved identifiers, your scanner must utilize a binary search on a reserved
word table. All reserved word are special identifiers. If an identifier is found, then return that token code
and the lexeme. If not found, then return IDENTIFIER and the lexeme with the name of the identifier. The
IDENTIFIER lexeme, temporarily stored in yytext, must be copied into yylval.ident_name. The parser
will need these names to store and lookup in a symbol table. Again, consider using the _strdup(yytext) to
accomplish copying strings. Note that for reserved words this is not necessary; the code number is sufficient.
Remember that you may have to adjust the indices used for the reserved word table to the indices for the
mnemonic name used for the token since the code for the mnemonic names start at 257 and the reserved
word table index starts at 0.
White space (blanks and tabs) should be ignored. Comments should be ignored. See specifications for
syntax details about comments. Note, however, that comments must be copied verbatim into the listing file.
Remember, the listing file must look identical to the source, except that consecutive line numbers are added
at the begin each line.
5.2
flex input .l file format
All of the regular expressions for the lexeme patterns are put into a .l file. For the discussion here a file
name, cfScanner.l, will be used.
Like .y files, the .l files has three sections (delimited by %% characters):
declarations:
internal meta-level definitions for production rules
verbatim C/C++ code between %{ %} delimiters files.
Inside the verbatim code, #include the y.tab.h file, which should have been
generated by yacc with option -d.
%%
(regEx action) where action is C/C++ code
executed when regEx is flex regular expressing a set of patterns.
%%
utility and supporting functions
The scanner generator, flex creates a lex.yy.c file from the contents of a .l file. This lex.yy.c is compiled.
The first section is where you put your main() driver program for testing. For this project, this is where
the scanner test code is placed. You also put your supporting functions declarations (and definitions if
you prefer) and global variable and constant declarations. In the first section, you can also place definition
(outside the verbatim code) of regular expressions, e.g. digit
[0-9].
In the second section, is where you place your regular expressions that represents the patterns for the
lexemes and the actions taken when a pattern is recognized.
In the last section, you can put more supporting function definitions.
15
6
Implementation for Testing Scanner
The scanner test should include reading several C♭ source code files. The output of the scanner test should
print, per line, the Token Mnemonic name and the corresponding lexeme (as a string or integer).
You also need to use string versions of the mnemonic code names of the token codes. So, you need to
have an array of strings that have these corresponding names shown above. For example, you can utilize a
declaration such as:
static char *mnemonic[ ] = {“”, “_BOOL_TOK”, “_COMPLEX_TOK”, “_IMAGINARY_TOK”,
“AUTO”, “BREAK”, “CASE”, “CHAR”, “CONST”, “CONTINUE”,
“DEFAULT”, “DO”, “DOUBLE”, “ELSE”, “ENUM”, “EXTERN”,
“FLOAT”, “FOR”, “GOTO”, “IF”, “INLINE”, “INT”, “LONG”, “PRINTF”,
“REGISTER”, “RESTRICT”, “RETURN”, “SCANF”,
“SHORT”, “SIGNED”, “SIZEOF”, “STATIC”, “STRUCT”, “SWITCH”,
“TYPEDEF”, “UNION”, “UNSIGNED”, “VOID”, “VOLATILE”, “WHILE”,
“IDENTIFIER”, “INT_LITERAL”, “REAL_LITERAL”, “STR_LITERAL”,
“LPAREN”, “RPAREN”, “LBRACKET”, “RBRACKET”, “LBRACE”, “RBRACE”,
“SEMICOLON”, “COMMA”, “DOT”, “PLUS”, “MINUS”, “STAR”,
“SLASH”, “PERCENT”, “AMPERSAND”, “BAR”, “CARET”, “TILDE”,
“ARROW”, “EXCLAIM”, “DBL_PLUS”, “DBL_MINUS”, “DBL_LESS”, “DBL_GTR”,
“QUESTION”, “COLON”, “TRIPLE_DOT”, “DBL_AMP”, “DBL_BAR”, “DBL_EQL”,
“EXCL_EQ”, “LESS”, “LESS_EQ”, “GTR”, “GTR_EQ”, “EQ”,
“PLUS_EQ”, “MINUS_EQ”, “STAR_EQ”, “SLASH_EQ”, “PERCENT_EQ”, “AMP_EQ”,
“BAR_EQ”, “CARET_EQ”, “DBL_LESS_EQ”, “DBL_GRT_EQ”, “POUND”, “DBL_POUND” };
Your tester (driver) code simulates a very simple parser. It must simply keep calling yylex() until it
returns 0, indicating the end of source file. This code corresponds to a EOF_TOKEN, which is not explicitly
used in this project. For each token code, it must print the mnemonic string name (see the string array,
mnemonic[], above) and print the corresponding lexeme. Note that the lexemes for INT_LITERALs must
have been converted appropriately and printed as appropriate integers (e.g. int, unsigned, etc.). So, there
must another global variable that should be used for the scanner to let the parser (test code) which field of
the yylval should be used and what control string directive in the printf() function should be used (%d,
%ld, etc).
The general framework of the testing (driver) code should look like the following:
… // process arguments and open files
while( token = yylex() ) {
if( token == IDENTIFIER ) {
print the mnemonic string “IDENTIFIER” and the lexeme yylval.ident_name
}
16
else if (token == STR_LITERAL ) {
print the mnemonic string “STR_LITERAL” and the lexeme yylval.string_value);
}
else if (token == INT_LITERAL ){
print “INT_LITERAL” ;
figure out what kind of INT_LITERAL (int, long, unsigned, unsigned long)
and print the integer value appropriately
else if (token == REAL_LITERAL ){
print “REAL_LITERAL” ;
figure out what kind of REAL_LITERAL (float, double)
and print the float value or print double value (%g) appropriately
else
print the token (e.g. reserved word, operator) mnemonic name and lexeme.
}
}
close file(s)
17
A full C implementation of the driver function, main(), along with some of the global declarations and
definitions for the above framework is shown below. This goes in the first section (of three sections) of the
”.l” file.
The four subtoken categories of INT_LITERAL tokens, two subtoken categories for REAL_LITERAL tokens
and one STRING_TYPE literal type name are represented by enum values of type Literal_Type_Names and are
stored in a variable called literal_type. The Literal_Type_Names type should be declared and defined in
the global section of the verbatim code section of the first definition section of your .l file. The corresponding
values are communicated through the union members defined in the first section of the ”.y” file and written
into the ”y.tab.h” by using the command ”yacc -d your.y”.
You may have to change the library function(s) used for opening files in order to be compatible with
the version of the C compiler used. I am giving you supporting code for the changeFileNameExtension()
function. You will need to write some of the function definitions that have ”TO DO:” in them. You may
need to write additional supporting functions/macros for use in the action section in the second section of
the .l files.
%{
#include
#include
#include
#include “y.tab.h”
enum Literal_Type_Names {INTEGER_TYPE, UNSIGNED_TYPE, LONG_TYPE, UNSIGNED_LONG_TYPE,
FLOAT_TYPE, DOUBLE_TYPE, STRING_TYPE } literal_type;
int yylex();
#define TOKEN_INDEX_OFFSET 257
int reservedWordIndex = 0;
static char *mnemonic[ ] = {“”, “_BOOL_TOK”, “_COMPLEX_TOK”, “_IMAGINARY_TOK”,
“AUTO”, “BREAK”, “CASE”, “CHAR”, “CONST”, “CONTINUE”,
“DEFAULT”, “DO”, “DOUBLE”, “ELSE”, “ENUM”, “EXTERN”,
“FLOAT”, “FOR”, “GOTO”, “IF”, “INLINE”, “INT”, “LONG”, “PRINTF”,
“REGISTER”, “RESTRICT”, “RETURN”, “SCANF”,
“SHORT”, “SIGNED”, “SIZEOF”, “STATIC”, “STRUCT”, “SWITCH”,
“TYPEDEF”, “UNION”, “UNSIGNED”, “VOID”, “VOLATILE”, “WHILE”,
“IDENTIFIER”, “INT_LITERAL”, “REAL_LITERAL”, “STR_LITERAL”,
“LPAREN”, “RPAREN”, “LBRACKET”, “RBRACKET”, “LBRACE”, “RBRACE”,
“SEMICOLON”, “COMMA”, “DOT”, “PLUS”, “MINUS”, “STAR”,
“SLASH”, “PERCENT”, “AMPERSAND”, “BAR”, “CARET”, “TILDE”,
“ARROW”, “EXCLAIM”, “DBL_PLUS”, “DBL_MINUS”, “DBL_LESS”, “DBL_GTR”,
“QUESTION”, “COLON”, “TRIPLE_DOT”, “DBL_AMP”, “DBL_BAR”, “DBL_EQL”,
18
“EXCL_EQ”, “LESS”, “LESS_EQ”, “GTR”, “GTR_EQ”, “EQ”,
“PLUS_EQ”, “MINUS_EQ”, “STAR_EQ”, “SLASH_EQ”, “PERCENT_EQ”, “AMP_EQ”,
“BAR_EQ”, “CARET_EQ”, “DBL_LESS_EQ”, “DBL_GRT_EQ”, “POUND”, “DBL_POUND” };
static int ReserveWordLookup(char *yytext);
#define NUM_OF_RESERVED_WORDS 39
static char *reserveWordTable[]=
{“_Bool”,
“_Complex”,
“_Imaginary”,
“auto”,
“break”,
“case”,
“char”,
“const”,
“continue”,
“default”,
“do”,
“double”,
“else”,
“enum”,
“extern”,
“float”,
“for”,
“goto”,
“if”,
“inline”,
“int”,
“long”,
“printf”,
“register”,
“restrict”,
“return”,
“scanf”,
“short”,
“signed”,
“sizeof”,
“static”,
“struct”,
“switch”,
“typedef”,
“union”,
“unsigned”,
“void”,
“volatile”,
“while”
};
static int ReserveWordLookup(char* yytext)
{
int mid, first, last, cmp_test;
first = 0;
last = NUM_OF_RESERVED_WORDS – 1; //starting at index 0.
while(first 0)
first = mid+1;
else
return
mid ;
}
// must be identifier
yylval.ident_name = _strdup(yytext);
return -1; /* Reserved Word not Found */
}
19
typedef
int Boolean;
int isUnsignedLit(char *text){
return ( strchr( text,’u’) || strchr( text,’U’) );
}
int isLongLit(char
*text){
//TO DO:
}
int isFloatLit(char *text){
//TO DO:
}
//For processing string literals
#define MAX_STR_LIT_LEN 200
char buf[MAX_STR_LIT_LEN];
char *s;
//Utility for creating filenames based upon the source code name.
char *changeFileNameExtension(const char *src, const char* extName)
{
unsigned targetBodyNameLen = 0;
unsigned extNameLen = strlen(extName);
char *targetName = “”;
char * pch = strchr(src,’.’); //find last occurrence of . for extension name
if( pch != NULL){
targetBodyNameLen = pch-src;
}
else {
targetBodyNameLen = strlen(src)+1;
}
targetName = (char *)malloc(targetBodyNameLen+extNameLen+4);
targetName[targetBodyNameLen] = ’\0’;
strncpy(targetName, src, targetBodyNameLen);
targetName = strcat(targetName,”.”);
targetName = strcat(targetName, extName);
return targetName;
}
FILE *listingFile;
int lines;
main(int argc, char **argv)
{
20
int token;
char *infile;
char *listing;
errno_t fileErrorCode;
extern FILE *yyin;
extern FILE *listingFile;
extern int lines;
if (argc>1) {
infile = argv[1];
if( (fileErrorCode = fopen_s(&yyin, infile, “r”)) != 0 ) {
fprintf(stderr,”Source file, %s, was not opened. Error code = %d\n”,
infile, fileErrorCode);
exit( fileErrorCode );
}
// if ((yyin = fopen(argv[1], “r”))==0) {
// fprintf(stderr, “Error opening source file: %s”, argv[1]);
// exit(1);
// }
}
else {
infile = _strdup(“yyin”); //for interactive testing
yyin = stdin;
}
listing = changeFileNameExtension(infile, “txt”);
//when infile is “stdin”, this acts like a log of interactions
if( (fileErrorCode = fopen_s(&listingFile, listing, “w”)) != 0 ) {
fprintf(stderr,”Listing file, %s, was not opened. Error code = %d\n”, listing, fileErrorCode);
exit( 1 );
}
// if ((listingFile = fopen(listing, “w”))==0) {
// fprintf(stderr,”Listing file, %s, was not opened.\n”, listing);
// exit(1);
// }
lines = 1;
while( token = yylex() ) {
if( token == IDENTIFIER ) {
printf(“%s
%s\n”, mnemonic[token-TOKEN_INDEX_OFFSET+1], yylval.ident_name);
21
/*Move index up one since mnemonic[0] is not used */
}
else if (token == STR_LITERAL ) {
printf(“%s
%s\n”, mnemonic[token-TOKEN_INDEX_OFFSET+1], yylval.string_value);
}
else if (token == INT_LITERAL ){
printf(“INT_LITERAL
“);
if( literal_type == UNSIGNED_LONG_TYPE )
printf(“%s %lu\n”, yytext, yylval.ul_int_value );
else if ( literal_type == UNSIGNED_TYPE )
printf(“%s %u\n”, yytext, yylval.u_int_value );
else if( literal_type == LONG_TYPE )
printf(“%s %ld\n”, yytext, yylval.l_int_value );
else
printf(“%s %ld\n”, yytext, yylval.int_value);
}
else if (token == REAL_LITERAL) {
printf(“REAL_LITERAL
“);
if( literal_type == FLOAT_TYPE) {
printf(“%s %g
(float_type) \n”, yytext, yylval.float_value);
}
else {
printf(“%s %g (double_type)\n”, yytext, yylval.double_value);
}
}
else {
printf(“%s
%s\n”, mnemonic[token-TOKEN_INDEX_OFFSET+1], yytext);
}
}
fprintf(listingFile, “\n\nCompilation Complete. No errors reported.\n”);
fprintf(listingFile, “Source Program = %4d\n”, lines);
fclose(yyin);
fclose(listingFile);
}
%}
22
7
Examples
Example 1:
A sample C♭ source code in a file, sample.cf,is shown below.
#define PRIME 211
#define EOS 0
//Sample Code
int hash(char *s)
{
/* This is a multi-line
comment.
*/
char *p;
unsigned h = 0, g;
for( p = s; *p != EOS; p++; ){
h = (h > 24);
h = h ^ g;
}
}
return h % PRIME;
}
The listing file. sample.txt, should look like:
1
#define PRIME 211
2
#define EOS 0
3
//Sample Code
4
int hash(char *s)
5
{
6
/* This is a multi-line
7
8
comment.
*/
9
char *p;
10
unsigned h = 0, g;
11
for( p = s; *p != EOS; p++; ){
12
h = (h > 24);
15
h = h ^ g;
23
16
}
17
}
18
return h % PRIME;
19
}
Compilation Complete. No errors reported.
Source Program =
19
The standard output for the above source code example is shown below. Note that the comment is
ignored.The standard output for running (testing) the scanner is shown below:
C> …>cfScannerJeffrey sample.cf
POUND
#
IDENTIFIER
define
IDENTIFIER
PRIME
INT_LITERAL
POUND
211 211
#
IDENTIFIER
define
IDENTIFIER
EOS
INT_LITERAL
INT
0 0
int
IDENTIFIER
LPAREN
hash
(
CHAR
char
STAR
*
IDENTIFIER
RPAREN
)
LBRACE
{
CHAR
char
STAR
*
s
IDENTIFIER
p
SEMICOLON
;
UNSIGNED
unsigned
IDENTIFIER
EQ
h
=
INT_LITERAL
COMMA
0 0
,
IDENTIFIER
g
SEMICOLON
;
FOR
for
LPAREN
(
24
IDENTIFIER
EQ
p
=
IDENTIFIER
s
SEMICOLON
;
STAR
*
IDENTIFIER
EXCL_EQ
p
!=
IDENTIFIER
EOS
SEMICOLON
;
IDENTIFIER
p
DBL_PLUS
++
SEMICOLON
;
RPAREN
)
LBRACE
{
IDENTIFIER
EQ
h
=
LPAREN
(
IDENTIFIER
DBL_LESS
h
>
INT_LITERAL
RPAREN
24 24
)
SEMICOLON
;
IDENTIFIER
h
EQ
=
IDENTIFIER
CARET
h
^
IDENTIFIER
g
SEMICOLON
;
RBRACE
}
RBRACE
}
RETURN
return
IDENTIFIER
PERCENT
h
%
IDENTIFIER
PRIME
SEMICOLON
;
RBRACE
}
End of Example 1.
Example 2 (All Tokens)
This shows all token types. It also shows numeric literals along with the lexemes recognized and the semantic
values. For the REAL_LITERAL token type,the float type values are printed with a %f format specifier and
the double type with a %g format specifier.
A sample C♭ source code in a file, allTokens.cf,is shown below.
_Bool
_Complex
_Imaginary
auto
break
continue
default
do
double
else
enum
extern
float
for
goto
if
inline
int
long
printf
register
restrict
return
scanf
short
signed
sizeof
static
struct
switch
typedef
union
unsigned
void
volatile
case
char
while
x
abc
123 123l 123L
26
const
28UL 28LU 28ul 28lu 28uL 28ul
0X1C 0x1c 0X1c 0X1Cul
034 034L 034U 034UL
0 0L 0U 0UL 0LU
1.2
123.456 .123456789
1.2e3 1.2e-12
1.23e+12
1.2f 123.456f .123456f
1.2e3f 1.2e-12f 1.23e+12f
“This is a string”
“This is a string with newline \n”
“This is a string with tab \t”
“This is a string with quote escape \” ”
“This is a string with all three \n \t \” and then more chars\n”
( )
[
] { }
;
,
.
+
–
*
/
%
&
|
^
~
->
++ —
>
? :
… && || ==
!
!=
< >=
=
+= -=
*=
/=
=
#
##
%:
%:%:
|= ^=
%=
&=
//This is a comment.
int x, y, z;
/* Mult-line
Comment is here */
z = 3;
/* Another multi-line comment
*/
/* Another multi-line comment on one line. */
scanf(“%d”, &x);
//This is also a comment.
y = x + 4*z;
//Does this comment cause problems being at the end?
The listing of allTokens.cf is shown below:
1
_Bool
_Complex
2
auto
break
_Imaginary
case
char
27
const
3
continue
default
do
double
else
4
enum
extern
float
for
goto
5
if
inline
int
long
printf
6
register
restrict
return
scanf
short
7
signed
sizeof
static
struct
switch
8
typedef
union
unsigned
void
volatile
9
while
10
x
11
abc
12
123 123l 123L
13
28UL 28LU 28ul 28lu 28uL 28ul
14
0X1C 0x1c 0X1c 0X1Cul
15
034 034L 034U 034UL
16
0 0L 0U 0UL 0LU
17
1.2
18
1.2e3 1.2e-12
19
1.2f 123.456f .123456f
20
1.2e3f 1.2e-12f 1.23e+12f
123.456 .123456789
1.23e+12
21
22
“This is a string”
23
“This is a string with newline \n”
24
“This is a string with tab \t”
25
“This is a string with quote escape \” ”
26
“This is a string with all three \n \t \” and then more chars\n”
27
( )
28
;
,
29
+
–
*
/
%
30
&
|
^
~
->
31
++ —
>
? :
32
… && || ==
33
< >=
34
=
+= -=
*=
/=
35
|= ^=
>>=
#
##
36
%:
%:%:
37
//This is a comment.
38
int x, y, z;
39
40
[
] { }
.
cfScannerJeffrey allTokens.cf
_BOOL_TOK
_Bool
_COMPLEX_TOK
_Complex
_IMAGINARY_TOK
AUTO
auto
BREAK
break
CASE
case
CHAR
char
CONST
const
_Imaginary
CONTINUE
continue
DEFAULT
default
DO
do
DOUBLE
double
ELSE
else
ENUM
enum
EXTERN
extern
FLOAT
float
FOR
for
GOTO
goto
IF
if
INLINE
inline
INT
int
LONG
long
PRINTF
printf
REGISTER
register
RESTRICT
restrict
29
RETURN
return
SCANF
scanf
SHORT
short
SIGNED
signed
SIZEOF
sizeof
STATIC
static
STRUCT
struct
SWITCH
switch
TYPEDEF
typedef
UNION
union
UNSIGNED
VOID
unsigned
void
VOLATILE
WHILE
volatile
while
IDENTIFIER
x
IDENTIFIER
abc
INT_LITERAL
123 123
INT_LITERAL
123l 123
INT_LITERAL
123L 123
INT_LITERAL
28UL 28
INT_LITERAL
28LU 28
INT_LITERAL
28ul 28
INT_LITERAL
28lu 28
INT_LITERAL
28uL 28
INT_LITERAL
28ul 28
INT_LITERAL
0X1C 28
INT_LITERAL
0x1c 28
INT_LITERAL
0X1c 28
INT_LITERAL
0X1Cul 28
INT_LITERAL
034 28
INT_LITERAL
034L 28
INT_LITERAL
034U 28
INT_LITERAL
034UL 28
INT_LITERAL
0 0
INT_LITERAL
0L 0
INT_LITERAL
0U 0
INT_LITERAL
0UL 0
INT_LITERAL
0LU 0
30
REAL_LITERAL
1.2 1.2 (double_type)
REAL_LITERAL
123.456 123.456 (double_type)
REAL_LITERAL
.123456789 0.123457 (double_type)
REAL_LITERAL
1.2e3 1200 (double_type)
REAL_LITERAL
1.2e-12 1.2e-12 (double_type)
REAL_LITERAL
1.23e+12 1.23e+12 (double_type)
REAL_LITERAL
1.2f 1.2
REAL_LITERAL
123.456f 123.456
(float_type)
REAL_LITERAL
.123456f 0.123456
(float_type)
REAL_LITERAL
1.2e3f 1200
REAL_LITERAL
1.2e-12f 1.2e-12
REAL_LITERAL
1.23e+12f 1.23e+12
(float_type)
(float_type)
(float_type)
(float_type)
STR_LITERAL
This is a string
STR_LITERAL
This is a string with newline \n
STR_LITERAL
This is a string with tab \t
STR_LITERAL
This is a string with quote escape \”
STR_LITERAL
This is a string with all three \n \t \” and then more chars\n
LPAREN
(
RPAREN
)
LBRACKET
[
RBRACKET
]
LBRACE
{
RBRACE
}
SEMICOLON
COMMA
;
,
DOT
.
PLUS
+
MINUS
–
STAR
*
SLASH
/
PERCENT
%
AMPERSAND
BAR
&
|
CARET
^
TILDE
~
ARROW
->
EXCLAIM
!
DBL_PLUS
++
31
DBL_MINUS
—
DBL_LESS
>
QUESTION
?
COLON
:
TRIPLE_DOT
…
DBL_AMP
&&
DBL_BAR
||
DBL_EQL
==
EXCL_EQ
!=
LESS
< LESS_EQ GTR GTR_EQ EQ >=
=
PLUS_EQ
+=
MINUS_EQ
-=
STAR_EQ
*=
SLASH_EQ
/=
PERCENT_EQ
AMP_EQ
&=
BAR_EQ
|=
%=
CARET_EQ
^=
DBL_LESS_EQ
=
POUND
#
DBL_POUND
##
LBRACKET
LBRACE
POUND
%:
DBL_POUND
INT
%:%:
int
IDENTIFIER
COMMA
,
IDENTIFIER
COMMA
x
y
,
32
IDENTIFIER
z
SEMICOLON
;
IDENTIFIER
z
EQ
=
INT_LITERAL
SEMICOLON
3 3
;
SCANF
scanf
LPAREN
(
STR_LITERAL
COMMA
%d
,
AMPERSAND
&
IDENTIFIER
x
RPAREN
)
SEMICOLON
;
IDENTIFIER
y
EQ
=
IDENTIFIER
PLUS
x
+
INT_LITERAL
STAR
4 4
*
IDENTIFIER
z
SEMICOLON
;
End of Example 2.
33
Project 4: Translator Project (Part B) Using Flex and Yacc
for the C♭ (C flat) Language
Parser Specifications
1
Overview of Project 4: C flat Parser
This document is a continuation of the specifications for a translator project that scans and parses a subset
of C, called C-flat, notated as C♭ . The translator is to be constructed with the compiler tools: flex and
yacc. Use the structure of Part A (scanner) for the flex and yacc input files as a guide. They will need to
be modified for this project.
You must replace the driver used to test your scanner; it is no longer needed. A sample driver for the
parser is shown in the body of this document. You should make minor changes, where appropriate to the
rest of your .l file. If you have a working scanner (regular expressions and actions work), then that second
section of the .l will need no changes or, perhaps, a few modifications.
Most of your work will be adding production rules to the second of three sections in your .y file. There
will be some additional code and yacc directives added to the first section of your .y file. Depending on how
you structure your verbatim C code, there may also be additional code needed in the third section of your
.y file.
2
What, When and Where to turn in
What to turn in: You must zip all files for your parser project. You must also include your files:
.l, .y, lex.yy.c, y.tab.h, y.tab.c, all *.obj files, and, of course, cfParserLastName.exe. Also,
include your test source files containing C♭ source code, their respective listings, and respective .txt
files containing redirected standard output of tokens. Name your .zip file: cfParserLastName.zip.
When and Where to turn in: Before the due date and time shown on the assessment link under the
Project Folder on Blackboard.
3
File and Command-Line Specifications
This section specifies the usage of the translator and the files needed by the user of translator, but does not
indicate the files needed for constructing the translator. First, the required user files and their descriptions
are provided. Finally, the command-line syntax and semantics are presented.
3.1
Files Used and Descriptions
Below are descriptions of all the files that will ultimately be used by the fully-completed version of the
translator. Since the translator will be developed incrementally the code listing file will not be used by the
scanner and parser. However, the files containing the source and source listing must be handled by both the
scanner and parser. The handling of the code listing file is to be added during the development of the code
generator.
The source listing file is always produced. After opening the source and listing files, the parser should be
invoked. (The parser calls yylex(); do not call yylex() directly.) The source listing file name that should
1
be the source filename with .cf replaced with .txt. To generalize, no matter what the suffix extension of
the source file, that extension should be replaced with .txt and, if there is no extension, .txt should be
appended.
1. Input Files:
(a) Source File: A text file containing a program written in C♭ source code. A user of C♭ supplies
the name of this file on a command-line. The file should have an extension of .cf. The C♭
translator has the responsibility of opening, scanning, parsing, and closing this file.
2. Output Files:
(a) Source Listing: This file is always produced and overwritten if it already exists. The contents
are a copy of the source file except with line numbers in starting in column 1. If there are any
scanner errors, you must indicate where the error occurs. At the end of the source listing you
must print on a separate lines the following:
Compilation Complete. ### errors reported.
Source Program = ### lines.
The ### is replaced with the actual number, right-justified. For syntax or lexical errors, print the
error message on a new line right below the source line where the error occurs. Note: for this part
of the project you are not expected to implement error recovery and should stop parsing as soon
as an error is discovered. Hence, this number representing errors reported will be either 0 or 1.
In the case of an error print the listing file, before the above summary, the following:
Parsing failed. Error on line: ### at lexeme.
For emphasis: stop after the first error found.
(b) Standard Output: Besides creating the source code file as described above, it is convenient to
print the same last two lines Compilation Complete. … or the error message for syntax or
lexical errors to standard output so that you do not have to open up the listing file to see the
results, especially during testing.
3.2
Command-Line Syntax
The command line syntax is as follows:
cfParserLastName.exe sourceFileName.cf
Of course, sourceFileName.cf is just a sample name; use the legal naming convention of Windows and
use .cf at the end. You can use Visual Studio to edit your test source code files. If you use notepad, save as ”all
files” with a .cf suffix. For each run, the listing file must be created and given the name sourceFileName.txt.
When you open this file connect yyin to the file descriptor. (See sample driver below.)
If no sourceFileName.cf argument is given, use standard input as the input file. You will set yyin to
stdin. In this case use yyin.txt as the listing file name. You can type your code directly from the keyboard
in this case. On Windows, you end the source input with a control-Z.
A sample parser driver (main) program is shown in the next section. This main program is only a
framework. Your main program must handle all arguments on the command line and open all appropriate
files as specified above and in Part A (cf scanner) project specifications.
2
4
Parser Driver
This code can be placed in the ”.l” file; it replaces the driver used for testing your scanner. You can
eliminate the mnemonic names of internal TOKEN names. Keep the other %token directives and %union
directives. You will add to them for the purpose of indicating associativity and precedence of operators. Look
at the your Lex/Yacc book for how to use %left, %right, and how to order your operators for indicating
precedence. Note that the #pragma warning(disable : 4996) is for the allowing legacy functions, which
are generated by flex.
Here is the code for the parser driver along with additional declarations and definitions of utility functions,
variables, and constants. You may need to change some of the file and string functions to names appropriate
for the dialect of the c compiler and libraries.
//For processing string literals
#pragma warning(disable : 4996)
#define MAX_STR_LIT_LEN 200
char buf[MAX_STR_LIT_LEN];
char *s;
//Utility for creating filenames based upon the source code name.
char *changeFileNameExtension(const char *src, const char* extName)
{
unsigned targetBodyNameLen = 0;
unsigned extNameLen = strlen(extName);
char *targetName = “”;
char * pch = strchr(src,’.’); //find last occurrence of . for extension name
if( pch != NULL){
targetBodyNameLen = pch-src;
}
else {
targetBodyNameLen = strlen(src)+1;
}
targetName = (char *)malloc(targetBodyNameLen+extNameLen+4);
targetName[targetBodyNameLen] = ’\0’;
strncpy(targetName, src, targetBodyNameLen);
targetName = strcat(targetName,”.”);
targetName = strcat(targetName, extName);
return targetName;
}
FILE *listingFile;
int lines;
main(int argc, char **argv)
{
int token;
char *infile;
char *listing;
errno_t fileErrorCode;
extern FILE *yyin;
extern FILE *listingFile;
extern int lines;
numerrors = 1;
if (argc>1) {
infile = _strdup(argv[1]);
3
printf(“Infile is %s\n”, infile);
if( (fileErrorCode = fopen_s(&yyin, infile, “r”)) != 0 ) {
fprintf(stderr,”Source file, %s, was not opened. Error code = %d\n”,
infile, fileErrorCode);
exit( fileErrorCode );
}
// if ((yyin = fopen(argv[1], “r”))==0) {
// fprintf(stderr, “Error opening source file: %s”, argv[1]);
// exit(1);
// }
}
else { infile = _strdup(“yyin”);
yyin = stdin;
}
listing = changeFileNameExtension(infile, “txt”);
if( (fileErrorCode = fopen_s(&listingFile, listing, “w”)) != 0 ) {
fprintf(stderr,”Listing file, %s, was not opened. Error code = %d\n”,
listing, fileErrorCode);
exit( 1 );
}
// if ((listingFile = fopen(listing, “w”))==0) {
// fprintf(stderr,”Listing file, %s, was not opened.\n”, listing);
// exit(1);
// }
lines = 1;
fprintf(listingFile, “%3d
“, lines);
if( !yyparse() )
{
printf(“\nCompilation Complete. 0 errors reported.\n”);
printf(“Source Program = %4d lines.\n”, lines);
fprintf(listingFile, “\nCompilation Complete. 0 errors reported.\n”);
fprintf(listingFile, “Source Program = %4d lines.\n”, lines);
fclose(yyin);
fclose(listingFile);
return 0;
}
else
{
printf(“\nParsing failed. Error on line: %i at %s\n”, lines, yytext);
fprintf(listingFile, “\nParsing failed. Error on line: %i at %s\n”, lines, yytext);
fclose(listingFile);
return 2; //arbitrary non-zero error number
}
}
4
5
C flat Token Specification
With the exception of scanf and printf as reserved words, the tokens for C♭ are the same as C89. In Part
A of this project (Project 3), a full scanner for C89 was required to be written. Even though only a subset
of tokens of C89 are utilized in C♭ , use the full C♭ (C89) scanner from Part A. This allows you to make
the least version your scanner. I strongly suggest that you use the token names from your scanner in the
production rules supplied to Yacc. If you use single character tokens in the Yacc syntax for the production
rules, you will have to make some minor changes to your scanner from Part A. For example, consider the
if statement. You can represent tokens with single character lexemes in the right-hand side of a production
rule as one of the two ways. For example:
or as
if LPAREN expression RPAREN statement
if ’(’ expression ’)’ statement
In the second approach, yacc scans the single character and there is no need for your scanner to return
a token LPAREN for left-parenthesis or RPAREN for a right-parenthesis. I think it’s best that you use the first
approach.
In whatever approach you take, remember to use the -d option for yacc so that y.tab.h is generated.
6
A Context-Free Grammar for C flat
The EBNF meta-syntax that is used below for the presentation of C♭ grammar uses the following conventions.
The nonterminals are all in italics. The ::= or -> operators that are typically used in BNF or EBNF are
replaced with the colon character, :. The right-hand-sides are indented and all right-hand-side alternatives
are written on separate lines. The non-italicized words are terminals. Moreover, the terminals are shown in
a typewriter-like font, such as INT_LITERAL. The subscript of opt indicates that the preceding non-terminal
is optional. Iteration of a non-terminal or terminal construct is expressed through recursive rules, usually
left-recursive, which are better than right-recursive for LALR(1) parser generators like Yacc.
Production rule syntax shown below needs to be modified for yacc and flex to properly build an LALR(1)
table-driven parser and table-driven DFA scanner for C♭ , respectively. This grammar must be modified so
that yacc will not give shift-reduce or reduce-reduce conflict errors. See your flex/yacc textbook and the
tutorial on Blackboard for approaches to resolve these conflicts and how to change the grammar. This is the
core task of this project. See the Hints section below for more about this topic.
Again, remember that C♭ is not a subset of C89 since scanf and printf statements are not reserved
words in C syntax. Many other semantic issues also show that C♭ is not a subset of C89, for example,
parameter passing semantics differ.
Some of the sentences generated by this grammar will be shown to be illegal semantically. For example,
the substring shown below is syntactically correct, but will have illegal semantics.
…
int x;
x = “abc”;
…
In C♭ , we will not be able to assign string literals to variables since there is not a string data type. Several
other examples can be cited. Once this parser is written and this project is completed, then compile-time
semantic errors will be considered in future projects.
HINTS: These productions are written to help you eliminate some reduce-reduce and shift-reduction
errors. Again, as mentioned above, this grammar must be modified so that yacc will not give shift-reduce
or reduce-reduce conflict errors.
Remember to convert productions with the optionalopt into two productions appropriately.
Remember that left-recursion is good and right-recursion is bad for bottom-up parsers, such as LALR(1)
parsers as generated by yacc.
5
Remember to give precedence and associativity to operators. You may leave all the tokens in your
scanner; their presence will not cause problems for the parser. Do not add any terminals other than what is
given.
Remember to convert all the dashes (-) into underscores (_) for the nonterminals for all productions
inside your .y file.
You will probably have shift-reduce errors or reduce-reduce errors as indicated by yacc. To see more
information about which productions are giving you problems add the -v option (which book calls flag) to
the yacc command line. This option has yacc produced a file called: y.output. (You still need -v option, so
the full option list for yacc should be: -dv option.) If you get these errors, you must revise the productions
to fix the shift-reduce and reduce-reduce errors, but do not revise the C♭ language from the one generated
by these productions given below.
6
C♭ Grammar:
C♭ -translation-unit:
external-decl-defn
C♭ -translation-unit external-decl-defn
external-decl-defn:
function-decl-defn
declaration
function-decl-defn:
declarator compound-statement
declaration-specifiers declarator compound-statement
declaration-list:
declaration
declaration-list declaration
declaration:
declaration-specifiers init-declarator-listopt ;
declaration-specifiers:
storage-class-specifier declaration-specifiersopt
type-specifier declaration-specifiersopt
type-qualifier declaration-specifiersopt
storage-class-specifier:
auto
static
extern
type-specifier:
int
unsigned
long
float
double
type-qualifier:
const
init-declarator-list:
init-declarator
init-declarator-list , init-declarator
init-declarator:
declarator
declarator initializer
initializer:
= expression
declarator:
IDENTIFIER
declarator ( )
7
declarator ( parameter-type-list )
declarator ( formal-parm-list )
declarator [ const-expression ]
declarator [ ]
formal-parm-list:
formal-parameter
formal-parm-list , formal-parameter
formal-parameter:
type-specifier-list IDENTIFIER
parameter-type-list:
type-specifier-list
parameter-type-list , type-specifier-list
type-specifier-list:
type-specifier
type-specifier-list type-specifier
identifier-list:
IDENTIFIER
identifier-list , IDENTIFIER
statement-list:
statement
statement-list statement
compound-statement:
{ declaration-listopt statement-listopt }
statement:
compound-statement
if ( expression ) statement
if ( expression ) statement else statement
while ( expression ) statement
return ;
return expression ;
expression ;
;
expression-list:
expression
expression-list , expression
const-expression:
conditional-expr
expression:
conditional-expr
unary-expr assignment-op expression
assignment-op:
=
+=
-=
8
*=
\=
conditional-expr:
logical-or-expr
logical-or-expr:
logical-and-expr
logical-or-expr || logical-and-expr
logical-and-expr:
equality-expr
logical-and-expr && equality-expr
equality-expr:
relational-expr
equality-expr == relational-expr
equality-expr != relational-expr
relational-expr:
add-expr
relational-expr < add-expr relational-expr > add-expr
relational-expr = add-expr
add-expr:
mult-expr
add-expr + mult-expr
add-expr – mult-expr
mult-expr:
unary-expr
mult-expr * unary-expr
mult-expr / unary-expr
mult-expr % unary-expr
unary-expr:
postfix-expr
++ unary-expr
— unary-expr
– unary-expr
+ unary-expr
! unary-expr
postfix-expr:
primary-expr
primary-expr [ expression ]
scanf ( identifier-list )
printf ( expression-list )
primary-expr ( expression-listopt )
postfix-expr ++
postfix-expr —
9
primary-expr:
IDENTIFIER
REAL_LITERAL
INT_LITERAL
STR_LITERAL
( expression )
7
Some Examples
Test03.cf:
unsigned int f(int);
float add(int x, double y) {
return x + y;
}
int main()
{
int x;
int y;
printf(x + y, add(x, y));
return 0;
}
Test03.txt:
1
unsigned int f(int);
2
float add(int x, double y) {
3
return x + y;
4
}
5
6
int main()
7
{
8
int x;
9
int y;
10
printf(x + y, add(x, y));
11
return 0;
12
}
Compilation Complete. 0 errors reported.
Source Program =
12 lines.
Test03err.cf:
unsigned int f(int);
float add(int x, double y) {
return x + y;
}
int main()
{
int x;
10
int y;
printf(x + y, add(x, y);
return 0;
}
Test03err.txt:
1
unsigned int f(int);
2
float add(int x, double y) {
3
return x + y;
4
}
5
6
int main()
7
{
8
int x;
9
int y;
10
printf(x + y, add(x, y);
Parsing failed. Error on line: 10 at ;
Test08.cf:
unsigned int f(int);
float add(int x, double y) {
return x + y;
}
int main()
{
int max = 200;
int a[200];
int x;
int y;
unsigned i = 0;
scanf( x, y);
while( i < MAX ) { a[i++] = 0; } i = 0; while( i < MAX ) { a[i] = x + y; scanf( x, y); i++; }; if( i > 0 ) {
printf( a[i]/ i );
}
else {
printf(“i

Turn in your highest-quality paper
Get a qualified writer to help you with

“ scanner specifications and parser specifications for the C_flat language ”

Get high-quality paper

Guarantee! All work is written by expert writers!

Still stressed from student homework?

Get quality assistance from academic writers!

Order now