CSCI 330: Lab 4 (spring 2024)

CSCI 330 Lab 4: grammars, lex, yacc

For lab 4 you'll be editing and submitting lex and yacc files used to build a simple parser, lab4, for a language to be described below. The lex/yacc files are compiled through a makefile using lex and yacc on the cubs/pups (details below).

As usual, you'll be forking, cloning, editing, and submitting a git repository, following the same basic command sequence except it's named lab4 this time.

The starting lab4.lex file (and hence a current build of lab4) recognizes most, but not all, of the needed tokens, and the starting lab4.yacc file (and hence a current build of lab4) recognizes a subset of the needed grammar.

You'll be expanding the set of token regular expressions in lab4.lex and the grammar rules in lab4.yacc so that a new build of lab4 can parse the full language.

For details on lex and yacc themselves, see the notes for the week of Mar. 11-15 in the lecture resources.

Objectives

Lab 4 is meant to give you practice with regular expressions, context free grammars, and their use in syntax checking source code using tools like lex/flex and yacc/bison.

You will be using lex regular expressions to describe the tokens in a language, and yacc CFG rules to describe the syntax of the language.

A makefile is provided to allow you to compile your lex and yacc files into a tool that can be used to check for syntax errors in user source code, e.g.

make
./lab4 < filename

The lab4 executable processes the specified file content, and displays error messages if it detects any syntax errors, otherwise it displays a final "Parsing complete." message. The error messages, if any, are accompanied by a row and column number indicating where in the source file the parser recognized it was in an error state, and an indication of what statement type it last entered and where (to the best of its knowledge).

The language to be parsed

The starting lab4.lex and lab4.yacc files have been provided for you, that handle most of the tokens and grammar rules described below.

Informally, the language is a little bit lispy, but using square brackets instead of round, and having the operator/function precede the opening brackets, e.g. foo[x y z] to call function foo on arguments x, y, and z.

For the sake of lab4 it is really only necessary to know the token types (see tokens.txt in your lab 4 repo) and the grammar rules (see grammar.txt in your lab 4 repo), but for future reference in lab 5 and a general understanding of the target language, the language details are provided here.

A program is made up of one or more statements: function definitions, variable definitions, array definitions, function calls, if statements, print statements, and read statements.
It is a typed language, with variables, arrays, and functions each having a defined type (representing the return type for functions), and compatible types must always be passed/assigned.
(*It will be beyond the capabilities of our lab4 parser to check type compatibility.)
(*Note that a language problem, to be discussed in lab 5, is the lack of type specifiers for the parameters in the definition of functions.)
Three data types are supported by the language: INTEGER (e.g. 930), REAL (e.g. 14.123), and STRING (e.g. "blah blah blah").
Note there is no boolean type: the value 0 is treated as logical false, all other values are regarded as logically true.
All statements have a value:
- variables to their currently stored value
- function calls to their returned value
- built in operators to their evaluated result (0 or 1 for comparison ops, the value assigned for the assignment op)
- if statements to the result of either their true or false case (whichever applies based on the current condition)
- the initial assigned value for variable definitions
- 0 for other statement types (print, read, function/array definitions)
Functions, variables, and arrays must be declared before they are used.
(*It will be beyond the capabilities of our lab4 parser to check this.)
Identifiers (variable or function names) must be strictly alphabetic, can be upper or lower case (or a mix), but cannot be the same as any of the language keywords (IF, READ, PRINT, VAR, FUNCTION, ARRAY, INTEGER, STRING, REAL).
Function calls consist of the function name (or built in operator) followed by the arguments in square brackets, e.g. foo[x y z]. The arguments can be function calls themselves, with no limit on the nesting e.g. foo[ +[x 1] *[a b] +[i [* j k]]]
A number of built in operators are supported:
math ops * + / -
logical ops & | ! (and, or, not, respectively)
comparison ops = < (e.g. =[x y] tests if x equals y)
assignment op := (e.g. :=[x y] assigns y to x)
Thus assigning 3 * x to variable y might look like :=[y *[3 x]]
Variables must be declared using VAR, with the declaration specifying the variable name, data type, and initial value (which can be an expression), e.g. VAR[x INTEGER +[y z]]
Variables declared inside a function are local to that function, variables are global if their declaration is not inside any function.
(*It will be beyond the capabilities of our lab4 parser to identify scope.)
The built-in print statement accepts one or more arguments, each of which can be literal values, variables, or function calls, e.g. PRINT["blah blah" x *[a b]]
The built-in read statement accepts one or more variable names, and will read a value into each, e.g. READ[x y z]
Arrays can be defined using the ARRAY statement, specifying the array name, data type, and a list of initial values for the array elements, e.g. ARRAY[mydata INTEGER 23 4 *[x y]]
As with variables, arrays are local to the function in which they are declared, and are global if their declaration is not inside any function.
The if statement consists of three arguments: the condition to be checked, a statement for the true case of the condition, and a statement for the false case of the condition, e.g. IF[ <[x y] PRINT["Hey, it was true!"] PRINT["Ah, it was false"]]
Function definitions consist of the FUNCTION keyword, with arguments identifying its name, its return type, a list of parameter names, and one or more statements representing the body. The value of the final statement is the value returned by the function. e.g.
```
FUNCTION[foo INTEGER [a b c]
   VAR[total INTEGER {+ a +[b c]]
   PRINT["Grand total: " total]
   *[total total]
]
```

A sample valid program might look like:

   FUNCTION[sum REAL [x y]
      VAR[result REAL x]
      :=[result +[result y]]
   ]

   VAR[A REAL 10.5]
   VAR[B REAL *[A 2.1]]
   PRINT["The sum of " A " and " B " is: " sum[A B]]

It is relatively flexible with respect to whitespace, the layout above is simply to help with readability.

More formally, the token types to be accepted are shown below (see also the tokens.txt file in your lab4 repo). Note that all but INTLIT, STRLIT, REALLIT, and IDENT are simple hard-coded strings.

OROP: "|"
ANDOP: "&"
NOTOP: "!"
ADDOP: "+"
SUBOP: "-"
MULOP: "*"
DIVOP: "/"
EQOP: "="
LTOP: "<"
SETOP: ":="
OPENBR: "["
CLOSEBR: "]"
IFCTRL: "IF"
READCTRL: "READ"
PRTCTRL: "PRINT"
VARDEF: "VAR"
ARRDEF: "ARRAY"
FUNDEF: "FUNCTION"
INTTYPE: "INTEGER"
STRTYPE: "STRING"
REALTYPE: "REAL"
REALLIT: one or more digits (0-9), followed by a period, followed by one or more digits
INTLIT: one or more digits
IDENT: one or more alphabetic characters (a-z or A-Z)
STRLIT: a double-quote (") followed by one or more characters that are NOT double-quotes, followed by a double-quote

Meanwhile, the context free grammar rules are provided in grammar.txt in your lab4 repo, and are discussed here:

a program is made up of statements
statements can be one or more statements (defined recursively as either a statement or a statement followed by statements)
a statement can either be a definition, an expression, or a control statement
an expression list (used in function calls and prints) is one or more expressions (is defined recursively)
an expression can be a literal value, a function call, or an identifier token
a control statement can be an if statement, a print statement, or a read statement
an identifier list (used in function definitions and read statements) is one or more identifier tokens (is defined recursively)
a literal can be a string literal (token), an integer literal (token), or a real literal (token)
a definition can be a variable definition, an array definition, or a function definition
a function call could represent a call to a built in or a user defined function, so consists of a funtype (to encapsulate those two possibilities) and the actual (passed) parameter list
a funtype (as mentioned above) is either a built in op or an identifier (token) for a user-defined function
the built in ops are the tokens for add, multiply, subtract, divide, set (assign), equality, less-than, or, and, not
an if statement consists of the if keyword (token), an opening bracket (token), an expression representing the condition, a statement for the true case, a statement for the false case, and a closing bracket (token)
a print statement consists of the print keyword (token), an opening bracket (token), an expression list (things to be printed), and the closing bracket (token)
a read statement consists of the read keyword (token), an opening bracket (token), an identifier list (variables to be filled), and the closing bracket (token)
a variable definition consists of the var keyword (token), an opening bracket (token), an identifier (token) for the variable name, a type name (for the variable type), an expression (for the variable value), and the closing bracket (token)
an array definition consists of the array keyword (token), an opening bracket (token), an identfier (token) for the array name, a type name (for the array type), an expression list (for the initial values), and a closing bracket (token)
a function definition consists of the fundef keyword (token), an opening bracket (token), an identifier (token) for the function name, a type name (for the return type), a formal parameter list, a statement list (for the body), and the closing bracket (token)
the type names are one of three tokens, for the real type, the integer type, or the string type
an actual parameter list either consists of the opening bracket (token), an expression list, and the closing bracket (token) or it consists of just the opening and closing bracket (tokens)
a formal parameter list either consists of the opening bracket (token), an identifier list, and the closing bracket (token) or it consists of just the opening and closing bracket (tokens)

Requirements (fixing what's missing)

Fix lab4.lex and lab4.yacc to properly handle the language elements they don't currently support:

The lab4.lex file is missing the definitions for the following token types:
STRLIT, FUNDEF, NOTOP, ANDOP, OROP, LTOP, READCTRL
(the lab4.yacc file is also missing those from its tokens list)
The lab4.yacc file is missing the grammar rules for the following nonterminals:
fundefn, readstmt, formalparams, identlist
(and is also missing them from its types list)
Add basic identification of the last statement type the parser 'entered' successfully (see the error messages discussion below and in the errmsgs.txt file in your lab4 repo.
Ensure the lab4.lex and lab4.yacc files compile cleanly (no error/warning messages) when using "make".

Errors and error messages

Identifying the precise nature of a syntax error in the program being parsed can be quite challenging: by default our lab4 just tells us if parsing was successful or that there was an error (and where it got to in parsing if there was an error).

For the particular language/grammar we're parsing, there is often a unique keyword or construct that allows us to identify when we have 'entered' a particular kind of statement (e.g. after the "FUNCTION" keyword we know we should be inside a function definition).

A 'setstmt' function has been added to the lab4.lex, allowing us to specify what kind of statement we last entered (or the last one we were aware of entering).

Discussion of setstmt and how to use it is included in the errmsgs.txt file in your repo, and an example of its use can be seen in lab4.yacc for the print grammar rule.

Context sensitive checking is not required
Eventually a parser would need to account for things like checking that a called function actually existed, that the number of arguments passed in a call were correct, that the data types matched, etc, but that's beyond the scope of the parsing we'll do in lab4. For lab4 we just want to make sure the source code follows the token and CFG rules above.

Examples

With the token and grammar revisions above, sample valid programs include our original programs shown above, plus things like:

array_recursion - definition and call for a recursive function to sum array elements
basic_exprs - declares two vars and uses :=, *, +, -
functions - simple function definition and call
io_and_if - print, read, and if statements
just_arr - single array definition
just_expr - single call to +
just_if - single if statement
just_vars - three var defs (string, real, int)

Advice:

very

lex and yacc reminders

Remember that each token type specified in the .lex file needs a corresponding entry in the .yacc file on the tokens line:

...etc...

and each non-terminal type needs an entry on the types line:

...etc...

Also in the .lex file, don't forget to adjust col by the number of characters in the token that was read.

Testing

To compile/test your lab:

run "make", which will use lex and yacc to create an executable named "lab4"
create a variety of test files, some with valid syntax and some with invalid syntax
to use your lab4 to check the validity of the syntax in a file use
./lab4 < filename
to run your lab4 on a suite of provided valid/broken programs, use
./runtests.sh

Two subdirectories, valid and broken, have been provided, each containing a handful of test files for use with the completed lab4.

These are just a starting point for testing, however, and it is strongly recommended you add additional test cases to each, to ensure your lab4 is working as intended.

Details on the test cases and more discussion of the testing process can be found in the testcases.txt file in your lab4 repo.