CSCI 330 Lab 4: grammars, lex, yacc

For lab 4 you'll be editing and submitting lex and yacc files used to build a simple parser, lab4, for a language to be described below. The lex/yacc files are compiled through a makefile using lex and yacc on the cubs/pups (details below).

As usual, you'll be forking, cloning, editing, and submitting a git repository, following the same basic command sequence except it's named lab4 this time.

The starting lab4.lex file (and hence a current build of lab4) recognizes most, but not all, of the needed tokens, and the starting lab4.yacc file (and hence a current build of lab4) recognizes a subset of the needed grammar.

You'll be expanding the set of token regular expressions in lab4.lex and the grammar rules in lab4.yacc so that a new build of lab4 can parse the full language.

For details on lex and yacc themselves, see the notes for the week of Mar. 11-15 in the lecture resources.

Objectives

Lab 4 is meant to give you practice with regular expressions, context free grammars, and their use in syntax checking source code using tools like lex/flex and yacc/bison.

You will be using lex regular expressions to describe the tokens in a language, and yacc CFG rules to describe the syntax of the language.

A makefile is provided to allow you to compile your lex and yacc files into a tool that can be used to check for syntax errors in user source code, e.g.
make
./lab4 < filename
The lab4 executable processes the specified file content, and displays error messages if it detects any syntax errors, otherwise it displays a final "Parsing complete." message. The error messages, if any, are accompanied by a row and column number indicating where in the source file the parser recognized it was in an error state, and an indication of what statement type it last entered and where (to the best of its knowledge).


The language to be parsed

The starting lab4.lex and lab4.yacc files have been provided for you, that handle most of the tokens and grammar rules described below.

Informally, the language is a little bit lispy, but using square brackets instead of round, and having the operator/function precede the opening brackets, e.g. foo[x y z] to call function foo on arguments x, y, and z.

For the sake of lab4 it is really only necessary to know the token types (see tokens.txt in your lab 4 repo) and the grammar rules (see grammar.txt in your lab 4 repo), but for future reference in lab 5 and a general understanding of the target language, the language details are provided here.

  • A program is made up of one or more statements: function definitions, variable definitions, array definitions, function calls, if statements, print statements, and read statements.
  • It is a typed language, with variables, arrays, and functions each having a defined type (representing the return type for functions), and compatible types must always be passed/assigned.
    (*It will be beyond the capabilities of our lab4 parser to check type compatibility.)
    (*Note that a language problem, to be discussed in lab 5, is the lack of type specifiers for the parameters in the definition of functions.)
  • Three data types are supported by the language: INTEGER (e.g. 930), REAL (e.g. 14.123), and STRING (e.g. "blah blah blah").
    Note there is no boolean type: the value 0 is treated as logical false, all other values are regarded as logically true.
  • All statements have a value:
    - variables to their currently stored value
    - function calls to their returned value
    - built in operators to their evaluated result (0 or 1 for comparison ops, the value assigned for the assignment op)
    - if statements to the result of either their true or false case (whichever applies based on the current condition)
    - the initial assigned value for variable definitions
    - 0 for other statement types (print, read, function/array definitions)
  • Functions, variables, and arrays must be declared before they are used.
    (*It will be beyond the capabilities of our lab4 parser to check this.)
  • Identifiers (variable or function names) must be strictly alphabetic, can be upper or lower case (or a mix), but cannot be the same as any of the language keywords (IF, READ, PRINT, VAR, FUNCTION, ARRAY, INTEGER, STRING, REAL).
  • Function calls consist of the function name (or built in operator) followed by the arguments in square brackets, e.g. foo[x y z]. The arguments can be function calls themselves, with no limit on the nesting e.g. foo[ +[x 1] *[a b] +[i [* j k]]]
  • A number of built in operators are supported:
        math ops * + / -
        logical ops & | ! (and, or, not, respectively)
        comparison ops = < (e.g. =[x y] tests if x equals y)
        assignment op := (e.g. :=[x y] assigns y to x)
    Thus assigning 3 * x to variable y might look like :=[y *[3 x]]
  • Variables must be declared using VAR, with the declaration specifying the variable name, data type, and initial value (which can be an expression), e.g. VAR[x INTEGER +[y z]]
    Variables declared inside a function are local to that function, variables are global if their declaration is not inside any function.
    (*It will be beyond the capabilities of our lab4 parser to identify scope.)
  • The built-in print statement accepts one or more arguments, each of which can be literal values, variables, or function calls, e.g. PRINT["blah blah" x *[a b]]
  • The built-in read statement accepts one or more variable names, and will read a value into each, e.g. READ[x y z]
  • Arrays can be defined using the ARRAY statement, specifying the array name, data type, and a list of initial values for the array elements, e.g. ARRAY[mydata INTEGER 23 4 *[x y]]
    As with variables, arrays are local to the function in which they are declared, and are global if their declaration is not inside any function.
  • The if statement consists of three arguments: the condition to be checked, a statement for the true case of the condition, and a statement for the false case of the condition, e.g. IF[ <[x y] PRINT["Hey, it was true!"] PRINT["Ah, it was false"]]
  • Function definitions consist of the FUNCTION keyword, with arguments identifying its name, its return type, a list of parameter names, and one or more statements representing the body. The value of the final statement is the value returned by the function. e.g.
    FUNCTION[foo INTEGER [a b c]
       VAR[total INTEGER {+ a +[b c]]
       PRINT["Grand total: " total]
       *[total total]
    ]

A sample valid program might look like:
   FUNCTION[sum REAL [x y]
      VAR[result REAL x]
      :=[result +[result y]]
   ]

   VAR[A REAL 10.5]
   VAR[B REAL *[A 2.1]]
   PRINT["The sum of " A " and " B " is: " sum[A B]]
It is relatively flexible with respect to whitespace, the layout above is simply to help with readability.

More formally, the token types to be accepted are shown below (see also the tokens.txt file in your lab4 repo). Note that all but INTLIT, STRLIT, REALLIT, and IDENT are simple hard-coded strings. Meanwhile, the context free grammar rules are provided in grammar.txt in your lab4 repo, and are discussed here:


Requirements (fixing what's missing)

Fix lab4.lex and lab4.yacc to properly handle the language elements they don't currently support:

Errors and error messages

Identifying the precise nature of a syntax error in the program being parsed can be quite challenging: by default our lab4 just tells us if parsing was successful or that there was an error (and where it got to in parsing if there was an error).

For the particular language/grammar we're parsing, there is often a unique keyword or construct that allows us to identify when we have 'entered' a particular kind of statement (e.g. after the "FUNCTION" keyword we know we should be inside a function definition).

A 'setstmt' function has been added to the lab4.lex, allowing us to specify what kind of statement we last entered (or the last one we were aware of entering).

Discussion of setstmt and how to use it is included in the errmsgs.txt file in your repo, and an example of its use can be seen in lab4.yacc for the print grammar rule.


Context sensitive checking is not required
Eventually a parser would need to account for things like checking that a called function actually existed, that the number of arguments passed in a call were correct, that the data types matched, etc, but that's beyond the scope of the parsing we'll do in lab4. For lab4 we just want to make sure the source code follows the token and CFG rules above.

Examples

With the token and grammar revisions above, sample valid programs include our original programs shown above, plus things like:

Advice:


lex and yacc reminders

Remember that each token type specified in the .lex file needs a corresponding entry in the .yacc file on the tokens line:
and each non-terminal type needs an entry on the types line:

Also in the .lex file, don't forget to adjust col by the number of characters in the token that was read.


Testing

To compile/test your lab:

Two subdirectories, valid and broken, have been provided, each containing a handful of test files for use with the completed lab4.

These are just a starting point for testing, however, and it is strongly recommended you add additional test cases to each, to ensure your lab4 is working as intended.

Details on the test cases and more discussion of the testing process can be found in the testcases.txt file in your lab4 repo.