Tips for lab 5
Reflections as we dig deeper into the grammar and some context sensitive checking...

  • I've posted lab 4 sample solutions if you've got leftover problems with the tokenizing or basic parsing of the grammar

  • When using the $< ItemInfo.whatever >i notation, if you have blocks of C interleaved with the RHS elements those count when you're calculating positions, e.g. for
    X: A { printf("Hi"); } B C ;
    X is index $, A is index 1, B is index 3 and C is index 4.

  • In the makefile if you add the -v option to the yacc command (yacc -v ......) then it produces an additional file, y.output, that shows the structure of the FSM it builds and the relationship to the grammar rules (can help weird yacc/lex debugging issues).

  • The nonterminal identlist is currently used both inside the formalparams (where they'll need to get inserted into the symbol table) and inside the readstmt (where they'll need to be looked up to make sure they're already in the symbol table).
    It might be helpful to create a separate nonterminal and pair of rules for the formal parameters, e.g.
    formalplist --> IDENT
    formalplist --> IDENT formalplist
    and change identlist to formalplist inside the formalparams rule. That way you can use lookup on the IDENTs in identlist and insert on the IDENTs in formalplist.

  • Accessing the contents of arrays also looks just like a one-element function call, e.g. foo[3] could be a call to a function foo or looking up the content of array element 3. The existing grammar rules just treat it as a function call, so our symbol table inserts/lookups should probably also track entries for array declarations and check for them in the funcall rule. A more well-thought-out grammar would perhaps have added an operator for array element lookup, e.g. @[arrayname, position], where we could even add type checking to ensure position was an integer expression.

CSCI 330 Lab 5: context sensitive checking

In lab5 we'll be continuing the work begun in lab4, working with the same language but now adding a symbol table and some forms of context sensitive checking (with the checking to be carried out as the program is being parsed).

The base repo for lab5 is a copy of the starting repo for lab4. If you wish to re-use your token/grammar code from lab4 as a starting point then:


Lab time and submission deadline

As there are more significant challenges to be overcome for lab5, we'll have two lab sessions dedicated to it (Mar 18 and 25), and the final submission deadline for the lab will be 8pm Fri. April 5 (extended from apr. 3).


CFG and token rules

The context free grammar rules and tokenization rules for the language are the same in lab5 as lab4 (the same grammar.txt, tokens.txt have been included in the lab5 repo). If there were errors in your lab4 lex/yacc implementations then those should be corrected for the lab5 submission.


Context sensitive checking

The objective for lab5 is to add code to our lab5.lex/lab5.yacc to allow our lab5 executable to check that (i) every function, parameter, and variable used in the program is declared before it is used and is accessible at the point of use, and (ii) that functions, parameters, and variables are not declared multiple times in the same scope.

For lab5 the program does not need to do any type checking, nor does it need to check that the correct number of parameters are passed in function calls.

To carry out the checking, you will be implementing a simple form of symbol tables and supporting functions plus some scope tracking features, and using these as parsing takes place. The process for doing so is outlined at the bottom of this page, and will be discussed in some depth during the Mar. 18/25 lab sessions.


Testing

This time your test cases/infrastructure will make up part (25%) of your mark for the lab.

Feel free to use the lab4 test infrastructure as a template, but for lab5 your actual test cases need to focus on whether or not the checking for variable declarations/redeclarations is working correctly.


Design and use of the symbol table(s)

To carry out the desired error checking, during parsing we need to keep track of:

It will likely be helpful to give each new scope a unique integer id, to help later with our lookup process. This can be done simply with a C global variable (e.g. ScopeID) that we increment each time a brand new scope is detected.

Since scopes can be nested (e.g. function foo declared inside function blah which is declared in the global scope) we need to be able to keep track of the set of currently active scopes. This can easily be done with a simple stack, pushing the scope's id when we enter its scope and popping it off the stack when we leave its scope. Thus the current active scopes are on the stack, with the 'innermost' scope at the top and the global scope at the bottom.

Whenever a function, variable, or parameter is declared we would have to do a lookup in the symbol table to make sure there wasn't already a declaration for that function/variable/parameter in the same scope, generating an error (yyerror) if there's a problem.

Whenever a function call is made or a variable or parameter is used within an expression we would have to do a lookup in the symbol table to ensure that a function/variable/parameter with that name exists in at least one of the currently-active scopes, generating an error if there isn't one.


Associating information with a token in lex/yacc

To accomplish the objectives above, our yacc rules will somehow need to be able to look up the actual name associated with the token that represented a function name, variable name, or parameter name.

During tokenizing, we've seen the use of local variable yytext in the .lex file to access the content of the current token (though we've just used it to grab the string length so far).

There is another local variable, yylval, that can be used to store information that will later be accessible in .yacc's parsing of the tokens.

First, in the .yacc file, we'll need to specify what kind of information we want stored with our tokens/nonterminals. For simplicity we'll just the same kinds of data for all the tokens/nonterminals, and will use a struct to define it:
%union { struct storedInfo { char ItemText[128]; int ItemType; } ItemInfo; }
%token<struct storedInfo> ...our list of token types...
%type<struct storedInfo> ...our list of nonterminal types...

In the .lex file, these fields can be accessed a yylval.ItemInfo.ItemText and yylval.ItemInfo.ItemType. Thus in a REALLIT, INTLIT, STRLIT, or IDENT token we could copy the relevant data from the source code using strcpy(yylval.ItemInfo.ItemText, yytext); in the same block of code that does the col update.

yacc also provides a rather unique syntax for accessing the data stored for a token or nonterminal in a rule.


Symbol table implementation in lex/yacc

Remember we can add C global variables, type definitions, and function prototypes in the %{ .. %} section at the top of lab5.lex, then add the function implementations at the bottom of the .lex file.

Anything that we wish to also call from inside the .yacc file we'll also need to add to the top %{ ... %} section of the yacc file, e.g. copies of the function prototypes and EXTERN declarations of any global lex variables we wish to access directly.

As far as the symbol table goes, the simplest approach is probably to use a struct to record information about each individual entry and make the symbol table an array of these structs, e.g. something like
struct SymEntry {
   char name[128];  // its name, e.g. "foo": it's ok here to impose a reasonable max name length
   int  scope;      // the id of the scope it was declared in
   int  symtype;    // e.g. 0 for function, 1 for parameter, 2 for variable
};
SymEntry SymTable[500]; // again, it's ok to assume a reasonable limit on the number of defined items

Functions would be needed to insert an entry into the table and to lookup table entries, and these functions could be called as needed from inside { ... } blocks within the .lex or .yacc files, e.g.
// insert a new entry into the table, return the table row if successful, -1 otherwise
int insertEntry(char entryName[], int entryScope, int entryType);

Adding the checking in our .yacc file
In our yacc parsing rule for a variable definition, at the end of the rule we would add a C block to do the appropriate checking, e.g.
vardefn: .......the tokens and nonterminals......
   {
      ... our C code to check that this is a valid declaration and do the symtable insert,
      ... calling yyerror if anything fails
   };
We would do similarly for the rule for a function definition, and for the parameters within its formal parameter list.

To ensure that items are declared before being called we could perhaps add a check for the funcall and IDENT in expr: making sure that a function or variable or parameter (whichever is appropriate) with that name has been declared in (at least) one of the scopes that is currently accessible.

A reminder about text strings in C
Remember that we don't have a 'string' class in C, they're just null-terminated character arrays, so to copy them we have to use things like strcpy from the string.h library, you can't rely on simply assigning one to another.