CSCI 330 Lab 5: context sensitive checking

The lab5 repo has been updated (as of 1:08 Monday) to include the lab4 sample solution as a baseline, with fixes for the language change and preliminary code for the symbol table/functions and token/nonterminal storage

In lab5 we'll be augmenting our scanner/parser from lab4 to include context sensitive checking: checking that variables have been declared before use (and prevent redeclarations) and checking that operands have the appropriate types wherever they are used.

To do so, we'll be:

LANGUAGE MODIFICATION: variable types must now be specified when we declare our variable, either as a number or a string.
The var keyword for declarations is replaced by two type-specific versions: numvar and strvar, to declare numeric variables and string variables respectively.
The isnum and isstr operations are now obsolete and removed from the language.

The lab can be obtained and submitted through our usual git approach (repo name 'lab5' this time), and the starting repo structure is similar to lab4 (other than the lab4->lab5 name tweaks).

The starter code for the repo is essentially a tweaked sample solution for lab4 with some of the initial symbol table code/functions already built in.,
You're welcome to use your own lab4 solution as a starting point if you'd prefer (just be sure to add the var-->numvar/strvar changes mentioned above).

Overview

For lab 5 you'll be augmenting the lex/yacc code from lab4 with extra code to support context-sensitive checking: checking that variables are declared before being used, checking that the same variable name isn't declared multiple times, and checking that data types are used appropriately in the various language statements.

To do so, we will need to: Each of these steps is discussed in greater detail below.

Expectations/requirements

When compiled into lab5x, your lab5.lex/yacc code should accept programs in our new languaage (through standard input, e.g. ./lab5x < someprogram), and verify all the following:

If the input program violates any of the context-sensitive rules then your lab5x must produce a clear and informative error message describing the nature of the error.
(If the code contains multiple errors then it's your choice as to whether it quits after the first error message or attempts to continue on.)


Setting and using the new storage type for token/nonterminal data

We'll want to store context-sensitive data with the various tokens and nonterminals, and to accomplish this, we need to:
  1. update the %union type (in .yacc) so each token/nonterminal has a struct of data instead of just a long,
  2. update the %token and %type rules (in .yacc) so specify the tokens/nonterminals are using this new type,
  3. update the .lex code to set initial stored data for the tokens as they are scanned,
  4. update the .yacc grammar rules to access/update the stored data for tokens and nonterminals.
Updating the %union involves specifying a type of struct for the info, while updating the %token and %type lines simple involves referencing that struct type, e.g.:

 /* info struct for tokens and non-terminals:
  *    row,col: the row/column of the relevant token in the source code
  *    content: the actual text content of the token
  *    type: 0 for numeric, 1 for text, -1 if unknown or invalid
  */
%union { struct itemInfo { char content[128]; int row; int col; int type; } info; }

 /* identify the valid token types, all have yylval type long */
%token <struct itemInfo> ASSIGN PLUS MINUS TIMES DIV POW LEFTBR RIGHTBR LEFTPAR RIGHTPAR
%token <struct itemInfo> LT GT LE GE EQ NE NUMDEF STRDEF IDENT STRLIT NUMLIT IF ELSE WHILE READ PRINT
%type <struct itemInfo> program statements statement vardecl assignment expr value mathop
%type <struct itemInfo> inputstmt outputstmt whileloop ifelse condition compop

Within the .lex file rules, the struct for the current token is accessible through variable yylval.

Within that, we must specify which of the union forms we're using (info is the only one we've defined) and then within that we specify which struct field we want to access, e.g. yylval.info.row would access the row field for the current token.

The code snippet below updates the .lex string literal code to initialize that token's struct content:

["][^"\n]*["]                  { col+=strlen(yytext);
                                 yylval.info.row = row;
                                 yylval.info.col = col;
                                 strncpy(yylval.info.content, yytext, 128);
                                 yylval.info.type = 1;
                                 return(STRLIT); }

Within the .yacc file rules, access to the fields of a token/nonterminal is a little unusual syntactically:

Maintaining lists of the declared numeric variables and declared text variables

We'll want maintain a collection of which variables have been declared and of what type, and to be able to easily lookup/insert variables in the collection.

I would recommend simply keeping two arrays of names: one array of names for the numeric variables declared so far and another array of names for the text variables, with functions to do inserts and lookups on each. (Note that, since the underlying code is C, they will need to be arrays of character arrays, and you'll need to use appropriate operations for null-terminated character arrays: strcpy, strcmp, etc.)

In the C section at the top of the lex file we can create suitable arrays and insert/lookup functions, e.g.:

char StrVars[128][64]; /* up to 128 text variables, max name length 64 chars each */
char NumVars[128][64]; /* up to 128 numeric variables, max name length 64 chars each */
int  numSVs = 0; /* initially 0 text variables */
int  numNVs = 0; /* initially 0 numeric variables */
int  getType(char varname[]); /* return 0 if numeric, 1 if text, -1 if undeclared */
int  insert(char varname[], int vartype); /* insert var in appropriate list, return 1 if ok or 0 on fail */

*The two function prototypes also need to be included near the top of the .yacc file

We need to provide the full implementations of the functions at the bottom of the .lex file.
The example below only checks/inserts the numeric ones, you'll need to tweak it to handle the text ones as well (and add suitable error messages).

int  getType(char varname[])
{
   for (int i = 0; i < numNVs; i++) {
       if (strcmp(NumVars[i], varname) == 0) return 0;
   }
   return -1;
}

int  insert(char varname[], int vartype)
{
   if (getType(varname) != -1) {
      return -1;
   }
   if (vartype == 0) {
      if (numNVs >= 128) {
         return -1;
      }
      strncpy(NumVars[numNVs++],varname, 64);
      return 1;
   }
   return -1;
}


Handling variable declarations

In the yacc rules for declaring variables, we'll want to get the name of the variable being declared (from the token's stored name) and the type (a determined from the numvar or strvar token) and attempt an insert (checking whether it succeeded or not).

Now that we know how to add C code to our rules, and how to fill in the variable name for a token (into the content field in the .lex), we could have our rule attempt an insert of the variable name.

In the example below I'll assume one of our syntax rules for a variable declaration is
    vardecl: NUMDEF IDENT ;
So the C code attempts an insert with the right name and type then checks the result:

vardecl: NUMDEF IDENT {
      if (insert( $<info.content>2, 0 ) == -1) {
         yyerror("Attempted numeric variable declaration failed");
      }
   } ;


Handling variable use

In the assignment statement we need to check that the left-hand-side variable has actually been declared and that it's datatype is the same as the datatype for the right-hand-side expression.

Where identifiers are used as an expression value or the target for an input statement, on the other hand, we need to lookup the variable in the lists to determine what type it is (error if not declared) and then set/store that as the data type for the expression itself.

For example, the code snippet below would check to see if we're attempting to read into an undeclared variable:

inputstmt: READ IDENT {
       if (getType($<info.content>2) == -1) {
          yyerror("Error: attempt to read into undeclared identifier");
       }
   };


Determining and checking expression types

For our expression rules we now need to determine what datatype each expression must be, e.g.:

The handful of rules below could be used to determine the data types for simple expressions. (The operator version of expressions is left for you to work out, and you'll want to add an error message in the value: IDENT rule if getType returns -1.)

expr: value { <$info.type>$ = $<info.type>1; };

value: IDENT { $<info.type>$ = getType($<info.content>1); };
value: STRLIT { $<info.type>$ = 1; };
value: NUMLIT { $<info.type>$ = 0; };


Error messages and testing

Be sure your code provides meaningful messages when the context-sensitive checks fail (e.g. "attempt to use undeclared variable x on row 27 column 18", "attempt to redeclare variable y on row 9 column 7", "type mismatch between operands of + on row 11 column 12", etc).

I recommend writing a suite of small valid/invalid programs to test that your parser is working correctly. Typical cases would include things like (I'd recommend having your basic error test cases just include one error per file, as overlapping errors can sometimes mask problems in your error handling.)