CSCI 330: Lab 4 (2025)

CSCI 330 Lab 5: context sensitive checking

The lab5 repo has been updated (as of 1:08 Monday) to include the lab4 sample solution as a baseline, with fixes for the language change and preliminary code for the symbol table/functions and token/nonterminal storage

In lab5 we'll be augmenting our scanner/parser from lab4 to include context sensitive checking: checking that variables have been declared before use (and prevent redeclarations) and checking that operands have the appropriate types wherever they are used.

To do so, we'll be:

maintaining a limited form of symbol tables,
storing extra context-sensitive information with the various tokens and nonterminals,
augmenting our lex/yacc rules to record/check/update the content sensitive information, and
adding a variety of support functions to make all the above easier.

LANGUAGE MODIFICATION: variable types must now be specified when we declare our variable, either as a number or a string.
The var keyword for declarations is replaced by two type-specific versions: numvar and strvar, to declare numeric variables and string variables respectively.
The isnum and isstr operations are now obsolete and removed from the language.

The lab can be obtained and submitted through our usual git approach (repo name 'lab5' this time), and the starting repo structure is similar to lab4 (other than the lab4->lab5 name tweaks).

The starter code for the repo is essentially a tweaked sample solution for lab4 with some of the initial symbol table code/functions already built in.,
You're welcome to use your own lab4 solution as a starting point if you'd prefer (just be sure to add the var-->numvar/strvar changes mentioned above).

Overview

For lab 5 you'll be augmenting the lex/yacc code from lab4 with extra code to support context-sensitive checking: checking that variables are declared before being used, checking that the same variable name isn't declared multiple times, and checking that data types are used appropriately in the various language statements.

To do so, we will need to:

Track declared variables and their types:

maintain lists (arrays) of which numeric variables and which text variables have been declared so far,
create functions to do inserts/lookups on the lists,
check the list whenever a variable is used, to make sure it has been previously declared,
update the list whenever a variable is declared (after checking to make sure the variable wasn't previously declared).
Store extra information with the program tokens and nonterminals:
replace the 'long info' entry in the .yacc %union rule with a struct that allows multiple fields, e.g.
- a field to store token content (e.g. the "x" for a variable named foo),
- fields for the row and column of a token (for debugging messages),
- a field for the type associated with a token or nonterminal (e.g. 0 for numeric, 1 for text, -1 for unknown/invalid)
revise the %token and %type lines to use the new union struct,
initialize some of the token fields when the token is scanned in the .lex file,
within the yacc rules, update the type fields for tokens/nonterminals if/when we can identify something's type.
Perform compile-time type checking in statements/expressions:
for *, /, ^ operators make sure that both operands are numeric,
for other operators make sure that all operands have the same type (all numeric or all text).

Each of these steps is discussed in greater detail below.

Expectations/requirements

When compiled into lab5x, your lab5.lex/yacc code should accept programs in our new languaage (through standard input, e.g. ./lab5x < someprogram), and verify all the following:

The program follows the language token and syntax rules. (This can be covered simply by using a lab4 sample solution with the var/numvar/strvar/isnum/isstr tweaks.)
Each variable has been declared before it is used.
No variable is declared more than once (e.g. you cannot declare two x's).
No operator is applied to operands of different types (e.g. you cannot do things like ( 3 < "foo" ) or ("ick" + x ) where x is a numeric variable.
The * ^ and / operators cannot have operands of string/text type.

If the input program violates any of the context-sensitive rules then your lab5x must produce a clear and informative error message describing the nature of the error.
(If the code contains multiple errors then it's your choice as to whether it quits after the first error message or attempts to continue on.)

Setting and using the new storage type for token/nonterminal data

We'll want to store context-sensitive data with the various tokens and nonterminals, and to accomplish this, we need to:

update the %union type (in .yacc) so each token/nonterminal has a struct of data instead of just a long,
update the %token and %type rules (in .yacc) so specify the tokens/nonterminals are using this new type,
update the .lex code to set initial stored data for the tokens as they are scanned,
update the .yacc grammar rules to access/update the stored data for tokens and nonterminals.

Updating the %union involves specifying a type of struct for the info, while updating the %token and %type lines simple involves referencing that struct type, e.g.:

 /* info struct for tokens and non-terminals:
  *    row,col: the row/column of the relevant token in the source code
  *    content: the actual text content of the token
  *    type: 0 for numeric, 1 for text, -1 if unknown or invalid
  */
%union { struct itemInfo { char content[128]; int row; int col; int type; } info; }

 /* identify the valid token types, all have yylval type long */
%token <struct itemInfo> ASSIGN PLUS MINUS TIMES DIV POW LEFTBR RIGHTBR LEFTPAR RIGHTPAR
%token <struct itemInfo> LT GT LE GE EQ NE NUMDEF STRDEF IDENT STRLIT NUMLIT IF ELSE WHILE READ PRINT
%type <struct itemInfo> program statements statement vardecl assignment expr value mathop
%type <struct itemInfo> inputstmt outputstmt whileloop ifelse condition compop

Within the .lex file rules, the struct for the current token is accessible through variable yylval.

Within that, we must specify which of the union forms we're using (info is the only one we've defined) and then within that we specify which struct field we want to access, e.g. yylval.info.row would access the row field for the current token.

The code snippet below updates the .lex string literal code to initialize that token's struct content:

["][^"\n]*["]                  { col+=strlen(yytext);
                                 yylval.info.row = row;
                                 yylval.info.col = col;
                                 strncpy(yylval.info.content, yytext, 128);
                                 yylval.info.type = 1;
                                 return(STRLIT); }

Within the .yacc file rules, access to the fields of a token/nonterminal is a little unusual syntactically:

For a grammar rule, each of the tokens and nonterminals is regarded as having a position, with the LHS item being position $ and the others numbered 1,2,etc from left to right.
For example, consider the following yacc grammar rule:
something: whatever SOMETOKEN otherstuff ;
something has position $, whatever has position 1, SOMETOKEN has position 2, and otherstuff has position 3 (sort of array-indexish, but with $ instead of 0).
We can add a segment of C code at the end of the rule (just before the ;) to handle our context-sensitive checking, e.g.
```
something: whatever SOMETOKEN otherstuff {
        /* our context sensitive C code goes here */
     };
```
Within that C code we can access a specific field of the LHS nonterminal by using the syntax $<info.fieldname>$ and for the RHS tokens and nonterminals we can access their fields by using their position in place of the last $. For example, to access the row field of otherstuff we would use $<info.row>3

Maintaining lists of the declared numeric variables and declared text variables

We'll want maintain a collection of which variables have been declared and of what type, and to be able to easily lookup/insert variables in the collection.

I would recommend simply keeping two arrays of names: one array of names for the numeric variables declared so far and another array of names for the text variables, with functions to do inserts and lookups on each. (Note that, since the underlying code is C, they will need to be arrays of character arrays, and you'll need to use appropriate operations for null-terminated character arrays: strcpy, strcmp, etc.)

In the C section at the top of the lex file we can create suitable arrays and insert/lookup functions, e.g.:

char StrVars[128][64]; /* up to 128 text variables, max name length 64 chars each */
char NumVars[128][64]; /* up to 128 numeric variables, max name length 64 chars each */
int  numSVs = 0; /* initially 0 text variables */
int  numNVs = 0; /* initially 0 numeric variables */
int  getType(char varname[]); /* return 0 if numeric, 1 if text, -1 if undeclared */
int  insert(char varname[], int vartype); /* insert var in appropriate list, return 1 if ok or 0 on fail */

*The two function prototypes also need to be included near the top of the .yacc file

We need to provide the full implementations of the functions at the bottom of the .lex file.
The example below only checks/inserts the numeric ones, you'll need to tweak it to handle the text ones as well (and add suitable error messages).

int  getType(char varname[])
{
   for (int i = 0; i < numNVs; i++) {
       if (strcmp(NumVars[i], varname) == 0) return 0;
   }
   return -1;
}

int  insert(char varname[], int vartype)
{
   if (getType(varname) != -1) {
      return -1;
   }
   if (vartype == 0) {
      if (numNVs >= 128) {
         return -1;
      }
      strncpy(NumVars[numNVs++],varname, 64);
      return 1;
   }
   return -1;
}

Handling variable declarations

In the yacc rules for declaring variables, we'll want to get the name of the variable being declared (from the token's stored name) and the type (a determined from the numvar or strvar token) and attempt an insert (checking whether it succeeded or not).

Now that we know how to add C code to our rules, and how to fill in the variable name for a token (into the content field in the .lex), we could have our rule attempt an insert of the variable name.

In the example below I'll assume one of our syntax rules for a variable declaration is
vardecl: NUMDEF IDENT ;
So the C code attempts an insert with the right name and type then checks the result:

vardecl: NUMDEF IDENT {
      if (insert( $<info.content>2, 0 ) == -1) {
         yyerror("Attempted numeric variable declaration failed");
      }
   } ;

Handling variable use

In the assignment statement we need to check that the left-hand-side variable has actually been declared and that it's datatype is the same as the datatype for the right-hand-side expression.

Where identifiers are used as an expression value or the target for an input statement, on the other hand, we need to lookup the variable in the lists to determine what type it is (error if not declared) and then set/store that as the data type for the expression itself.

For example, the code snippet below would check to see if we're attempting to read into an undeclared variable:

inputstmt: READ IDENT {
       if (getType($<info.content>2) == -1) {
          yyerror("Error: attempt to read into undeclared identifier");
       }
   };

Determining and checking expression types

For our expression rules we now need to determine what datatype each expression must be, e.g.:

if the expression is a numeric literal then set the expression type as numeric (0),
if the expression is a text literal then set the expression type as text (1),
if the expression is an identifier then use the lookup function to figure out the type and set that as the expression type,
if the expression is an operator acting on two operands then make sure both types are the same and suitable for the operator (since * / ^ cannot have text operands) then set the expression type to be the same as the operands' type.

The handful of rules below could be used to determine the data types for simple expressions. (The operator version of expressions is left for you to work out, and you'll want to add an error message in the value: IDENT rule if getType returns -1.)

expr: value { <$info.type>$ = $<info.type>1; };

value: IDENT { $<info.type>$ = getType($<info.content>1); };
value: STRLIT { $<info.type>$ = 1; };
value: NUMLIT { $<info.type>$ = 0; };

Error messages and testing

Be sure your code provides meaningful messages when the context-sensitive checks fail (e.g. "attempt to use undeclared variable x on row 27 column 18", "attempt to redeclare variable y on row 9 column 7", "type mismatch between operands of + on row 11 column 12", etc).

I recommend writing a suite of small valid/invalid programs to test that your parser is working correctly. Typical cases would include things like

attempts to use undeclared variables (as the LHS of an assignment, as the RHS of an assignment, inside expressions, and as the RHS of input/output statements),
attempts to declare the same variable multiple times
attempts to assign values of an incorrect type to a variable (string to number and number to string),
attempts to use string values with * / and ^
attempts to mix incompatible types with the various operands (+ - < >= etc etc)

(I'd recommend having your basic error test cases just include one error per file, as overlapping errors can sometimes mask problems in your error handling.)