Tips for lab 5
Reflections as we dig deeper into the grammar and some context sensitive checking...
- I've posted lab 4 sample solutions if you've got leftover problems
with the tokenizing or basic parsing of the grammar
- When using the $< ItemInfo.whatever >i notation, if you have blocks of C
interleaved with the RHS elements those count when you're calculating positions,
e.g. for
X: A { printf("Hi"); } B C ;
X is index $, A is index 1, B is index 3 and C is index 4.
- In the makefile if you add the -v option to the yacc command (yacc -v ......)
then it produces an additional file, y.output, that shows the structure of the FSM it builds
and the relationship to the grammar rules (can help weird yacc/lex debugging issues).
- The nonterminal identlist is currently used both inside the formalparams
(where they'll need to get inserted into the symbol table) and inside the readstmt
(where they'll need to be looked up to make sure they're already in the symbol table).
It might be helpful to create a separate nonterminal and pair of rules for the
formal parameters, e.g.
formalplist --> IDENT
formalplist --> IDENT formalplist
and change identlist to formalplist inside the formalparams rule. That way you can
use lookup on the IDENTs in identlist and insert on the IDENTs in formalplist.
- Accessing the contents of arrays also looks just like a one-element function call, e.g.
foo[3] could be a call to a function foo or looking up the content of array element 3.
The existing grammar rules just treat it as a function call,
so our symbol table inserts/lookups should probably also track entries for array declarations
and check for them in the funcall rule. A more well-thought-out grammar would perhaps have
added an operator for array element lookup, e.g. @[arrayname, position], where we
could even add type checking to ensure position was an integer expression.
|
CSCI 330 Lab 5: context sensitive checking
In lab5 we'll be continuing the work begun in lab4, working with the same
language but now adding a symbol table and some forms of context sensitive checking
(with the checking to be carried out as the program is being parsed).
The base repo for lab5 is a copy of the starting repo for lab4. If you wish to re-use
your token/grammar code from lab4 as a starting point then:
- fork/clone the lab5 repo normally
- copy your lab4.lex and lab4.yacc files from your lab4 repo into your lab5 repo,
but rename them as lab5.lex and lab5.yacc (replacing the stock lab5.lex and lab5.yacc)
Lab time and submission deadline
As there are more significant challenges to be overcome for lab5, we'll have two
lab sessions dedicated to it (Mar 18 and 25), and the final submission deadline for the
lab will be 8pm Fri. April 5 (extended from apr. 3).
CFG and token rules
The context free grammar rules and tokenization rules for the language are the
same in lab5 as lab4 (the same grammar.txt, tokens.txt have been included in the lab5
repo). If there were errors in your lab4 lex/yacc implementations then those should
be corrected for the lab5 submission.
Context sensitive checking
The objective for lab5 is to add code to our lab5.lex/lab5.yacc to
allow our lab5 executable to check that (i) every function, parameter, and variable used in the program
is declared before it is used and is accessible at the point of use, and (ii) that functions, parameters, and variables
are not declared multiple times in the same scope.
For lab5 the program does not need to do any type checking, nor does it need to
check that the correct number of parameters are passed in function calls.
To carry out the checking, you will be implementing a simple form of symbol tables and
supporting functions plus some scope tracking features, and using these as parsing takes place. The process
for doing so is outlined at the bottom of this page, and will be discussed in some depth
during the Mar. 18/25 lab sessions.
Testing
This time your test cases/infrastructure will make up part (25%) of your mark for the lab.
Feel free to use the lab4 test infrastructure as a template, but for lab5 your actual test
cases need to focus on whether or not the checking for variable declarations/redeclarations
is working correctly.
Design and use of the symbol table(s)
To carry out the desired error checking, during parsing we need to keep track of:
- what nested set of scopes we're currently "in" (updating this each time we enter/leave a function definition)
- what functions have been declared and in which scopes
- what parameters have been declared and in which scopes
- what variables have been declared and in which scopes
It will likely be helpful to give each new scope a unique integer id, to help later with our
lookup process. This can be done simply with a C global variable (e.g. ScopeID) that we
increment each time a brand new scope is detected.
Since scopes can be nested (e.g. function foo declared inside function blah which is declared
in the global scope) we need to be able to keep track of the set of currently active scopes.
This can easily be done with a simple stack, pushing the scope's id when we enter its scope
and popping it off the stack when we leave its scope. Thus the current active scopes are
on the stack, with the 'innermost' scope at the top and the global scope at the bottom.
Whenever a function, variable, or parameter is declared we would have to do a lookup
in the symbol table to make sure there wasn't already a declaration for that function/variable/parameter
in the same scope, generating an error (yyerror) if there's a problem.
Whenever a function call is made or a variable or parameter is used within an expression
we would have to do a lookup in the symbol table to ensure that a function/variable/parameter
with that name exists in at least one of the currently-active scopes, generating an error
if there isn't one.
Associating information with a token in lex/yacc
To accomplish the objectives above, our yacc rules will somehow need to be able to look
up the actual name associated with the token that represented a function name, variable name,
or parameter name.
During tokenizing, we've seen the use of local variable yytext in the .lex file to
access the content of the current token (though we've just used it to grab the string
length so far).
There is another local variable, yylval, that can be used to store information
that will later be accessible in .yacc's parsing of the tokens.
First, in the .yacc file, we'll need to specify what kind of information we want
stored with our tokens/nonterminals. For simplicity we'll just the same kinds of
data for all the tokens/nonterminals, and will use a struct to define it:
%union { struct storedInfo { char ItemText[128]; int ItemType; } ItemInfo; }
%token<struct storedInfo> ...our list of token types...
%type<struct storedInfo> ...our list of nonterminal types...
In the .lex file, these fields can be accessed a yylval.ItemInfo.ItemText and
yylval.ItemInfo.ItemType. Thus in a REALLIT, INTLIT, STRLIT, or IDENT token
we could copy the relevant data from the source code using strcpy(yylval.ItemInfo.ItemText, yytext);
in the same block of code that does the col update.
yacc also provides a rather unique syntax for accessing the data
stored for a token or nonterminal in a rule.
- Suppose we have a rule of the form X: A B C;
- X, A, B, and C each have an info item associated with them,
and we can access their fields as follows:
$<ItemInfo.ItemText>$ // refer's to x's text
$<ItemInfo.ItemType>$ // refer's to x's type
$<ItemInfo.ItemText>1 // refer's to A's text
$<ItemInfo.ItemType>1 // refer's to A's type
$<ItemInfo.ItemText>2 // refer's to B's text
$<ItemInfo.ItemType>2 // refer's to B's type
$<ItemInfo.ItemText>3 // refer's to C's text
$<ItemInfo.ItemType>3 // refer's to C's type
- During parsing we can look up which text/types the parser found
for A, B, and C, then use that to decide what values to assign
for the text/type of X, e.g.
$<ItemInfo.ItemType>$ = 0;
Symbol table implementation in lex/yacc
Remember we can add C global variables, type definitions, and function prototypes
in the %{ .. %} section at the top of lab5.lex, then add the function implementations
at the bottom of the .lex file.
Anything that we wish to also call from inside the .yacc file we'll also need to add
to the top %{ ... %} section of the yacc file, e.g. copies of the function prototypes
and EXTERN declarations of any global lex variables we wish to access directly.
As far as the symbol table goes, the simplest approach is probably to use a struct
to record information about each individual entry and make the symbol table an
array of these structs, e.g. something like
struct SymEntry {
char name[128]; // its name, e.g. "foo": it's ok here to impose a reasonable max name length
int scope; // the id of the scope it was declared in
int symtype; // e.g. 0 for function, 1 for parameter, 2 for variable
};
SymEntry SymTable[500]; // again, it's ok to assume a reasonable limit on the number of defined items
Functions would be needed to insert an entry into the table and to lookup table entries,
and these functions could be called as needed from inside { ... } blocks within the .lex
or .yacc files, e.g.
// insert a new entry into the table, return the table row if successful, -1 otherwise
int insertEntry(char entryName[], int entryScope, int entryType);
Adding the checking in our .yacc file
In our yacc parsing rule for a variable definition, at the end of the rule
we would add a C block to do the appropriate checking, e.g.
vardefn: .......the tokens and nonterminals......
{
... our C code to check that this is a valid declaration and do the symtable insert,
... calling yyerror if anything fails
};
We would do similarly for the rule for a function definition, and for the parameters
within its formal parameter list.
To ensure that items are declared before being called we could perhaps add a check for
the funcall and IDENT in expr: making sure that a function or variable or parameter (whichever is appropriate)
with that name has been declared in (at least) one of the scopes that is currently accessible.
A reminder about text strings in C
Remember that we don't have a 'string' class in C, they're just null-terminated
character arrays, so to copy them we have to use things like strcpy from the
string.h library, you can't rely on simply assigning one to another.