CSCI 439: Lab 1 (Spring 2025)
Sample walkthrough of early lab1 steps
I've added a step-by-step walkthrough showing the addition of a couple of token types,
allowing multiple statements in the program, and incorporating simple assignment statements.
walkthru.html.
Whoops, hadn't set a git deadline for the repo so it wasn't allowing people to push: that should be corrected now.
Grammar corrections: two quick corrections to the grammar/explanations
1. Originally I listed COM as a token, but I would recommend following the process
discussed in lectures instead (a regex that takes and discards the text from COM to end of line)
2. The keyword list was missing a number of keywords, including: "call", "read", "write", "neg", "integer",
"text", "real".
|
Review and prep:
A significant portion of lab 1 will be spent getting familiar with
the basics of lex (lexical analyzer) and yacc (yet another compiler compiler).
If you completed CSCI 330 recently then this will (hopefully) be review, otherwise there is quite
a bit of new material to pick up as part of the lab 1 exercise.
If you're rusty with the git submit process then it is also worth seeing the review material
on the bottom half of the
main labs page.
As usual, you'll need to fork/clone a git repository with the starting code for lab1
(csci439 as the course, lab1 as the repo name), and remember your git add/commit/push
when the time comes to submit.
We'll spend the first lab session (Jan. 22) on lab1: this week we'll
focus on the fundamentals of lex/yacc for tokening and basic parsing.
In lab 2 (the subsequent two weeks) we'll delve into symbol tables, scope/type checking
and expand the language tokens and syntax.
Lab 1 exercise:
For the language we're checking in lab 1, the table below covers:
- a basic overview of the language,
- a set of regular expressions defining the token types,
- a set of context free grammar rules defining the language syntax,
- a set of context sensitive rules covering additional language rules,
- some examples of source code written in the language.
The Vurbossity language:
Disclaimer: it's pretty dicey trying to get language specs correct on the first stab,
so we may need to tweak the specs below as issues are identified. Please let me know as soon
as possible if you think you've spotted a problem!
|
Overview
Vurbossity is a text-heavy language, generally favouring the use of words over symbols.
(The only characters used in the language aside from whitespace are [a-z], [A-Z], [0-9],
the decimal point [.] and the double-quote ["].)
It is strictly typed: every literal, variable, and parameter has a fixed type
(integer, real, or text) with strict rules on mixed-type operations.
There is currently no support for composite data types (e.g. arrays, structs, lists, etc) and
no mechanism for manipulating string content beyond basic assignment (set) and input (read).
The expression syntax generally follows a fully-parenthesized prefix style
(operation type then the operands, all within the left/right bracket keywords).
The language supports global and local variables but has no support for constants.
In addition to a (required) 'main' routine, the language supports procedures (subroutines that
do not return a value) but has no support for functions (subroutines that return a value).
Thus "returning" a value from the callee to the caller is done via global variables.
The language currently does not support any form of forward declaration for procedures
and does not support recursion.
The language supports only one kind of loop, the if loop, which is discussed in greater
detail in the syntax and examples sections below
Comments begin with the COM keyword and continue to the end of the current line.
Language tokens (all case sensitive)
Keywords:
begin end (used to mark the beginning and end of most blocks)
pdef gdef vdef (used to denote definitions of procedures, global vars, local vars)
if else (used for if/else loops)
add sub mul div rem (math operators)
lt gt le ge ne eq (comparison operators)
set (assignment operator)
and or not (boolean operators)
left right (used to bracket expressions)
call (used to call a procedure)
Literal values:
integer literals consist of one or more digits (0-9)
real literals consist of one or more digits then a decimal point then one or more digits
text literals begin and end with " and may have any other characters enclosed other than a "
Identifiers:
any sequence of alphabetic characters (that are not a keyword)
Syntax rules
Overall program structure:
The program structure has a fixed order for global components:
1. zero or more global variable declarations
2. zero or more procedure declarations
3. exactly one main routine
Global variable declarations:
These begin with the "gdef" keyword followed by an identifier for the variable name
(no two global variables can have the same name), followed by the type specifier
(currently "text", "real", or "integer").
Procedure declarations:
These begin with the "pdef" keyword followed by an identifier for the procedure name
(no two procedures can have the same name), followed by the formal parameter list and then
a body block.
Formal parameter lists:
These begin with the "left" keyword, followed by zero or more pairs of data type and
identifier (thus providing the data type of each parameter), ending with the "right" keyword.
Main routine:
This begins with the "main" token and is followed by a body block.
Body blocks (for procedures, main, and loops):
These blocks start with the "begin" keyword and finish with the "end" keyword and
may contain zero or more local statements.
Local statement types:
Local variable declarations:
These begin with the "vdef" keyword followed by an identifier for the variable name
(no two variables can be declared in the same scope with the same name), followed by
the type specifier ("text", "integer", or "real").
Assignment statements:
These begin with the "set" keyword followed by an identifier naming the variable
whose value is being set (the variable must have been previously declared in an
accessible scope), then the expression specifying the value to be assigned.
(See the "additional rules" section for type compatibility/casting.)
Output statements:
These begin with the "write" keyword followed by an expression whose evaluated
result is the value is to be output. (Note that expressions include things like
stand-alone literals and identifiers.)
Input statements:
These begin with the "read" keyword followed by an identifier naming which variable
is to be used as the storage location for the value read. The variable must have
been previously declared in an accessible scope.
(During execution the user must enter a type-compatible value or the program will crash.)
Procedure calls:
These begin with the "call" keyword followed by an identifier naming the procedure
to be called followed by the list of passed arguments (discussed below). The number
of values passed must exactly match the number of arguments the procedure is expecting,
and the passed data types must be compatible with the declared parameter types (see the
"Additional rules" below for type compatibility rules).
Parameter lists:
Parameter lists consist of the "left" keyword followed by zero or more expressions
followed by the "right" keyword.
Expressions can be any one of the following::
A single literal value, or
a single identifier (which must be accessible in the current scope), or
a "left" keyword followed by an operator followed by one or two expressions
(one expression for unary operators, two expressions for binary operators),
followed by a "right" keyword
Conditional expressions use the conditional operators (and, or, not, lt, gt, ge, le, ne, eq)
while computational expressions use the sub, mul, div, add, and rem operators.
Note that, since expressions are fully parenthesized, associativity and precedence
rules are not required.
Binary operators for computation expressions:
"add" "sub" "div" "mul" "rem" (remainder, integer division)
Binary operators for condition expressions:
"eq" "ne" "lt" "le" "gt" "ge" (equals, not-equals, etc)
"and", "or"
Unary operators for computation expressions:
neg (e.g. left NEG x right to get -x)
Unary operators for condition expressions:
not
If loops:
These consist of
- the "if" keyword
- then a condition expression
- then a body
- then the "else" keyword
- then a body
Note:
The first if/condition/body operates like a conventional top-tested while loop,
repeating as long as the condition is true.
The else/body operates like a final/last clause: it is always executed exactly once,
after the while test condition fails.
Additional rules
Scoping rules:
- variables can be either global or local to body (a procedure's body or an if loop's body)
- global variables are usable anywhere after their point of declaration
- local variables are usable anywhere within the same body block after the point of their declaration
note that global variables can only be declared in the global space, locals only in body blocks
Type rules:
- every variable, parameter, and literal has a fixed type (integer, real, or text)
- expressions and assignment (set) statements that contain a mix of integer and real
types are permitted, with the integer values automatically cast to reals,
but no other type mixes are permitted
- when assigning values the types of the left and right hand sides must match,
with the exception that integer values will automatically be cast to reals if assigned to a real variable
- similarly values passed to parameters must be of matching types with the exception
of passing integers to reals (where again the integer value is automatically cast to real)
Code example (.vurb file)
COM a complete (hopefully correct) program using most features
COM some global variable declarations
gdef foo text
gdef blah integer
gdef ick real
COM A procedure that sets foo and blah to parameters str and i,
COM and prompts the user for a value which it reads into a local variable
COM then uses that to set ick
pdef getvals left integer i text str right
begin
vdef myfoo text COM just to show use of a local
write "Please enter a real number"
read myfoo
set foo myfoo
end
main
begin
vdef result real
call getvals left 123 "hope this works" right
set result left add blah ick right COM implicitly converts int to real for the add op
write foo
write result
vdef startval integer
set startval 3
COM show an if loop example
if left lt startval blah right
begin
write startval
set startval left add startval 1 right COM basically startval++
end
else
begin
write "I think we're done!"
end
end
COM second example: a program containing an if statement
COM with conditions and bodies that use every operator in the language
main
begin
vdef r real
vdef i integer
vdef t text
COM get the initial values from the user
write "Enter an integer"
read i
write "Enter a real"
read r
write "Enter a word"
read t
COM giant ugly condition:
COM if !( ((r=1.234)||(i!=2004)) &&
COM (((r<0.001)&&(i>0)) || ((t <= "abc")&&(t >= "xyz"))))
if left not
left and
left or
left eq r 1.234 right
left ne i 2004 right
right
left or
left and
left lt r 0.001 right
left gt i 0 right
right
left and
left le t "abc" right
left ge t "xyz" right
right
right
right
right
begin
set r left div
left add
left div 200 r right
99.99
right
left mul
0.001
left sub
left neg r right
77
right
right
right
set i left rem i 13 right
end
else
begin
set t "Ooops, that was not the setup I wanted"
end
COM write out the results
write "The resulting real is"
write r
write "and the integer is"
write i
write "and the word is"
write t
end
|
In the lab1 repository, you'll be provided with skeleton lex and yacc files ( lab1.lex,
lab1.yacc) and a makefile to compile
them into a lab1 executable that will be your recognizer.
After running make to compile the lex/yacc code, we can test user programs in our
new language by running
./lab1 < someprogram.vurb
Currently the skeleton version thinks anything with the form "main begin NNN end"
(where NNN is an integer) is a valid program, so the .lex and .yacc files
need a fair bit of work to bring them in line with the actual language specifications.
The skeleton lex/yacc is currently set up so it can track the current row/column number
as it reads through the source file, and if it decides the source code program is
invalid it displays an error message with the row/column. If no error messages
are displayed then it appears the source code program is valid.
Objectives:
The goal for lab1 is to modify the .lex and .yacc files so that the constructed
lab1 executable accepts syntactically valid programs in our newly created language, and generates at least
one error message for each syntactically invalid program. (It's fine if it generates multiple
error messages if multiple errors are detected, but it's also fine if it also just generates
one error message for the first one detected.) The error message
should include an accurate row/column for the point at which the lab1 program
identifies there is a problem.
This means adding suitable regular expressions, tokens, rules etc across the two files
to support the grammar rules.
Your lab1 program is not required to check the semantic rules
(things like declaring variables before they're used, type checking, etc),
it just has to check basic syntax.
Debugging:
If lex/yacc give nebulous warnings about shift/reduce conflicts,
one way to get more information is to add a -v option at the end of
the yacc commands in the makefile, e.g.
yacc lab1.yacc -v
This will cause yacc to produce an extra file, y.output, which contains
a description of the finite state machine yacc constructed. At the top
of y.output it will also tell you which state(s) have conflicts.
These conflicts mean that there is some ambiguity in what action the parser
should take given a specific state and token, and could come from ambiguities
in either your lex regexes or your yacc grammars.
Sometimes conflicts can also be flagged when you have executable C blocks embedded
in your rules at points where the parser can't yet identify which rule will ultimately
be correct, e.g.
W: X { } Y ;
W: X { } Z ;
Either of the two { } code blocks could theoretically be valid at the
point we're being asked to run them, it's not until Y or Z is encountered
that the parser knows which form of W we're actually dealing with.
Eliminating the code blocks (or moving them after the Y/Z) will elimiate such conflicts,
since the parser isn't
asked to take external action until after it has figured out which form is correct.
Testing:
When evaluating your programs, they'll be run against a suite of valid
programs and a suite of invalid programs,
using different subsets and combinations of the language features.
I strongly recommend creating a collection of test cases and programs.
While your .lex and .yacc code must be strictly your own work, I have no objection to
students sharing interesting programs for use as test cases.
Two subdirectories, validprogs and invalidprogs, have been set up to hold test cases.
They each have a single example in them to start with, but those examples are
based on the starting "main begin NNN end" syntax, not the language rules you're trying
to implement.
Submission:
Remember to perform "git add"s for your lab1.lex and lab1.yacc files,
plus any test cases you used that you'd like me to see, then do a "git commit"
and "git push" to submit the work before the deadline.
Reminder of git fetch/clone/submit cycle:
To obtain a lab initially (e.g. lab1 here)
ssh csci fork csci439/lab1 csci439/$USER/lab1
git clone csci:csci439/$USER/lab1
To submit a lab, be sure you are inside your cloned repository for that lab
and have done a "git add" for each file to be submitted, e.g.
git add filename
Create a commit for the submission, e.g.
git commit -m "lab1 submission"
Push the lab
git push origin --all
(being paranoid here with the --all just in case you've created branches)
|
Advice:
In this lab get the tokenizing/parsing working but don't worry about the
symbol table/scopes at all, then in lab 2 we'll focus on adding
the symbol table and associated scope and type error checking.
In lab1:
- Get tokenizing working first, adding one token type at a time:
- add the new regex to your lex file, giving it a unique token name
- add the tnew token type to the types list in your yacc file
- change the statement rule in your yacc file add | NEWTOKTYPE
now test input files containing the new token type to make sure they work
- Once all the tokens are correctly recognized (and invalid ones correctly rejected)
then move on to refining the yacc grammar rules, adding one new feature at a time:
- change it so that body recognizes one or more statements rather than just a single
statement, test to make sure you can put multiple tokens in your program body
and it accepts all the valid ones
- start adding rules for the individual statement types: writes, input, output,
variable definitions, function calls, ifloops, assignment statements.
You'll need to create some subrules for these, initially just make those
something very simple, e.g. have your initial rule for an expression simply accept
a number, then once refine/expand them once the basics work.
- When dealing with the single line comment: in its { .... } section update the row/col
but don't return a token type. This will have the effect of effectively
stripping out/ignoring the comments before the parsing starts.