CSCI 439 lab 1

CSCI 439: Lab 1 (Spring 2025)

Sample walkthrough of early lab1 steps I've added a step-by-step walkthrough showing the addition of a couple of token types, allowing multiple statements in the program, and incorporating simple assignment statements. walkthru.html.

Whoops, hadn't set a git deadline for the repo so it wasn't allowing people to push: that should be corrected now.

Grammar corrections: two quick corrections to the grammar/explanations
1. Originally I listed COM as a token, but I would recommend following the process discussed in lectures instead (a regex that takes and discards the text from COM to end of line)
2. The keyword list was missing a number of keywords, including: "call", "read", "write", "neg", "integer", "text", "real".

Review and prep:
A significant portion of lab 1 will be spent getting familiar with the basics of lex (lexical analyzer) and yacc (yet another compiler compiler).

If you completed CSCI 330 recently then this will (hopefully) be review, otherwise there is quite a bit of new material to pick up as part of the lab 1 exercise.

If you're rusty with the git submit process then it is also worth seeing the review material on the bottom half of the main labs page.
As usual, you'll need to fork/clone a git repository with the starting code for lab1 (csci439 as the course, lab1 as the repo name), and remember your git add/commit/push when the time comes to submit.

We'll spend the first lab session (Jan. 22) on lab1: this week we'll focus on the fundamentals of lex/yacc for tokening and basic parsing.

In lab 2 (the subsequent two weeks) we'll delve into symbol tables, scope/type checking and expand the language tokens and syntax.

Lab 1 exercise: For the language we're checking in lab 1, the table below covers:

a basic overview of the language,
a set of regular expressions defining the token types,
a set of context free grammar rules defining the language syntax,
a set of context sensitive rules covering additional language rules,
some examples of source code written in the language.

The Vurbossity language:
Disclaimer: it's pretty dicey trying to get language specs correct on the first stab, so we may need to tweak the specs below as issues are identified. Please let me know as soon as possible if you think you've spotted a problem!

Overview

Vurbossity is a text-heavy language, generally favouring the use of words over symbols.
   (The only characters used in the language aside from whitespace are [a-z], [A-Z], [0-9],
    the decimal point [.] and the double-quote ["].)

It is strictly typed: every literal, variable, and parameter has a fixed type
   (integer, real, or text) with strict rules on mixed-type operations.
There is currently no support for composite data types (e.g. arrays, structs, lists, etc) and
   no mechanism for manipulating string content beyond basic assignment (set) and input (read).

The expression syntax generally follows a fully-parenthesized prefix style
    (operation type then the operands, all within the left/right bracket keywords).

The language supports global and local variables but has no support for constants.

In addition to a (required) 'main' routine, the language supports procedures (subroutines that
    do not return a value) but has no support for functions (subroutines that return a value).
    Thus "returning" a value from the callee to the caller is done via global variables.

The language currently does not support any form of forward declaration for procedures
    and does not support recursion.

The language supports only one kind of loop, the if loop, which is discussed in greater
    detail in the syntax and examples sections below

Comments begin with the COM keyword and continue to the end of the current line.
Language tokens (all case sensitive)

Keywords:
   begin end (used to mark the beginning and end of most blocks)
   pdef gdef vdef (used to denote definitions of procedures, global vars, local vars)
   if else (used for if/else loops)
   add sub mul div rem (math operators)
   lt gt le ge ne eq (comparison operators)
   set (assignment operator)
   and or not (boolean operators)
   left right (used to bracket expressions)
   call (used to call a procedure)

Literal values:
   integer literals consist of one or more digits (0-9)
   real literals consist of one or more digits then a decimal point then one or more digits
   text literals begin and end with " and may have any other characters enclosed other than a "

Identifiers:
   any sequence of alphabetic characters (that are not a keyword)
Syntax rules

Overall program structure:
   The program structure has a fixed order for global components:
      1. zero or more global variable declarations
      2. zero or more procedure declarations
      3. exactly one main routine

Global variable declarations:
   These begin with the "gdef" keyword followed by an identifier for the variable name
   (no two global variables can have the same name), followed by the type specifier
   (currently "text", "real", or "integer").


Procedure declarations:
   These begin with the "pdef" keyword followed by an identifier for the procedure name
   (no two procedures can have the same name), followed by the formal parameter list and then
   a body block.


Formal parameter lists:
   These begin with the "left" keyword, followed by zero or more pairs of data type and
   identifier (thus providing the data type of each parameter), ending with the "right" keyword.


Main routine:
   This begins with the "main" token and is followed by a body block.


Body blocks (for procedures, main, and loops):
   These blocks start with the "begin" keyword and finish with the "end" keyword and
   may contain zero or more local statements.


Local statement types:

   Local variable declarations:
      These begin with the "vdef" keyword followed by an identifier for the variable name
      (no two variables can be declared in the same scope with the same name), followed by
      the type specifier ("text", "integer", or "real").

   Assignment statements:
      These begin with the "set" keyword followed by an identifier naming the variable
      whose value is being set (the variable must have been previously declared in an
      accessible scope), then the expression specifying the value to be assigned.
      (See the "additional rules" section for type compatibility/casting.)

   Output statements:
      These begin with the "write" keyword followed by an expression whose evaluated
      result is the value is to be output.  (Note that expressions include things like
      stand-alone literals and identifiers.)

   Input statements:
      These begin with the "read" keyword followed by an identifier naming which variable
      is to be used as the storage location for the value read.  The variable must have
      been previously declared in an accessible scope.
      (During execution the user must enter a type-compatible value or the program will crash.)

   Procedure calls:
      These begin with the "call" keyword followed by an identifier naming the procedure
      to be called followed by the list of passed arguments (discussed below).  The number
      of values passed must exactly match the number of arguments the procedure is expecting,
      and the passed data types must be compatible with the declared parameter types (see the
      "Additional rules" below for type compatibility rules).

   Parameter lists:
      Parameter lists consist of the "left" keyword followed by zero or more expressions
      followed by the "right" keyword.

   Expressions can be any one of the following::
      A single literal value, or
      a single identifier (which must be accessible in the current scope), or
      a "left" keyword followed by an operator followed by one or two expressions
        (one expression for unary operators, two expressions for binary operators),
        followed by a "right" keyword
      Conditional expressions use the conditional operators (and, or, not, lt, gt, ge, le, ne, eq)
         while computational expressions use the sub, mul, div, add, and rem operators.
      Note that, since expressions are fully parenthesized, associativity and precedence
        rules are not required.

   Binary operators for computation expressions:
      "add" "sub" "div" "mul" "rem" (remainder, integer division)

   Binary operators for condition expressions:
      "eq" "ne" "lt" "le" "gt" "ge" (equals, not-equals, etc)
      "and", "or"

   Unary operators for computation expressions:
      neg     (e.g. left NEG x right to get -x)

   Unary operators for condition expressions:
      not

   If loops:
      These consist of
       - the "if" keyword
       - then a condition expression
       - then a body
       - then the "else" keyword
       - then a body
   Note:
      The first if/condition/body operates like a conventional top-tested while loop,
         repeating as long as the condition is true.
      The else/body operates like a final/last clause: it is always executed exactly once,
         after the while test condition fails.
Additional rules

Scoping rules:
 - variables can be either global or local to body (a procedure's body or an if loop's body)
 - global variables are usable anywhere after their point of declaration
 - local variables are usable anywhere within the same body block after the point of their declaration
 note that global variables can only be declared in the global space, locals only in body blocks

Type rules:
 - every variable, parameter, and literal has a fixed type (integer, real, or text)
 - expressions and assignment (set) statements that contain a mix of integer and real
   types are permitted, with the integer values automatically cast to reals,
   but no other type mixes are permitted
 - when assigning values the types of the left and right hand sides must match,
   with the exception that integer values will automatically be cast to reals if assigned to a real variable
 - similarly values passed to parameters must be of matching types with the exception
   of passing integers to reals (where again the integer value is automatically cast to real)
Code example (.vurb file)

COM a complete (hopefully correct) program using most features

COM some global variable declarations
gdef foo text
gdef blah integer
gdef ick real

COM A procedure that sets foo and blah to parameters str and i,
    COM and prompts the user for a value which it reads into a local variable
    COM then uses that to set ick
pdef getvals left integer i text str right
begin
   vdef myfoo text       COM just to show use of a local
   write "Please enter a real number"
   read myfoo
   set foo myfoo
end

main
begin
   vdef result real
   call getvals left 123 "hope this works" right
   set result left add blah ick right  COM implicitly converts int to real for the add op
   write foo
   write result
   vdef startval integer
   set startval 3
   COM show an if loop example
   if left lt startval blah right
   begin
      write startval
      set startval left add startval 1 right COM basically startval++
   end
   else
   begin
      write "I think we're done!"
   end
end

COM second example: a program containing an if statement
COM    with conditions and bodies that use every operator in the language

main
begin
   vdef r real
   vdef i integer
   vdef t text

   COM get the initial values from the user
   write "Enter an integer"
   read i
   write "Enter a real"
   read r
   write "Enter a word"
   read t

   COM giant ugly condition:
   COM    if !( ((r=1.234)||(i!=2004)) &&
   COM       (((r<0.001)&&(i>0)) || ((t <= "abc")&&(t >= "xyz"))))
   if left not
           left and
                left or
                     left eq r 1.234 right
                     left ne i 2004 right
                right
                left or
                     left and
                          left lt r 0.001 right
                          left gt i 0 right
                     right
                     left and
                          left le t "abc" right
                          left ge t "xyz" right
                     right
                right
           right
      right
   begin
      set r left div
                 left add
                      left div 200 r right
                      99.99
                 right
                 left mul
                      0.001
                      left sub
                          left neg r right
                          77
                      right
                 right
            right
      set i left rem i 13 right
   end
   else
   begin
      set t "Ooops, that was not the setup I wanted"
   end

   COM write out the results
   write "The resulting real is"
   write r
   write "and the integer is"
   write i
   write "and the word is"
   write t
end

In the lab1 repository, you'll be provided with skeleton lex and yacc files ( lab1.lex, lab1.yacc) and a makefile to compile them into a lab1 executable that will be your recognizer.

After running make to compile the lex/yacc code, we can test user programs in our new language by running
./lab1 < someprogram.vurb

Currently the skeleton version thinks anything with the form "main begin NNN end" (where NNN is an integer) is a valid program, so the .lex and .yacc files need a fair bit of work to bring them in line with the actual language specifications.

The skeleton lex/yacc is currently set up so it can track the current row/column number as it reads through the source file, and if it decides the source code program is invalid it displays an error message with the row/column. If no error messages are displayed then it appears the source code program is valid.

Objectives:

The goal for lab1 is to modify the .lex and .yacc files so that the constructed lab1 executable accepts syntactically valid programs in our newly created language, and generates at least one error message for each syntactically invalid program. (It's fine if it generates multiple error messages if multiple errors are detected, but it's also fine if it also just generates one error message for the first one detected.) The error message should include an accurate row/column for the point at which the lab1 program identifies there is a problem.

This means adding suitable regular expressions, tokens, rules etc across the two files to support the grammar rules.

Your lab1 program is not required to check the semantic rules (things like declaring variables before they're used, type checking, etc), it just has to check basic syntax.

Debugging:

If lex/yacc give nebulous warnings about shift/reduce conflicts, one way to get more information is to add a -v option at the end of the yacc commands in the makefile, e.g.
yacc lab1.yacc -v

This will cause yacc to produce an extra file, y.output, which contains a description of the finite state machine yacc constructed. At the top of y.output it will also tell you which state(s) have conflicts.

These conflicts mean that there is some ambiguity in what action the parser should take given a specific state and token, and could come from ambiguities in either your lex regexes or your yacc grammars.

Sometimes conflicts can also be flagged when you have executable C blocks embedded in your rules at points where the parser can't yet identify which rule will ultimately be correct, e.g.

W: X { } Y ;
W: X { } Z ;

Either of the two { } code blocks could theoretically be valid at the point we're being asked to run them, it's not until Y or Z is encountered that the parser knows which form of W we're actually dealing with.

Eliminating the code blocks (or moving them after the Y/Z) will elimiate such conflicts, since the parser isn't asked to take external action until after it has figured out which form is correct.

Testing:

When evaluating your programs, they'll be run against a suite of valid programs and a suite of invalid programs, using different subsets and combinations of the language features.

I strongly recommend creating a collection of test cases and programs. While your .lex and .yacc code must be strictly your own work, I have no objection to students sharing interesting programs for use as test cases.

Two subdirectories, validprogs and invalidprogs, have been set up to hold test cases. They each have a single example in them to start with, but those examples are based on the starting "main begin NNN end" syntax, not the language rules you're trying to implement.

Submission:

Remember to perform "git add"s for your lab1.lex and lab1.yacc files, plus any test cases you used that you'd like me to see, then do a "git commit" and "git push" to submit the work before the deadline.

Reminder of git fetch/clone/submit cycle:
To obtain a lab initially (e.g. lab1 here) ssh csci fork csci439/lab1 csci439/$USER/lab1 git clone csci:csci439/$USER/lab1
To submit a lab, be sure you are inside your cloned repository for that lab and have done a "git add" for each file to be submitted, e.g. git add filename Create a commit for the submission, e.g. git commit -m "lab1 submission" Push the lab git push origin --all (being paranoid here with the --all just in case you've created branches)

Advice:

In this lab get the tokenizing/parsing working but don't worry about the symbol table/scopes at all, then in lab 2 we'll focus on adding the symbol table and associated scope and type error checking.

In lab1:

Get tokenizing working first, adding one token type at a time:
- add the new regex to your lex file, giving it a unique token name
- add the tnew token type to the types list in your yacc file
- change the statement rule in your yacc file add | NEWTOKTYPE
now test input files containing the new token type to make sure they work
Once all the tokens are correctly recognized (and invalid ones correctly rejected) then move on to refining the yacc grammar rules, adding one new feature at a time:
- change it so that body recognizes one or more statements rather than just a single statement, test to make sure you can put multiple tokens in your program body and it accepts all the valid ones
- start adding rules for the individual statement types: writes, input, output, variable definitions, function calls, ifloops, assignment statements.
You'll need to create some subrules for these, initially just make those something very simple, e.g. have your initial rule for an expression simply accept a number, then once refine/expand them once the basics work.
When dealing with the single line comment: in its { .... } section update the row/col but don't return a token type. This will have the effect of effectively stripping out/ignoring the comments before the parsing starts.