CSCI 330 Lab 4: scanning and parsing

The primary focus of lab4 is to practice the use of regular expressions and context free grammars in the tokenizing (scanning) and parsing of programming languages.

LANGUAGE CORRECTION: the originally posted version forgot to mention that an expression can be simply an identifier or a literal (string or number)
SPECIAL NOTE FOR CSCI 439 STUDENTS:
Students who are also taking CSCI 439 (Compilers) this term have just completed a pair of lex/yacc labs similar to 330's labs 4 and 5 but dealing with a different language. As an alternative to repeating the process with yet another language, you have the following option to get credit for the 330 labs 4 and 5:
  • For 330 lab 4 credit: copy your CSCI 439 lab2 content to your 330 lab4 repo and fix your 439 lab 2, fixing all the issues pointed out in your 439 lab 2 feedback.
  • For 330 lab 5 credit: (in your 330 lab 5 repo) expand on the above to add all the Vurbossity language modifications that were introduced in CSCI 439 labs 3 and 4 (adding the boolean, incorporating all the new type checking rules, etc).
If you choose these options then your 330 lab 4 and 5 marks will be based on the correctness and completeness of the listed fixes/modifications.

The lab can be obtained and submitted through our usual git approach (repo name 'lab4' this time), and the starting repo contains a number of files and subdirectories:

Overview

For lab 4 you'll be editing and submitting lex and yacc files used to build a simple parser, lab4x, for a language to be described below. The lex/yacc files are compiled through a makefile using lex and yacc on the cubs/pups/tuxworld (details below).

The starting lab4.lex file (and hence a current build of lab4x) recognizes most, but not all, of the needed tokens, and the starting lab4.yacc file (and hence a current build of lab4x) recognizes a subset of the needed grammar.

You'll be expanding the set of token regular expressions in lab4.lex and the grammar rules in lab4.yacc so that a new build of lab4x can parse the full language.

For details on lex and yacc themselves, see the notes for the week of Mar. 3-7 in the lecture resources.

Objectives

Lab 4 is meant to give you practice with regular expressions, context free grammars, and their use in syntax checking source code using tools like lex/flex and yacc/bison.

You will be using lex regular expressions to describe the tokens in a language, and yacc CFG rules to describe the syntax of the language.

A makefile is provided to allow you to compile your lex and yacc files into a tool that can be used to check for syntax errors in user source code, by feeding the file content into the lab4x executable as standard input (./lab4x < someprog.lisc), e.g.
make
./lab4x < valid/simple.lisc     # tries parsing the simple.lisc code example
The lab4x executable processes the specified file content, and displays error messages if it detects any syntax errors, otherwise it displays a final "Parsing complete." message. The error messages, if any, are accompanied by a row and column number indicating where in the source file the parser recognized it was in an error state, and an indication of what statement type it last entered and where (to the best of its knowledge).


The language to be parsed: lisc

The starting lab4.lex and lab4.yacc files have been provided for you, that handle most of the tokens and grammar rules described below.

Informally, the language is a bit of a simplified lisp/c mix: The full official rules for the language are listed and discussed in the table below, with code samples provided later on this page.

lisc token types, comments, and delimiters

All tokens must be whitespace delimited (spaces, tabs, newlines).

Comments begin with a semicolon (;) and continue to the end of the line (newline: not Windows compatible), and must be stripped out by the tokenizer

The supported token types are:
  • keywords: if else while read write var isnum isstr
  • operators:   +   -   *   /   ^   ==   !=   <   <=   >   >=   =
  • brackets/parentheses:   {   }   (   )
  • number literals: one or more digits then optionally a decimal point and another one or more digits (e.g. valid numeric literals include things like 99, 00123.4500, or 1.234, but you cannot have a number that begins or ends with a decimal point)
  • string literals: begin and end with a double-quote (") and contain zero or more characters which can be anything except another double-quote or newline
  • identifiers: begin with an upper or lowercase alphabetic character (a-z or A-Z) and can be followed by any number of digits, underscores, and upper or lowercase alphabetic characters (e.g. aAB___933eAx_2 would be a valid identifier)

lisc grammar rules A lisc program consists of one or more statements of the following forms:

  • variable declarations : consist of the keywrord var followed by an identifier

  • input statements : consist of the keyword read followed by an identifier

  • output statements : consist of the keyword write followed by an expression

  • assignment statements : consist of an identifier followed by the = operator followed by an expression

  • while loops : have the following form:
           while CONDITION {
                STATEMENTS
           }
    
    where STATEMENTS is a series of zero or more statements (variable declarations, input statements, output statements, assignment statements, while loops, if/else statements) and conditions are discussed below.

  • if/else statements: have the following form:
         if CONDITION {
            STATEMENTS
         } else {
            STATEMENTS
         }
    
    (following the same rules as while loops with respect to the STATEMENTS and CONDITIONS)

Conditions can be either a typecheck or a comparison:
  • typechecks have the form ( isnum EXPRESSION ) or ( isstr EXPRESSION )
  • comparisons have the form ( EXPRESSION OP EXPRESSION )
    where the OP can be any one of   ==   !=   <   <=   >   >=

Expressions can be either string expressions or number expressions:

  • number expressions: can be a single number literal, or a single identifier, or have the form ( EXPRESSION OP EXPRESSION )
    where the OP can be any one of   +   -   *   /   ^

  • string expressions: follow the same rules as number expressions but the only valid operators are + and - (and would use string literal not number literal)

lisc declaration, type, and scope rules

Note that we won't be carrying out type and declaration checking in lab4, but they will be a part of lab5.
  • Variables must be declared before they can be used.
  • All variables are globally scoped, and accessible anywhere after their point of declaration.
  • Number and string types are incompatible: you cannot assign, compare, or perform + - * / ^ using a mix of number and string values.
  • The comparison, math, and assignment operators act 'normally' with numbers.
  • Comparison operations on strings use dictionary-style alphabetic orderings (e.g. "blah" < "blar", "boo" < "boooo", etc).
  • + acts like concatenation for strings: s1 + s2 returns a new string whose contents are a copy of s1's content followed by a copy of s2's content, e.g. s = "foo" + "blah" would result in s being assigned "fooblah".
  • - acts like substring deletion for strings: s1 - s2 returns a new string whose contents are a copy of s1's content after removing the first substring that matches s2's content, e.g. s = "yucky muck" - "ck" would result in s being assigned "yuy muck". If s1 does not contain s2 then the result is simply a copy of s1.
  • The * / ^ operators cannot be applied to strings.

Code sample

The first sample program below is meant to show a good mix of language features, while the second program is a stripped down example that will work with the starter lab4.lex and lab4.yacc files.

; program 1: your finished lab4x should accept this

; a couple of declarations
var abc
var Xafoo_123__Yep

; try some assignments
abc = 1230
abc = 98.7654
abc = "hello there!"

; try some expressions and operators
abc = ( 5 + 6.78 )
abc = ( abc + abc )
abc = ( 1 + ( abc * ( 3 ^ ( 4 / 5.678 ) ) ) )
abc = ( "foo" + "blah" )
abc = ( abc - "foo" )
abc = ( ( abc + abc ) - ( abc - "f" ) )

; try some I/O
read abc
write abc
write "blah blah blah\n"
write ( ( 123 + 456 ) * 27 )

; try some loops and nesting
if ( isnum abc ) {
   var userVal
   write "Enter a number"
   read userVal
   if ( isnum userVal ) {
      while ( userVal < abc ) {
         write "current value is "
         write x
         x = ( 1 + x )
      }
      write "ended with"
      write userVal
   } else {
     write "that was not a number"
   }
} else {
   write "oops, started with a bad abc value:"
   write abc
}


; program 2 ; simple program that will parse successfully with the starter code var x var foo x = foo foo = ( foo + x )

Requirements (fixing what's missing in the starter .lex and .yacc files)

Fix lab4.lex and lab4.yacc to properly handle the language elements they don't currently support.

Context sensitive checking is not required

Eventually a parser would need to account for things like checking that a called function actually existed, that the number of arguments passed in a call were correct, that the data types matched, etc, but that's beyond the scope of the parsing we'll do in lab4 (parts of it will be in lab5).

For lab4 we just want to make sure the source code follows the token and CFG rules above.


Advice:

As with most forms of programming, be sure to add/test tokens and rules in small increments - debugging lex/yacc can be very challenging if you try to add too much at once.

lex and yacc reminders

Each token type specified in the .lex file needs a corresponding entry in the .yacc file on the tokens line:
  %token<long> INTLIT MULOP ...etc...
and each non-terminal type needs an entry on the types line:
  %type<long> program statements ...etc...

To help with the effectiveness of your error messages, col (in the .lex file) needs to be consistently adjusted by the number of characters in the token that was read.


Testing

To compile/test your lab:

The repo starts with two subdirectories, valid and broken, each containing some test files for use with the completed lab4x.

As you work through your lex and yacc code, I strongly recommend augmenting these with more test cases tailored specifically to whichever lisc feature you're currently working on.

lab4d and y.output

The makefile builds a second executable, lab4d, that is compiled with the -g options if you want to run it through a debugger. That compilation also uses yacc's -v option to produce a file called y.output with information on the state machine constructed by yacc (which can be helpful in debugging some yacc errors).