CSCI 439 Quiz 1 Sample Solutions: scanning and parsing


Question 1 [6 marks]

(i) Provide a regular expression for each of the following token types:

Sample solution:
Identifiers: [A-Z][a-zA-Z0-9]*
Integers: [-]?[0-9]+
Reals: [.][0-9]+[x][+-][0-9]+


(ii) If our language had the additional token types listed below, identify any possible overlaps between token types and breifly explain how to resolve the conflicts.

Sample solution:
keywords and identifiers can't clash since keywords are all lowercase
   whereas identifiers must start with an uppercase, so no issues there

reals (and their embedded x,+,-) can't conflict with anything else because of the leading .

we do need to check integers before the - operator, however, since we'd want
   -12 to be treated as the integer -12 not the - operator followed by the integer 12


Question 2 [6 marks]

Show the following context free grammar is ambiguous by giving a single valid token sequence for an expression and two different parse trees for the sequence.
Token types:
   IDENTIFIER: [a-z]+
   INTEGER: [0-9]+
   ADD: [+]
   MULT: [*]
Grammar rules:
   expression --> multexpr ADD expression
   expression --> multexpr
   multexpr --> value
   multexpr --> multexpr MULT multexpr
   value --> IDENTIFIER
   value --> INTEGER

Sample solution:
The ambiguity stems from the multexpr --> multexpr MULT multexpr.
If we have a token sequence like x * y * z either of the two * operations
   could be parsed first, producing either of the following parse trees:

       multexpr
      /        \
     /          \
value         multexpr
   |          /      \
IDENTIFIER  multexpr  multexpr            equating to
  (x)        |          |                (x * (y * z))
           value      value
             |          |
         IDENTIFIER  IDENTIFIER
            (y)        (z)

          multexpr
         /        \
        /          \
  multexpr          multexpr               equating to
    /      \            \                  ((x * y) * z)
multexpr  multexpr      value
   |          |            \
 value      value        IDENTIFIER
   |          |             (z)
IDENTIFIER  IDENTIFIER
  (x)        (y)


Question 3 [6 marks]

For the grammar shown below,
(i) Explain the relative precedence levels of the +, *, and = operators, and their associativity (whether they evaluate left-to-right or right-to-left and why).
(ii) Explain whether or not this appears to be a sensible grammar for assignment statements and why.
Token types:
   IDENTIFIER: [a-z]+
   INTEGER: [0-9]+
   ADD: [+]
   ASSIGN: [=]
   MULT: [*]
   SEMI: [;]
Grammar rules:
   assignment --> IDENTIFIER ASSIGN expr SEMI
   expr --> subexpr
   expr --> subexpr op expr
   subexpr --> val
   subexpr --> subexpr ASSIGN val
   val --> IDENTIFIER | INTEGER
   op --> ADD | MULT
assignment is the 'starting' non-terminal

Sample solution:

(i) precedence and associativity

The leftmost assign (from the assignment rule) is a special case in that it
will always get parsed out first (evaluated last) and there can only be one.
The remainder of the discussion deals with precedence within the subexpr's.

The only way ADD/MULT can be processed is through expr rules,
    and all the expr rules must be applied before moving on to subexprs,
so all the ADD/MULTs must be parsed out before/above any of the ASSIGNs.
This means they'll be higher in the parse tree and evaluated later than ASSIGNs
   (since the tree is constructed top-down but evaluated bottom-up).
Thus ASSIGN has higher precedence thand ADD/MULT.

ADD and MULT are treated interchangably within expr, so have the same
    associativity and precedence.

In terms of associativity, if we consider x = y = z we can see in the subexpr rule
   that the rightmost = must be parsed first, since the thing to its right can only
   be a value (identifier or integer)
Thus it parses like
        =
       / \
      =   z
     / \
    x   y
which effectively means ((x = y) = z) so assignments evaluate left-to-right.

The rules for +/* are the opposite: the expr --> subexpr op expr means the portion
    with any other */+ must appear in the right subtree, so x * y * z gives
   *
  / \
 x   *           which effectively means (x * (y * z))
    / \          and implies */+ evaluate right to left
   y   z

(ii) is this sensible?

From a conventional viewpoint probably not: this is the opposite associativity to
"normal" math, and the opposite associativity to the way most languages handle assignment,
and the leftmost = operator has a different precedence level than all the other ones do.
Consider a = b + c = d = e + f is permitted and gives something like
         a = (b + (((c = d) = e) + f))?

(That doesn't mean it's unusable, and there may be a solid justification for it
 in some special-purpose language, but it would be counterintutitive for most
 programmers and probably not recommended in a general-purpose language.)


Question 4 [6 marks]

Suppose we have the grammar shown below and we require that variables must be declared before they are used and that strings and integers are incompatible types (you cannot mix integers and strings in assignments).
Token types:                       Grammar rules
   IDENTIFIER: [a-z]+                  program --> statements
   INTLIT: [0-9]+                      statements --> stmt
   STRLIT: ["][^\n]*["]                statements --> stmt statements
   ASSIGN: [=]                         stmt --> assignment
   SEMI: [;]                           stmt --> declaration
   INT: "integer"                      assignment --> IDENTIFIER ASSIGN value SEMI
   STR:  "string"                      declaration --> IDENTIFIER type SEMI
                                       type --> STR
                                       type --> INT
                                       value --> INTLIT
                                       value --> STRLIT
Briefly outline the type/declaration checking that would need to be performed in each of the relevant rules.

Sample solution:

We'd need some way to record/look up what variables have been declared
   and what type they were declared with, I'll assume here we're using
   some form of symbol table.
We'd also need to evaluate the data type associated with the right hand value
   in the assignment statement, so here I'll assume we can associate types with
   the nonterminals as we evaluate rules.

Given those assumptions, a workable checking approach would be:

value --> INTLIT
   action: associate datatype integer with this 'value' nonterminal
value --> STRLIT
   action: associate datatype string with this 'value' nonterminal

type --> STR
   action: associate datatype string with this 'type' nonterminal
type --> INT
   action: associate datatype integer with this 'type' nonterminal

declaration --> IDENTIFIER type SEMI
   action: lookup the identifier in the symbol table, error if previously declared
           insert the identifier into the table with whatever datatype is associated with the 'type' nonterminal

assignment --> IDENTIFIER ASSIGN value SEMI
   action: lookup the identifier in the symbol table, error if not declared yet
           check the datatype for that identifier (from the lookup) against the datatype
              associated with the 'value' nonterminal, error if the two datatypes don't match