Describing syntax and semantics

One of the problems encountered in the study of programming languages is "how do we clearly and accurately describe the features of the language?".

One part of the problem is the study and description of syntax: the actual form of the language's expressions, control statements, and program units.

The other part of the problem is the study and description of semantics: the meaning of those expressions, statements, and units.

We would like description techniques that support intuitive understanding of a language (i.e. the description actually helps people understand the language) but we also need description techniques that allow the development of tools to work with the language (debuggers, compilers, etc).

Formal methods for syntax description

We have a precise and accepted format for describing syntax. This involves the use of another language.

The technique we will discuss involves regular expressions and context free grammars:

Regular expressions are used to describe sets of valid strings. For instance, in some programming language we might want to describe identifiers as the set of alphanumeric strings which are at least one character long.

The rules for creating regular expressions are as follows:

For example, uppose our alphabet is the set of digits {0, 1, 2, ..., 9}

Context free grammars: are used to represent the syntactic structure of the language.

Usually we assume that the basic tokens have been defined, and focus on the ways in which tokens can be strung together to form valid language statements.

The method of syntax description we will use is called Backus Naur form, or BNF.

BNF grammars use rules, terminals, and nonterminals to show how a valid language construct can be built:

For example, the following two rules describe an identifier list as either a single identifier or a list of identifiers seperated by commas:

<identifier_list> -> identifier
<identifier_list> -> identifier, <identifier_list>
Actually, as a short hand when one non-terminal can be used to derive several different possibilities, we use | as OR:
<identifier_list> -> identifier
                   | identifier, <identifier_list>
Similarly, a rule to describe if-then-else statements in a language might look something like:
<if_statement> if <logic_expression> then <statement> else
<statement>
Derivations: the process of using the rules of a grammar to generate a valid program is called a derivation.

A derivation involves applying the available rules repeatedly until there are no non-terminals left.

Example 1: suppose the following rules are available for a small language

The "start" symbol for our deviation is <program>
<program> -> begin <statement_list> end
<statement_list> -> <statement>
                   | <statement>; <statement_list>
<statement> -> <variable> := <expression>
<variable> -> A | B | C
<expression> -> <variable> + <variable>
              | <variable> - <variable>
              | <variable>
Suppose we want to show that the following is a valid program under the language:
begin A := B + C; B := C end
One way to show its validity is to show a derivation of the program from the language rules:
<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <statement>; <statement> end
           -> begin <variable> := <expression>; <statement> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <variable> + <variable>;
<variable> := <variable> end
           -> begin A := <variable> + <variable>; <variable> :=
<variable> end
           -> begin A := <variable> + <variable>; B := <variable> end
           -> begin A := <variable> + <variable>; B := C end
           -> begin A := <variable> + C; B := C end
           -> begin A := B + C; B := C end
Note, there may be more than one sequence in which the rules may be applied to derive the desired program.

Just for consistency and clarity, we will use leftmost derivations, in which we always apply the grammar rules to the leftmost available non-terminal symbol. For example:

<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <variable> := <expression>; <statement_list> end
           -> begin A := <expression>; <statement_list> end
           -> begin A := <variable> + <variable>; <statement_list>
end
           -> begin A := B + <variable>; <statement_list> end
           -> begin A := B + C; <statement_list> end
           -> begin A := B + C; <statement> end
           -> begin A := B + C; <variable> := <expression> end
           -> begin A := B + C; B := <expression> end
           -> begin A := B + C; B := <variable> end
           -> begin A := B + C; B := C end
Parse trees: the derivation of a particular program can also be shown as a parse tree, in which the root of the tree is the start symbol, the internal nodes are the non terminals, and the leaves are the terminals.

For example (using the same program as above)

               ____<program>_____
              /      |           \
         begin  <statement_list>  end
               /     |          \
      <statement>    ;    <statement_list>
     /     |     \                    \
<variable> := <expression>;           <statement>
   |            /   |   \               /   |  \
   A     <variable> + <variable> <variable> := <variable>
             |            |          |             |
             B            C          B             C
Problem: ambiguous grammars: a grammar is ambiguous if there can be more than one valid parse tree to derive the same program.

Since parse trees are often used by compilers during code generation, ambigous grammars make it more difficult to identify the meaning of a program.

This is illustrated in the problem of operator precedence in expressions. Consider the grammar below:

<expression> -> <expression> + <expression>
              | <expression> - <expression>
              | <identifier>
<identifier> -> A | B | C
Then the expression A + B * C could have either of the following two parse trees:
      <expression>     TREE 1
     /       |    \
<expression> +     <expression>
     |                  /  |  \
<identifier>  <expression> * <expression>
     |              |             |
     A        <identifier>   <identifier>
                    |             |
                    B             C


TREE 2             <expression>
                   /    |    \        
       <expression>     *    <expression>
       /     |   \              |
<expression> + <expression>  <identifier>
    |                 |         |
<identifier>   <identifier>     C
    |                 |
    A                 B
The first tree would evaluate like (A + (B * C)), whereas the second tree would evaluate like ((A + B) * C), with substantially different results!

Similarly, a grammar must address problems of associativity for operators. For example, A + B + C yields the same result regardless of the order of evaluation, but A - B - C does not.

Our grammar must correctly define order of evaluation in all circumstances, and hence the grammar describing our language syntax must be unambiguous. (Not a trivial task!)

The grammar rules below effectively give expressions within brackets the highest evaluation priority, then multiplication and division operations, then addition and subtraction operations.

<expression> -> <expression> + <pterm>
<expression> -> <expression> - <pterm>
<expression> -> <pterm>
<pterm> -> <pterm> * <final>
<pterm> -> <pterm> / <final>
<pterm> -> <final>
<final> -> ( <expression> )
<final> -> <value>
<value> -> <number>
<value> -> <variable>
Observations:


A PORTION OF THE C GRAMMAR:

Below is a subset of the grammar rules for C statements, with terminals in boldface. (The complete C syntax grammar takes about 350 lines to express in this style.)

statement:
   labeled-statement
   expression-statement
   compound-statement
   selection-statement
   iteration-statement
   jump-statement

labeled-statement:
   identifier : statement
   case constant-expression : statement
   default: statement

expression-statment:
   expression ;

compound-statement:
   { declaration-list statement-list }

statement-list:
   statement
   statement-list statement

selection-statement:
   if ( expression ) statement
   if ( expression ) statement else statement
   switch ( expression ) statement

iteration-statement:
   while ( expression ) statement
   do statement while ( expression );

jump-statement:
   goto identifier ;
   continue;
   break;
   return expression ;
Meanwhile, to maintain the precedence rules for the language, here is the breakdown of some of the expression grammar for C:
expression:
   assignment-expression
   expression , assignment-expression

assignment-expression:
   conditional-expression
   unary-expression assignment-operator assignment-expression
Note: the assignment operators are = += -= /= etc

conditional-expression:
   logical-OR-expression
   logical-OR-expression ? expression : conditional-expression

logical-OR-expression:
   logical-AND-expression
   logical-OR-expression || logical-AND-expression

logical-AND-expression:
   inclusive-OR-expression
   logical-AND-expression && inclusive-OR-expression

inclusive-OR-expression:
   exclusive-OR-expression
   inclusive-OR-expression | exclusive-OR-expression

... and so on through each of the different operator types
    in increasing order of operator precedence ...

Note, , has the lowest precedence,
then the assignment operators
then :?
then ||
then &&
etc

So the expression w || x | y && z, which has form
identifier || identifier | identifier && identifier
would be derived as follows:

expression 

 -> assignment-expression

 -> conditional-expression

 -> logical-OR-expression 

 -> logical-OR-expression || logical-AND-expression

 ... skip the long sequence going through all the expression types
     and finally getting us to the identifier in w's place ...

 -> identifier || logical-AND-expression

 -> identifier || logical-AND-expression && inclusive-OR-expression

 -> identifier || inclusive-OR-expression && inclusive-OR-expression

 -> identifier || inclusive-OR-expression | exclusive-OR-expression && inclusive-OR-expression

 ... skip the long sequence going through all the expression types
     and finally getting us to the identifier in x's place ...

 -> identifier || identifier | exclusive-OR-expression && inclusive-OR-expression

 ... skip the long sequence going through all the expression types
     and finally getting us to the identifier in y's place ...

 -> identifier || identifier | identifier && inclusive-OR-expression

 ... similarly for the identifier in z's place

 -> identifier || identifier | identifier && identifier

Syntax graphs: the information contained in our set of grammar rules can also be represented through the use of syntax graphs.

Such graphs show the information visually (hence are often conceptually easier to read) but are less applicable for use in grammar manipulation tools.

Extending the language description: attribute grammars:

Not all language features can be described easily using context free grammars, and some cannot be described at all.

Producing an exhaustive set of type compatibility rules may be possible, but not practical. Defining rules such as "variables must be declared prior to being used" may be completely impossible in a context free grammar.

The static semantics of a language refer to such legalities regarding the forms of programs.

Attribute grammars are used to define the static semantics of a language, using the following additional features:

The grammar rules for the language are permitted to make use of the attributes and semantic/predicate functions to define additional language features.

For instance, if a variable is used as an expression, it must have the same type as the expected type of the expression.

To show this, we might augment our syntax rule with semantic and predicate rules as shown below:

Syntax rule: <expression> -> <variable>
Semantic rule: <expression>.actual_type <- <variable>.actual_type
Predicate: <expression>.actual_type = <expression>.expected_type

Describing (dynamic) semantics

Unfortunately, there is no universally accepted method for clearly and accurately describing all of a language's semantics.

Even the extra abilities enabled by the use of attribute grammars are insufficient to describe the required dynamic semantics.

Generally, the semantics of a language are described using informal English (or other natural language) descriptions and examples.

There are several techniques for describing dynamic semantics. We will not consider these in any great detail, but can summarize them as follows: