Program syntax and semantics

Context free grammars: are used to represent the syntactic structure of the language.

Usually we assume that the basic tokens have been defined, and focus on the ways in which tokens can be strung together to form valid language statements.

The method of syntax description we will use is called Backus Naur form, or BNF.

BNF grammars use rules, terminals, and nonterminals to show how a valid language construct can be built:

For example, the following two rules describe an identifier list as either a single identifier or a list of identifiers seperated by commas:

<identifier_list> -> identifier
<identifier_list> -> identifier, <identifier_list>
Actually, as a short hand when one non-terminal can be used to derive several different possibilities, we use | as OR:
<identifier_list> -> identifier
                   | identifier, <identifier_list>
Similarly, a rule to describe if-then-else statements in a language might look something like:
<if_statement> if <logic_expression> then <statement> else
<statement>
Derivations: as discussed earlier for CFG derivations, the process of using the rules of a grammar to generate a valid program is called a derivation.

A derivation involves applying the available rules repeatedly until there are no non-terminals left.

Example 1: suppose the following rules are available for a small language

The "start" symbol for our deviation is <program>
<program> -> begin <statement_list> end
<statement_list> -> <statement>
                   | <statement>; <statement_list>
<statement> -> <variable> := <expression>
<variable> -> A | B | C
<expression> -> <variable> + <variable>
              | <variable> - <variable>
              | <variable>
Suppose we want to show that the following is a valid program under the language:
begin A := B + C; B := C end
One way to show its validity is to show a derivation of the program from the language rules:
<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <statement>; <statement> end
           -> begin <variable> := <expression>; <statement> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <variable> + <variable>;
<variable> := <variable> end
           -> begin A := <variable> + <variable>; <variable> :=
<variable> end
           -> begin A := <variable> + <variable>; B := <variable> end
           -> begin A := <variable> + <variable>; B := C end
           -> begin A := <variable> + C; B := C end
           -> begin A := B + C; B := C end
Note, there may be more than one sequence in which the rules may be applied to derive the desired program.

Just for consistency and clarity, we will use leftmost derivations, in which we always apply the grammar rules to the leftmost available non-terminal symbol. For example:

<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <variable> := <expression>; <statement_list> end
           -> begin A := <expression>; <statement_list> end
           -> begin A := <variable> + <variable>; <statement_list>
end
           -> begin A := B + <variable>; <statement_list> end
           -> begin A := B + C; <statement_list> end
           -> begin A := B + C; <statement> end
           -> begin A := B + C; <variable> := <expression> end
           -> begin A := B + C; B := <expression> end
           -> begin A := B + C; B := <variable> end
           -> begin A := B + C; B := C end
Problem: ambiguous grammars: as stated earlier, a grammar is ambiguous if there can be more than one valid parse tree to derive the same program.

Ambigous grammars make it more difficult to identify the meaning of a program.

This is illustrated in the problem of operator precedence in expressions. Consider the grammar below:

<expression> -> <expression> + <expression>
              | <expression> - <expression>
              | <identifier>
<identifier> -> A | B | C
Then the expression A + B * C could have either of the following two parse trees:
      <expression>     TREE 1
     /       |    \
<expression> +     <expression>
     |                  /  |  \
<identifier>  <expression> * <expression>
     |              |             |
     A        <identifier>   <identifier>
                    |             |
                    B             C


TREE 2             <expression>
                   /    |    \        
       <expression>     *    <expression>
       /     |   \              |
<expression> + <expression>  <identifier>
    |                 |         |
<identifier>   <identifier>     C
    |                 |
    A                 B
The first tree would evaluate like (A + (B * C)), whereas the second tree would evaluate like ((A + B) * C), with substantially different results!

Similarly, a grammar must address problems of associativity for operators. For example, A + B + C yields the same result regardless of the order of evaluation, but A - B - C does not.

Our grammar must correctly define order of evaluation in all circumstances, and hence the grammar describing our language syntax must be unambiguous. (Not a trivial task!)

The grammar rules below effectively give expressions within brackets the highest evaluation priority, then multiplication operations, then addition operations:

<expression> -> <expression> + <term>
              | <term>
<term> -> <term> * <factor>
        | <factor>
<factor> -> ( <expression> )
          | <identifier>
Syntax graphs: the information contained in our set of grammar rules can also be represented through the use of syntax graphs.

Such graphs show the information visually (hence are often conceptually easier to read) but are less applicable for use in grammar manipulation tools.

Extending the language description: attribute grammars:

Not all language features can be described easily using context free grammars, and some cannot be described at all.

Producing an exhaustive set of type compatibility rules may be possible, but not practical. Defining rules such as "variables must be declared prior to being used" may be completely impossible in a context free grammar.

The static semantics of a language refer to such legalities regarding the forms of programs.

Attribute grammars are used to define the static semantics of a language, using the following additional features:

The grammar rules for the language are permitted to make use of the attributes and semantic/predicate functions to define additional language features.

For instance, if a variable is used as an expression, it must have the same type as the expected type of the expression.

To show this, we might augment our syntax rule with semantic and predicate rules as shown below:

Syntax rule from CFG: 
   <expression> -> <variable>
Added rules:
   Semantic rule: <expression>.actual_type <- <variable>.actual_type
   Predicate: <expression>.actual_type = <expression>.expected_type

Describing (dynamic) semantics

Unfortunately, there is no universally accepted method for clearly and accurately describing all of a language's semantics.

Even the extra abilities enabled by the use of attribute grammars are insufficient to describe the required dynamic semantics.

Generally, the semantics of a language are described using informal English (or other natural language) descriptions and examples.

Several possibilities for formally describing dynamic semantics include: