Program syntax and semantics

Context free grammars: are used to represent the syntactic structure of the language.

Usually we assume that the basic tokens have been defined, and focus on the ways in which tokens can be strung together to form valid language statements.

The method of syntax description we will use is called Backus Naur form, or BNF.

BNF grammars use rules, terminals, and nonterminals to show how a valid language construct can be built:

a language token is represented by a terminal,
a more complex language feature or construct is represented by a non-terminal,
and a rule is composed as follows: LHS -> RHS
where LHS, or left hand side, is a non-terminal and RHS, or right hand side, is a list of terminals and non-terminals
the notation { X } is used to indicate 0 or more occurrences of X

For example, the following two rules describe an identifier list as either a single identifier or a list of identifiers seperated by commas:

<identifier_list> -> identifier
<identifier_list> -> identifier, <identifier_list>

Actually, as a short hand when one non-terminal can be used to derive several different possibilities, we use | as OR:

<identifier_list> -> identifier
                   | identifier, <identifier_list>

Similarly, a rule to describe if-then-else statements in a language might look something like:

<if_statement> if <logic_expression> then <statement> else
<statement>

Derivations: as discussed earlier for CFG derivations, the process of using the rules of a grammar to generate a valid program is called a derivation.

A derivation involves applying the available rules repeatedly until there are no non-terminals left.

Example 1: suppose the following rules are available for a small language

The "start" symbol for our deviation is <program>
<program> -> begin <statement_list> end
<statement_list> -> <statement>
                   | <statement>; <statement_list>
<statement> -> <variable> := <expression>
<variable> -> A | B | C
<expression> -> <variable> + <variable>
              | <variable> - <variable>
              | <variable>

Suppose we want to show that the following is a valid program under the language:

begin A := B + C; B := C end

One way to show its validity is to show a derivation of the program from the language rules:

<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <statement>; <statement> end
           -> begin <variable> := <expression>; <statement> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <variable> + <variable>;
<variable> := <variable> end
           -> begin A := <variable> + <variable>; <variable> :=
<variable> end
           -> begin A := <variable> + <variable>; B := <variable> end
           -> begin A := <variable> + <variable>; B := C end
           -> begin A := <variable> + C; B := C end
           -> begin A := B + C; B := C end

Note, there may be more than one sequence in which the rules may be applied to derive the desired program.

Just for consistency and clarity, we will use leftmost derivations, in which we always apply the grammar rules to the leftmost available non-terminal symbol. For example:

<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <variable> := <expression>; <statement_list> end
           -> begin A := <expression>; <statement_list> end
           -> begin A := <variable> + <variable>; <statement_list>
end
           -> begin A := B + <variable>; <statement_list> end
           -> begin A := B + C; <statement_list> end
           -> begin A := B + C; <statement> end
           -> begin A := B + C; <variable> := <expression> end
           -> begin A := B + C; B := <expression> end
           -> begin A := B + C; B := <variable> end
           -> begin A := B + C; B := C end

Problem: ambiguous grammars: as stated earlier, a grammar is ambiguous if there can be more than one valid parse tree to derive the same program.

Ambigous grammars make it more difficult to identify the meaning of a program.

This is illustrated in the problem of operator precedence in expressions. Consider the grammar below:

<expression> -> <expression> + <expression>
              | <expression> - <expression>
              | <identifier>
<identifier> -> A | B | C

Then the expression A + B * C could have either of the following two parse trees:

      <expression>     TREE 1
     /       |    \
<expression> +     <expression>
     |                  /  |  \
<identifier>  <expression> * <expression>
     |              |             |
     A        <identifier>   <identifier>
                    |             |
                    B             C


TREE 2             <expression>
                   /    |    \        
       <expression>     *    <expression>
       /     |   \              |
<expression> + <expression>  <identifier>
    |                 |         |
<identifier>   <identifier>     C
    |                 |
    A                 B

The first tree would evaluate like (A + (B * C)), whereas the second tree would evaluate like ((A + B) * C), with substantially different results!

Similarly, a grammar must address problems of associativity for operators. For example, A + B + C yields the same result regardless of the order of evaluation, but A - B - C does not.

Our grammar must correctly define order of evaluation in all circumstances, and hence the grammar describing our language syntax must be unambiguous. (Not a trivial task!)

The grammar rules below effectively give expressions within brackets the highest evaluation priority, then multiplication operations, then addition operations:

<expression> -> <expression> + <term>
              | <term>
<term> -> <term> * <factor>
        | <factor>
<factor> -> ( <expression> )
          | <identifier>

Syntax graphs: the information contained in our set of grammar rules can also be represented through the use of syntax graphs.

Such graphs show the information visually (hence are often conceptually easier to read) but are less applicable for use in grammar manipulation tools.

Extending the language description: attribute grammars:

Not all language features can be described easily using context free grammars, and some cannot be described at all.

Producing an exhaustive set of type compatibility rules may be possible, but not practical. Defining rules such as "variables must be declared prior to being used" may be completely impossible in a context free grammar.

The static semantics of a language refer to such legalities regarding the forms of programs.

Attribute grammars are used to define the static semantics of a language, using the following additional features:

Attributes: each symbol can have a set of associated attributes, which are used to pass information up/down the parse tree
Semantic functions: specify how the attributes' values are computed
Predicate functions: are Boolean functions on the set of attributes,

The grammar rules for the language are permitted to make use of the attributes and semantic/predicate functions to define additional language features.

For instance, if a variable is used as an expression, it must have the same type as the expected type of the expression.

To show this, we might augment our syntax rule with semantic and predicate rules as shown below:

Syntax rule from CFG: 
   <expression> -> <variable>
Added rules:
   Semantic rule: <expression>.actual_type <- <variable>.actual_type
   Predicate: <expression>.actual_type = <expression>.expected_type

Describing (dynamic) semantics

Unfortunately, there is no universally accepted method for clearly and accurately describing all of a language's semantics.

Even the extra abilities enabled by the use of attribute grammars are insufficient to describe the required dynamic semantics.

Generally, the semantics of a language are described using informal English (or other natural language) descriptions and examples.

Several possibilities for formally describing dynamic semantics include:

Operational semantics: here the meaning of program constructs are defined in terms of their effects when operating on a low-level virtual machine.
A translator is needed to translate the source code into a low level language appropriate to our virtual machine, and the translated code is then simulated on the machine.
The development of an effective translator and representative virtual machine are the primary impediments to the practicality of operational semantic approaches.
Axiomatic semantics: here the meaning of a program is stated in formal logic: an axiom or inference rule is defined for each statement type in the language.
To date, this approach seems more appropriate in the realm of program specification testing and proof correctness than in the practical description of program behavior.
Denotational semantics: here a language is defined in terms of mathematical objects, with functions to map each language feature into instances of appropriate objects.
The complexity of the denotational semantics has typically precluded interest by language users, however they have been shown to enable the creation of programs which automatically generate language compilers. (I.e. you supply the denotational semantics, and the software produces a compiler for your language.)