Programming languages: lecture notes

Describing syntax and semantics

One of the problems encountered in the study of programming languages is "how do we clearly and accurately describe the features of the language?".

One part of the problem is the study and description of syntax: the actual form of the language's expressions, control statements, and program units.

The other part of the problem is the study and description of semantics: the meaning of those expressions, statements, and units.

We would like description techniques that support intuitive understanding of a language (i.e. the description actually helps people understand the language) but we also need description techniques that allow the development of tools to work with the language (debuggers, compilers, etc).

Formal methods for syntax description

We have a precise and accepted format for describing syntax. This involves the use of another language.

meta-language

The technique we will discuss involves regular expressions and context free grammars:

The lexemes of a language are the lowest level syntactic units, such as identifiers, literals, operators, keywords, etc.
The tokens of a language are the categories of lexemes. For example, an identifier is a token that can have instances (lexemes) such as count, sum, result,... etc.
Regular expressions can be used to describe individual tokens in a language
Context free grammars can be used to describe the syntax of the whole (or close to it) programming language.

Regular expressions are used to describe sets of valid strings. For instance, in some programming language we might want to describe identifiers as the set of alphanumeric strings which are at least one character long.

The rules for creating regular expressions are as follows:

Let A be the alphabet of symbols which may be used in our strings
The empty set is a valid regular expression
Each symbol in the alphabet is a valid regular expression
For any two regular expressions (over the same alphabet), r and s:
- (r + s) is a regular expression corresponding to the union of the sets corresponding to r and s
- (rs) is a regular expression corresponding to the set of strings which can be formed by concatenating a string from r and a string from s
- (r - s) is a regular expression corresponding to the set of strings which are in r but not in s
For any regular expression, r, over the same alphabet:
- r^k, where k is a digit, corresponds to the concatenation of k strings from r
- r⁺ is a regular expression corresponding to the concatenation of from 1 to infinity strings from r
- r^* is a regular expression corresponding to the concatenation of from 0 to infinity strings from r

For example, uppose our alphabet is the set of digits {0, 1, 2, ..., 9}

If we are trying to represent the (non-empty) strings of even digits, we could express them as (0 + 2 + 4 + 6 + 8)⁺
If we are trying to represent the strings corresponding to rolls of two dice, we could express them as (1 + 2 + 3 + 4 + 5 + 6)²
If we are trying to represent the set of eight-bit binary strings, we could express them as (0 + 1)⁸

Context free grammars: are used to represent the syntactic structure of the language.

Usually we assume that the basic tokens have been defined, and focus on the ways in which tokens can be strung together to form valid language statements.

The method of syntax description we will use is called Backus Naur form, or BNF.

BNF grammars use rules, terminals, and nonterminals to show how a valid language construct can be built:

a language token is represented by a terminal,
a more complex language feature or construct is represented by a non-terminal,
and a rule is composed as follows: LHS -> RHS
where LHS, or left hand side, is a non-terminal and RHS, or right hand side, is a list of terminals and non-terminals

For example, the following two rules describe an identifier list as either a single identifier or a list of identifiers seperated by commas:

<identifier_list> -> identifier
<identifier_list> -> identifier, <identifier_list>

Actually, as a short hand when one non-terminal can be used to derive several different possibilities, we use | as OR:

<identifier_list> -> identifier
                   | identifier, <identifier_list>

Similarly, a rule to describe if-then-else statements in a language might look something like:

<if_statement> if <logic_expression> then <statement> else
<statement>

Derivations: the process of using the rules of a grammar to generate a valid program is called a derivation.

A derivation involves applying the available rules repeatedly until there are no non-terminals left.

Example 1: suppose the following rules are available for a small language

The "start" symbol for our deviation is <program>
<program> -> begin <statement_list> end
<statement_list> -> <statement>
                   | <statement>; <statement_list>
<statement> -> <variable> := <expression>
<variable> -> A | B | C
<expression> -> <variable> + <variable>
              | <variable> - <variable>
              | <variable>

Suppose we want to show that the following is a valid program under the language:

begin A := B + C; B := C end

One way to show its validity is to show a derivation of the program from the language rules:

<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <statement>; <statement> end
           -> begin <variable> := <expression>; <statement> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <expression>; <variable> :=
<expression> end
           -> begin <variable> := <variable> + <variable>;
<variable> := <variable> end
           -> begin A := <variable> + <variable>; <variable> :=
<variable> end
           -> begin A := <variable> + <variable>; B := <variable> end
           -> begin A := <variable> + <variable>; B := C end
           -> begin A := <variable> + C; B := C end
           -> begin A := B + C; B := C end

Note, there may be more than one sequence in which the rules may be applied to derive the desired program.

Just for consistency and clarity, we will use leftmost derivations, in which we always apply the grammar rules to the leftmost available non-terminal symbol. For example:

<program> -> begin <statement_list> end
           -> begin <statement>; <statement_list> end
           -> begin <variable> := <expression>; <statement_list> end
           -> begin A := <expression>; <statement_list> end
           -> begin A := <variable> + <variable>; <statement_list>
end
           -> begin A := B + <variable>; <statement_list> end
           -> begin A := B + C; <statement_list> end
           -> begin A := B + C; <statement> end
           -> begin A := B + C; <variable> := <expression> end
           -> begin A := B + C; B := <expression> end
           -> begin A := B + C; B := <variable> end
           -> begin A := B + C; B := C end

Parse trees: the derivation of a particular program can also be shown as a parse tree, in which the root of the tree is the start symbol, the internal nodes are the non terminals, and the leaves are the terminals.

For example (using the same program as above)

               ____<program>_____
              /      |           \
         begin  <statement_list>  end
               /     |          \
      <statement>    ;    <statement_list>
     /     |     \                    \
<variable> := <expression>;           <statement>
   |            /   |   \               /   |  \
   A     <variable> + <variable> <variable> := <variable>
             |            |          |             |
             B            C          B             C

Problem: ambiguous grammars: a grammar is ambiguous if there can be more than one valid parse tree to derive the same program.

Since parse trees are often used by compilers during code generation, ambigous grammars make it more difficult to identify the meaning of a program.

This is illustrated in the problem of operator precedence in expressions. Consider the grammar below:

<expression> -> <expression> + <expression>
              | <expression> - <expression>
              | <identifier>
<identifier> -> A | B | C

Then the expression A + B * C could have either of the following two parse trees:

      <expression>     TREE 1
     /       |    \
<expression> +     <expression>
     |                  /  |  \
<identifier>  <expression> * <expression>
     |              |             |
     A        <identifier>   <identifier>
                    |             |
                    B             C


TREE 2             <expression>
                   /    |    \        
       <expression>     *    <expression>
       /     |   \              |
<expression> + <expression>  <identifier>
    |                 |         |
<identifier>   <identifier>     C
    |                 |
    A                 B

The first tree would evaluate like (A + (B * C)), whereas the second tree would evaluate like ((A + B) * C), with substantially different results!

Similarly, a grammar must address problems of associativity for operators. For example, A + B + C yields the same result regardless of the order of evaluation, but A - B - C does not.

Our grammar must correctly define order of evaluation in all circumstances, and hence the grammar describing our language syntax must be unambiguous. (Not a trivial task!)

The grammar rules below effectively give expressions within brackets the highest evaluation priority, then multiplication and division operations, then addition and subtraction operations.

<expression> -> <expression> + <pterm>
<expression> -> <expression> - <pterm>
<expression> -> <pterm>
<pterm> -> <pterm> * <final>
<pterm> -> <pterm> / <final>
<pterm> -> <final>
<final> -> ( <expression> )
<final> -> <value>
<value> -> <number>
<value> -> <variable>

Observations:

Lower-precedence operators are built first.
For binary operators that we want evaluated left-to-right we use rightmost derivations, so the operator on the right side is built earlier.
For binary operators we want evaluated right-to-left (like the = operator in C) we use leftmost derivations.

A PORTION OF THE C GRAMMAR:

Below is a subset of the grammar rules for C statements, with terminals in boldface. (The complete C syntax grammar takes about 350 lines to express in this style.)

statement:
   labeled-statement
   expression-statement
   compound-statement
   selection-statement
   iteration-statement
   jump-statement

labeled-statement:
   identifier : statement
   case constant-expression : statement
   default: statement

expression-statment:
   expression ;

compound-statement:
   { declaration-list statement-list }

statement-list:
   statement
   statement-list statement

selection-statement:
   if ( expression ) statement
   if ( expression ) statement else statement
   switch ( expression ) statement

iteration-statement:
   while ( expression ) statement
   do statement while ( expression );

jump-statement:
   goto identifier ;
   continue;
   break;
   return expression ;

Meanwhile, to maintain the precedence rules for the language, here is the breakdown of some of the expression grammar for C:

expression:
   assignment-expression
   expression , assignment-expression

assignment-expression:
   conditional-expression
   unary-expression assignment-operator assignment-expression
Note: the assignment operators are = += -= /= etc

conditional-expression:
   logical-OR-expression
   logical-OR-expression ? expression : conditional-expression

logical-OR-expression:
   logical-AND-expression
   logical-OR-expression || logical-AND-expression

logical-AND-expression:
   inclusive-OR-expression
   logical-AND-expression && inclusive-OR-expression

inclusive-OR-expression:
   exclusive-OR-expression
   inclusive-OR-expression | exclusive-OR-expression

... and so on through each of the different operator types
    in increasing order of operator precedence ...

Note, , has the lowest precedence,
then the assignment operators
then :?
then ||
then &&
etc

So the expression w || x | y && z, which has form
identifier || identifier | identifier && identifier
would be derived as follows:

expression 

 -> assignment-expression

 -> conditional-expression

 -> logical-OR-expression 

 -> logical-OR-expression || logical-AND-expression

 ... skip the long sequence going through all the expression types
     and finally getting us to the identifier in w's place ...

 -> identifier || logical-AND-expression

 -> identifier || logical-AND-expression && inclusive-OR-expression

 -> identifier || inclusive-OR-expression && inclusive-OR-expression

 -> identifier || inclusive-OR-expression | exclusive-OR-expression && inclusive-OR-expression

 ... skip the long sequence going through all the expression types
     and finally getting us to the identifier in x's place ...

 -> identifier || identifier | exclusive-OR-expression && inclusive-OR-expression

 ... skip the long sequence going through all the expression types
     and finally getting us to the identifier in y's place ...

 -> identifier || identifier | identifier && inclusive-OR-expression

 ... similarly for the identifier in z's place

 -> identifier || identifier | identifier && identifier

Syntax graphs: the information contained in our set of grammar rules can also be represented through the use of syntax graphs.

Such graphs show the information visually (hence are often conceptually easier to read) but are less applicable for use in grammar manipulation tools.

Extending the language description: attribute grammars:

Not all language features can be described easily using context free grammars, and some cannot be described at all.

Producing an exhaustive set of type compatibility rules may be possible, but not practical. Defining rules such as "variables must be declared prior to being used" may be completely impossible in a context free grammar.

The static semantics of a language refer to such legalities regarding the forms of programs.

Attribute grammars are used to define the static semantics of a language, using the following additional features:

Attributes: each symbol can have a set of associated attributes, which are used to pass information up/down the parse tree
Semantic functions: specify how the attributes' values are computed
Predicate functions: are Boolean functions on the set of attributes,

The grammar rules for the language are permitted to make use of the attributes and semantic/predicate functions to define additional language features.

For instance, if a variable is used as an expression, it must have the same type as the expected type of the expression.

To show this, we might augment our syntax rule with semantic and predicate rules as shown below:

Syntax rule: <expression> -> <variable>
Semantic rule: <expression>.actual_type <- <variable>.actual_type
Predicate: <expression>.actual_type = <expression>.expected_type

Describing (dynamic) semantics

Unfortunately, there is no universally accepted method for clearly and accurately describing all of a language's semantics.

Even the extra abilities enabled by the use of attribute grammars are insufficient to describe the required dynamic semantics.

Generally, the semantics of a language are described using informal English (or other natural language) descriptions and examples.

There are several techniques for describing dynamic semantics. We will not consider these in any great detail, but can summarize them as follows:

Operational semantics: here the meaning of program constructs are defined in terms of their effects when operating on a low-level virtual machine.
A translator is needed to translate the source code into a low level language appropriate to our virtual machine, and the translated code is then simulated on the machine.
The development of an effective translator and representative virtual machine are the primary impediments to the practicality of operational semantic approaches.
Axiomatic semantics: here the meaning of a program is stated in formal logic: an axiom or inference rule is defined for each statement type in the language.
To date, this approach seems more appropriate in the realm of program specification testing and proof correctness than in the practical description of program behavior.
Denotational semantics: here a language is defined in terms of mathematical objects, with functions to map each language feature into instances of appropriate objects.
The complexity of the denotational semantics has typically precluded interest by language users, however they have been shown to enable the creation of programs which automatically generate language compilers. (I.e. you supply the denotational semantics, and the software produces a compiler for your language.)