Context free grammars: are used to represent the syntactic structure of the language.
Usually we assume that the basic tokens have been defined, and focus on the ways in which tokens can be strung together to form valid language statements.
The method of syntax description we will use is called Backus Naur form, or BNF.
BNF grammars use rules, terminals, and nonterminals to show how a valid language construct can be built:
For example, the following two rules describe an identifier list as either a single identifier or a list of identifiers seperated by commas:
<identifier_list> -> identifier <identifier_list> -> identifier, <identifier_list>Actually, as a short hand when one non-terminal can be used to derive several different possibilities, we use | as OR:
<identifier_list> -> identifier | identifier, <identifier_list>Similarly, a rule to describe if-then-else statements in a language might look something like:
<if_statement> if <logic_expression> then <statement> else <statement>Derivations: as discussed earlier for CFG derivations, the process of using the rules of a grammar to generate a valid program is called a derivation.
A derivation involves applying the available rules repeatedly until there are no non-terminals left.
Example 1: suppose the following rules are available for a small language
The "start" symbol for our deviation is <program> <program> -> begin <statement_list> end <statement_list> -> <statement> | <statement>; <statement_list> <statement> -> <variable> := <expression> <variable> -> A | B | C <expression> -> <variable> + <variable> | <variable> - <variable> | <variable>Suppose we want to show that the following is a valid program under the language:
begin A := B + C; B := C endOne way to show its validity is to show a derivation of the program from the language rules:
<program> -> begin <statement_list> end -> begin <statement>; <statement_list> end -> begin <statement>; <statement> end -> begin <variable> := <expression>; <statement> end -> begin <variable> := <expression>; <variable> := <expression> end -> begin <variable> := <expression>; <variable> := <expression> end -> begin <variable> := <variable> + <variable>; <variable> := <variable> end -> begin A := <variable> + <variable>; <variable> := <variable> end -> begin A := <variable> + <variable>; B := <variable> end -> begin A := <variable> + <variable>; B := C end -> begin A := <variable> + C; B := C end -> begin A := B + C; B := C endNote, there may be more than one sequence in which the rules may be applied to derive the desired program.
Just for consistency and clarity, we will use leftmost derivations, in which we always apply the grammar rules to the leftmost available non-terminal symbol. For example:
<program> -> begin <statement_list> end -> begin <statement>; <statement_list> end -> begin <variable> := <expression>; <statement_list> end -> begin A := <expression>; <statement_list> end -> begin A := <variable> + <variable>; <statement_list> end -> begin A := B + <variable>; <statement_list> end -> begin A := B + C; <statement_list> end -> begin A := B + C; <statement> end -> begin A := B + C; <variable> := <expression> end -> begin A := B + C; B := <expression> end -> begin A := B + C; B := <variable> end -> begin A := B + C; B := C endProblem: ambiguous grammars: as stated earlier, a grammar is ambiguous if there can be more than one valid parse tree to derive the same program.
Ambigous grammars make it more difficult to identify the meaning of a program.
This is illustrated in the problem of operator precedence in expressions. Consider the grammar below:
<expression> -> <expression> + <expression> | <expression> - <expression> | <identifier> <identifier> -> A | B | CThen the expression A + B * C could have either of the following two parse trees:
<expression> TREE 1 / | \ <expression> + <expression> | / | \ <identifier> <expression> * <expression> | | | A <identifier> <identifier> | | B C TREE 2 <expression> / | \ <expression> * <expression> / | \ | <expression> + <expression> <identifier> | | | <identifier> <identifier> C | | A BThe first tree would evaluate like (A + (B * C)), whereas the second tree would evaluate like ((A + B) * C), with substantially different results!
Similarly, a grammar must address problems of associativity for operators. For example, A + B + C yields the same result regardless of the order of evaluation, but A - B - C does not.
Our grammar must correctly define order of evaluation in all circumstances, and hence the grammar describing our language syntax must be unambiguous. (Not a trivial task!)
The grammar rules below effectively give expressions within brackets the highest evaluation priority, then multiplication operations, then addition operations:
<expression> -> <expression> + <term> | <term> <term> -> <term> * <factor> | <factor> <factor> -> ( <expression> ) | <identifier>Syntax graphs: the information contained in our set of grammar rules can also be represented through the use of syntax graphs.
Such graphs show the information visually (hence are often conceptually easier to read) but are less applicable for use in grammar manipulation tools.
Extending the language description: attribute grammars:
Not all language features can be described easily using context free grammars, and some cannot be described at all.
Producing an exhaustive set of type compatibility rules may be possible, but not practical. Defining rules such as "variables must be declared prior to being used" may be completely impossible in a context free grammar.
The static semantics of a language refer to such legalities regarding the forms of programs.
Attribute grammars are used to define the static semantics of a language, using the following additional features:
The grammar rules for the language are permitted to make use of the attributes and semantic/predicate functions to define additional language features.
For instance, if a variable is used as an expression, it must have the same type as the expected type of the expression.
To show this, we might augment our syntax rule with semantic and predicate rules as shown below:
Syntax rule from CFG: <expression> -> <variable> Added rules: Semantic rule: <expression>.actual_type <- <variable>.actual_type Predicate: <expression>.actual_type = <expression>.expected_type
Even the extra abilities enabled by the use of attribute grammars are insufficient to describe the required dynamic semantics.
Generally, the semantics of a language are described using informal English (or other natural language) descriptions and examples.
Several possibilities for formally describing dynamic semantics include:
A translator is needed to translate the source code into a low level language appropriate to our virtual machine, and the translated code is then simulated on the machine.
The development of an effective translator and representative virtual machine are the primary impediments to the practicality of operational semantic approaches.
To date, this approach seems more appropriate in the realm of program specification testing and proof correctness than in the practical description of program behavior.
The complexity of the denotational semantics has typically precluded interest by language users, however they have been shown to enable the creation of programs which automatically generate language compilers. (I.e. you supply the denotational semantics, and the software produces a compiler for your language.)