Simple hand-crafted parser for a backtrack-free grammar

Hand-crafted parser for a simple backtrack-free grammar

Here we introduce a rudimentary parser for a simple grammar, backtrack free using lookahead of one symbol.

The program assumes the source code has already been scanned and a token list generated and stored in tokens.h. (I mentioned the parser was rudimentary, right?)

The code directory includes

this document
parser.c: mostly one big main :)
parser.h: type definitions for tokens, non terminals
tokens.h: contains the token sequence to be parsed
a makefile: to build/clean the beastie

The language token types are

INT, REAL (representing datatypes)
VAR (for variable identifiers)
INUM (for integer literals)
RNUM (for real number literals)
PRINT (a built in print function)
EQ (for the assignment operator)
TERM (for the statement terminator, e.g. a semicolon)

The grammar it recognizes is as follows:

prog --> list
list --> stmt list
       | nil
stmt --> VAR EQ asgn TERM
       | PRINT VAR TERM
       | INT VAR TERM
       | REAL VAR TERM
asgn --> VAR EQ asgn
       | INUM
       | RNUM

Depending on the regular expressions used for the various tokens, this could mean programs look something like

int x ;
real y ;
int z ;
y = 0.5 ;
print y ;
x = z = 10 ;

The parser currently just works off the token types, no additional token information is provided (e.g. the actual identifier name, real/integer value etc).

The parser datatypes

Token types and non terminal types are uniquely identified using an enumerated type, provided in parser.h.

The parser will use a stack of terminals/nonterminals, implemented as an array whose elements are of the enumerated type.

The token stream is also provided as an array (in tokens.h) of the enumerated type.

The parsing sequence

The parser keeps track of the current symbol being analyzed (either a token or nonterminal), the current position in the input token sequence, and a stack of tokens/non terminals.

The current symbol is initialized to the prog non terminal, the stack is initially empty, and the starting position in the token sequence is 0.

The main parsing loop repeats until either an error occurs or the stack and input sequence are both exhausted:

If current is a nonterminal then look at the next character in the input sequence to determine which rule should be applied for the nonterminal (i.e. pick the rule whose leftmost symbol matches the input symbol).
Once the rule has been identified, push the resulting symbols (from the RHS of the rule) from right to left onto the stack.
Finally, pop the top element off the stack and use that as the new value for current. (The NONE symbol is used if the stack was empty.)
Otherwise, if current is a token and it matches the next character in the input sequence, then advance one character through the input sequence and pop the top symbol off the stack and use that as the new value for current. (The NONE symbol is used if the stack was empty.)
Otherwise, current must have been a token that does not match the character in the input sequence, indicating the program is not syntactically valid.

Once the loop completes, the stack should be empty and we should have matched all characters in the input sequence.

Currently the parser displays which rule it applies and which characters it matches, so for the tokenized version of the example program above, the output looks like

Input 0, applying prog-->list
Input 0, applying list-->stmt list
Input 0, applying stmt-->INT VAR TERM
Matched INT at input position 0
Matched VAR at input position 1
Matched TERM at input position 2
Input 3, applying list-->stmt list
Input 3, applying stmt-->REAL VAR TERM
Matched REAL at input position 3
Matched VAR at input position 4
Matched TERM at input position 5
Input 6, applying list-->stmt list
Input 6, applying stmt-->INT VAR TERM
Matched INT at input position 6
Matched VAR at input position 7
Matched TERM at input position 8
Input 9, applying list-->stmt list
Input 9, applying stmt--> VAR EQ asgn TERM
Matched VAR at input position 9
Matched EQ at input position 10
Input 11, applying asgn-->RNUM
Matched RNUM at input position 11
Matched TERM at input position 12
Input 13, applying list-->stmt list
Input 13, applying stmt-->PRINT VAR TERM
Matched PRINT at input position 13
Matched VAR at input position 14
Matched TERM at input position 15
Input 16, applying list-->stmt list
Input 16, applying stmt--> VAR EQ asgn TERM
Matched VAR at input position 16
Matched EQ at input position 17
Input 18, applying asgn-->VAR EQ asgn
Matched VAR at input position 18
Matched EQ at input position 19
Input 20, applying asgn-->INUM
Matched INUM at input position 20
Matched TERM at input position 21
Success: the input file is a valid token sequence