The program assumes the source code has already been scanned and a token list generated and stored in tokens.h. (I mentioned the parser was rudimentary, right?)
The code directory includes
The language token types are
INT, REAL (representing datatypes) VAR (for variable identifiers) INUM (for integer literals) RNUM (for real number literals) PRINT (a built in print function) EQ (for the assignment operator) TERM (for the statement terminator, e.g. a semicolon)The grammar it recognizes is as follows:
prog --> list list --> stmt list | nil stmt --> VAR EQ asgn TERM | PRINT VAR TERM | INT VAR TERM | REAL VAR TERM asgn --> VAR EQ asgn | INUM | RNUMDepending on the regular expressions used for the various tokens, this could mean programs look something like
int x ; real y ; int z ; y = 0.5 ; print y ; x = z = 10 ;The parser currently just works off the token types, no additional token information is provided (e.g. the actual identifier name, real/integer value etc).
The parser datatypes
Token types and non terminal types are uniquely identified using an enumerated type, provided in parser.h.
The parser will use a stack of terminals/nonterminals, implemented as an array whose elements are of the enumerated type.
The token stream is also provided as an array (in tokens.h) of the enumerated type.
The parsing sequence
The parser keeps track of the current symbol being analyzed (either a token or nonterminal), the current position in the input token sequence, and a stack of tokens/non terminals.
The current symbol is initialized to the prog non terminal, the stack is initially empty, and the starting position in the token sequence is 0.
The main parsing loop repeats until either an error occurs or the stack and input sequence are both exhausted:
Once the rule has been identified, push the resulting symbols (from the RHS of the rule) from right to left onto the stack.
Finally, pop the top element off the stack and use that as the new value for current. (The NONE symbol is used if the stack was empty.)
Once the loop completes, the stack should be empty and we should have matched all characters in the input sequence.
Currently the parser displays which rule it applies and which characters it matches, so for the tokenized version of the example program above, the output looks like
Input 0, applying prog-->list Input 0, applying list-->stmt list Input 0, applying stmt-->INT VAR TERM Matched INT at input position 0 Matched VAR at input position 1 Matched TERM at input position 2 Input 3, applying list-->stmt list Input 3, applying stmt-->REAL VAR TERM Matched REAL at input position 3 Matched VAR at input position 4 Matched TERM at input position 5 Input 6, applying list-->stmt list Input 6, applying stmt-->INT VAR TERM Matched INT at input position 6 Matched VAR at input position 7 Matched TERM at input position 8 Input 9, applying list-->stmt list Input 9, applying stmt--> VAR EQ asgn TERM Matched VAR at input position 9 Matched EQ at input position 10 Input 11, applying asgn-->RNUM Matched RNUM at input position 11 Matched TERM at input position 12 Input 13, applying list-->stmt list Input 13, applying stmt-->PRINT VAR TERM Matched PRINT at input position 13 Matched VAR at input position 14 Matched TERM at input position 15 Input 16, applying list-->stmt list Input 16, applying stmt--> VAR EQ asgn TERM Matched VAR at input position 16 Matched EQ at input position 17 Input 18, applying asgn-->VAR EQ asgn Matched VAR at input position 18 Matched EQ at input position 19 Input 20, applying asgn-->INUM Matched INUM at input position 20 Matched TERM at input position 21 Success: the input file is a valid token sequence