The code directory includes
The sequence of events in a run of the tokenizer program is as follows:
The interesting work thus begins in the tokenize routine, which repeatedly calls the readToken routine to read one token at a time. The two routines coordinate with one another to keep track of where the are in the array of source code characters: tokenize tells readToken where in the character array to start searching for the next token, and readToken tells tokenize where that token ended.
The readToken routine starts at the array position identified by tokenize, skips past any whitespace, and then attempts to identify the content of the next token in the array. Having done so, it copies that token to the next available spot in the token list. Note that the logic readToken applies to attempt to identify the type/content of the next token is closely related to a FSM-like view of the token types supported in the SIMP language.
The main routine, the
The points of the most interest to us are the token descriptions for the SIMP language,
the tokenize routine, and the readToken routine.
token, tokenList
Note that the token type is simply a character array (to hold the content
of the token once known), and the tokenList is simply an array of these tokens
(i.e. an array of character arrays).
int tokenize(char source[], token tokenList[])
The tokenize function keeps track of how much of the source character array has been
processed so far (in the pos variable) and the number of tokens processed
so far (in the numTokens variable).
On each call to readToken it passes the current position in the source
array and the next free token from the token list, then waits for
readToken to fill in the next token (if any) and return the updated
position in the source array.
After checking the returned position from readToken (-1 if an invalid token was
encountered) it can update the number of tokens and the current position in
the source array, and either terminate or proceed with the next call to readToken.
int readToken(char source[], token tok, int position)
The SIMP token types are:
The basic readToken process is thus straightforward:
The Simp language
The tokens in Simp are as follows:
The tokenizer does not need to know/check the validity of the programs
it is given, the grammar for Simp is shown below for completeness only.
Note that most of these token types can be uniquely identified by their first character,
the only exception being integers vs floats.
The one additional complication with readToken is that end-of-input checks must be included
in all the "keep reading" segments.
Program --> Statements
Statements --> Statement
Statements --> Statement Statements
Statement --> Variable = Expression
Expression --> Expression Operator Value
Expression --> Value
Value --> Variable
Value --> Constant
Value --> Number
Operator --> [+*/-]
Operator --> [%][%]
Constant --> [A-Z] [0-9]*
Variable --> [a-z]+
Number --> Integer | Float
Integer --> [0-9]+
Float --> [0-9]+[.][0-9]+