Hand-crafted tokenizer for SIMP

Here we introduce a rudimentary tokenizer for the SIMP language, which is described further below.

The code directory includes

The sequence of events in a run of the tokenizer program is as follows:

  1. The main routine calls readSource to read the contents of the desired SIMP program from a file into a character array.
  2. main then calls the tokenize routine to go through the character array, identifying the sequence of SIMP tokens and storing them in a token list
  3. main then calls the printToken routine to display the contents of the token list

The interesting work thus begins in the tokenize routine, which repeatedly calls the readToken routine to read one token at a time. The two routines coordinate with one another to keep track of where the are in the array of source code characters: tokenize tells readToken where in the character array to start searching for the next token, and readToken tells tokenize where that token ended.

The readToken routine starts at the array position identified by tokenize, skips past any whitespace, and then attempts to identify the content of the next token in the array. Having done so, it copies that token to the next available spot in the token list. Note that the logic readToken applies to attempt to identify the type/content of the next token is closely related to a FSM-like view of the token types supported in the SIMP language.

The main routine, the routine to read the source file into an array, and the printTokens routine to print the contents of the final token array are each quite straight-forward and not discussed further (see the source code if you're curious).

The points of the most interest to us are the token descriptions for the SIMP language, the tokenize routine, and the readToken routine.


token, tokenList

Note that the token type is simply a character array (to hold the content of the token once known), and the tokenList is simply an array of these tokens (i.e. an array of character arrays).


int tokenize(char source[], token tokenList[])

The tokenize function keeps track of how much of the source character array has been processed so far (in the pos variable) and the number of tokens processed so far (in the numTokens variable).

On each call to readToken it passes the current position in the source array and the next free token from the token list, then waits for readToken to fill in the next token (if any) and return the updated position in the source array.

After checking the returned position from readToken (-1 if an invalid token was encountered) it can update the number of tokens and the current position in the source array, and either terminate or proceed with the next call to readToken.


int readToken(char source[], token tok, int position)

As mentioned above, readToken takes a character array containing the target source code, searches for/reads the first token at/after a specified starting point in the array, copies that token's content to a token list, and identifies (and returns) the spot in the character array where the token ends.

The SIMP token types are:

Note that most of these token types can be uniquely identified by their first character, the only exception being integers vs floats.

The basic readToken process is thus straightforward:

The one additional complication with readToken is that end-of-input checks must be included in all the "keep reading" segments.


The Simp language

The tokens in Simp are as follows:

The tokenizer does not need to know/check the validity of the programs it is given, the grammar for Simp is shown below for completeness only.

Program --> Statements
Statements --> Statement
Statements --> Statement Statements
Statement --> Variable = Expression
Expression --> Expression Operator Value
Expression --> Value
Value --> Variable
Value --> Constant
Value --> Number
Operator --> [+*/-]
Operator --> [%][%]
Constant --> [A-Z] [0-9]*
Variable --> [a-z]+
Number --> Integer | Float
Integer --> [0-9]+
Float --> [0-9]+[.][0-9]+