Simple hand-crafted tokenizer for a fictional language

Hand-crafted tokenizer for SIMP

Here we introduce a rudimentary tokenizer for the SIMP language, which is described further below.

The code directory includes

this document
tokenizer.c: essentially the main routine for the tokenizer
tokens.c: most of the functions for tokenization
tokens.h: type definitions and function prototypes
a makefile: to build/clean the beastie
an directory: a variety of SIMP programs for testing (some valid, some containing invalid tokens)
two bash scripts, testGood.sh and testBad.sh: to run the tokenizer on the set of good/bad SIMP test cases

The sequence of events in a run of the tokenizer program is as follows:

The main routine calls readSource to read the contents of the desired SIMP program from a file into a character array.
main then calls the tokenize routine to go through the character array, identifying the sequence of SIMP tokens and storing them in a token list
main then calls the printToken routine to display the contents of the token list

The interesting work thus begins in the tokenize routine, which repeatedly calls the readToken routine to read one token at a time. The two routines coordinate with one another to keep track of where the are in the array of source code characters: tokenize tells readToken where in the character array to start searching for the next token, and readToken tells tokenize where that token ended.

The readToken routine starts at the array position identified by tokenize, skips past any whitespace, and then attempts to identify the content of the next token in the array. Having done so, it copies that token to the next available spot in the token list. Note that the logic readToken applies to attempt to identify the type/content of the next token is closely related to a FSM-like view of the token types supported in the SIMP language.

The main routine, the routine to read the source file into an array, and the printTokens routine to print the contents of the final token array are each quite straight-forward and not discussed further (see the source code if you're curious).

The points of the most interest to us are the token descriptions for the SIMP language, the tokenize routine, and the readToken routine.

token, tokenList

Note that the token type is simply a character array (to hold the content of the token once known), and the tokenList is simply an array of these tokens (i.e. an array of character arrays).

int tokenize(char source[], token tokenList[])

The tokenize function keeps track of how much of the source character array has been processed so far (in the pos variable) and the number of tokens processed so far (in the numTokens variable).

On each call to readToken it passes the current position in the source array and the next free token from the token list, then waits for readToken to fill in the next token (if any) and return the updated position in the source array.

After checking the returned position from readToken (-1 if an invalid token was encountered) it can update the number of tokens and the current position in the source array, and either terminate or proceed with the next call to readToken.

int readToken(char source[], token tok, int position)

As mentioned above, readToken takes a character array containing the target source code, searches for/reads the first token at/after a specified starting point in the array, copies that token's content to a token list, and identifies (and returns) the spot in the character array where the token ends.

The SIMP token types are:

variables, which consist solely of lowercase alphabetic characters
integers, which consist entirely of digits
floats in fixed point notation (digits followed by a decimal point followed by digits)
constants, which consist of an uppercase alphabetic character followed by zero or more digits
single-character symbols, which are the characters * - + / =
the double-character symbol %%

Note that most of these token types can be uniquely identified by their first character, the only exception being integers vs floats.

The basic readToken process is thus straightforward:

starting from the given position, skip past whitespace
look at the first character and see if it is variable, constant, integer/float, operator, or invalid
for variables or constants, after the first character keep reading as long as you see further valid characters (lowercase alpha/digits respectively)
for integers/floats: treat as an integer initially and keep reading as long as you see digits, if you see a decimal point then regard it as a float and keep reading as long as you see digits
operators other than %% should just be the single character
if the % was seen then check for the second %

The one additional complication with readToken is that end-of-input checks must be included in all the "keep reading" segments.

The Simp language

The tokens in Simp are as follows:

Constants: begin with an uppercase alphabetic character, followed by zero or more digits
Variables: consist of one or more lowercase alphabetic characters.
Integers: consist entirely of one or more digits.
Floats: consist of one or more digits, followed by a decimal point, followed by one or more digits
Symbols: consist of any one of the following single characters: + - * / = %%

The tokenizer does not need to know/check the validity of the programs it is given, the grammar for Simp is shown below for completeness only.

Program --> Statements
Statements --> Statement
Statements --> Statement Statements
Statement --> Variable = Expression
Expression --> Expression Operator Value
Expression --> Value
Value --> Variable
Value --> Constant
Value --> Number
Operator --> [+*/-]
Operator --> [%][%]
Constant --> [A-Z] [0-9]*
Variable --> [a-z]+
Number --> Integer | Float
Integer --> [0-9]+
Float --> [0-9]+[.][0-9]+