CSCI 439 lab 3

CSCI 439: Lab 3 (Spring 2025)

Lab 3 can be obtained and submitted following our usual git processes (with the repo named "lab3" this time).

The repo contains four files:

tokenizer.h and tokenizer.cpp: for the tokenization code you'll be writing
lab3.cpp: just a main routine/driver to call your tokenizer and display the results
a makefile to compile the code into executable lab3x

Note: the repo has been pushed (4pm Feb. 9th)

Vurbossity revisions:

We'll be working with a slightly revised version of verbossity, the changes to the language specs are as follows (in addition to the corrections posted on the lab1 page):

There is an additional data type, and keyword "boolean", so variables and parameters can be declared of this new type.
The conditional expressions (lt, gt, eq, ne, le, ge) evaluate to a boolean and their operands must be either be two expressions of the same type or a combination of integer and real (in either order).
The and, or, not operators require their operands evaluate to a boolean (i.e. their operands cannot be integers, reals, or strings).
The add, sub, mul, div operators require their operands to be integers or reals, and the rem operator requires both its operands to be integers.
The set operator works if the identifier being assigned to has the same type as the expression being assigned or if the identifier is a real and the expression evaluates to an integer, any other type combination is invalid.

The new type-compatibility specifications won't be important for lab3, but they will be important in lab4 when you write a parser that works from the tokenizer you produce in lab3.

Lab 3 overview:

For lab 3 you'll be implementing a hand-crafted tokenizer for vurbossity, written in C++ and following the specifications below.

The two core routines are the actual tokenizer, called tokenize, and a routine to display the resulting tokens, printTokens.

The tokenizer reads the source code from standard input (so lab3x is run similarly to labs 1 and 2, ./lab3x < somecode.vurb). It strips comments and identifies each valid token, storing information about the token in an array of token structs and returning the total number of valid tokens read.

// read each word from standard input,
//    displaying error messages for invalid tokens encountered,
//    filling in the corresponding token information in the tokens array for valid tokens,
// and returning the number of valid tokens read
int tokenize(token tokens[]);

The token struct itself is defined in tokenizer.h as follows:

// each token has a type (from the TokenTypes enum),
//    the associated token text content, and
//    its position in the sequence of valid tokens
struct token {
   unsigned int ttype;
   string content;
   int pos;
};

The list of enumerated types is also found in tokenizer.h, but is only partially complete:

// enumeration of all the token types [only partially completed]
enum TokenType {
   Invalid = -1,
   Begin, Left, Right, End,
   Identifier, RealLit, BoolLit, IntLit, StrLit
};

The supporting print routine is simply so that the main routine (in lab3.cpp) can display the gathered information about the tokens afterward:

// display the token information for each token in the array
void printTokens(token tokens[], int size);

You need to complete the tokenize and printTokens routines, as well as the enumerated list of token types. The implementation approach is left up to you (as long as it is hand-crafted, not using any kind of scanner-generator like lex), and you may add as many additional functions to the tokenizing.cpp file as desired.

I'll be writing my own version of lab3.cpp to include/call your tokenize and printToken routines, so make sure all the code they rely upon is in the tokenizing.cpp file (i.e. don't have them rely on code that is in your lab3.cpp).

A sample input file and resulting output is provided below (just testing the tokenization, we don't care at this point if the program is syntactically valid):

begin
left 104
right COM blah blah blah
foo begin "this is some text" 9x5?
end

Beginning scanning:

Scanning error: "9x5?" invalid token (found after the token in position 6)

Total valid tokens: 5
Token 0: begin, content: begin
Token 1: left, content: left
Token 2: intlit, content: 104
Token 3: right, content: right
Token 4: identifier, content: foo
Token 5: begin, content: begin
Token 6: strlit, content: "this is some text"
Token 7: end, content: end

---end of tokens---