CSCI 439: Lab 3 (Spring 2025)

Lab 3 can be obtained and submitted following our usual git processes (with the repo named "lab3" this time).

The repo contains four files:
Note: the repo has been pushed (4pm Feb. 9th)

Vurbossity revisions:

We'll be working with a slightly revised version of verbossity, the changes to the language specs are as follows (in addition to the corrections posted on the lab1 page): The new type-compatibility specifications won't be important for lab3, but they will be important in lab4 when you write a parser that works from the tokenizer you produce in lab3.

Lab 3 overview:

For lab 3 you'll be implementing a hand-crafted tokenizer for vurbossity, written in C++ and following the specifications below.

The two core routines are the actual tokenizer, called tokenize, and a routine to display the resulting tokens, printTokens.

The tokenizer reads the source code from standard input (so lab3x is run similarly to labs 1 and 2, ./lab3x < somecode.vurb). It strips comments and identifies each valid token, storing information about the token in an array of token structs and returning the total number of valid tokens read.

// read each word from standard input,
//    displaying error messages for invalid tokens encountered,
//    filling in the corresponding token information in the tokens array for valid tokens,
// and returning the number of valid tokens read
int tokenize(token tokens[]);

The token struct itself is defined in tokenizer.h as follows:

// each token has a type (from the TokenTypes enum),
//    the associated token text content, and
//    its position in the sequence of valid tokens
struct token {
   unsigned int ttype;
   string content;
   int pos;
};

The list of enumerated types is also found in tokenizer.h, but is only partially complete:

// enumeration of all the token types [only partially completed]
enum TokenType {
   Invalid = -1,
   Begin, Left, Right, End,
   Identifier, RealLit, BoolLit, IntLit, StrLit
};

The supporting print routine is simply so that the main routine (in lab3.cpp) can display the gathered information about the tokens afterward:

// display the token information for each token in the array
void printTokens(token tokens[], int size);

You need to complete the tokenize and printTokens routines, as well as the enumerated list of token types. The implementation approach is left up to you (as long as it is hand-crafted, not using any kind of scanner-generator like lex), and you may add as many additional functions to the tokenizing.cpp file as desired.

I'll be writing my own version of lab3.cpp to include/call your tokenize and printToken routines, so make sure all the code they rely upon is in the tokenizing.cpp file (i.e. don't have them rely on code that is in your lab3.cpp).

A sample input file and resulting output is provided below (just testing the tokenization, we don't care at this point if the program is syntactically valid):

begin
left 104
right COM blah blah blah
foo begin "this is some text" 9x5?
end

Beginning scanning: Scanning error: "9x5?" invalid token (found after the token in position 6) Total valid tokens: 5 Token 0: begin, content: begin Token 1: left, content: left Token 2: intlit, content: 104 Token 3: right, content: right Token 4: identifier, content: foo Token 5: begin, content: begin Token 6: strlit, content: "this is some text" Token 7: end, content: end ---end of tokens---