lex/yacc

lex is a lexical analyzer - it allows you to provide regular expressions for the tokens recognized in a language, and it produces a C program (named lex.yy.c) that can tokenize source code for the language.

yacc (yet another compiler compiler) allows you to provide a set of grammar rules for a language, and it produces a program that can then parse input files for that language. It also allows you to tell yacc what to do with the language component it recognizes, allowing you to effectively create a syntax checker, interpretter, or compiler for any language you want (as long as you can come up with the grammar rules for the language and C code telling yacc what to do with the various language statements).

Suppose I have some language, L, and I want to be able to create a syntax-checking program, checkL, that reads programs written in L and checks that their syntax is valid, producing appropriate error messages if not. The steps would be as follows:

  1. Create a file called checkL.lex that contains the rules for tokens in L, e.g. what are the keywords and symbols in L, what is the syntax for a string, a comment, a real number, etc.
  2. Run the program lex on checkL.lex, to produce a c program (named lex.yy.c) to carry out tokenizing.
    lex checkL.lex
  3. Create a file called checkL.yacc that contains the grammar rules for L, e.g. what are the syntax rules for L, what kind of type checking takes place, what operator precedence is used, etc.
  4. Run the program yacc on checkL.yacc, to produce a c program (named y.tab.c) that will apply your token rules and grammar rules.
    yacc checkL.yacc
  5. Use gcc to compile the two .c files together and produce your checkL program:
    gcc y.tab.c lex.yy.c -o checkL
  6. Use checkL to test the syntax of source code written in L:
    checkL < someProgram.L

The trick now is understanding how we have to express our token and grammar rules in a way that lex and yacc understand, and that also correctly describes our language L.

Some reference material for lex and yacc are available here:

The small examples below use lex/yacc to recognize various languages/features, with some comments in the .lex and .yacc files to explain what is being done.

Each directory contains a makefile to build an interpretter for the language (called interp in each case) and a subdirectory of test cases.

Run make to build lex.yy.c, y.tab.c, and interp, then apply the interpretter using ./interp < filename
e.g. ./interp < testcases/t1

Note that augmented grammars (such as the yacc/lex CFGs plus supporting C code, or prolog's DCGs) can be used to describe a great many things beyond the conventional idea of a language:

Shopping orders as a grammar:
order --> items, payments
items --> item
items --> item, items
payments --> payment
payments --> payment, payments
item --> price, itype, quantity, etc
payment --> ptype, amount, etc

A forest of binary search trees as a grammar:
forest --> tree
forest --> tree, forest
tree --> node // i.e. the root node
node --> data, children
children --> node
children --> node, children
data --> datum
data --> datum, data
datum --> dtype, dname, dconstraints

ERDs as a grammar:
erd --> entities, relationships
entities --> entity
entities --> entity, entities
entity --> ename, attributes
attributes --> attribute
attributes --> attribute, attributes
attribute --> type, attrname, attrconstraints
relationships --> relationship
relationships --> relationship, relationships
relationship --> entities, rname, rconstraints, entities  // the "from" and "to" entities