Programming Assignment 2 (Due on Feb 1)

In this assignment you will build a lexer - a program that reads a text file and breaks it up into units called tokens. You will do this both from scratch and using a program that writes your lexer for you based on a regular expression grammar you provide.

Files

Notice that there is a hierarchical structure to the files - that is, some of them are in subdirectories. This is indicated by the indenting level of the lists in this section. When you download the files you should maintain this directory structure. One reason is that each problem involves creating a class called Lexer - and they are different classes. Another reason is it will make it easier to grade the handed in versions. When you turn in your results you should use the turnin command on the directory that contains all these files.


Problem 1

The classes in P1.java code for a lexer that recognizes the lexicon of Nocaf, an almost invisible subset of Java. You are to upgrade it to recognize the lexicon of Decaf, and to skip both // and /* comments. A correct solution will be able to read the provided file Algebra.decaf.

The Decaf lexicon is as follows:

Although we listed "true" and "false" with the keywords, formally they are not keywords. They are boolean literals.

This problem shouldn't be too hard once you see how the basic framework works. The most difficult part will be to accomodate the multi-character operators. As it is now, when the Lexer encounters a non-letter non-number character, it immediately grabs it, assumes it's done, and notifies the Token. This needs to change.

Getting Started

Set up the directory hierarchy and download the files. Then issue the following commands at the Unix prompt:

You should get the following output:

The program has digested the input file, which contains a small source text written in Nocaf. The file P1.java defines three classes:

We will be using this framework for subsequent stages of the programming project; you will be provided with a correct version that incorporates some other useful features. Most notable of these is keeping track of line and character numbers, which makes error messages a lot more useful.


Problem 2

Create a lexer for the same lexicon, but using the automated tool JLex.

The source code (in Java) and documentation for this tool is available at http://www.cs.princeton.edu/~appel/modern/java/JLex. The CS-160 home directory has a pre-compiled version of this tool.

The JLex specification file for recognizing the lexicon of Nocaf is provided for you as P2.jlx. It should be pretty easy to extend it to recognize Decaf; probably easier than your solution to Problem 1.

(The code for recognizing and skipping comments is already included for you. Explain the meaning of each term in the regular expressions for COMMENT1 and COMMENT2. Put this explanation in a file called explain and include it in the p2 directory when you hand in your assignment.)

Getting Started

Do the following:

You should get the same output you did when running the first program on the nocaf file, except with the additional token "Eof" at the end.

The documentation provided with this tool is not the greatest. (For example, § 2.3.2 says the characters / ; and = are metacharacters for its regular expressions, but doesn't say what for.) However, the template provided in P2.jlx gives you a good start. Here is a brief description of its contents.

The first section of the file (before the first %% marker) is Java source that will be copied verbatim into the Lexer that the tool generates. Here we define a simple Token type, with identical fields to the one used in Problem 1. But this is a "dumb" token, without any methods except constructors. Also provided in this section is a main method that instantiates the Lexer by passing in a FileInputStream. (That's undocumented; to figure it out you have to read the output Java code created by the tool.)

The second section contains directives - instructions about what sorts of regular expressions are to be recognized, and how the Java classes are to be named.

The third section contains the actions - what the Lexer is supposed to do when it recognizes the spellings of specific tokens. In some of these actions, we make a call to yylex(), a method that returns a String with the spelling of the token; the method is described in § 2.3.3.3.