In this assignment you will build a lexer - a program that reads a text file and breaks it up into units called tokens. You will do this both from scratch and using a program that writes your lexer for you based on a regular expression grammar you provide.
Empty.nocaf
Algebra.decaf
Notice that there is a hierarchical structure to the files - that is, some of them are in subdirectories. This is indicated by the indenting level of the lists in this section. When you download the files you should maintain this directory structure. One reason is that each problem involves creating a class called Lexer
- and they are different classes. Another reason is it will make it easier to grade the handed in versions. When you turn in your results you should use the turnin
command on the directory that contains all these files.
The classes in P1.java
code for a lexer that recognizes the lexicon of Nocaf, an almost invisible subset of Java. You are to upgrade it to recognize the lexicon of Decaf, and to skip both //
and /*
comments. A correct solution will be able to read the provided file Algebra.decaf
.
The Decaf lexicon is as follows:
class
else
if
int
boolean
return
static
while
true
false
;
,
(
)
{
}
=
+
-
*
/
%
==
!=
||
&&
Although we listed "true" and "false" with the keywords, formally they are not keywords. They are boolean literals.
This problem shouldn't be too hard once you see how the basic framework works. The most difficult part will be to accomodate the multi-character operators. As it is now, when the Lexer encounters a non-letter non-number character, it immediately grabs it, assumes it's done, and notifies the Token. This needs to change.
Set up the directory hierarchy and download the files. Then issue the following commands at the Unix prompt:
% cd <your Programming Assignment 1 directory>/p1 % javac P1.java % java P1 ../Empty.nocaf
You should get the following output:
Class "Empty" L-brace Static Int "nothing" L-paren R-paren L-brace R-brace R-brace
The program has digested the input file, which contains a small source text written in Nocaf. The file P1.java
defines three classes:
Scanner
- This class takes care of reading in the source text and feeding its characters to its client, which in this case is Lexer
.Lexer
- This class creates new Tokens
and fills up their spelling
field with characters provided by a Scanner
. It has some primitive knowledge of the language: how to distinguish words, numbers, symbols, and whitespace.Token
- Once a Token
's spelling is set, the Token
decides for itself what kind of token it is and sets its type
field accordingly. The details of the language's lexicon are in here.We will be using this framework for subsequent stages of the programming project; you will be provided with a correct version that incorporates some other useful features. Most notable of these is keeping track of line and character numbers, which makes error messages a lot more useful.
Create a lexer for the same lexicon, but using the automated tool JLex.
The source code (in Java) and documentation for this tool is available at http://www.cs.princeton.edu/~appel/modern/java/JLex. The CS-160 home directory has a pre-compiled version of this tool.
The JLex specification file for recognizing the lexicon of Nocaf is provided for you as P2.jlx
. It should be pretty easy to extend it to recognize Decaf; probably easier than your solution to Problem 1.
(The code for recognizing and skipping comments is already included for you. Explain the meaning of each term in the regular expressions for COMMENT1
and COMMENT2
. Put this explanation in a file called explain
and include it in the p2
directory when you hand in your assignment.)
Do the following:
% cd <your Programming Assignment 1 directory>/p2 % setenv CLASSPATH ".:/fs/cs-cls/cs160/lib" % java JLex.Main P2.jlx % javac P2.jlx.java % java P2 ../Empty.nocaf
You should get the same output you did when running the first program on the nocaf file, except with the additional token "Eof
" at the end.
The documentation provided with this tool is not the greatest. (For example, § 2.3.2 says the characters / ;
and =
are metacharacters for its regular expressions, but doesn't say what for.) However, the template provided in P2.jlx
gives you a good start. Here is a brief description of its contents.
The first section of the file (before the first %%
marker) is Java source that will be copied verbatim into the Lexer that the tool generates. Here we define a simple Token type, with identical fields to the one used in Problem 1. But this is a "dumb" token, without any methods except constructors. Also provided in this section is a main method that instantiates the Lexer by passing in a FileInputStream
. (That's undocumented; to figure it out you have to read the output Java code created by the tool.)
The second section contains directives - instructions about what sorts of regular expressions are to be recognized, and how the Java classes are to be named.
The third section contains the actions - what the Lexer is supposed to do when it recognizes the spellings of specific tokens. In some of these actions, we make a call to yylex()
, a method that returns a String
with the spelling of the token; the method is described in § 2.3.3.3.