Lex and Yacc in Java Part 1
Lex and Yacc in Java Part 1
Sign In |
Register
NEWS
Looking for lex and yacc for Java? You don't know Jack
How to get and use Sun's free automatic Java parser generator -- a unique new tool that's a must for Java compiler
developers
By Chuck Mcmanis
JavaWorld |
DEC 1, 1996 12:00 AM PT
Sun has released Jack, a new tool written in Java that automatically generates parsers by compiling a high-level
grammar specification stored in a text file. This article will serve as an introduction to this new tool. The first part of
the article covers a brief introduction to automatic parser generation, and my first experiences with them. Then the
article will focus on Jack and how you can use it to generate parsers and applications built with those parsers,
based on your high-level grammar.
In my first "real" job after college, I got an assigment to create a new graphics processing language to compile into
commands for a graphics coprocessor. I started with a freshly composed grammar and prepared to launch into the
multiweek project of putting together a compiler. Then a friend showed me the Unix utilities lex and yacc. Lex built
lexical analyzers from regular expressions, and yacc reduced a grammar specification into a table-driven compiler
that could produce code when it had successfully parsed productions from that grammar. I used lex and yacc, and
in less than a week my compiler was up and running! Later, the Free Software Foundation's GNU project produced
"improved" versions of lex and yacc -- named flex and bison -- for use on platforms that did not run a derivative of
the Unix operating system.
The world of automatic parser generation advanced again when Terrence Parr, then a student at Purdue University,
created the Purdue Compiler Construction Tool Set or PCCTS. Two components of PCCTS -- DFA and ANTLR --
provide the same functions as lex and yacc; however the grammars that ANTLR accepts are LL(k) grammars as
opposed to the LALR grammars used by yacc. Furthermore, the code that PCCTS generates is much more readable
than the code generated by yacc. By generating code that is easier to read, PCCTS makes it easier for a human
reading the code to understand what the various pieces are doing. This understanding can be essential when trying
to diagnose errors in the grammar specification. PCCTS quickly developed a following of folks who found its files
easier to use than yacc.
The power of automatic parser generation is that it allows users to concentrate on the grammar and not worry
about the correctness of the implementation. This can be a tremendous time-saver in both simple and complex
projects.
Jack (rhymes with yacc) is a parser generator, in the spirit of PCCTS, that Sun has released for free to the Java
programming community. Jack is an exceptionally easy tool to describe: Simply put, you give it a set of combined
grammatical and lexing rules in the form of a .jack file and run the tool, and it gives you back a Java class that will
parse that grammar. What could be easier?
[ Learn Java from beginning concepts to advanced design patterns in this comprehensive 12-part course!]
Getting ahold of Jack is also quite easy. First you download a copy from the Jack home page. This comes to you in
the form of a self-unpacking Java class called install. To install Jack you need to invoke this install class, which, on a
Windows 95 machine is done using the command: C:>java install.
The command shown above assumes that the java command is in your command path and that the class path has
been set up appropriately. If the above command did not work, or if you aren't sure whether or not you have things
set up properly, open an MS-DOS window by traversing the Start->Programs->MS-DOS Prompt menu items. If you
have the Sun JDK installed, you can type these commands:
If Symantec Cafe version 1.2 or later is installed, you can type these commands:
The class path should already be set up in a file called sc.ini in the bin directory of Cafe.
Next, type the java install command from above. The install program will ask you what directory you wish to install
into, and the Jack subdirectory will be created below that.
Using Jack
Jack is written entirely in Java, so having the Jack classes means that this tool is instantly available on every
platform that supports the Java virtual machine. However, it also means that on Windows boxes you have to run
Jack from the command line. Let's say you chose the directory name JavaTools when you installed Jack on your
system. To use Jack you will need to add Jack's classes to your class path. You can do this in your autoexec.bat file
or in your .cshrc file if you are a Unix user. The critical command is something like the line shown below:
Note that Symantec Cafe users can edit the sc.ini file and include the Jack classes there, or they can set
CLASSPATH explicitly as shown above.
Setting the environment variable as shown above puts the Jack classes in the CLASSPATH between "." (the current
directory) and the base system classes for Java. The main class for Jack is COM.sun.labs.jack.Main. Capitalization is
important! There are exactly four capital letters in the command ('C', 'O', 'M', and another 'M'). To run Jack manually,
type the command:
If you don't have the Jack files in your class path, you can use this command:
As you can see, this gets a bit long. To minimize typing, I put the invocation into a .bat file named Jack.bat. At some
point in the future, a simple C wrapper program will become available, perhaps even as you read this. Check out the
Jack home page for the availability of this and other programs.
When Jack is run, it creates several files in the current directory that you will later compile into your parser. Most
are either prefixed with the name of your parser or are common to all parsers. One of these, however,
ASCII_CharStream.java, may collide with other parsers, so it is probably a good idea to start off in a directory
containing only the .jack file you are going to use to generate the parser.
Once you run Jack, if the generation has gone smoothly you will have a bunch of .java files in the current directory
with various interesting names. These are your parsers. I encourage you to open them up with an editor and look
them over. When you are ready you can compile them with the command
C:> javac -d .
ParserName.java
where ParserName is the name you gave your parser in the input file. More on that in a bit. If all of the files for
your parser don't compile, you can use the brute force method of typing:
This will compile everything in the directory. At this point your new parser is ready to use.
options {
LOOKAHEAD = 1;
}
PARSER_BEGIN(Simple1)
public class Simple1 {
public static void main(String args[]) throws ParseError {
Simple1 parser = new Simple1(System.in);
parser.Input();
}
}
PARSER_END(Simple1)
The first few lines above describe options for the parser; in this case LOOKAHEAD is set to 1. There are other
options, such as diagnostics, Java Unicode handling, and so on, that also can be set here. Following the options
comes the base class of the parser. The two tags PARSER_BEGIN and PARSER_END bracket the class that becomes
the base Java code for the resulting parser. Note that the class name used in the parser specification must be the
same in the beginning, middle, and ending part of this section. In the example above, I've put the class name in
bold face to make this clear. As you can see in the code above, this class defines a static main method so that the
class can be invoked by the Java interpreter on the command line. The main method simply instantiates a new
parser with an input stream (in this case System.in) and then invokes the Input method. The Input method is a non-
terminal in our grammar, and it is defined in the form of an EBNF element. EBNF stands for Extended Backus-Naur
Form. The Backus-Naur form is a method for specifying context-free grammars. The specification consists of a
terminal on the left-hand side, a production symbol, which is typically "::=", and one or more productions on the
right-hand side. The notation used is typically something like this:
This would be read as, "The Keyword terminal is one of the string literals 'if', 'then', or 'else.'" In Jack, this form is
extended to allow the left-hand part to be represented by a method, and the alternate expansions may be
represented by regular expressions or other non-terminals. Continuing with our simple example, the file contains
the following definitions:
void Input() :
{}
{
MatchedBraces() "\n" <EOF>
}
void MatchedBraces() :
{}
{
"{" [ MatchedBraces() ] "}"
}
MatchedBraces "\n"
Input ::=
<EOF>
MatchedBraces ::= "{" [ MatchedBraces ] "}"
I've used italics to show the non-terminals on the right side of the productions and boldface to show literals. As you
can see, the grammar simply parses matched sets of brace "{" and "}" characters. There are two productions in the
Jack file to describe this grammar. The first terminal, Input, is defined by this definition to be three items in
sequence: a MatchedBraces terminal, a newline character, and an end-of-file token. The <EOF> token is defined by
Jack so that you don't have to specify it for your platform.
When this grammar is generated, the left-hand sides of the productions are turned into methods inside the Simple1
class; when compiled, the Simple1 class reads characters from System.in and verifies that they contain a matching
set of braces. This is accomplished by invoking the generated method Input, which is transformed by the generation
process into a method that parses an Input non-terminal. If the parse fails, the method throws the exception
ParseError, which the main routine can catch and then complain about if it so chooses.
Of course there's more. The block delineated by "{" and "}" after the terminal name -- which is empty in this
example -- can contain arbitrary Java code that is inserted at the front of the generated method. Then, after each
expansion, there is another optional block that can contain arbitrary Java code to be executed when the parser
successfully matches that expansion.
options {
LOOKAHEAD=1;
}
PARSER_BEGIN(Calc1)
public class Calc1 {
public static void main(String args[]) throws ParseError {
Calc1 parser = new Calc1(System.in);
while (true) {
System.out.print("Enter Expression: ");
System.out.flush();
try {
switch (parser.one_line()) {
case -1:
System.exit(0);
default:
break;
}
} catch (ParseError x) {
System.out.println("Exiting.");
throw x;
}
}
}
}
PARSER_END(Calc1)
The first part is nearly the same as Simple1, except that the main routine now calls the terminal one_line
repeatedly until it fails to parse. Next comes the following code:
IGNORE_IN_BNF :
{}
{
" " | "\r" | "\t"
}
TOKEN :
{}
{
< EOL: "\n" >
}
TOKEN : /* OPERATORS */
{}
{
< PLUS: "+" >
| < MINUS: "-" >
| < MULTIPLY: "*" >
| < DIVIDE: "/" >
}
TOKEN :
{}
{
< CONSTANT: ( <DIGIT> )+ >
| < #DIGIT: ["0" - "9"] >
}
These definitions cover the basic terminals in which the grammar is specified. The first, named IGNORE_IN_BNF, is a
special token. Any tokens read by the parser that match the characters defined in an IGNORE_IN_BNF token are
silently discarded. As you can see in our example, this causes the parser to ignore space characters, tabs, and
carriage return characters in the input.
Page 1 of 3 ▻
Recommended
Java Challengers: Method overloading in the JVM
WHITE PAPER
Choosing the Right In-Memory Computing Solution
CORE JAVA ENTERPRISE JAVA LEARN JAVA MOBILE JAVA
How-Tos Features News Blogs Resources Newsletters
ABOUT US CONTACT PRIVACY POLICY COOKIE POLICY MEMBER PREFERENCES ADVERTISING IDG CAREERS AD CHOICES E-COMMERCE LINKS