Compiler
Compiler
Introduction
1
2 CHAPTER 1 INTRODUCTION
Some parts of the book describe lcc from the bottom up. For example,
the chapters on managing storage, strings, and symbol tables describe
functions that are at or near the ends of call chains. Little context is
needed to understand them.
Other parts of the book give a top-down presentation. For example,
the chapters on parsing expressions, statements, and declarations begin
with the top-level constructs. Top-down material presents some func-
tions or fragments well after the code that uses them, but material near
the first use tells enough about the function or fragment to understand
what’s going on in the interim.
Some parts of the book alternate between top-down and bottom-up
presentations. A less variable explanation order would be nice, but it’s
unattainable. Like most compilers, lcc includes mutually recursive func-
tions, so it’s impossible to describe all callees before all callers or all
callers before all callees.
Some fragments are easier to explain before you see the code. Others
are easier to explain afterward. If you need help with a fragment, don’t
struggle before scanning the text just before and after the fragment.
Most of the code for lcc appears in the text, but a few fragments are
used but not shown. Some of these fragments hold code that is omitted
to save space. Others implement language extensions, optional debug-
ging aids, or repetitious constructs. For example, once you’ve seen the
code that handles C’s for statement, the code that handles the do-while
statement adds little. The only wholesale omission is the explanation of
how lcc processes C’s initializers, which we skipped because it is long,
not very interesting, and not needed to understand anything else. Frag-
ments that are used but not defined are easy to identify: no page number
follows the fragment name.
Also omitted are assertions. lcc includes hundreds of assertions.
Most assert something that the code assumes about the value of a param-
eter or data structure. One is assert(0), which guarantees a diagnostic
and thus identifies states that are not supposed to occur. For example, if
a switch is supposed to have a bona fide case for all values of the switch
expression, then the default case might include assert(0).
The companion diskette is complete. Even the assertions and frag-
ments that are omitted from the text appear on the diskette. Many of
them are easily understood once the documented code nearby is under-
stood.
4 CHAPTER 1 INTRODUCTION
1.3 Overview
lcc transforms a source program to an assembler language program.
Following a sample program through the intermediate steps in this trans-
formation illustrates lcc’s major components and data structures. Each
step transforms the program into a different representation: prepro-
cessed source, tokens, trees, directed acyclic graphs, and lists of these
graphs are examples. The initial source code is:
# 1 "sample.c"
int round(f) float f; {
return f + 0.5;
}
The lexical analyzer reads source text and produces tokens, which are
the basic lexical units of the language. For example, the expression
*ptr = 56; contains 10 characters or five tokens: *, ptr, =, 56, and
;. For each token, the lexical analyzer returns its token code and zero
or more associated values. The token codes for single-character tokens,
such as operators and separators, are the characters themselves. Defined
constants (with values that do not collide with the numeric values of sig-
nificant characters) are used for the codes of the tokens that can consist
of one or more characters, such as identifiers and constants.
For example, the statement *ptr = 56; yields the token stream shown
on the left below; the associated values, if there are any, are shown on
the right.
'*'
ID "ptr" symbol-table entry for "ptr"
stringn 30 '='
ICON "56" symbol-table entry for 56
The token codes for the operators * and = are the operators themselves,
i.e., the numeric values of * and =, respectively, and they do not have
associated values. The token code for the identifier ptr is the value of
the defined constant ID, and the associated values are the saved copy
of the identifier string itself, i.e., the string returned by stringn, and a
symbol-table entry for the identifier, if there is one. Likewise, the integer
constant 56 returns ICON, and the associated values are the string "56"
and a symbol-table entry for the integer constant 56.
Keywords, such as for, are assigned their own token codes, which
distinguish them from identifiers.
The lexical analyzer also tracks the source coordinates for each token.
These coordinates, defined in Section 3.1, give the file name, line number,
and character index within the line of the first character of the token.
Coordinates are used to pinpoint the location of errors and to remember
where symbols are defined.
The lexical analyzer is the only part of the compiler that looks at each
character of the source text. It is not unusual for lexical analysis to ac-
count for half the execution time of a compiler. Hence, speed is impor-
tant. The lexical analyzer s main activity is moving characters, so mini-
mizing the amount of character movement helps increase speed. This is
done by dividing the lexical analyzer into two tightly coupled modules.
102
6.1 INPUT 103
The input module, input.c, reads the input in large chunks into a buffer,
and the recognition module, lex.c, examines the characters to recognize
tokens.
6.1 Input
In most programming languages, input is organized in lines. Although
in principle, there is rarely a limit on line length, in practice, line length
is limited. In addition, tokens cannot span line boundaries in most lan-
guages, so making sure complete lines are in memory when they are
being examined simplifies lexical analysis at little expense in capability.
String literals are the one exception in C, but they can be handled as a
special case.
The input module reads the source in large chunks, usually much
larger than individual lines, and it helps arrange for complete tokens
to be present in the input buffer when they are being examined, except
identifiers and string literals. To minimize the overhead of accessing the
input, the input module exports pointers that permit direct access to the
input buffer:
C
hinput.c exported datai+≡ 97 104
D
extern unsigned char *cp; 106 fillbuf
extern unsigned char *limit; 106 nextline
\n
cp limit
where shading depicts the characters that have yet to be consumed and
\n represents the newline. If fillbuf is called, it slides the unconsumed
tail of the input buffer down and refills the buffer. The resulting state is
\n
cp limit
fillbuf 106 where the darker shading differentiates the newly read characters from
limit 103
those moved by fillbuf. When a call to fillbuf reaches the end of the
nextline 106
input, the buffer s state becomes
\n
cp limit
Finally, when nextline is called for the last sentinel at *limit, fillbuf
sets cp equal to limit, which indicates end of file (after the first call to
nextline). This final state is
\n
cp
limit
Input is read from the file descriptor given by infd; the default is zero,
which is the standard input. file is the name of the current input file;
line gives the location of the beginning of the current line, if it were
to fit in the buffer; and lineno is the line number of the current line.
The coordinates f , x, y of the token that begins at cp, where f is the file
name, are thus given by file, cp-line, and lineno, where characters in
the line are numbered beginning with zero. line is used only to compute
the x coordinate, which counts tabs as single characters. firstfile
gives the name of the first source file encountered in the input; it s used
in error messages.
The input buffer itself is hidden inside the input module:
hinput.c datai≡
static int bsize;
static unsigned char buffer[MAXLINE+1 + BUFSIZE+1];
BUFSIZE is the size of the input buffer into which characters are read,
and MAXLINE is the maximum number of characters allowed in an uncon-
sumed tail of the input buffer. fillbuf must not be called if limit-cp
104 file
is greater than MAXLINE. The standard specifies that compilers need not
106 fillbuf
handle lines that exceed 509 characters; lcc handles lines of arbitrary 104 firstfile
length, but, except for identifiers and string literals, insists that tokens 104 infd
not exceed 512 characters. 103 limit
The value of bsize encodes three different input states: If bsize is less 104 line
104 lineno
than zero, no input has been read or a read error has occurred; if bsize
106 nextline
is zero, the end of input has been reached; and bsize is greater than
zero when bsize characters have just been read. This rather complicated
encoding ensures that lcc is initialized properly and that it never tries
to read past the end of the input.
inputInit initializes the input variables and fills the buffer:
C
hinput.c functionsi+≡ 105 106
D
void nextline() {
do {
if (cp >= limit) {
hrefill buffer 106i
if (cp == limit)
return;
} else
lineno++;
for (line = (char *)cp; *cp==' ' || *cp=='\t'; cp++)
;
} while (*cp == '\n' && cp == limit);
if (*cp == '#') {
resynch();
nextline();
}
}
If cp is still equal to limit after filling the buffer, the end of the file has
been reached. The do-while loop advances cp to the first nonwhite-space
character in the line, treating sentinel newlines as white space. The last
four lines of nextline check for resynchronization directives emitted by
bsize 105 the preprocessor; see Exercise 6.2. inputInit and nextline call fillbuf
buffer 105 to refill the input buffer:
BUFSIZE 105
infd 104 hrefill buffer 106i≡ 105 106
inputInit 105 fillbuf();
limit 103 if (cp >= limit)
line 104
cp = limit;
lineno 104
MAXLINE 105
If the input is exhausted, cp will still be greater than or equal to limit
resynch 125
when fillbuf returns, which leaves these variables set as shown in the
last diagram on page 104. fillbuf does all of the buffer management
and the actual input:
C
hinput.c functionsi+≡ 106
void fillbuf() {
if (bsize == 0)
return;
if (cp >= limit)
cp = &buffer[MAXLINE+1];
else
hmove the tail portion 107i
bsize = read(infd, &buffer[MAXLINE+1], BUFSIZE);
if (bsize < 0) {
error("read error\n");
exit(1);
6.2 RECOGNIZING TOKENS 107
}
limit = &buffer[MAXLINE+1+bsize];
*limit = '\n';
}
fillbuf reads the BUFSIZE (or fewer) characters into the buffer begin-
ning at position MAXLINE+1, resets limit, and stores the sentinel newline.
If the input buffer is empty when fillbuf is called, cp is reset to point
to the first new character. Otherwise, the tail limit-cp characters are
moved so that the last character is in buffer[MAXLINE], and is thus ad-
jacent to the newly read characters.
Notice the computation of line: It accounts for the portion of the current
line that has already been consumed, so that cp-line gives the correct 105 bsize
index of the character *cp. 105 buffer
105 BUFSIZE
106 fillbuf
103 limit
6.2 Recognizing Tokens 104 line
105 MAXLINE
There are two principal techniques for recognizing tokens: building a
finite automaton or writing an ad hoc recognizer by hand. The lexical
structure of most programming languages can be described by regular
expressions, and such expressions can be used to construct a determin-
istic finite automaton that recognizes and returns tokens. The advantage
of this approach is that it can be automated. For example, LEX is a pro-
gram that takes a lexical specification, given as regular expressions, and
generates an automaton and an appropriate interpreting program.
The lexical structure of most languages is simple enough that lexical
analyzers can be constructed easily by hand. In addition, automatically
generated analyzers, such as those produced by LEX, tend to be large
and slower than analyzers built by hand. Tools like LEX are very use-
ful, however, for one-shot programs and for applications with complex
lexical structures.
For C, tokens fall into the six classes defined by the following EBNF
grammar:
108 CHAPTER 6 LEXICAL ANALYSIS
token:
keyword
identifier
constant
string-literal
operator
punctuator
punctuator:
one of [ ] ( ) { } * , : = ; ...
White space blanks, tabs, newlines, and comments separates some
tokens, such as adjacent identifiers, but is otherwise ignored except in
string literals.
The lexical analyzer exports two functions and four variables:
hlex.c exported functionsi≡
extern int getchr ARGS((void));
extern int gettok ARGS((void));
C
htoken.h 109i+≡ 109
yy(0, 42, 13, MUL, multree,ID, "*")
yy(0, 43, 12, ADD, addtree,ID, "+")
yy(0, 44, 1, 0, 0, ',', ",")
yy(0, 45, 12, SUB, subtree,ID, "-")
yy(0, 46, 0, 0, 0, '.', ".")
yy(0, 47, 13, DIV, multree,'/', "/")
xx(DECR, 48, 0, SUB, subtree,ID, "--")
xx(DEREF, 49, 0, 0, 0, DEREF, "->")
xx(ANDAND, 50, 5, AND, andtree,ANDAND, "&&")
xx(OROR, 51, 4, OR, andtree,OROR, "||")
xx(LEQ, 52, 10, LE, cmptree,LEQ, "<=")
given by the values in the second column. token.h is read to define sym-
bols, build arrays indexed by token, and so forth, and using it guarantees
that such definitions are synchronized with one another. This technique
is common in assembler language programming.
Single-character tokens have yy lines and multicharacter tokens and
other definitions have xx lines. The first column in xx is the enumeration
identifier. The other columns give the identifier or character value, the
precedence if the token is an operator (Section 8.3), the generic opera-
tor (Section 5.5), the tree-building function (Section 9.4), the token s set
(Section 7.6), and the string representation.
These columns are extracted for different purposes by defining the xx
and yy macros and including token.h again. The enumeration definition
above illustrates this technique; it defines xx so that each expansion de-
fines one member of the enumeration. For example, the xx line for DECR
expands to
DECR=48,
and thus defines DECR to an enumeration constant with the value 48. yy
is defined to have no replacement, which effectively ignores the yy lines.
The global variable t is often used to hold the current token, so most
calls to gettok use the idiom
DECR 109 t = gettok();
gettok 111
src 108 token, tsym, and src hold the values associated with the current token,
Symbol 37 if there are any. token is the source text for the token itself, and tsym is
token.h 109
a Symbol for some tokens, such as identifiers and constants. src is the
token 108
tsym 108 source coordinate for the current token.
gettok could return a structure containing the token code and the
associated values, or a pointer to such a structure. Since most calls to
gettok examine only the token code, this kind of encapsulation does
not add significant capability. Also, gettok is the most frequently called
function in the compiler; a simple interface makes the code easier to
read.
gettok recognizes a token by switching on its first character, which
classifies the token, and consuming subsequent characters that make up
the token. For some tokens, these characters are given by one or more
of the sets defined by map. map[c] is a mask that classifies character c
as a member of one or more of six sets:
hlex.c typesi≡
enum { BLANK=01, NEWLINE=02, LETTER=04,
DIGIT=010, HEX=020, OTHER=040 };
hlex.c macrosi≡
#define MAXTOKEN 32
gettok begins by skipping over white space and then checking that there
is at least one token in the input buffer. If there isn t, calling fillbuf
ensures that there is. MAXTOKEN applies to all tokens except identifiers,
string literals, and numeric constants; occurrences of these tokens that
are longer than MAXTOKEN characters are handled explicitly in the code
for those tokens. The standard permits compilers to limit string literals
to 509 characters and identifiers to 31 characters. lcc increases these
112 CHAPTER 6 LEXICAL ANALYSIS
When control reaches this case, cp points to the character that follows
the newline; when nextline returns, cp still points to that character, and
cp is less than limit. End of file is the exception: here, cp equals limit.
Testing for this condition is rarely needed, because *cp will always be a
newline, which terminates the scans for most tokens.
6.3 RECOGNIZING KEYWORDS 113
The sections below describe the remaining cases. Recognizing the to-
kens themselves is relatively straightforward; computing the associated
values for some token is what complicates each case.
keyword: one of
auto double int struct
break else long switch
char extern return union
const float short unsigned
continue for signed void
default goto sizeof volatile
do if static while
id labels the code in the next section that scans identifiers. If the token
is if or int, cp is updated and the appropriate token code is returned;
otherwise, the token is an identifier. For int, tsym holds the symbol-
table entry for the type int. The cases for the characters abcdefglrsuvw
are similar, and were generated automatically by a short program.
114 CHAPTER 6 LEXICAL ANALYSIS
The code generated for these fragments is short and fast. For example,
on most machines, int is recognized by less than a dozen instructions,
many fewer than are executed when a table is searched for keywords,
even if perfect hashing is used.
identifier:
nondigit { nondigit | digit }
digit:
one of 0 1 2 3 4 5 6 7 8 9
nondigit:
one of _
a b c d e f g h i j k l m
n o p q r s t u v w x y z
A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z
The code echoes this syntax, but must also cope with the possibility of
DIGIT 110 identifiers that are longer than MAXTOKEN characters and thus might be
LETTER 110
split across input buffers.
map 110
MAXLINE 105 C
hgettok cases 112i+≡ 113 116 111
MAXTOKEN 111 D
rcp 111 case 'h': case 'j': case 'k': case 'm': case 'n': case 'o':
stringn 30 case 'p': case 'q': case 'x': case 'y': case 'z':
token 108 case 'A': case 'B': case 'C': case 'D': case 'E': case 'F':
case 'G': case 'H': case 'I': case 'J': case 'K':
case 'M': case 'N': case 'O': case 'P': case 'Q': case 'R':
case 'S': case 'T': case 'U': case 'V': case 'W': case 'X':
case 'Y': case 'Z': case '_':
id:
hensure there are at least MAXLINE characters 115i
token = (char *)rcp - 1;
while (map[*rcp]&(DIGIT|LETTER))
rcp++;
token = stringn(token, (char *)rcp - token);
htsym ← type named by token 115i
cp = rcp;
return ID;
All identifiers are saved in the string table. At the entry to this and all
cases, both cp and rcp have been incremented past the first character
of the token. If the input buffer holds less than MAXLINE characters,
6.5 RECOGNIZING NUMBERS 115
hensure there are at least MAXLINE characters 115i≡ 114 116 120
if (limit - rcp < MAXLINE) {
cp = rcp - 1;
fillbuf();
rcp = ++cp;
}
If token names a type, tsym is set to the symbol-table entry for that type,
and tsym->sclass will be equal to TYPEDEF. Otherwise, tsym is null or
the identifier isn t a type name. The macro
constant:
floating-constant
integer-constant
enumeration-constant
character-constant
enumeration-constant:
identifier
The code for identifiers shown in the previous section handles enumera-
tion constants, and the code in Section 6.6 handles character constants.
The lexical analyzer returns the token code ID and sets tsym to the
symbol-table entry for the enumeration constant. The caller checks for
116 CHAPTER 6 LEXICAL ANALYSIS
As for identifiers, this case begins by insuring that the input buffer holds
at least MAXLINE characters, which permits the code to look ahead, as the
test for hexadecimal constants illustrates.
The fragments for the three kinds of integer constant set n to the value
of the constant. They must not only recognize the constant, but also
ensure that the constant is within the range of representable integers.
Recognizing decimal constants illustrates this processing. The syntax
for decimal constants is:
decimal-constant:
nonzero-digit { digit }
nonzero-digit:
one of 1 2 3 4 5 6 7 8 9
The code accumulates the decimal value in n by repeated multiplications:
6.5 RECOGNIZING NUMBERS 117
At each step, overflow will occur if 10∗ n + d > UINT MAX, where UINT_MAX
is the value of the largest representable unsigned number. Rearranging
this equation gives the test shown above, which looks before it leaps into
computing the new value of n. overflow is set to one if the constant
overflows. icon handles the optional suffixes.
A decimal constant is the prefix of a floating constant if the next char-
acter is a period or an exponent indicator:
tval serves only to provide the type and value of a constant to gettok s
caller. The caller must lift the relevant data before the next call to gettok.
C
hlex.c functionsi+≡ 111 119
D
static Symbol icon(n, overflow, base)
unsigned n; int overflow, base; {
if ((*cp=='u'||*cp=='U') && (cp[1]=='l'||cp[1]=='L')
|| (*cp=='l'||*cp=='L') && (cp[1]=='u'||cp[1]=='U')) {
tval.type = unsignedlong;
118 CHAPTER 6 LEXICAL ANALYSIS
cp += 2;
} else if (*cp == 'u' || *cp == 'U') {
tval.type = unsignedtype;
cp += 1;
} else if (*cp == 'l' || *cp == 'L') {
if (n > (unsigned)LONG_MAX)
tval.type = unsignedlong;
else
tval.type = longtype;
cp += 1;
} else if (base == 10 && n > (unsigned)LONG_MAX)
tval.type = unsignedlong;
else if (n > (unsigned)INT_MAX)
tval.type = unsignedtype;
else
tval.type = inttype;
if (overflow) {
warning("overflow in constant '%S'\n", token,
(char*)cp - token);
n = LONG_MAX;
}
hset tval s value 118i
isunsigned 60 ppnumber("integer");
longtype 57 return &tval;
ppnumber 119
}
%S 99
token 108 If both U and L appear, n is an unsigned long, and if only U appears,
tval 117
unsignedlong 57
n is an unsigned. If only L appears, n is a long unless it s too big, in
unsignedtype 58 which case it s an unsigned long. n is also an unsigned long if it s an
unsuffixed decimal constant and it s too big to be a long. Unsuffixed
octal and hexadecimal constants are ints unless they re too big, in which
case they re unsigneds. The format code %S prints a string like printf s
%s, but consumes an additional argument that specifies the length of the
string. It can thus print strings that aren t terminated by a null character.
The types int, long, and unsigned are different types, but lcc insists
that they all have the same size. This constraint simplifies the tests
shown above and the code that sets tval s value:
hset tval s value 118i≡ 118
if (isunsigned(tval.type))
tval.u.c.v.u = n;
else
tval.u.c.v.i = n;
Relaxing this constraint would complicate this code and the tests above.
For example, the standard specifies that the type of an unsuffixed dec-
imal constant is int, long, or unsigned long, depending on its value. In
6.5 RECOGNIZING NUMBERS 119
lcc, ints and longs can accommodate the same range of integers, so an
unsuffixed decimal constant is either int or unsigned.
A numeric constant is formed from a preprocessing number, which is
the numeric constant recognized by the C preprocessor. Unfortunately,
the standard specifies preprocessing numbers that are a superset of the
integer and floating constants; that is, a valid preprocessing number may
not be a valid numeric constant. 123.4.5 is an example. The prepro-
cessor deals with such numbers too, but it may pass them on to the
compiler, which must treat them as single tokens and thus must catch
preprocessing numbers that aren t valid constants.
The syntax of a preprocessing number is
pp-number:
[ . ] digit { digit | . | nondigit | E sign | e sign }
sign: - | +
ppnumber backs up one character and skips over the characters that may
comprise a preprocessing number; if it scans past the end of the numeric
token, there s an error.
fcon recognizes the suffix of floating constants and is called in two
places. One of the calls is shown above in hcheck for floating constant i.
The other call is from the gettok case for ‘. :
C
hgettok cases 112i+≡ 116 122 111
D
case '.':
if (rcp[0] == '.' && rcp[1] == '.') {
cp += 2;
return ELLIPSIS;
120 CHAPTER 6 LEXICAL ANALYSIS
}
if ((map[*rcp]&DIGIT) == 0)
return '.';
hensure there are at least MAXLINE characters 115i
cp = rcp - 1;
token = (char *)cp;
tsym = fcon();
return FCON;
floating-constant:
fractional-constant [ exponent-part ] [ floating-suffix ]
digit-sequence exponent-part [ floating-suffix ]
fractional-constant:
[ digit-sequence ] . digit-sequence
digit-sequence .
exponent-part:
e [ sign ] digit-sequence
E [ sign ] digit-sequence
digit-sequence:
digit { digit }
DIGIT 110
map 110 floating-suffix:
ppnumber 119 one of f l F L
rcp 111
token 108 fcon recognizes a floating-constant, converts the token to a double value,
tsym 108
and determines tval s type and value:
tval 117
C
hlex.c functionsi+≡ 119
static Symbol fcon() {
hscan past a floating constant 121i
errno = 0;
tval.u.c.v.d = strtod(token, NULL);
if (errno == ERANGE)
hwarn about overflow 120i
hset tval s type and value 121i
ppnumber("floating");
return &tval;
}
constant is out of range, strtod sets the global variable errno to ERANGE
as stipulated by the ANSI C specification for the C library.
A floating constant follows the syntax shown above, and is recognized
by:
acters, and thus uses unsigned char for the type wchar_t. The syntax
is
character-constant:
[ L ] 'c-char { c-char }'
c-char:
any character except ', \, or newline
escape-sequence
escape-sequence:
one of \' \" \? \\ \a \b \f \n \r \t \v
\ octal-digit [ octal-digit [ octal-digit ] ]
\x hexadecimal-digit { hexadecimal-digit }
string-literal:
[ L ] "{ s-char }"
s-char:
any character except ", \, or newline
escape-sequence
String literals can span more than one line if a backslash immediately
precedes the newline. Adjacent string literals are automatically concate-
nated together to form a single literal. In a proper ANSI C implemen-
BUFSIZE 105 tation, this line splicing and string literal concatenation is done by the
limit 103 preprocessor, and the compiler sees only single, uninterrupted string lit-
MAXLINE 105
nextline 106
erals. lcc implements line splicing and concatenation for string literals
anyway, so that it can be used with pre-ANSI preprocessors.
Implementing these features means that string literals can be longer
than MAXLINE characters, so hensure there are at least MAXLINE charactersi
cannot be used to ensure that a sequence of adjacent entire string literals
appears in the input buffer. Instead, the code must detect the newline
at limit and call nextline explicitly, and it must copy the literal into a
private buffer.
C
hgettok cases 112i+≡ 119 111
scon:
case '\'': case '"': {
static char cbuf[BUFSIZE+1];
char *s = cbuf;
int nbad = 0;
*s++ = *--cp;
do {
cp++;
hscan one string literal 123i
if (*cp == cbuf[0])
cp++;
else
6.6 RECOGNIZING CHARACTER CONSTANTS AND STRINGS 123
The outer do-while loop gathers up adjacent string literals, which are
identified by their leading double quote character, into cbuf, and reports
those that are too long. The leading character also determines the type
of the associated value and gettok s return value:
break;
cp++;
nextline();
if (hend of input 112i)
break;
continue;
}
c = *cp++;
if (c == '\\') {
if (map[*cp]&NEWLINE) {
if (cp < limit)
break;
cp++;
nextline();
}
if (limit - cp < MAXTOKEN)
fillbuf();
c = backslash(cbuf[0]);
} else if (map[c] == 0)
nbad++;
if (s < &cbuf[sizeof cbuf] - 2)
*s++ = c;
backslash 126 }
fillbuf 106
limit 103 If *limit is a newline, it serves only to terminate the buffer, and is thus
map 110 ignored unless there s no more input. Other newlines (those for which
MAXTOKEN 111
cp is less than limit) and the one at the end of file terminate the while
NEWLINE 110
nextline 106 loop without advancing cp. backslash interprets the escape sequences
described above; see Exercise 6.10. nbad counts the number of non-ANSI
characters that appear in the literal; lcc s -A -A option causes warn-
ings about literals that contain such characters or that are longer than
ANSI s 509-character guarantee.
Further Reading
The input module is based on the design described by Waite (1986). The
difference is that Waite s algorithm moves one partial line instead of
potentially several partial lines or tokens, and does so after scanning
the first newline in the buffer. But this operation overwrites storage
before the buffer when a partial line is longer than a fixed maximum.
The algorithm above avoids this problem, but at the per-token cost of
comparing limit-cp with MAXTOKEN.
Lexical analyzers can be generated from a regular-expression specifi-
cation of the lexical structure of the language. LEX (Lesk 1975), which
is available on UNIX, is perhaps the best known example. Schreiner and
EXERCISES 125
Friedman (1985) use LEX in their sample compilers, and Holub (1990) de-
tails an implementation of a similar tool. More recent generators, such
as flex, re2c (Bumbulis and Cowan 1993), and ELI s scanner genera-
tor (Gray et al. 1992; Heuring 1986), produce lexical analyzers that are
much faster and smaller than those produced by LEX. On some comput-
ers, ELI and re2c produce lexical analyzers that are faster than lcc s. ELI
originated some of the techniques used in lcc s gettok.
A perfect hash function is one that maps each word from a known
set into a different hash number (Cichelli 1980; Jaeschke and Osterburg
1980; Sager 1985). Some compilers use perfect hashing for keywords,
but the hashing itself usually takes more instructions than lcc uses to
recognize keywords.
lcc relies on the library function strtod to convert the string repre-
sentation of a floating constant to its corresponding double value. Doing
this conversion as accurately as possible is complicated; Clinger (1990)
shows that it may require arithmetic of arbitrary precision in some cases.
Many implementations of strtod are based on Clinger s algorithm. The
opposite problem converting a double to its string representation
is just as laborious. Steele and White (1990) give the gory details.
Exercises
105 BUFSIZE
6.1 What happens if a line longer than BUFSIZE characters appears in 104 file
111 gettok
the input? Are zero-length lines handled properly?
104 lineno
106 nextline
6.2 The C preprocessor emits lines of the form
# n "file"
#line n "file"
#line n
These lines are used to reset the current line number and file name
to n and file, respectively, so that error messages refer to the correct
file. In the third form, the current file name remains unchanged.
resynch, called by nextline, recognizes these lines and resets file
and lineno accordingly. Implement resynch.
6.5 What happens when lcc reads an identifier longer than MAXLINE
characters?
6.6 Implement int getchr(void).
6.7 Try perfect hashing for the keywords. Does it beat the current im-
plementation?
6.8 The syntax for octal constants is
octal-constant:
0 { octal-digit }
octal-digit:
one of 0 1 2 3 4 5 6 7
Write hoctal constant i. Be careful; an octal constant is a valid prefix
of a floating constant, and octal constants can overflow.
6.9 The syntax for hexadecimal constants is
hexadecimal-constant:
( 0x | 0X ) hexadecimal-digit { hexadecimal-digit }
hexadecimal-digit:
one of 0 1 2 3 4 5 6 7 a b c d e f A B C D E F
getchr 108
icon 117
Write hhexadecimal constant i. Don t forget to handle overflow.
MAXLINE 105
6.10 Implement
hlex.c prototypesi≡
static int backslash ARGS((int q));