Tokenizing Arithmetic expressions - calculator
Introduction
This is the first post of a four part series around implementing
and understanding the steps for interpreting arithmetic
expressions. The series is meant for explaining key concepts
such as lexical analysis, parsing / building the ast, walking the
ast / flatting it to byte code, bytecode virtual machines and
TDD centered around compilers and interpreters.
Calculator
Simple educational calculator with lexer, parser and bytecode
vm
This project is meant for explaining concepts used when
interpreting programming languages, therefore containing
concepts such as:
tokenizing characters to tokens
parsing tokens to a abstract syntax tree
tree walk interpreting
transforming the abstract syntax tree to byte code
interpreting byte code using a virtual machine
Usage
$ calc "1+1*12.1/5"
3.4
$ cat calculations.txt
# operations
1+1
1-1
1*1
1/1
# chained operations
1*1+1-1/1
$ cat calculations.txt > calc
2
0
1
1
1
How this project works
Compiling the project
Compiling this project requires go with version 1.20:
$ go build .
Produces an executable for your architecture and operating
system, which can be started:
$ ./calc
no input given
$ ./calc "1+1"
The last command supplies calc with 2+1*2 and promptly
executes the expression:
$ ./calc "2+1*2"
index | type | raw
0| NUMBER | 2
1| PLUS | +
2| NUMBER | 1
3| ASTERISK | *
4| NUMBER | 2
5| EOF | EOF
+
*
2
1
2
OP_LOAD :: 2.000000
OP_STORE :: 1.000000
OP_LOAD :: 1.000000
OP_MULTIPY :: 1.000000
OP_STORE :: 1.000000
OP_LOAD :: 2.000000
OP_ADD :: 1.000000
=> 4
The first output is the tokens generated with the lexical
analysis, the second output is the abstract syntax tree the
parser builds, in the third part the steps the virtual machine
takes are traced to execute the expression. The last output is
the resulting number.
This first article contains the introduction to our problem
domain, the setup of our project, the basics of TDD and the
first steps towards interpreting arithmetic expressions:
tokenizing our input / performing lexical analysis
The second article will be centered around converting the list
of tokens we created in the first article to an abstract syntax
tree, short ast
The third article will be focused on walking the abstract syntax
tree and converting nodes into a list of instructions for our
virtual machine
The fourth and last article will be consisting of implementing
the bytecode virtual machine and executing expressions
compiled to bytecode
Problem domain
The expression we want to be able to execute with our
interpreter is the smallest subset of a programming language i
could imagine thus our problem domain is defined by a subset
of arithmetic expressions:
addition
subtraction
multiplication
division
We will also support braces that can be used to indicate
precedence, which we will talk about in the second post of
this series.
Expressions
Some examples of expression we will accept:
# comments
# addition
1029+0.129
# =1029.129
# subtraction
5_120_129-5_120_128
# =1
# multiplication
5e12*3
# =150000
# division
1/2
# =0.5
# braces
(1+1)/2
# =1
Interpreter design
There are several well established ways of interpreting
programming languages. Lets take a look at the stages an
interpreter commonly performs before the passed in source
code is interpreted.
Interpreter Stages
1. Lexical analysis -> the process of recognizing structures
such as numbers, identifiers or symbols in the source
code , converts recognized structures into a list of token,
often referred to as scanning, lexing or tokenizing.
2. Parsing -> refers to the process of detecting precedence
and checking whether the input conforms to the defined
grammar. In our case the parser analyses the output of
our previous stage and builds an abstract syntax tree
while taking operator precedence into account.
3. Evaluating -> commonly means walking the tree starting
from the leafs, computing a value for each node and
exiting on computing the value of the root.
For our interpreter we decide to use the idea of bytecode
interpreters, thus splitting the third step into two sub steps::
1. Compiling to bytecode -> Walking the tree and compiling
each node into bytecode
2. Executing bytecode -> Iterating over the list of bytecode
instructions and executing them
Example
Consider the following statement and lets visualize the stages
using the example:
1.025*3+1
Lexical analysis
Performing the first stage converts the input from a character
stream into a list of structures:
token := []Token{
Token{Type: NUMBER, Raw: "1.025"},
Token{Type: ASTERIKS, Raw: "*"},
Token{Type: NUMBER, Raw: "3"},
Token{Type: PLUS, Raw: "+"},
Token{Type: NUMBER, Raw: "3"},
}
Parsing
We now build an abstract syntax tree out of the list of token
we get from the previous stage:
ast := Addition{
Token: Token{Type: PLUS, Raw: "+"},
Left: Multiplication{
Token: Token{Type: ASTERIKS, Raw: "*"},
Left: Number{
Token: Token{Type: NUMBER, Raw: "1.025"},
},
Right: Number{
Token: Token{Type: NUMBER, Raw: "3"}
},
},
Right: Number{Token: NUMBER, Raw: "1"},
}
Notice the depth of the tree, the deeper the node sits the
earlier it is compiled to bytecode, thus considering operator
precedence, see below for a visual explanation:
Initial ast
Multiplication evaluated:
Addition evaluated:
4.075
Compiling to bytecode
We use the AST we got from the previous step to compile each
node to a list of bytecode instructions. The bottom most node,
commonly referred to as leafs are all numbers, thus we will
start there.
The bytecode VM we want to implement has a list of registers,
comparable to the CPU registers on a real machine. We can
load and manipulate values in these registers. In the third and
fourth part of this series, we will go into great depth on
registers, bytecode and virtual machines. For now simply
know there are registers, we can manipulate them, our VM
accepts an instruction and an argument.
Lets now take a look at the bytecode our previous example
compiles to:
;; multiplication
;; loading 1.025 into register 0
OP_LOAD :: 1.025000
;; moving 1.025 from register 0 to register 1
OP_STORE :: 1.000000
;; loading 3 into register 0
OP_LOAD :: 3.000000
;; multiplying the value of register 0
;; with the value of register 1
OP_MULTIPY :: 1.000000
;; storing the result of the
;; multiplication in register 1
OP_STORE :: 1.000000
;; addition
;; loading 1 into register 0
OP_LOAD :: 1.000000
;; adding the value of register 1
;; to the value of register 0
OP_ADD :: 1.000000
The left hand side of each line is the operation the virtual
machine is executing, the right hand side is the argument of
the operation, sides are separated with ::.
This should suffice as a high level overview over the steps we
want to take until we reach the integer result of our
expression, starting from the source code, tokenizing, parsing,
compiling and executing bytecode.
Project setup
All code snippets used in this series start with a comment
specifying the file it points to. Code not relevant to the current
topic is omitted but still notated using [...].
// main.go
package main
// [...]
func main() { }
If a new file should be created it will be explicitly stated.
Code snippets starting with a $ must be executed in a shell:
$ echo "This is a shell"
Creating a directory for our project:
$ mkdir calc
Entry point
Using go we can start with the bare minimum a project
requires:
// main.go
package main
import "fmt"
func main() {
fmt.Println("Hello World!")
}
Running the above via go run . requires the creation of the
projects go.mod file:
Initialising the project
$ go mod init calc
Building and running the source code
$ go run .
Hello World!
Test driven development
Info
Test driven development refers to the process of defining a
problem domain for a function, creating the corresponding
test, preferably with as much edge cases as possible and
incrementing the implementation of the function to satisfy all
test cases.
As we are implementing an interpreter both the input to our
function and the output of our function is known and
therefore easily representable with tests which screams we
should use TDD and iterate until all tests are passing. We will
create our tests once we defined the different kinds a token
can represent and the Token structure.
Tokenising
Leaving the above behind, lets now get to the gist of this part
of the series: the tokeniser. Our main goal is to step through
the source code we input and convert it to different tokens
and afterwards spitting out this list of tokens.
Lets get started, create a new file in the projects directory,
beside main.go and go.sum called lexer.go:
// lexer.go
package main
import (
"bufio"
"io"
"log"
"strings"
)
For now this will be enough, we will fill this file with content
in the following sections.
Token and Types of Token
In the classical sense a lexical token refers to a list of
characters with an assigned meaning, the first step of the
example.
To define the meaning we attach to this list of characters we
will define a list of possible meanings we want to support in
our interpreter.
// lexer.go
package main
const (
TOKEN_UNKNOWN = iota + 1
TOKEN_NUMBER
// symbols
TOKEN_PLUS
TOKEN_MINUS
TOKEN_ASTERISK
TOKEN_SLASH
// structure
TOKEN_BRACE_LEFT
TOKEN_BRACE_RIGHT
TOKEN_EOF
)
We will now define the structure holding the detected
structure, its type and its raw value:
// lexer.go
package main
// [...] token kind definition
type Token struct {
Type int
Raw string
}
We defined the structure to hold tokens we found in the
source code and their types.
Tests
Now lets get started with writing tests. we create a new file
postfixed with _test.go. This lexer_test.go file contains all tests
for the tokeniser and exists beside all previously created files
in the root of the directory.
So lets create the foundation for our tests - we will make use
of an idea called table driven tests:
// lexer_test.go
package main
import (
"testing"
"strings"
func TestLexer(t *testing.T) {
tests := []struct{
Name string
In string
Out []Token
}{}
for _, test := range tests {
t.Run(test.Name, func(t *testing.T) {
in := strings.NewReader(test.In)
out := NewLexer(in).Lex()
assert.EqualValues(t, test.Out, out)
})
}
}
We use the assert.EqualValues function to compare our
expected and the actual resulting arrays.
Lets add our first test - an edge case - specifically the case of
an empty input for which we expect only one structure Token
with Token.Type: TOKEN_EOF in the resulting token list.
// lexer_test.go
package main
import (
"testing"
"strings"
)
func TestLexer(t *testing.T) {
tests := []struct{
Name string
In string
Out []Token
}{
{
Name: "empty input",
In: "",
Out: []Token{
{Type: TOKEN_EOF, Raw: "TOKEN_EOF"},
},
},
}
for _, test := range tests {
t.Run(test.Name, func(t *testing.T) {
in := strings.NewReader(test.In)
out := NewLexer(in).Lex()
assert.EqualValues(t, test.Out, out)
})
}
}
Running our tests with go test ./... -v will result in an error
simply because we have not yet defined our Lexer:
$ go test ./... -v
# calc [calc.test]
./lexer_test.go:35:11: undefined: NewLexer
FAIL calc [build failed]
FAIL
Debugging
If we try to print our Token structure we will see the
Token.Type as an integer, for example:
package main
func main() {
t := Token{Type: TOKEN_NUMBER, Raw: "12"}
fmt.Printf("Token{Type: %d, Raw: %s}\n", t.Type, t.Raw)
}
This would of course not result in the output we want, due to
the enum defining token types as integers:
$ go run .
Token{Type: 2, Raw: 12}
Therefore we add the TOKEN_LOOKUP hash map:
// lexer.go
package main
// [...] imports
// [...] token types generation
var TOKEN_LOOKUP = map[int]string{
TOKEN_UNKNOWN: "UNKNOWN",
TOKEN_NUMBER: "TOKEN_NUMBER",
TOKEN_PLUS: "TOKEN_PLUS",
TOKEN_MINUS: "TOKEN_MINUS",
TOKEN_ASTERISK: "TOKEN_ASTERISK",
TOKEN_SLASH: "TOKEN_SLASH",
TOKEN_BRACE_LEFT: "TOKEN_BRACE_LEFT",
TOKEN_BRACE_RIGHT: "TOKEN_BRACE_RIGHT",
TOKEN_EOF: "EOF",
}
Tip
With vim the above is extremely easy to generate, simply copy
the before defined types of tokens, paste them into the map,
remove = iota +1, white space and comments. Afterwards
mark them again with Shift+v. Now regex all over the place by
typing :'<,'>s/\([A-Z_]\+\)/\1: "\1",, this creates a capture
group for all upper case characters found one or more times,
this group is reused in the substitute replace part of the
command (second part of the command, infix split by /) and
replaces all \1 with the captured name, thus filling the map.
If we were to now update our previous example to using the
new TOKEN_LOOKUP map we notice it now works correctly:
package main
import "fmt"
func main() {
t := Token{Type: TOKEN_NUMBER, Raw: "12"}
fmt.Printf("Token{Type: %s, Raw: %s}\n",
TOKEN_LOOKUP[t.Type], t.Raw)
}
$ go run .
Token{Type: TOKEN_NUMBER, Raw: 12}
Lexer overview
After establishing our debug capabilities we now can move on
to creating the Lexer and defining our tokenisers API:
// lexer.go
package main
// [...] Imports, token types, token struct, TOKEN_LOOKUP
map
type Lexer struct {
scanner bufio.Reader
cur rune
}
func NewLexer(reader io.Reader) *Lexer { }
func (l *Lexer) Lex() []Token { }
func (l *Lexer) number() Token { }
func (l *Lexer) advance() { }
The Lexer structure holds a scanner we will create in the
NewLexer function this function accepts an unbuffered reader
which we will wrap into a buffered reader for stepping trough
the source in an optimized fashion. The function returns a
Lexer structure. The cur field holds the current character.
The heart of the tokeniser is the Lexer.Lex method. It iterates
over all characters in the buffered reader and tries to
recognize structures.
The Lexer.number method is called when an number is
detected, it then iterates until the current character is no
longer a part of a number and returns a Token structure.
Lexer.advance requests the next character from the buffered
scanner and sets Lexer.cur to the resulting character.
Number vs integer vs digit -> Here I define a number as 1 or
more characters between 0 and 9, I extend this definition with
e, . and _ in between the first number and all following
numbers. Thus I consider the following numbers as valid for
this interpreter:
1e5
12.5
0.5
5_000_000
Creating the lexer
As introduced before the NewLexer function creates the lexer:
// lexer.go
package main
// [...] Imports, token types, token struct, TOKEN_LOOKUP
map
type Lexer struct {
scanner bufio.Reader
cur rune
}
func NewLexer(reader io.Reader) *Lexer {
l := &Lexer{
scanner: *bufio.NewReader(reader),
}
l.advance()
return l
}
This function accepts a reader, creates a new Lexer structure,
converts the reader to a buffered Reader, assigns it to the
Lexer structure and afterwards invokes the Lexer.advance
function .
Advancing in the Input
Stepping through the source code is as easy as requesting a
new character from our buffered reader via the
bufio.Reader.ReadRune() method:
// lexer.go
package main
// [...]
func (l *Lexer) advance() {
r, _, err := l.scanner.ReadRune()
if err != nil {
l.cur = 0
} else {
l.cur = r
}
}
The ReadRune function returns an error once the end of the
file is hit, to indicate this to our Lexer.Lex function we will set
the Lexer.cur field to 0.
Tip
End of file is often referred to as EOF.
We will now focus on the heart of the tokeniser: Lexer.Lex():
// lexer.go
package main
// [...]
func (l *Lexer) Lex() []Token {
t := make([]Token, 0)
for l.cur != 0 {
l.advance()
}
t = append(t, Token{
Type: TOKEN_EOF,
Raw: "TOKEN_EOF",
})
return t
}
We firstly create a new slice of type []Token, we will fill with
tokens we find while stepping through the source code. The
while loop iterates until we hit the EOF by calling
*Lexer.advance(). To indicate the ending of our token list we
append a token of type TOKEN_EOF to the slice t.
After defining the NewLexer and the *Lexer.Lex we can try
running our tests again:
$ go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
PASS
ok calc 0.002s
Thus we know our lexer works correctly for empty inputs.
Ignoring white space
Every good programming language ignores white space and so
do we (looking at you Python). White space is commonly
defined as a new line: '\n' / '\r', a tab '\t' or a space ' '.
Lets add a new test case called whitespace to our white space
tests:
// lexer_test.go
package main
// [...]
func TestLexer(t *testing.T) {
tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "whitespace",
In: "\r\n\t ",
Out: []Token{
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
Having defined what we want as the output, lets get started
with ignoring white space:
To check if the current character matches any of the above we
introduce a switch case statement:
// lexer.go
package main
// [...]
func (l *Lexer) Lex() []Token {
t := make([]Token, 0)
for l.cur != 0 {
switch l.cur {
case ' ', '\n', '\t', '\r':
l.advance()
continue
}
l.advance()
}
t = append(t, Token{
Type: TOKEN_EOF,
Raw: "TOKEN_EOF",
})
return t
}
Lets run our tests and check if everything worked out the way
we wanted it to:
$ go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
PASS
ok calc 0.001s
Seems like we ignored whitespace the right way .
Support for comments
Lets add a very similar test as we added in the previous
chapter to check if we ignore comments correctly:
// lexer_test.go
package main
// [...]
func TestLexer(t *testing.T) {
tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "comment",
In: "# this is a comment\n# this is a comment
without a newline at the end",
Out: []Token{
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
To ignore comments, we add a new case to our switch
statement:
// lexer.go
package main
// [...]
func (l *Lexer) Lex() []Token {
// [...]
for l.cur != 0 {
switch l.cur {
case '#':
for l.cur != '\n' && l.cur != 0 {
l.advance()
}
continue
case ' ', '\n', '\t', '\r':
l.advance()
continue
}
l.advance()
}
// [...]
}
We want our comments to start with #, therefore we enter the
case if the current character is a #. Once in the case we call
*Lexer.advance() until we either hit a newline or EOF - both
causing the loop to stop.
Lets again run our tests:
$ go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
=== RUN TestLexer/comment
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
--- PASS: TestLexer/comment (0.00s)
PASS
ok calc 0.001s
Detecting special symbols
Having added tests for empty input, ignoring white space and
comments, we will now add a new test for the symbols we
want to recognize in out input:
// lexer_test.go
package main
// [...]
func TestLexer(t *testing.T) {
tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "symbols",
In: "+-/*()",
Out: []Token{
{TOKEN_PLUS, "+"},
{TOKEN_MINUS, "-"},
{TOKEN_SLASH, "/"},
{TOKEN_ASTERISK, "*"},
{TOKEN_BRACE_LEFT, "("},
{TOKEN_BRACE_RIGHT, ")"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
Running our tests including the above at the current state of
out implementation will result in the following assert
Error:
$ go test ./...
--- FAIL: TestLexer (0.00s)
--- FAIL: TestLexer/symbols (0.00s)
lexer_test.go:56:
Error Trace: ./lexer_test.go:56
Error: Not equal:
expected: []main.Token{main.Token{Type:3,
Raw:"+"}, main.Token{Type:4, Raw:"-"}, main.Token{Type:6,
Raw:"/"}, main.Token{Type:5, Raw:"*"}, main.Token{Type:7,
Raw:"("}, main.Token{Type:8, Raw:")"}, main.Token{Type:9,
Raw:"TOKEN_EOF"}}
actual : []main.Token{main.Token{Type:9,
Raw:"TOKEN_EOF"}}
// [...]
Test: TestLexer/symbols
FAIL
FAIL calc 0.004s
FAIL
Implementing support for the symbols we want should fix this
issue.
Our first step towards this goal is to define a new variable
called ttype holding the type of token we recognized:
// lexer.go
package main
// [...]
func (l *Lexer) Lex() []Token {
// [...]
for l.cur != 0 {
// [...]
ttype := TOKEN_UNKNOWN
// [...]
l.advance()
}
// [...]
}
We use this variable to insert detected tokens into our t array,
if the value of ttype didn’t change and is still
TOKEN_UNKNOWN we display an error and exit:
// lexer.go
package main
import (
// [...]
"log"
)
// [...]
func (l *Lexer) Lex() []Token {
// [...]
for l.cur != 0 {
// [...]
ttype := TOKEN_UNKNOWN
// [...]
if ttype != TOKEN_UNKNOWN {
t = append(t, Token{
Type: ttype,
Raw: string(l.cur),
})
} else {
log.Fatalf("unknown %q in input", l.cur)
}
l.advance()
}
// [...]
}
For now this concludes our error handling, not great - i know.
Our next step is to add cases to our switch to react to differing
characters:
// lexer.go
package main
// [...]
func (l *Lexer) Lex() []Token {
// [...]
for l.cur != 0 {
// [...]
switch l.cur {
// [...]
case '+':
ttype = TOKEN_PLUS
case '-':
ttype = TOKEN_MINUS
case '/':
ttype = TOKEN_SLASH
case '*':
ttype = TOKEN_ASTERISK
case '(':
ttype = TOKEN_BRACE_LEFT
case ')':
ttype = TOKEN_BRACE_RIGHT
}
// [...]
l.advance()
}
// [...]
}
We can now once again run our tests:
$ go test ./... -v
calc master M :: go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
=== RUN TestLexer/comment
=== RUN TestLexer/symbols
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
--- PASS: TestLexer/comment (0.00s)
--- PASS: TestLexer/symbols (0.00s)
PASS
ok calc 0.003s
And we pass our tests, the only feature missing from our
tokeniser is detecting numbers.
Support for integers and floating point numbers
As introduced before i want to support numbers with several
infixes, such as _, e and ..
Go ahead and add some tests for these cases:
// lexer_test.go
package main
// [...]
func TestLexer(t *testing.T) {
tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "number",
In: "123",
Out: []Token{
{TOKEN_NUMBER, "123"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
{
Name: "number with underscore",
In: "10_000",
Out: []Token{
{TOKEN_NUMBER, "10_000"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
{
Name: "number with e",
In: "10e5",
Out: []Token{
{TOKEN_NUMBER, "10e5"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
{
Name: "number with .",
In: "0.005",
Out: []Token{
{TOKEN_NUMBER, "0.005"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
Lets add a default-case to our switch statement:
// lexer.go
package main
// [...]
func (l *Lexer) Lex() []Token {
// [...]
for l.cur != 0 {
// [...]
switch l.cur {
// [...]
default:
if (l.cur >= '0' && l.cur <= '9') || l.cur == '.' {
t = append(t, l.number())
continue
}
}
// [...]
}
// [...]
}
As one should notice we have yet to define the *Lexer.number
function:
// lexer.go
package main
// [...]
func (l *Lexer) number() Token {
b := strings.Builder{}
for (l.cur >= '0' && l.cur <= '9') || l.cur == '.' || l.cur == '_' ||
l.cur == 'e' {
b.WriteRune(l.cur)
l.advance()
}
return Token{
Raw: b.String(),
Type: TOKEN_NUMBER,
}
}
The function makes use of the strings.Builder structure. This is
used to omit copying the string which we would have to do if
we simply used string+string. We iterate while our character
matches what we want and write to the strings.Builder
structure. Upon hitting a character we do not accept the loop
stops and the function returns a Token-Structure with the
result of the strings.Builder we defined and wrote to
previously.
Combining the previously added default-case and our new
*Lexer.number() function we added support for numbers
starting with 0-9 or .. We support infixes such as _, ., _ and e -
exactly matching our test cases, thus we can now once again
check if our tests pass:
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
=== RUN TestLexer/comment
=== RUN TestLexer/symbols
=== RUN TestLexer/number
=== RUN TestLexer/number_with_underscore
=== RUN TestLexer/number_with_e
=== RUN TestLexer/number_with_.
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
--- PASS: TestLexer/comment (0.00s)
--- PASS: TestLexer/symbols (0.00s)
--- PASS: TestLexer/number (0.00s)
--- PASS: TestLexer/number_with_underscore (0.00s)
--- PASS: TestLexer/number_with_e (0.00s)
--- PASS: TestLexer/number_with_. (0.00s)
PASS
ok calc 0.003s
Calling our Tokeniser
Out tests pass - we can finally move on to my favorite part of
every programming project: passing input via the command
line to our program and seeing the output. Doing so requires
some packages. We need os to access the command line
arguments our program was called with, we need strings to
create a io.Reader for the parameter our tokeniser requires.
Furthermore we include the log package and promptly disable
all prefixes, timestamps, etc. by invoking log.SetFlags with 0 as
the argument.
package main
import (
"log"
"os"
"strings"
)
func main() {
log.SetFlags(0)
if len(os.Args) != 2 {
log.Fatalln("missing input")
}
input := os.Args[1]
token := NewLexer(strings.NewReader(input)).Lex()
}
Tip
When an executable build with go is started it can access the
arguments passed to it via the os.Args slice:
// main.go
package main
import (
"fmt"
"os"
)
func main() {
fmt.Println(os.Args)
}
$ go build .
$ ./main arg1 arg2 arg3
[./main arg1 arg2 arg3]
The 0 index is always the name of the executable.
We got our tokens but we haven’t printed them yet, so we
create a helper method called debugToken - we first print the
header of our table and afterwards iterate through our list of
Token structures, printing them one by one.
// main.go
package main
// [...]
func debugToken(token []Token) {
log.Printf("%5s | %20s | %15s \n\n", "index", "type",
"raw")
for i, t := range token {
log.Printf("%5d | %20s | %15s \n", i,
TOKEN_LOOKUP[t.Type], t.Raw)
}
}
func main() {
log.SetFlags(0)
if len(os.Args) != 2 {
log.Fatalln("missing input")
}
input := os.Args[1]
token := NewLexer(strings.NewReader(input)).Lex()
debugToken(token)
}
Running our program with an expression of our choice results
in a table of lexemes we recognized
$ go run . "100_000+.5*(42-3.1415)/12"
index | type | raw
0| TOKEN_NUMBER | 100_000
1| TOKEN_PLUS | +
2| TOKEN_NUMBER | .5
3| TOKEN_ASTERISK | *
4| TOKEN_BRACE_LEFT | (
5| TOKEN_NUMBER | 42
6| TOKEN_MINUS | -
7| TOKEN_NUMBER | 3.1415
8 | TOKEN_BRACE_RIGHT | )
9| TOKEN_SLASH | /
10 | TOKEN_NUMBER | 12
11 | EOF | TOKEN_EOF