0% found this document useful (0 votes)

28 views58 pages

حاسبة

This document introduces an educational calculator project that can tokenize, parse, and interpret arithmetic expressions. It describes the tokenization, parsing, abstract syntax tree construction, bytecode compilation, and virtual machine execution steps involved. Test-driven development is also discussed as an approach.

Uploaded by

Osama RashadAhmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views58 pages

حاسبة

Uploaded by

Osama RashadAhmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 58

Tokenizing Arithmetic expressions - calculator

Introduction
This is the first post of a four part series around implementing
and understanding the steps for interpreting arithmetic
expressions. The series is meant for explaining key concepts
such as lexical analysis, parsing / building the ast, walking the
ast / flatting it to byte code, bytecode virtual machines and
TDD centered around compilers and interpreters.

Calculator
Simple educational calculator with lexer, parser and bytecode
vm

This project is meant for explaining concepts used when

interpreting programming languages, therefore containing
concepts such as:

tokenizing characters to tokens

parsing tokens to a abstract syntax tree
tree walk interpreting
transforming the abstract syntax tree to byte code
interpreting byte code using a virtual machine
Usage
$ calc "1+1*12.1/5"
3.4
$ cat calculations.txt
# operations
1+1
1-1
1*1
1/1

# chained operations
1*1+1-1/1
$ cat calculations.txt > calc
2
0
1
1
1
How this project works
Compiling the project
Compiling this project requires go with version 1.20:
$ go build .
Produces an executable for your architecture and operating
system, which can be started:

$ ./calc
no input given
$ ./calc "1+1"
The last command supplies calc with 2+1*2 and promptly
executes the expression:

$ ./calc "2+1*2"
index | type | raw

0| NUMBER | 2
1| PLUS | +
2| NUMBER | 1
3| ASTERISK | *
4| NUMBER | 2
5| EOF | EOF
+
*
2
1
2
OP_LOAD :: 2.000000
OP_STORE :: 1.000000
OP_LOAD :: 1.000000
OP_MULTIPY :: 1.000000
OP_STORE :: 1.000000
OP_LOAD :: 2.000000
OP_ADD :: 1.000000
=> 4
The first output is the tokens generated with the lexical
analysis, the second output is the abstract syntax tree the
parser builds, in the third part the steps the virtual machine
takes are traced to execute the expression. The last output is
the resulting number.

This first article contains the introduction to our problem

domain, the setup of our project, the basics of TDD and the
first steps towards interpreting arithmetic expressions:
tokenizing our input / performing lexical analysis

The second article will be centered around converting the list

of tokens we created in the first article to an abstract syntax
tree, short ast

The third article will be focused on walking the abstract syntax

tree and converting nodes into a list of instructions for our
virtual machine

The fourth and last article will be consisting of implementing

the bytecode virtual machine and executing expressions
compiled to bytecode

Problem domain
The expression we want to be able to execute with our
interpreter is the smallest subset of a programming language i
could imagine thus our problem domain is defined by a subset
of arithmetic expressions:
addition
subtraction
multiplication
division
We will also support braces that can be used to indicate
precedence, which we will talk about in the second post of
this series.

Expressions
Some examples of expression we will accept:

# comments

# addition
1029+0.129
# =1029.129

# subtraction
5_120_129-5_120_128
# =1
# multiplication
5e12*3
# =150000

# division
1/2
# =0.5

# braces
(1+1)/2
# =1
Interpreter design
There are several well established ways of interpreting
programming languages. Lets take a look at the stages an
interpreter commonly performs before the passed in source
code is interpreted.

Interpreter Stages
1. Lexical analysis -> the process of recognizing structures
such as numbers, identifiers or symbols in the source
code , converts recognized structures into a list of token,
often referred to as scanning, lexing or tokenizing.

2. Parsing -> refers to the process of detecting precedence

and checking whether the input conforms to the defined
grammar. In our case the parser analyses the output of
our previous stage and builds an abstract syntax tree
while taking operator precedence into account.
3. Evaluating -> commonly means walking the tree starting
from the leafs, computing a value for each node and
exiting on computing the value of the root.

For our interpreter we decide to use the idea of bytecode

interpreters, thus splitting the third step into two sub steps::

1. Compiling to bytecode -> Walking the tree and compiling

each node into bytecode
2. Executing bytecode -> Iterating over the list of bytecode
instructions and executing them
Example
Consider the following statement and lets visualize the stages
using the example:

1.025*3+1
Lexical analysis
Performing the first stage converts the input from a character
stream into a list of structures:

token := []Token{
Token{Type: NUMBER, Raw: "1.025"},
Token{Type: ASTERIKS, Raw: "*"},
Token{Type: NUMBER, Raw: "3"},
Token{Type: PLUS, Raw: "+"},
Token{Type: NUMBER, Raw: "3"},
}
Parsing
We now build an abstract syntax tree out of the list of token
we get from the previous stage:
ast := Addition{
Token: Token{Type: PLUS, Raw: "+"},
Left: Multiplication{
Token: Token{Type: ASTERIKS, Raw: "*"},
Left: Number{
Token: Token{Type: NUMBER, Raw: "1.025"},
},
Right: Number{
Token: Token{Type: NUMBER, Raw: "3"}
},
},
Right: Number{Token: NUMBER, Raw: "1"},
}
Notice the depth of the tree, the deeper the node sits the
earlier it is compiled to bytecode, thus considering operator
precedence, see below for a visual explanation:

Initial ast
Multiplication evaluated:

Addition evaluated:

4.075
Compiling to bytecode
We use the AST we got from the previous step to compile each
node to a list of bytecode instructions. The bottom most node,
commonly referred to as leafs are all numbers, thus we will
start there.
The bytecode VM we want to implement has a list of registers,
comparable to the CPU registers on a real machine. We can
load and manipulate values in these registers. In the third and
fourth part of this series, we will go into great depth on
registers, bytecode and virtual machines. For now simply
know there are registers, we can manipulate them, our VM
accepts an instruction and an argument.

Lets now take a look at the bytecode our previous example

compiles to:

;; multiplication
;; loading 1.025 into register 0
OP_LOAD :: 1.025000
;; moving 1.025 from register 0 to register 1
OP_STORE :: 1.000000

;; loading 3 into register 0

OP_LOAD :: 3.000000
;; multiplying the value of register 0
;; with the value of register 1
OP_MULTIPY :: 1.000000

;; storing the result of the

;; multiplication in register 1
OP_STORE :: 1.000000

;; addition
;; loading 1 into register 0
OP_LOAD :: 1.000000
;; adding the value of register 1
;; to the value of register 0
OP_ADD :: 1.000000
The left hand side of each line is the operation the virtual
machine is executing, the right hand side is the argument of
the operation, sides are separated with ::.

This should suffice as a high level overview over the steps we

want to take until we reach the integer result of our
expression, starting from the source code, tokenizing, parsing,
compiling and executing bytecode.

Project setup
All code snippets used in this series start with a comment
specifying the file it points to. Code not relevant to the current
topic is omitted but still notated using [...].

// main.go
package main

// [...]

func main() { }
If a new file should be created it will be explicitly stated.

Code snippets starting with a $ must be executed in a shell:

$ echo "This is a shell"

Creating a directory for our project:

$ mkdir calc
Entry point

Using go we can start with the bare minimum a project

requires:

// main.go
package main

import "fmt"

func main() {
fmt.Println("Hello World!")
}
Running the above via go run . requires the creation of the
projects go.mod file:

Initialising the project

$ go mod init calc

Building and running the source code

$ go run .
Hello World!

Test driven development

Info
Test driven development refers to the process of defining a
problem domain for a function, creating the corresponding
test, preferably with as much edge cases as possible and
incrementing the implementation of the function to satisfy all
test cases.
As we are implementing an interpreter both the input to our
function and the output of our function is known and
therefore easily representable with tests which screams we
should use TDD and iterate until all tests are passing. We will
create our tests once we defined the different kinds a token
can represent and the Token structure.

Tokenising
Leaving the above behind, lets now get to the gist of this part
of the series: the tokeniser. Our main goal is to step through
the source code we input and convert it to different tokens
and afterwards spitting out this list of tokens.

Lets get started, create a new file in the projects directory,

beside main.go and go.sum called lexer.go:
// lexer.go
package main

import (
"bufio"
"io"
"log"
"strings"
)
For now this will be enough, we will fill this file with content
in the following sections.

Token and Types of Token

In the classical sense a lexical token refers to a list of
characters with an assigned meaning, the first step of the
example.

To define the meaning we attach to this list of characters we

will define a list of possible meanings we want to support in
our interpreter.

// lexer.go
package main

const (
TOKEN_UNKNOWN = iota + 1

TOKEN_NUMBER

// symbols
TOKEN_PLUS
TOKEN_MINUS
TOKEN_ASTERISK
TOKEN_SLASH

// structure
TOKEN_BRACE_LEFT
TOKEN_BRACE_RIGHT

TOKEN_EOF
)

We will now define the structure holding the detected

structure, its type and its raw value:

// lexer.go
package main

// [...] token kind definition

type Token struct {

Type int
Raw string
}

We defined the structure to hold tokens we found in the

source code and their types.

Tests
Now lets get started with writing tests. we create a new file
postfixed with _test.go. This lexer_test.go file contains all tests
for the tokeniser and exists beside all previously created files
in the root of the directory.

So lets create the foundation for our tests - we will make use
of an idea called table driven tests:

// lexer_test.go
package main

import (
"testing"
"strings"

func TestLexer(t *testing.T) {

tests := []struct{
Name string
In string
Out []Token
}{}
for _, test := range tests {
t.Run(test.Name, func(t *testing.T) {
in := strings.NewReader(test.In)
out := NewLexer(in).Lex()
assert.EqualValues(t, test.Out, out)
})
}
}
We use the assert.EqualValues function to compare our
expected and the actual resulting arrays.

Lets add our first test - an edge case - specifically the case of
an empty input for which we expect only one structure Token
with Token.Type: TOKEN_EOF in the resulting token list.

// lexer_test.go
package main

import (
"testing"
"strings"

)
func TestLexer(t *testing.T) {
tests := []struct{
Name string
In string
Out []Token
}{
{
Name: "empty input",
In: "",
Out: []Token{
{Type: TOKEN_EOF, Raw: "TOKEN_EOF"},
},
},
}
for _, test := range tests {
t.Run(test.Name, func(t *testing.T) {
in := strings.NewReader(test.In)
out := NewLexer(in).Lex()
assert.EqualValues(t, test.Out, out)
})
}
}
Running our tests with go test ./... -v will result in an error
simply because we have not yet defined our Lexer:

$ go test ./... -v
# calc [calc.test]
./lexer_test.go:35:11: undefined: NewLexer
FAIL calc [build failed]
FAIL
Debugging
If we try to print our Token structure we will see the
Token.Type as an integer, for example:

package main

func main() {
t := Token{Type: TOKEN_NUMBER, Raw: "12"}
fmt.Printf("Token{Type: %d, Raw: %s}\n", t.Type, t.Raw)
}
This would of course not result in the output we want, due to
the enum defining token types as integers:
$ go run .
Token{Type: 2, Raw: 12}
Therefore we add the TOKEN_LOOKUP hash map:

// lexer.go
package main

// [...] imports

// [...] token types generation

var TOKEN_LOOKUP = map[int]string{

TOKEN_UNKNOWN: "UNKNOWN",
TOKEN_NUMBER: "TOKEN_NUMBER",
TOKEN_PLUS: "TOKEN_PLUS",
TOKEN_MINUS: "TOKEN_MINUS",
TOKEN_ASTERISK: "TOKEN_ASTERISK",
TOKEN_SLASH: "TOKEN_SLASH",
TOKEN_BRACE_LEFT: "TOKEN_BRACE_LEFT",
TOKEN_BRACE_RIGHT: "TOKEN_BRACE_RIGHT",
TOKEN_EOF: "EOF",
}
Tip
With vim the above is extremely easy to generate, simply copy
the before defined types of tokens, paste them into the map,
remove = iota +1, white space and comments. Afterwards
mark them again with Shift+v. Now regex all over the place by
typing :'<,'>s/$[A-Z_]\+$/\1: "\1",, this creates a capture
group for all upper case characters found one or more times,
this group is reused in the substitute replace part of the
command (second part of the command, infix split by /) and
replaces all \1 with the captured name, thus filling the map.
If we were to now update our previous example to using the
new TOKEN_LOOKUP map we notice it now works correctly:

package main

import "fmt"

func main() {
t := Token{Type: TOKEN_NUMBER, Raw: "12"}
fmt.Printf("Token{Type: %s, Raw: %s}\n",
TOKEN_LOOKUP[t.Type], t.Raw)
}
$ go run .
Token{Type: TOKEN_NUMBER, Raw: 12}

Lexer overview
After establishing our debug capabilities we now can move on
to creating the Lexer and defining our tokenisers API:

// lexer.go
package main

// [...] Imports, token types, token struct, TOKEN_LOOKUP

map

type Lexer struct {

scanner bufio.Reader
cur rune
}

func NewLexer(reader io.Reader) *Lexer { }

func (l *Lexer) Lex() []Token { }

func (l *Lexer) number() Token { }

func (l *Lexer) advance() { }

The Lexer structure holds a scanner we will create in the
NewLexer function this function accepts an unbuffered reader
which we will wrap into a buffered reader for stepping trough
the source in an optimized fashion. The function returns a
Lexer structure. The cur field holds the current character.

The heart of the tokeniser is the Lexer.Lex method. It iterates

over all characters in the buffered reader and tries to
recognize structures.

The Lexer.number method is called when an number is

detected, it then iterates until the current character is no
longer a part of a number and returns a Token structure.

Lexer.advance requests the next character from the buffered

scanner and sets Lexer.cur to the resulting character.
Number vs integer vs digit -> Here I define a number as 1 or
more characters between 0 and 9, I extend this definition with
e, . and _ in between the first number and all following
numbers. Thus I consider the following numbers as valid for
this interpreter:

1e5
12.5
0.5
5_000_000
Creating the lexer
As introduced before the NewLexer function creates the lexer:

// lexer.go
package main

// [...] Imports, token types, token struct, TOKEN_LOOKUP

map

type Lexer struct {

scanner bufio.Reader
cur rune
}

func NewLexer(reader io.Reader) *Lexer {

l := &Lexer{
scanner: *bufio.NewReader(reader),
}
l.advance()
return l
}
This function accepts a reader, creates a new Lexer structure,
converts the reader to a buffered Reader, assigns it to the
Lexer structure and afterwards invokes the Lexer.advance
function .

Advancing in the Input

Stepping through the source code is as easy as requesting a
new character from our buffered reader via the
bufio.Reader.ReadRune() method:

// lexer.go
package main

// [...]

func (l *Lexer) advance() {

r, _, err := l.scanner.ReadRune()
if err != nil {
l.cur = 0
} else {
l.cur = r
}
}
The ReadRune function returns an error once the end of the
file is hit, to indicate this to our Lexer.Lex function we will set
the Lexer.cur field to 0.

Tip
End of file is often referred to as EOF.
We will now focus on the heart of the tokeniser: Lexer.Lex():

// lexer.go
package main

// [...]

func (l *Lexer) Lex() []Token {

t := make([]Token, 0)
for l.cur != 0 {
l.advance()
}
t = append(t, Token{
Type: TOKEN_EOF,
Raw: "TOKEN_EOF",
})
return t
}
We firstly create a new slice of type []Token, we will fill with
tokens we find while stepping through the source code. The
while loop iterates until we hit the EOF by calling
*Lexer.advance(). To indicate the ending of our token list we
append a token of type TOKEN_EOF to the slice t.

After defining the NewLexer and the *Lexer.Lex we can try

running our tests again:

$ go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
PASS
ok calc 0.002s
Thus we know our lexer works correctly for empty inputs.
Ignoring white space
Every good programming language ignores white space and so
do we (looking at you Python). White space is commonly
defined as a new line: '\n' / '\r', a tab '\t' or a space ' '.

Lets add a new test case called whitespace to our white space
tests:

// lexer_test.go
package main

// [...]

func TestLexer(t *testing.T) {

tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "whitespace",
In: "\r\n\t ",
Out: []Token{
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
Having defined what we want as the output, lets get started
with ignoring white space:

To check if the current character matches any of the above we

introduce a switch case statement:

// lexer.go
package main

// [...]

func (l *Lexer) Lex() []Token {

t := make([]Token, 0)
for l.cur != 0 {
switch l.cur {
case ' ', '\n', '\t', '\r':
l.advance()
continue
}

l.advance()
}
t = append(t, Token{
Type: TOKEN_EOF,
Raw: "TOKEN_EOF",
})
return t
}
Lets run our tests and check if everything worked out the way
we wanted it to:

$ go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
PASS
ok calc 0.001s
Seems like we ignored whitespace the right way .

Support for comments

Lets add a very similar test as we added in the previous
chapter to check if we ignore comments correctly:

// lexer_test.go
package main

// [...]

func TestLexer(t *testing.T) {

tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "comment",
In: "# this is a comment\n# this is a comment
without a newline at the end",
Out: []Token{
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
To ignore comments, we add a new case to our switch
statement:

// lexer.go
package main

// [...]

func (l *Lexer) Lex() []Token {

// [...]
for l.cur != 0 {
switch l.cur {
case '#':
for l.cur != '\n' && l.cur != 0 {
l.advance()
}
continue
case ' ', '\n', '\t', '\r':
l.advance()
continue
}
l.advance()
}
// [...]
}
We want our comments to start with #, therefore we enter the
case if the current character is a #. Once in the case we call
*Lexer.advance() until we either hit a newline or EOF - both
causing the loop to stop.

Lets again run our tests:

$ go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
=== RUN TestLexer/comment
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
--- PASS: TestLexer/comment (0.00s)
PASS
ok calc 0.001s

Detecting special symbols

Having added tests for empty input, ignoring white space and
comments, we will now add a new test for the symbols we
want to recognize in out input:

// lexer_test.go
package main

// [...]

func TestLexer(t *testing.T) {

tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "symbols",
In: "+-/*()",
Out: []Token{
{TOKEN_PLUS, "+"},
{TOKEN_MINUS, "-"},
{TOKEN_SLASH, "/"},
{TOKEN_ASTERISK, "*"},
{TOKEN_BRACE_LEFT, "("},
{TOKEN_BRACE_RIGHT, ")"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}
Running our tests including the above at the current state of

out implementation will result in the following assert

Error:

$ go test ./...
--- FAIL: TestLexer (0.00s)
--- FAIL: TestLexer/symbols (0.00s)
lexer_test.go:56:
Error Trace: ./lexer_test.go:56
Error: Not equal:
expected: []main.Token{main.Token{Type:3,
Raw:"+"}, main.Token{Type:4, Raw:"-"}, main.Token{Type:6,
Raw:"/"}, main.Token{Type:5, Raw:"*"}, main.Token{Type:7,
Raw:"("}, main.Token{Type:8, Raw:")"}, main.Token{Type:9,
Raw:"TOKEN_EOF"}}
actual : []main.Token{main.Token{Type:9,
Raw:"TOKEN_EOF"}}
// [...]
Test: TestLexer/symbols
FAIL
FAIL calc 0.004s
FAIL
Implementing support for the symbols we want should fix this
issue.

Our first step towards this goal is to define a new variable

called ttype holding the type of token we recognized:

// lexer.go
package main

// [...]

func (l *Lexer) Lex() []Token {

// [...]
for l.cur != 0 {
// [...]
ttype := TOKEN_UNKNOWN
// [...]
l.advance()
}
// [...]
}
We use this variable to insert detected tokens into our t array,
if the value of ttype didn’t change and is still
TOKEN_UNKNOWN we display an error and exit:

// lexer.go
package main

import (
// [...]
"log"
)
// [...]

func (l *Lexer) Lex() []Token {

// [...]
for l.cur != 0 {
// [...]
ttype := TOKEN_UNKNOWN
// [...]
if ttype != TOKEN_UNKNOWN {
t = append(t, Token{
Type: ttype,
Raw: string(l.cur),
})
} else {
log.Fatalf("unknown %q in input", l.cur)
}

l.advance()
}
// [...]
}
For now this concludes our error handling, not great - i know.
Our next step is to add cases to our switch to react to differing
characters:

// lexer.go
package main

// [...]

func (l *Lexer) Lex() []Token {

// [...]
for l.cur != 0 {
// [...]
switch l.cur {
// [...]
case '+':
ttype = TOKEN_PLUS
case '-':
ttype = TOKEN_MINUS
case '/':
ttype = TOKEN_SLASH
case '*':
ttype = TOKEN_ASTERISK
case '(':
ttype = TOKEN_BRACE_LEFT
case ')':
ttype = TOKEN_BRACE_RIGHT
}
// [...]
l.advance()
}
// [...]
}
We can now once again run our tests:
$ go test ./... -v
calc master M :: go test ./... -v
=== RUN TestLexer
=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
=== RUN TestLexer/comment
=== RUN TestLexer/symbols
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
--- PASS: TestLexer/comment (0.00s)
--- PASS: TestLexer/symbols (0.00s)
PASS
ok calc 0.003s
And we pass our tests, the only feature missing from our
tokeniser is detecting numbers.

Support for integers and floating point numbers

As introduced before i want to support numbers with several
infixes, such as _, e and ..

Go ahead and add some tests for these cases:

// lexer_test.go
package main

// [...]

func TestLexer(t *testing.T) {

tests := []struct {
Name string
In string
Out []Token
}{
// [...]
{
Name: "number",
In: "123",
Out: []Token{
{TOKEN_NUMBER, "123"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
{
Name: "number with underscore",
In: "10_000",
Out: []Token{
{TOKEN_NUMBER, "10_000"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
{
Name: "number with e",
In: "10e5",
Out: []Token{
{TOKEN_NUMBER, "10e5"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
{
Name: "number with .",
In: "0.005",
Out: []Token{
{TOKEN_NUMBER, "0.005"},
{TOKEN_EOF, "TOKEN_EOF"},
},
},
}
// [...]
}

Lets add a default-case to our switch statement:

// lexer.go
package main

// [...]

func (l *Lexer) Lex() []Token {

// [...]
for l.cur != 0 {
// [...]
switch l.cur {
// [...]
default:
if (l.cur >= '0' && l.cur <= '9') || l.cur == '.' {
t = append(t, l.number())
continue
}
}
// [...]
}
// [...]
}
As one should notice we have yet to define the *Lexer.number
function:

// lexer.go
package main

// [...]

func (l *Lexer) number() Token {

b := strings.Builder{}
for (l.cur >= '0' && l.cur <= '9') || l.cur == '.' || l.cur == '_' ||
l.cur == 'e' {
b.WriteRune(l.cur)
l.advance()
}
return Token{
Raw: b.String(),
Type: TOKEN_NUMBER,
}
}
The function makes use of the strings.Builder structure. This is
used to omit copying the string which we would have to do if
we simply used string+string. We iterate while our character
matches what we want and write to the strings.Builder
structure. Upon hitting a character we do not accept the loop
stops and the function returns a Token-Structure with the
result of the strings.Builder we defined and wrote to
previously.

Combining the previously added default-case and our new

*Lexer.number() function we added support for numbers
starting with 0-9 or .. We support infixes such as _, ., _ and e -
exactly matching our test cases, thus we can now once again
check if our tests pass:

=== RUN TestLexer

=== RUN TestLexer/empty_input
=== RUN TestLexer/whitespace
=== RUN TestLexer/comment
=== RUN TestLexer/symbols
=== RUN TestLexer/number
=== RUN TestLexer/number_with_underscore
=== RUN TestLexer/number_with_e
=== RUN TestLexer/number_with_.
--- PASS: TestLexer (0.00s)
--- PASS: TestLexer/empty_input (0.00s)
--- PASS: TestLexer/whitespace (0.00s)
--- PASS: TestLexer/comment (0.00s)
--- PASS: TestLexer/symbols (0.00s)
--- PASS: TestLexer/number (0.00s)
--- PASS: TestLexer/number_with_underscore (0.00s)
--- PASS: TestLexer/number_with_e (0.00s)
--- PASS: TestLexer/number_with_. (0.00s)
PASS
ok calc 0.003s

Calling our Tokeniser

Out tests pass - we can finally move on to my favorite part of
every programming project: passing input via the command
line to our program and seeing the output. Doing so requires
some packages. We need os to access the command line
arguments our program was called with, we need strings to
create a io.Reader for the parameter our tokeniser requires.
Furthermore we include the log package and promptly disable
all prefixes, timestamps, etc. by invoking log.SetFlags with 0 as
the argument.

package main

import (
"log"
"os"
"strings"
)

func main() {
log.SetFlags(0)
if len(os.Args) != 2 {
log.Fatalln("missing input")
}

input := os.Args[1]
token := NewLexer(strings.NewReader(input)).Lex()
}
Tip
When an executable build with go is started it can access the
arguments passed to it via the os.Args slice:

// main.go
package main

import (
"fmt"
"os"
)

func main() {
fmt.Println(os.Args)
}
$ go build .
$ ./main arg1 arg2 arg3
[./main arg1 arg2 arg3]
The 0 index is always the name of the executable.
We got our tokens but we haven’t printed them yet, so we
create a helper method called debugToken - we first print the
header of our table and afterwards iterate through our list of
Token structures, printing them one by one.

// main.go
package main

// [...]

func debugToken(token []Token) {

log.Printf("%5s | %20s | %15s \n\n", "index", "type",
"raw")
for i, t := range token {
log.Printf("%5d | %20s | %15s \n", i,
TOKEN_LOOKUP[t.Type], t.Raw)
}
}

func main() {
log.SetFlags(0)
if len(os.Args) != 2 {
log.Fatalln("missing input")
}

input := os.Args[1]

token := NewLexer(strings.NewReader(input)).Lex()
debugToken(token)
}
Running our program with an expression of our choice results
in a table of lexemes we recognized

$ go run . "100_000+.5*(42-3.1415)/12"
index | type | raw

0| TOKEN_NUMBER | 100_000
1| TOKEN_PLUS | +
2| TOKEN_NUMBER | .5
3| TOKEN_ASTERISK | *
4| TOKEN_BRACE_LEFT | (
5| TOKEN_NUMBER | 42
6| TOKEN_MINUS | -
7| TOKEN_NUMBER | 3.1415
8 | TOKEN_BRACE_RIGHT | )
9| TOKEN_SLASH | /
10 | TOKEN_NUMBER | 12
11 | EOF | TOKEN_EOF

Source Code - Machine Code
No ratings yet
Source Code - Machine Code
102 pages
BNF Ebnf
100% (1)
BNF Ebnf
25 pages
Python For Software Development Hans Petter Halvorsen
No ratings yet
Python For Software Development Hans Petter Halvorsen
204 pages
Cocoa Programming (2002)
75% (4)
Cocoa Programming (2002)
1,207 pages
Three Address Codes
No ratings yet
Three Address Codes
5 pages
Managing Memory in SAS
No ratings yet
Managing Memory in SAS
17 pages
04-grammars سعود
No ratings yet
04-grammars سعود
66 pages
2 Tokens Naturalness of Code
No ratings yet
2 Tokens Naturalness of Code
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
66 pages
9 - Syntax Analysis
No ratings yet
9 - Syntax Analysis
60 pages
V02 Parsing, Conditionals, Names
No ratings yet
V02 Parsing, Conditionals, Names
64 pages
Live Human Being Detection Wireless Remote Controlled Robot
90% (10)
Live Human Being Detection Wireless Remote Controlled Robot
59 pages
Cat 1
No ratings yet
Cat 1
150 pages
2 SimpleOnePassCompiler
No ratings yet
2 SimpleOnePassCompiler
66 pages
Lecture 24
No ratings yet
Lecture 24
49 pages
CH 02
No ratings yet
CH 02
78 pages
Write A Computer Language Using Go (Golang)
100% (1)
Write A Computer Language Using Go (Golang)
14 pages
O Level m3 r5 Python Important MCQ For Theory Examination
No ratings yet
O Level m3 r5 Python Important MCQ For Theory Examination
49 pages
Ss CC
No ratings yet
Ss CC
24 pages
Sample
No ratings yet
Sample
21 pages
Experiment No. 9 3118013: Aim: Theory: Lexical Analyzer
No ratings yet
Experiment No. 9 3118013: Aim: Theory: Lexical Analyzer
16 pages
05 Inheritance PDF
No ratings yet
05 Inheritance PDF
27 pages
Sample
No ratings yet
Sample
15 pages
Lex Yacc Tutorial
No ratings yet
Lex Yacc Tutorial
38 pages
Compilers: Basic Compiler Functions
No ratings yet
Compilers: Basic Compiler Functions
51 pages
RTL Simulation & Synthesis & PLD
No ratings yet
RTL Simulation & Synthesis & PLD
32 pages
Lex and Yacc
No ratings yet
Lex and Yacc
33 pages
1 To 10
No ratings yet
1 To 10
16 pages
CD Lab Manual
No ratings yet
CD Lab Manual
28 pages
001 2015 4 B PDF
No ratings yet
001 2015 4 B PDF
385 pages
Here PDF
No ratings yet
Here PDF
112 pages
Programming Savvy and Arithmetic Expressions
No ratings yet
Programming Savvy and Arithmetic Expressions
9 pages
Aaaa
No ratings yet
Aaaa
8 pages
CHatGPT Answers CD Mid - 2
No ratings yet
CHatGPT Answers CD Mid - 2
17 pages
Compiler Construction: Department of Computer Science
No ratings yet
Compiler Construction: Department of Computer Science
17 pages
01 134201 011 9556776808 15032022 124916pm
No ratings yet
01 134201 011 9556776808 15032022 124916pm
6 pages
Compiler Design Lec 2
No ratings yet
Compiler Design Lec 2
7 pages
PPL Unit 1
No ratings yet
PPL Unit 1
8 pages
Spring 2024 Compiler Constructoin A Lab 5
No ratings yet
Spring 2024 Compiler Constructoin A Lab 5
9 pages
Lab Manual JAVA EO
100% (1)
Lab Manual JAVA EO
134 pages
32 Calculator 4pp
No ratings yet
32 Calculator 4pp
6 pages
Introduction To C++ Inheritance: CS107L Handout 05 Autumn 2007 October 26, 2007
No ratings yet
Introduction To C++ Inheritance: CS107L Handout 05 Autumn 2007 October 26, 2007
27 pages
Implementation of A Simple Calculator Using Flex and Bison: Abstract
No ratings yet
Implementation of A Simple Calculator Using Flex and Bison: Abstract
5 pages
Crafting An Interpreter Part 1 - Parsing and Grammars - Martin - Holzherr - CodePr
No ratings yet
Crafting An Interpreter Part 1 - Parsing and Grammars - Martin - Holzherr - CodePr
12 pages
Unit 1
No ratings yet
Unit 1
50 pages
1506 02567 PDF
No ratings yet
1506 02567 PDF
350 pages
FAF 233 Nicolai Petcov 8
No ratings yet
FAF 233 Nicolai Petcov 8
5 pages
Crafting An Interpreter Part 2 - The Calc0 Toy Language: Congratulations! Your Download Is Complete
No ratings yet
Crafting An Interpreter Part 2 - The Calc0 Toy Language: Congratulations! Your Download Is Complete
7 pages
Blue Print Exit Exam
No ratings yet
Blue Print Exit Exam
223 pages
CD - Exp-12 0682
No ratings yet
CD - Exp-12 0682
4 pages
CH 01 PPT
No ratings yet
CH 01 PPT
61 pages
CD - Exp-11 0682
No ratings yet
CD - Exp-11 0682
4 pages
Overview of LEX and YACC
No ratings yet
Overview of LEX and YACC
6 pages
Compiler Construction
No ratings yet
Compiler Construction
5 pages
Calculator
No ratings yet
Calculator
3 pages
Compiler Compiler: Flex and Bison: 1 Today's Goal
No ratings yet
Compiler Compiler: Flex and Bison: 1 Today's Goal
7 pages
Let's Build A Simple Interpreter. Part 1
No ratings yet
Let's Build A Simple Interpreter. Part 1
12 pages
SS 35 Questions
No ratings yet
SS 35 Questions
65 pages
Ex5pcd Kavi
No ratings yet
Ex5pcd Kavi
5 pages
The Different Phases of A Compiler
No ratings yet
The Different Phases of A Compiler
9 pages
TutorialFlexBison PDF
No ratings yet
TutorialFlexBison PDF
2 pages
Lex and Yacc
No ratings yet
Lex and Yacc
8 pages
Yacc Examples
No ratings yet
Yacc Examples
9 pages
The Abstract Stack Machine
No ratings yet
The Abstract Stack Machine
7 pages
Compiler Design - pdf2
No ratings yet
Compiler Design - pdf2
2 pages
C++ Learn C++ Programming FAST A Project-Based Introduction To Programming B011DO9JE4
100% (2)
C++ Learn C++ Programming FAST A Project-Based Introduction To Programming B011DO9JE4
28 pages
Compiler Design
No ratings yet
Compiler Design
5 pages
Gem5 Splash2
100% (1)
Gem5 Splash2
13 pages
Chapter 1: Introduction To Compiling: 1.1: Language Processors
No ratings yet
Chapter 1: Introduction To Compiling: 1.1: Language Processors
3 pages
Build Your Own C Interpreter
No ratings yet
Build Your Own C Interpreter
18 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
Module 1. Programming Methodologies
No ratings yet
Module 1. Programming Methodologies
42 pages
Mastering Fundamentals of Programming
No ratings yet
Mastering Fundamentals of Programming
58 pages
Study of Lex and Yacc: Lexisa
No ratings yet
Study of Lex and Yacc: Lexisa
4 pages
OOP1 صلاح
No ratings yet
OOP1 صلاح
65 pages
Compiler Construction Course
No ratings yet
Compiler Construction Course
12 pages
WRFDA Users Guide
No ratings yet
WRFDA Users Guide
81 pages
CD Previous QA 2010
No ratings yet
CD Previous QA 2010
64 pages
Java History: Us06Ccsc01 Unit - I OOP Using Java
No ratings yet
Java History: Us06Ccsc01 Unit - I OOP Using Java
69 pages
Cse 304 Compiler Design
No ratings yet
Cse 304 Compiler Design
6 pages
توثيق نظام الصيدليات
No ratings yet
توثيق نظام الصيدليات
18 pages
OASIS-CC Quick Reference Manual - Laboratory For Atmospheric ...
No ratings yet
OASIS-CC Quick Reference Manual - Laboratory For Atmospheric ...
46 pages
Chapter Three - Programming Constructs
No ratings yet
Chapter Three - Programming Constructs
55 pages
خطة بحث
No ratings yet
خطة بحث
10 pages
Part 2
No ratings yet
Part 2
4 pages
Finin 1
No ratings yet
Finin 1
16 pages
التحليل
No ratings yet
التحليل
4 pages
The Basics: Introduction To GDB
No ratings yet
The Basics: Introduction To GDB
15 pages
LLVM Homework
100% (1)
LLVM Homework
7 pages
Overview of Computers & Programming Languages
No ratings yet
Overview of Computers & Programming Languages
15 pages
Lab 1 Intro Linux
No ratings yet
Lab 1 Intro Linux
6 pages
Definition of Compiler Design
No ratings yet
Definition of Compiler Design
9 pages

حاسبة

Uploaded by

حاسبة

Uploaded by

Tokenizing Arithmetic expressions - calculator

This project is meant for explaining concepts used when

tokenizing characters to tokens

This first article contains the introduction to our problem

The second article will be centered around converting the list

The third article will be focused on walking the abstract syntax

The fourth and last article will be consisting of implementing

2. Parsing -> refers to the process of detecting precedence

For our interpreter we decide to use the idea of bytecode

1. Compiling to bytecode -> Walking the tree and compiling

Lets now take a look at the bytecode our previous example

;; loading 3 into register 0

;; storing the result of the

This should suffice as a high level overview over the steps we

Code snippets starting with a $ must be executed in a shell:

$ echo "This is a shell"

Using go we can start with the bare minimum a project

Initialising the project

$ go mod init calc

Test driven development

Lets get started, create a new file in the projects directory,

Token and Types of Token

To define the meaning we attach to this list of characters we

We will now define the structure holding the detected

// [...] token kind definition

type Token struct {

We defined the structure to hold tokens we found in the

func TestLexer(t *testing.T) {

// [...] token types generation

var TOKEN_LOOKUP = map[int]string{

// [...] Imports, token types, token struct, TOKEN_LOOKUP

type Lexer struct {

func NewLexer(reader io.Reader) *Lexer { }

func (l *Lexer) number() Token { }

func (l *Lexer) advance() { }

The heart of the tokeniser is the Lexer.Lex method. It iterates

The Lexer.number method is called when an number is

Lexer.advance requests the next character from the buffered

// [...] Imports, token types, token struct, TOKEN_LOOKUP

type Lexer struct {

func NewLexer(reader io.Reader) *Lexer {

Advancing in the Input

func (l *Lexer) advance() {

func (l *Lexer) Lex() []Token {

After defining the NewLexer and the *Lexer.Lex we can try

func TestLexer(t *testing.T) {

To check if the current character matches any of the above we

func (l *Lexer) Lex() []Token {

Support for comments

func TestLexer(t *testing.T) {

func (l *Lexer) Lex() []Token {

Lets again run our tests:

Detecting special symbols

func TestLexer(t *testing.T) {

out implementation will result in the following assert

Our first step towards this goal is to define a new variable

func (l *Lexer) Lex() []Token {

func (l *Lexer) Lex() []Token {

func (l *Lexer) Lex() []Token {

Support for integers and floating point numbers

Go ahead and add some tests for these cases:

func TestLexer(t *testing.T) {

Lets add a default-case to our switch statement:

func (l *Lexer) Lex() []Token {

func (l *Lexer) number() Token {

Combining the previously added default-case and our new

=== RUN TestLexer

Calling our Tokeniser

func debugToken(token []Token) {

You might also like