TalkingCompilersWithGPT
TalkingCompilersWithGPT
1
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Preface
In the second semester of 2024, I was tasked with teaching Compiler Construction to
undergraduate students at the Federal University of Minas Gerais. At that time, I had been
conducting research in the field of compilers for almost twenty years. However, I had never
taught this specific course before. Instead, I was usually teaching courses like Theory and
Practice of Programming Language Design or Basic Programming to undergraduates.
That said, I did have some material prepared for a graduate course on Static Program Analysis,
which I could potentially reuse. But all of that seemed almost irrelevant, because this was the age
of ChatGPT, Gemini, and Llama! It was the era of large language models (LLMs) and all the
incredible things they enable. So, I decided to embrace these tools wholeheartedly to help create
my lecture notes.
I usually teach by asking students questions, and my lecture notes often consist of
questions—typically without the answers. These notes turned out to be excellent guidelines for
crafting a more comprehensive story to share with the Compiler Construction students. This PDF
is primarily the result of feeding ChatGPT and other LLMs those questions and refining their
responses.
The table of contents, in turn, is inspired by the ACM Undergraduate Curricula for Compilers. I
tried to cover as much of it as time and energy allowed. I've organized the contents following
Bob Nystrom's Crafting Interpreters: first I discuss the implementation of an interpreter (for a
simple expression-oriented subset of SML/NJ) and then talk about code generation (using
RISC-V as the main target).
Of course, not all the answers generated by LLMs met my expectations. For instance, I struggled
to get GPT to explain that the application "(fn x => fn y => x + y) z" would yield a closure in
SML/NJ. But these minor frustrations were far outweighed by the joy of working with these
models. The process of creating this book was both productive and exciting—I learned much
along the way.
The students of that second semester of 2024 played a key role in helping me shape a compelling
narrative. I am also deeply grateful to the teaching assistants, who reviewed the code I uploaded
to GitHub. Ultimately, this book was a collective effort involving many people, and I owe them
all my thanks. This includes friends from the Reddit community r/compilers, as well as
colleagues who also teach compiler construction. To everyone who contributed to this collection
of lecture notes, muito obrigado!
2
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Table of Contents
● Introduction
● Lexical Analysis
● Tree-Like Program Representation
● Recursive-Descent Parsing
● Bottom-Up Parsing
● Parser Generators and Parser Combinators
● Variables and Bindings
● The Visitor Design Pattern
● Type Systems
● Type Checking
● Type Inference
● Anonymous Functions
● Recursive Functions
● Introduction to Code Generation
● Code Generation for Expressions
● Code Generation for Statements
● Code Generation for Functions
● Memory Allocation
● Pointers and Aggregate Types
● Code Generation for Object-Oriented Features
● Heap Allocation
● Introduction to Code Optimizations
● Data-Flow Analyses
● Static Single-Assignment Form
● SSA-Based Register Allocation
3
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Introduction
How do we run programs?
There are three typical ways to execute programs:
● Interpretation: In this method, the program remains in a high-level representation, such
as an Abstract Syntax Tree (AST) or another intermediate form. The interpreter
processes this representation directly, executing the program step by step within a
"dispatch loop." No machine code is generated beforehand, so the interpreter must
remain active throughout the execution to manage and execute each instruction.
● Compilation: This method involves translating the entire program into machine code (a
binary executable) ahead of time using a compiler. Once the compilation is complete, the
resulting machine code can be executed directly by the hardware without further
interpretation or translation. The program runs natively on the machine, leading to
potentially faster execution compared to interpretation.
● Virtualization: Virtualization usually involves transforming the program into a low-level
code representation, often called bytecode. This bytecode is not directly executable by
the hardware but is designed to be efficiently interpreted by a virtual machine. The virtual
machine then executes the bytecode within its own dispatch loop, similar to how an
interpreter would handle a high-level representation. While the bytecode is typically more
abstract than machine code, it is more structured and lower-level than the high-level
representation used in pure interpretation.
Have you ever seen the cover of "Crafting Interpreters". It shows different
ways to analyze or execute programs. Have you ever heard of the book?
Yes, "Crafting Interpreters" by Robert Nystrom is a well-known and highly regarded book in the
programming community. It provides a hands-on, practical introduction to building interpreters
and compilers from scratch. The book is split into two parts: the first part walks readers through
building a tree-walk interpreter for a simple programming language called Lox, focusing on
concepts like lexical analysis, parsing, and interpretation. The second part dives into more
advanced topics like bytecode compilation, virtual machines, and optimization. The book is
written in a conversational, accessible style, making it ideal for both beginners and experienced
developers interested in language design and implementation. Here's the cover page:
4
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Some people like to call "Crafting Interpreters" the "Mountain Book". That's due to its cover
page, which creatively visualizes the different paths taken by compilers and interpreters during
the execution or translation of programming languages. The image shows a mountainous terrain
that symbolizes the journey of a program from source code to its final form, whether that's
machine code or bytecode. Here’s an explanation of the key paths shown:
5
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
6
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Compiled languages are typically "System Languages", that is, languages meant to implement
computer systems like operating systems, databases, webserves, video games, etc. Examples
include C, C++, Rust, Go, Fortran and Pascal.
And almost everything else is virtualized. Examples of languages where programs are
translated to bytecodes, and then the bytecodes are interpreted include Python (via CPython),
Java (via the JVM), C# (via the CLR), etc.
It's important to note that some languages can fall into multiple categories depending on the
implementation. For instance, Python can be both interpreted directly (in some implementations)
and virtualized via bytecode (in CPython). JavaScript is another example that, depending on the
engine, may use a mix of interpretation, JIT compilation, and virtualized execution.
● Interpreted languages (like Bash) are ideal for cases where the program's execution is
relatively short-lived or dynamic. The overhead of compiling in such cases would
outweigh any potential performance gains, so interpretation, which avoids the
compilation step and executes code directly, is more efficient in terms of overall time. For
example, in a Bash script, the primary role is to automate tasks and launch other
programs (which are often compiled). Since the script itself executes for a short time
compared to the programs it runs, interpreting it makes sense. There's little benefit to
compiling such scripts, as they aren't run long enough for the performance gains of
compilation to be worth the initial cost.
● Compiled languages (like C or C++) are used when performance is critical, particularly
when the program is expected to run for a long time or repeatedly. Programs in these
languages, such as operating system kernels or performance-sensitive applications,
benefit greatly from being compiled because the compilation process allows for
optimizations that make the resulting machine code run faster. While compilation takes
time, this upfront cost is justified because the compiled program can be executed many
times, and each execution benefits from the optimizations. The more a program is
expected to run, the more it makes sense to invest time in compiling and optimizing it.
7
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Some languages, like JavaScript, benefit from both interpretation and compilation, depending
on the program's runtime behavior. In many cases, JavaScript programs are short-lived, running
only while a web page is open. For these situations, interpreting the code makes sense because
it avoids the overhead of compiling something that won't be used for long.
However, when a JavaScript program runs for an extended period (e.g., a web application that
remains active), a JIT (Just-In-Time) compiler can be used. JIT compilation begins by
interpreting the code but monitors how it runs. If certain parts of the code are executed
frequently, the JIT compiler steps in to compile and optimize those parts on the fly. This hybrid
approach balances the initial speed of interpretation with the eventual performance
improvements of compilation for long-running programs.
Conclusion
The choice between interpretation and compilation can be seen as a matter of optimizing for
user time:
● Interpreted languages shine when the cost of compilation would exceed the benefits
because the programs are expected to run for short durations.
● Compiled languages are essential for long-running programs, where upfront
compilation costs are offset by significant execution speed improvements.
● JIT compilation provides a middle ground, allowing for initial speed via interpretation
and optimizing performance as needed for longer-running code.
This time trade-off reflects the different roles that languages play in modern computing
environments, from quick scripting tasks to performance-critical applications, with tools like JIT
adapting dynamically to both ends of the spectrum.
Let's take a look into some program representations. How to Visualize the
AST of clang?
Given this program below:
#include <stdio.h>
int main() {
char str[] = "Hello, world!";
8
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Consider:
public class T {
public static void main(String args[]) {
int x = 1;
x += (x = 2);
System.out.println("x = " + x);
}
}
We can do:
9
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Usually, we call these three parts that form up the language ecosystem its front-end, its
middle-end and its back-end.
The front-end translates the source code (Java, C++, Rust, etc.) into an intermediate
representation (IR). This layer is responsible for converting high-level constructs (functions,
loops, variables) into an IR, such as LLVM IR in LLVM compilers or bytecode in the JVM.
By creating a common IR, multiple programming languages can target the same runtime
environment. For example, in the LLVM ecosystem, languages like C, C++, Rust, Swift, and
others can all be compiled into LLVM IR.
10
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Since this phase operates on the IR, it is independent of both the source language and the
target machine architecture. This allows the same optimizations to be applied to code from
any language and for any platform.
The back-end is responsible for transforming the optimized IR into machine code or bytecode
that can be executed by a specific processor or runtime environment. It maps the IR into
instructions that the target architecture can understand (such as x86, ARM, RISC-V) and
handles:
● Register allocation: Deciding how to use the CPU registers to hold data.
● Instruction scheduling: Ordering the instructions for efficient execution on a given
CPU.
● Target-specific optimizations: Improving performance based on the capabilities of the
hardware, such as vectorization or using specific CPU instructions.
The back-end is highly target-dependent, but the separation allows engineers to support
multiple architectures without needing to change the previous stages.
11
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Separation of Concerns
Each stage in this pipeline addresses different responsibilities, which makes the system easier
to manage, extend, and optimize. By dividing the work into distinct stages, developers can focus
on the particular tasks associated with each phase without worrying about others.
2. Modularity
Dividing the runtime or compiler into these components allows for greater modularity. Each
phase can be developed, improved, or replaced independently. For example:
● The front-end can support multiple languages without needing to modify the back-end.
● The back-end can support multiple architectures without changing the front-end.
This design allows engineers to focus on specific areas of the system in isolation and build a
reusable infrastructure.
3. Cross-Platform Flexibility
This architecture is particularly useful when supporting multiple platforms. For example, LLVM
can compile code for x86, ARM, RISC-V, and other architectures, but the front-end remains the
same regardless of the final machine code target. This flexibility allows developers to build
cross-platform tools easily by simply plugging in different back-ends.
Similarly, in the Java Virtual Machine (JVM) ecosystem, bytecode generated by the front-end
can run on any platform that has a JVM (e.g., Windows, Linux, macOS). The JVM acts as a
platform-independent back-end, abstracting the machine-specific details away from the
language.
12
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The middle-end focuses on code improvements that are universally beneficial across all
platforms.
● The back-end focuses on platform-specific improvements, ensuring that the final code
runs optimally on the target hardware.
This structure makes compilers and runtime environments extensible and reusable. For
instance:
This approach drastically reduces development time and complexity, as adding support for new
languages or architectures becomes more modular.
Summary
The front-end, middle-end, and back-end architecture of runtime environments and compilers is
adopted because it allows for:
This structure makes modern runtimes and compilers more flexible, efficient, and maintainable.
13
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● C
● C++
● Rust
● Swift
● Objective-C
● Fortran
● Julia
● Haskell (via GHC backend)
● Kotlin/Native
● Zig
● OCaml
● Nim
14
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
And indeed, there are several other similar ecosystems where multiple languages target a
common intermediate layer, which is then executed on different platforms or runtimes. Here are
two more prominent examples:
JavaScript is often used as a compilation target for other languages to enable web
development. Many languages compile to JavaScript, which can then run on web browsers or
server environments like Node.js.
Languages
● TypeScript
● Dart
● Elm
● CoffeeScript
● ReasonML/ReScript
● Scala.js
● ClojureScript
● Haxe
● Python (via Brython or Transcrypt)
● Ruby (via Opal)
15
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Intermediate Representation
● JavaScript
Targets
The .NET Common Language Runtime (CLR) allows various languages to be compiled into CIL
(Common Intermediate Language), which is then executed by the .NET runtime.
Languages
● C#
● F#
● Visual Basic .NET
● C++/CLI
● IronPython
16
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● IronRuby
Intermediate Representation
Targets
● Windows
● Linux
● macOS (via .NET Core)
● Azure (Cloud)
1. Understanding compiler optimizations can help you write code that performs better.
2. Compilers are the foundation of programming languages, and learning about them can
empower you to create your own.
3. A strong understanding of compilers can aid in debugging complex software issues.
4. Compilers offer a unique lens through which to view the field, providing a deeper
appreciation for the intricacies of programming and computer systems.
17
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The demand for compiler experts is particularly high in industries that rely on high-performance
computing, such as:
18
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Algorithm Development: Creating new algorithms and techniques for optimizing deep
learning computations.
● Collaboration: Working closely with researchers, data scientists, and hardware engineers
to develop and deploy efficient AI systems.
Do you know anyone who works with compiler development here in Brazil?
There are lots of alumni from UFMG's Compilers Lab who work with compiler development
nowadays. Take a look into Slide 34+ in this presentation. We can talk about a few success
stories in the classroom!
Let's talk a bit about the importance of Compilers with a cool example of it.
Do you know Thompson's "Reflection on Trusting Trust"?
That's a famous paper in computer science where Ken Thompson discusses the concept of
trusting software, particularly in the context of compilers and the potential for malicious code to
be hidden in a way that is almost impossible to detect. The paper is often cited when discussing
the importance of trust and security in software systems. Thompson argues that we have no
choice but to trust the tools we use. However, this trust is based on the assumption that those
tools are not malicious. The implication is that absolute trust is impossible without rigorous
verification, which itself may be infeasible.
19
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Thompson's exploit starts with a clean compiler, which does not contain any backdoors. This
compiler is trusted by the user and produces backdoor-free binaries. The attack leverages two
backdoors to propagate malicious behavior, even when the source code appears clean. The
process can be described in four stages, as illustrated in the figure below:
1. Insertion: In Thompson's hack, a (malicious) user modifies the compiler source to
introduce two backdoors:
a. BD1: This backdoor recognizes when the compiler is compiling the login program
and inserts a malicious backdoor (e.g., a hardcoded password) into the compiled
binary.
b. BD2: This self-replicating backdoor recognizes when the compiler is compiling its
own source code and reinserts both BD1 and BD2 into the new compiler binary.
2. Bootstrapping: Once the modified compiler source is compiled with the original
compiler, it produces a binary version of the trojan compiler. This binary now contains
both backdoors, BD1 and BD2. At this point, the source code of the trojan compiler is no
longer necessary, and can be removed. Therefore, the source code of the compiler
appears clean, as BD2 exists only at the binary level.
3. Attack: When the binary trojan compiler compiles the login system, it inserts BD1 into
the resulting binary, creating a trojan login system.
20
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. Propagation: When the binary trojan compiler compiles a clean version of its own
source code, it reinserts BD1 and BD2 into the new binary compiler. This ensures the
trojan behavior propagates indefinitely, even if the source code is reviewed and found to
be clean.
The genius of this attack lies in its self-propagating nature: the malicious behavior becomes
embedded at the binary level and is undetectable through inspection of the source code alone.
Trusting the compiler binary is therefore essential, as the hack demonstrates how source code
auditing alone cannot guarantee the absence of malicious behavior.
How could someone compromise the Brazilian election using this hack?
Let's walk through a step-by-step example of how a hacker with control over a compiler could
compromise a voting machine in an election, based on the principles from Thompson's paper.
1. Initial Setup
a. Voting Machine: The machine runs a specialized Linux-based software that
counts votes and sends results to a central server.
b. Compiler: This software is written in a programming language like C or C++ and
compiled using a compiler like GCC.
2. Hacker Controls the Compiler
a. The hacker gains control of the compiler used to build the voting software. This
could happen if the compiler is developed by a trusted but compromised third
party or if the hacker manages to inject malicious code into an existing compiler.
3. Inserting the Backdoor into the Compiler
a. The hacker modifies the compiler to recognize when it is compiling the voting
software. When this occurs, the compiler automatically inserts a backdoor into
the compiled version of the software, without altering the source code that
developers see.
b. Example: The backdoor might be designed to alter vote counts, either by flipping
a small percentage of votes from one candidate to another or by manipulating the
totals in a more subtle way.
4. Self-Replicating Compiler Hack
a. To ensure that this malicious behavior persists, the hacker also modifies the
compiler to recognize when it is compiling its own source code. When
recompiling itself, the compiler reinserts the backdoor code, ensuring that even if
someone tries to build a new, "clean" compiler from source, the backdoor
remains.
5. Compiling the Voting Software
a. The voting software is compiled using this compromised compiler. The
developers see nothing suspicious, as the source code appears normal.
However, the compiled executable contains the malicious backdoor.
6. Deploying the Voting Software
a. The compromised voting software is deployed on voting machines. When people
cast their votes, the software functions as expected on the surface but secretly
21
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Lexical and Syntactic Analysis: Understanding how to break down and parse
arithmetic expressions, including both top-down and bottom-up parsing techniques.
2. Type Systems: Verification and inference of types in programs, including working with
anonymous functions, closures, and recursive functions.
3. Code Generation: Techniques for generating and optimizing machine code, including
control structures and function implementation.
4. Memory Management: Understanding memory allocation, garbage collection, and heap
management.
5. Program Analysis and Optimization: Control flow graphs, liveness analysis, register
allocation, and peephole optimizations.
The course also includes practical work with 15 virtual programming labs, worth 4.5 points each,
and two major exams. The structure of the course is mostly based on Crafting Interpreters,
which is a freely available textbook on the art of designing and implementing interpreters and
compilers.
22
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Tree-based Interpreter: This method will handle closures and recursive functions.
2. Code Generator: This generates RISC-V instructions and manages anonymous
functions, but not recursive ones.
Students will work in Python, with 15 deliverables that progressively build the compiler. The
deliverables are interdependent, but some can be completed independently, even if earlier steps
are unfinished. Each deliverable comes with partially completed code that students must finish.
Students are encouraged to follow a test-driven development approach, using provided tests to
guide their implementation. They can submit their work multiple times through the Moodle
system, and partial credit is available if not all tests pass initially
The homework itself is worth 60 points, but includes an additional 7.5 bonus points, meaning
students can earn up to 67.5 points for the homework, though only 60 will count toward the final
grade.
Additionally, students have opportunities to earn extra credit. Up to 5 points can be earned by
answering questions on https://www.reddit.com/r/Compilers/, with each answer worth 1 point.
Notice that students can earn at most five extra points in the course! So, even if they get 12.5
extra points, that will be capped to 5.0.
23
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's close with a bit of history. Do you know some of the history of early
compilers?
I do! The history of the first compilers is a cornerstone of computer science, marking the
transition from low-level, machine-oriented programming to more human-friendly, high-level
languages. Here's a summary of key milestones:
In the 1940s and early 1950s, programming was done directly in machine code or assembly
language. These methods were:
The need for an automated way to translate higher-level abstractions into machine-readable
instructions gave rise to the concept of the compiler.
Grace Hopper is often credited with developing one of the first compiler-like tools. Her work
included:
● The A-0 System, which she designed for the UNIVAC I. This system converted symbolic
mathematical instructions into machine code.
● Her groundbreaking ideas laid the foundation for the development of programming
languages and compilers.
The FORTRAN (FORmula TRANslation) compiler, developed by John Backus and his team at
IBM, was the first widely successful compiler. Key aspects include:
The development of LISP, a language for artificial intelligence research, brought recursive
programming and symbolic computation into the spotlight. John McCarthy’s team created a
compiler capable of handling the complexities of LISP’s syntax and semantics.
5. COBOL (1959–1960)
24
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The COBOL (COmmon Business-Oriented Language) compiler emerged as the first attempt to
make programming languages universally understandable. It emphasized:
● Limited hardware resources: Compilers had to run on the same primitive systems they
were targeting.
● Complexity: Translating high-level abstractions into optimized machine code was a novel
and technically demanding task.
● Resistance: Many programmers initially resisted high-level languages, fearing a loss of
efficiency.
Bootstrapping refers to the process of building a complex system (like a compiler) from a
simpler one. For the first compiler, here's how it was likely possible:
1. Manual Translation: The very first compilers were often written in assembly language
or even machine code, which could be directly executed by the computer's hardware.
This is the simpler starting point. The development of the compiler began by manually
25
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Self-Compilation: In some cases, the early stages of the compiler involved a
self-compiling process. The very first version of the compiler might only handle a
limited subset of the language. As the compiler was improved, more complex features
were added. Eventually, this early version could compile itself (i.e., compile a more
advanced version of the compiler).
4. Two-Stage Process: In practice, the compiler development process often starts with a
"bootstrapping" phase where:
○ A very simple "primitive" compiler is manually written in an existing low-level
language like assembly.
○ This simple compiler can compile a more advanced version of the compiler,
which in turn can compile even more sophisticated versions.
When John Backus and his team developed the first FORTRAN compiler in the 1950s, the early
stages of the process were likely done by:
● Writing initial pieces of the compiler by hand in assembly language (low-level code).
● Using that primitive assembler or interpreter to slowly build up the full FORTRAN
compiler.
Once the first working compiler was created, it could be used to compile additional versions,
eventually leading to more advanced versions of the FORTRAN compiler capable of compiling
complex programs.
In Summary:
This bootstrap process is essential to the development of complex systems, and it illustrates
how simple, manually-crafted programs can evolve into fully automated, self-compiling tools.
26
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This John Backus was very important, right? He bridged the gap between
theory and practical application in compilers, I suppose!
Yes: he was tremendous! John Warner Backus (1924–2007) was an American computer
scientist who revolutionized programming by creating the first high-level programming language,
FORTRAN, and advancing the field of compiler design. After initially studying chemistry and
dropping out of medical school, Backus discovered his passion for mathematics and
programming at IBM in the early 1950s. He led the team that developed FORTRAN, which
drastically simplified programming and made computers more accessible to scientists and
engineers. Later in his career, Backus shifted focus to programming language theory, proposing
innovative ideas like functional programming and contributing to the formal description of
programming languages with the Backus-Naur Form (BNF). For his groundbreaking
contributions, Backus received numerous accolades, including the Turing Award in 1977.
27
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Backus’s work fundamentally changed how humans interact with computers, making
programming more efficient, scalable, and accessible. His contributions remain central to
computer science today.
28
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Lexical Analysis
Code: https://github.com/pronesto/DCC053_Public/tree/main/1_ArithLexing
The scanner's job is to take an input string (the RPN expression) and break it down into tokens,
such as numbers and operators (+, -, *, /). The tokens will later be processed to compute the
result.
Once the scanner produces the list of tokens, the next step is to evaluate the RPN expression
using a stack-based approach. RPN means that operators come after their operands, so we
can evaluate expressions by pushing numbers onto the stack and applying operators when they
are encountered.
29
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
import enum
class TokenType(enum.Enum):
EOF = -1 # End of file
NLN = 0 # New line
WSP = 1 # White Space
NUM = 2 # Number (integers)
ADD = 202 # The token '+'
SUB = 203 # The token '-'
MUL = 204 # The token '*'
DIV = 205 # The token '/'
class Token:
operators = {TokenType.ADD, TokenType.SUB, TokenType.MUL, TokenType.DIV}
class Lexer:
def __init__(self, input_string):
self.input_string = input_string
self.position = 0
self.length = len(input_string)
def next_valid_token(self):
token = self.getToken()
if token.kind == TokenType.WSP or token.kind == TokenType.NLN:
token = self.next_valid_token()
return token
def tokens(self):
token = self.getToken()
while token.kind != TokenType.EOF:
if token.kind != TokenType.WSP and token.kind != TokenType.NLN:
yield token
token = self.getToken()
def getToken(self):
if self.position >= self.length:
30
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
current_char = self.input_string[self.position]
self.position += 1
if current_char.isdigit():
# Handle numbers (NUM)
number_text = current_char
while (
self.position < self.length
and self.input_string[self.position].isdigit()
):
number_text += self.input_string[self.position]
self.position += 1
return Token(number_text, TokenType.NUM)
else:
raise ValueError(f"Unexpected character: {current_char}")
The lexer is responsible for breaking an input string into tokens, which represent numbers and
operators like +, -, *, and /. Here’s an explanation of each part of the code to help a computer
science student understand how it works:
The TokenType class uses Python’s enum.Enum to define various token types used in the
lexer. Each token type represents a category of symbols or data in the input:
● EOF: End of file/input, signaling that the input string has been completely processed.
● NLN: New line (\n), which indicates the end of a line.
● WSP: Whitespace ( ), representing spaces in the input.
● NUM: Numbers (integer values).
31
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● ADD, SUB, MUL, DIV: The four arithmetic operators (+, -, *, /).
The numbers associated with each type are mostly symbolic. The exact values don’t matter, but
they help differentiate between the types.
2. Token Class
The Token class represents a single token in the input string. A token has two main attributes:
● text: The string representation of the token (e.g., "5" for a number or "+" for an
operator).
● kind: The token's type, which comes from the TokenType enum (e.g.,
TokenType.NUM for a number, TokenType.ADD for the + operator).
This class is used to store each token's text and its type as the lexer processes the input string.
3. Lexer Class
The Lexer class is responsible for reading the input string and breaking it down into tokens.
Constructor (__init__)
● input_string: The input to be tokenized (e.g., an RPN expression like "3 4 +").
● position: Tracks the current position in the input string, initialized to 0.
● length: The total length of the input string.
This constructor sets up the necessary variables to iterate over the input string.
next_valid_token()
tokens()
This method is a generator that returns all tokens in the input string, excluding whitespace and
newlines. It continuously calls getToken() until it reaches the end of the input (EOF).
getToken()
32
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This method is the core of the lexer. It reads one character at a time from the input string and
returns the corresponding token. Here's how it works:
● End of Input Check: If the current position has reached the end of the input, return an
EOF token.
● Number Handling: If the current character is a digit, it tries to read an entire number:
○ It starts with the current digit and keeps reading while the next characters are
also digits.
○ Once it reads the whole number, it creates and returns a NUM token.
● Operator Handling: If the current character is +, -, *, or /, it returns the corresponding
operator token.
● Whitespace and Newline: If the current character is a space (" ") or newline ("\n"), it
returns the corresponding token (WSP or NLN).
● Error Handling: If an unrecognized character is encountered, it raises an error.
1. First Token: The character '3' is a digit, so the lexer reads it and returns a Token with
kind=TokenType.NUM and text="3".
2. Second Token: The lexer skips the space character.
3. Third Token: The character '4' is a digit, so the lexer reads it and returns a Token with
kind=TokenType.NUM and text="4".
4. Fourth Token: The lexer skips the space and then finds the '+' character. It returns a
Token with kind=TokenType.ADD and text="+".
5. Fifth Token: The lexer reads the '5' and returns a Token with kind=TokenType.NUM
and text="5".
6. Sixth Token: The lexer reads the '*' and returns a Token with
kind=TokenType.MUL and text="*".
At the end of the input string, the lexer returns an EOF token to signal that the input has been
fully processed.
Could you name some key concepts related to the scanner we just wrote?
Tokenization: This is the process of breaking down an input string into meaningful chunks
(tokens), each representing a number or operator.
State Management: The lexer uses position to keep track of where it is in the input string. It
increments position as it processes each character.
Skipping Whitespace: The lexer handles whitespace and newlines but skips over them to
focus on numbers and operators, the meaningful parts of an RPN expression.
33
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Error Handling: The lexer raises an error if it encounters any unexpected characters, ensuring
that the input conforms to what the RPN calculator expects.
Now, can you implement the actual calculator that uses the Token?
def compute_postfix(lexer):
stack = []
b = stack.pop()
a = stack.pop()
if token.kind == TokenType.ADD:
result = a + b
elif token.kind == TokenType.SUB:
result = a - b
elif token.kind == TokenType.MUL:
result = a * b
elif token.kind == TokenType.DIV:
if b == 0:
raise ZeroDivisionError("Division by zero.")
result = a // b
else:
raise ValueError(f"Unexpected token type: {token.kind}")
# The final result should be the only value left in the stack
if len(stack) != 1:
raise ValueError("The user input has too many values.")
return stack[0]
This code implements the Reverse Polish Notation (RPN) calculator, which evaluates
arithmetic expressions in postfix notation (RPN). In postfix notation, operators follow their
operands, and the order of operations is implicitly determined by the placement of the operators.
Let’s break down how this function works for a computer science student:
Function Overview
34
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
What are the key concepts one should keep about our RPN calculator?
1. Stack-based Evaluation: RPN naturally fits a stack-based approach since operands are
pushed onto the stack, and operators pop them off, compute the result, and push it back.
2. Token Processing: The lexer is responsible for tokenizing the input, and the calculator
processes each token in sequence.
3. Error Handling:
○ The code checks for sufficient values in the stack before applying operators.
○ It handles division by zero.
○ It ensures that only one value remains in the stack at the end.
This structure makes the code modular, allowing easy extension for more operations or
features.
35
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Type (Kind): Represents the category or classification of the token (e.g., keyword,
identifier, number, operator). In your Lexer.py example, this is represented by the
TokenType enum.
● Text (Value): The actual text or value of the token from the source code. For example, if
the lexer identifies a number, the text would be the numeric string itself (e.g., "42").
Identifiers: Names defined by the programmer (e.g., variable names like x, counter).
Literals: Fixed values such as numbers (e.g., 123), strings (e.g., "hello"), or characters (e.g.,
'a').
Why do we split the string into tokens? How does that help?
Tokens allow the lexer to simplify the input for the next phase of the compiler or interpreter,
which is parsing. By converting the raw code into a sequence of tokens, the parser can more
easily understand and process the structure of the code. They help detect syntax errors early in
the compilation or interpretation process.
In this example, the regular expression is the word "error", which can be recognized as a five
state automaton: 'e' → 'r'→ 'r'→ 'r'→ 'o' → 'r'.
36
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Here, . matches any character, and {5} specifies the exact number of characters.
37
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
These examples showcase how AWK combined with regular expressions becomes a powerful
tool for searching and processing text.
A lexer uses regular expressions to scan the input source code and match substrings that fit the
patterns defined for different token types. For instance, a regular expression like \d+ might be
used to recognize a sequence of digits as a numeric literal token (NUM).
Regular expressions can be efficiently compiled into finite automata, which are used by lexers to
quickly scan and recognize tokens in the input string. Tools like lex or flex generate lexers by
converting regular expressions into efficient state machines that can process input streams in
linear time.
So, in summary, lexical analysis and regular expressions are deeply connected because regular
expressions provide the formal framework to describe the patterns of the different tokens a lexer
needs to identify. Regular expressions are used to define the patterns of tokens, and the lexer
matches these patterns against the source code to produce a stream of tokens for further
processing by the parser.
So, can we say that the set of all the tokens in a programming language
form a regular language?
Yeap! Tokens are strings that belong to some regular language, especially in the context of
lexical analysis. A regular language is a set of strings that can be recognized by a finite
automaton. Regular languages are defined by regular expressions, which describe patterns of
strings. Each type of token that a lexer needs to recognize (like identifiers, numbers, keywords,
operators) can be described by a regular expression. Since regular expressions define regular
languages, the set of all valid tokens for a given programming language can be seen as
belonging to a regular language. During lexical analysis, the lexer uses regular expressions to
identify tokens in the input string. Each time the lexer matches a portion of the input with a
regular expression, it identifies that portion as a token. This means that the strings recognized
as tokens are those that belong to the regular language described by the corresponding regular
expression. Lexers often use finite automata (which can recognize regular languages) to
38
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
efficiently match input strings against the patterns described by regular expressions. This
connection further supports the idea that tokens are part of a regular language.
In short, in lexical analysis, a regular automaton (or finite automaton) is used to recognize
tokens in a stream of input. Here’s a quick breakdown of the process:
1. States and Transitions: The automaton consists of a set of states, including a starting
state and one or more final (accepting) states. The transitions between states are
determined by the input characters.
2. Token Recognition: As the automaton processes the input string character by
character, it moves from one state to another based on the defined transitions.
3. Final States: When the automaton reaches a final state after processing the input, it
indicates that the input string matches a specific token pattern. The particular final state
reached determines the type of token that should be returned.
39
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This method allows for efficient tokenization of input, as regular automata can be implemented
using deterministic or nondeterministic approaches. Each token will be a path from the starting
state of the automaton to its final state.
Explanation:
40
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Transition Table: The transition_table dictionary maps states (like 'start' and
'number') to the actions associated with input characters. Each action can be a
method that handles the logic for that input character in that state.
2. Dispatch Function: The dispatch method handles the logic for checking the current
character and determining which action to take based on the transition table.
3. State Methods: Each method (like state_number, state_add, etc.) returns the next
state and the token (if any) to be produced. For example, when a digit is recognized, it
transitions to the state_number.
4. Continuing States: The number state allows for continuation without returning a token
immediately, enabling the lexer to handle multi-digit numbers.
1. Token Definitions: You define the tokens you want to recognize using regular
expressions. Each token can have a name and a corresponding pattern. For example:
NUM: [0-9]+
ADD: \+
SUB: -
WSP: \s+
2. Transition Table Creation: The lexer generator takes these definitions and generates a
transition table (or state machine) that describes how to transition between states based
on input characters. This is often done using finite automata.
3. Code Generation: The generator produces code that implements the lexer. This code
will include functions for handling each token type and logic for managing state
transitions.
4. Error Handling: Lexer generators often include error handling mechanisms to deal with
unexpected input characters or invalid token definitions.
5. Integration: The generated lexer code can be integrated into a parser, allowing for
seamless tokenization and syntactic analysis.
● Flex: A widely used lexer generator for C/C++ that takes regular expressions as input
and generates a C source file for the lexer.
● ANTLR: A powerful tool that can generate lexers and parsers in multiple languages,
allowing for complex grammars and language constructs.
41
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● JFlex: A lexer generator for Java that is similar to Flex but tailored for Java applications.
● Efficiency: Automatically generating lexers can save time and reduce errors in
implementation.
● Maintainability: Changes to token definitions can be easily made in the high-level
description, and regenerating the lexer will update the code accordingly.
● Flexibility: Lexer generators can handle complex tokenization scenarios and provide
built-in features for common tasks.
Overall, lexer generators simplify the process of building lexers, making it easier for developers
to focus on the syntax and semantics of the language rather than the intricacies of state
management and token recognition.
But then, in the end, I can't have tokens formed by balanced parentheses?
Yes, that's correct. Regular languages, which can be described by regular expressions or finite
automata, are not capable of recognizing patterns with nested or balanced structures, such as
balanced parentheses. This is because regular languages can only handle patterns with a fixed,
linear structure and cannot track nested or recursive patterns.
To recognize balanced parentheses or similar nested structures, you'd need a more powerful
computational model, like a context-free grammar, which is used to describe context-free
languages. Context-free grammars can handle nested patterns and are typically used in parsing
tasks where balanced parentheses are common, such as in programming language syntax.
1. Simplicity: Regular languages are simpler and easier to implement compared to
context-free languages. Regular expressions can define patterns with straightforward
rules, making them efficient to parse.
2. Efficiency: Lexical analyzers (lexers) built using regular expressions and finite automata
can operate in linear time, O(n), where n is the length of the input string. This efficiency
is crucial for performance in compilers, as lexers typically process large amounts of text.
42
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Determinism: Many regular languages can be implemented using deterministic finite
automata (DFAs), which allow for faster processing since they require only a single state
transition per input character. Context-free grammars, on the other hand, often involve
more complex parsing techniques, such as LL or LR parsing, which can require
backtracking or multiple passes over the input.
4. Separation of Concerns: Using regular languages for lexical analysis allows for a clear
separation between the lexical and syntactic analysis phases of compilation. Lexers
handle the tokenization (breaking input into tokens) while parsers handle the structure
and rules of the language. This separation simplifies the design of compilers.
5. Limited Scope of Tokens: The types of tokens that need to be recognized (like
keywords, identifiers, literals, operators, etc.) are generally simple and can be expressed
as regular languages. Context-free grammars are typically used for more complex
structures, like expressions or statements that may involve nested constructs.
While CFGs can describe more complex languages, their use in token recognition is often
unnecessary and can complicate the implementation. They are more suitable for parsing the
syntax of the language rather than for token recognition. In a typical compiler architecture:
● Lexical Analysis (Lexer): Uses regular expressions to identify and categorize tokens.
● Syntactic Analysis (Parser): Uses context-free grammars to analyze the grammatical
structure of the token sequences produced by the lexer.
In summary, regular languages provide the right balance of expressiveness, efficiency, and
simplicity for lexical analysis, while context-free grammars are used in later stages of the
compilation process where more complex language features are involved.
Yes! Lexical analyzers (lexers) built as a nest of if statements could indeed have a time
complexity of O(n*T), where n is the length of the input string and T is the number of tokens (or
the number of token patterns to be tested). This is because, in the worst case, the lexer would
need to evaluate each character against all possible token patterns, leading to multiple checks
for each input character. Notice that the lexer we built manually in this class operates as a nest
of if-then-else blocks; hence, its complexity is sub-optimal.
On the other hand, a lexer built using regular expressions and finite automata operates in O(n),
since the finite automaton processes each character exactly once, following state transitions
without needing to check all token patterns individually. This makes the automaton-based lexer
more efficient, especially as the number of token patterns increases.
43
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To compute expressions in prefix notation (where operators come before their operands), the
approach differs from the one used for postfix notation, primarily in how the expression is
processed.
1. Operator First: In prefix notation, operators appear before their operands. For example,
the expression + 3 4 would be written as 3 4 + in postfix notation.
2. Right-to-Left Processing: Since the operator comes first, prefix expressions are
generally processed from right to left. This allows us to encounter the operands first
before applying the operator, which contrasts with postfix where we process left-to-right.
def compute_prefix(lexer):
"""
Evaluates an arithmetic expression in Polish Notation (Prefix Notation).
Examples:
>>> lexer = Lexer("+ 3 * 4 2")
>>> compute_prefix(lexer)
11
if token.kind == TokenType.NUM:
# Base case: return the value if it's a number
return int(token.text)
if token.kind == TokenType.ADD:
return a + b
elif token.kind == TokenType.SUB:
return a - b
elif token.kind == TokenType.MUL:
return a * b
elif token.kind == TokenType.DIV:
if b == 0:
raise ZeroDivisionError("Division by zero.")
44
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return a // b
else:
raise ValueError(f"Unexpected token type: {token.kind}")
In summary, the main idea is that prefix notation requires processing from right to left to ensure
that operands are evaluated before their operators, but the stack-based approach remains
effective because it allows flexible order of operations based on the expression's format.
What if you had to add variables and keywords (add, sub, etc) to your
language? How would you do that?
There is a cheap trick to do it! We can add the keywords to a table. And then we use this table
to return the right token. For instance, let's imagine that we have general variables (symbols
formed of letters and digits, that start with letters) in addition to keywords like add and sub. So,
we can either write "+ 3 4" or "add 3 4". We need to create the new token VAR:
class TokenType(enum.Enum):
"""
This class defines the possible tokens that we consider.
"""
45
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Token:
...
@staticmethod
def key_word_token(text):
"""
This method returns the token associated with a given keyword in the
language, or None otherwise.
Example:
>>> Token.key_word_token('add').text
'+'
>>> Token.key_word_token('sub').text
'-'
>>> Token.key_word_token('x').text
'x'
"""
if text in tokens:
return tokens[text]
else:
return Token(text, TokenType.VAR)
Finally, we add a special case to the Lexer, to read the new tokens:
def getToken(self):
...
if current_char.isdigit():
# Handle numbers (NUM)
46
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
...
elif current_char.isalpha():
id_text = current_char
while (
self.position < self.length
and self.input_string[self.position].isalnum()
):
id_text += self.input_string[self.position]
self.position += 1
return Token.key_word_token(tokText)
elif current_char == "+":
...
Notice that we don't need to add special cases for the keywords add and sub in the lexer: that's
already taken care of in the table! With these changes, we can run tests like:
47
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The source code is typically a long sequence of characters that can be difficult to
process directly. The scanner breaks this input into discrete tokens, making it easier for
later stages (like parsing) to work with structured data instead of raw text.
● Example: In int x = 5;, the scanner produces tokens like int (keyword), x
(identifier), = (operator), 5 (number), and ; (punctuation).
2. Removing Noise
● The scanner handles details like whitespace (spaces, newlines) and comments, which
are usually not relevant for execution but help with code readability.
● By filtering out these elements early, the scanner ensures that only meaningful parts of
the code are passed to later stages.
● Separating scanning from parsing simplifies the overall design of the compiler or
interpreter. Instead of handling raw characters during parsing (which can be complex),
the parser can focus on analyzing the structure of the program using tokens.
● This separation makes it easier to manage error detection and recovery during the
parsing stage.
● The scanner is responsible for identifying lexical errors, such as illegal characters or
malformed numbers. This allows for early detection of issues and better error reporting
before deeper semantic checks are performed.
● The tokens produced by the scanner are passed to the parser, which checks the
syntactic structure of the code (grammar). This is where higher-level constructs like
expressions, statements, and functions are formed.
● The correctness of the scanner is critical because improper tokenization can lead to
parsing errors and incorrect interpretation of the code's structure.
● A well-designed scanner can handle different forms of tokens, enabling support for
various programming language features such as:
○ Different number formats (integers, floating points, hexadecimals).
○ String literals with escape sequences.
○ Operators with different precedence levels.
48
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● As programming languages evolve and become more complex, the scanner can be
extended to recognize more advanced lexical patterns, like identifiers in languages with
different symbol rules or keywords with context-sensitive meanings.
In summary, the scanner is fundamental to bridging the gap between raw code and a
structured program. It is the first step toward understanding and executing a programming
language. Without a scanner, managing the complexities of modern programming languages
would be far more difficult and error-prone.
● The starting point of the compiler pipeline is the raw source code, which is just a
sequence of characters or bytes. This is a high-level program written in a
programming language like C, Python, or Java.
● Example: int x = 5 + 3; is a sequence of characters in a C program.
● The first transformation occurs in the lexical analysis (scanning) phase, where the raw
source code is converted into tokens. Tokens are the basic building blocks of the
language, like keywords (int), identifiers (x), operators (+), and literals (5, 3).
● This is where whitespace and comments are ignored, and the program is reduced to
meaningful elements.
● Example: The source code int x = 5 + 3; would be tokenized into something like
[INT, IDENTIFIER(x), EQUALS, NUMBER(5), PLUS, NUMBER(3),
SEMICOLON].
● Next is the syntactic analysis (parsing) phase, where the sequence of tokens is
analyzed according to the language's grammar to form an Abstract Syntax Tree (AST).
● The AST represents the structure of the program in a hierarchical way. It reflects how
different components (like expressions, statements, functions) are related.
● Example: The expression 5 + 3 would be represented as a tree where + is the root,
with 5 and 3 as its children.
49
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. Semantic Analysis
● In the semantic analysis phase, the AST is checked for meaning beyond syntax. This
includes checking:
○ Type correctness (e.g., ensuring you don't add a string to a number).
○ Scope resolution (e.g., ensuring that variables are declared before being used).
● This stage might annotate the AST with additional information, such as data types.
t1 = 5
t2 = 3
t3 = t1 + t2
x = t3
6. Optimization
● The IR is then passed through an optimization phase where transformations are made
to improve the performance and efficiency of the program. Common optimizations
include:
○ Dead code elimination (removing code that has no effect).
○ Constant folding (precomputing constant expressions).
○ Loop unrolling (optimizing loop execution).
● The goal is to produce more efficient machine code without altering the program's
semantics.
MOV R1, 5
MOV R2, 3
50
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
8. Machine Code
● Finally, the assembly or low-level code is translated into machine code, which is a
binary representation that the CPU can directly execute. This is the final output of the
compiler.
● Example: The assembly instructions would be turned into a series of binary instructions
understood by the CPU, like 1101001010010000.
● In languages like C or C++, once the machine code is generated, linking combines the
object code with libraries and other modules to create an executable.
Each stage in the pipeline is a transformation from one representation of the program to
another, and each stage gets progressively closer to machine-executable code. This design
makes compilers modular and easier to maintain, as different stages can be modified or
optimized independently.
1. Source Code: At this stage, it's indeed just a sequence of bytes for the machine. To the
computer, it's a series of bits that must be processed to derive any meaning. This
processing involves interpreting the syntax and structure defined by the language's
grammar.
2. Tokens: These are symbolic representations extracted from the source code by a lexer
or tokenizer. While tokens are meaningful to a human who understands the language, to
the machine, they're just symbols. The semantic relationships between these tokens
aren't apparent until further processing.
3. AST (Abstract Syntax Tree): This is where the semantic structure starts to become
explicit. The AST represents the hierarchical syntactic structure of the source code, and
it can be traversed or interpreted to understand the code's intended behavior. It provides
a concrete way to examine the structure and relationships within the code, making it
possible to apply further transformations or optimizations.
51
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In summary, the AST is indeed the stage where we transition from raw tokens and bytes to a
meaningful representation that can be interpreted by a machine. Your emphasis on the role of
parsing in extracting meaning aligns with this perspective.
So, is lexical analysis the entry point of every system that processes
programs?
The scanner, or lexer, is indeed the entry point in the process of transforming source code into
executable behavior. Here's a breakdown of its role:
1. Lexical Analysis: The scanner performs lexical analysis, converting the raw source
code into a stream of tokens. These tokens are the basic building blocks of the
language's syntax, such as keywords, operators, and identifiers.
2. Entry Door: As the first stage in the processing pipeline, the scanner essentially serves
as the gateway for the execution engine—whether it's an interpreter, compiler, or virtual
machine. It provides the subsequent stages with a structured representation of the code
that can be further analyzed and processed.
3. Interaction with Execution Engines:
○ Interpreter: The interpreter uses the tokens to build an AST or directly interpret
the behavior of the program.
○ Compiler: The compiler translates the tokens into intermediate representations
or directly into machine code.
○ Virtual Machine: The VM uses the tokens to generate bytecode or other
intermediate forms that it can execute.
In essence, the scanner’s output is a critical input for all downstream components of the
execution engine, making it a fundamental part of the process.
I wish I could take a look into an actual lexer used in a real-world system.
Do you know any?
Sure! Let's take a look into the lexer in the One True AWK repository. This file is the
implementation of AWK's lexer. This file is responsible for lexical analysis, which is the process
of reading input and breaking it into tokens that the AWK interpreter can understand and
process.
1. Tokenization:
○ The lexer reads the input stream and splits it into meaningful units (tokens), such
as keywords, operators, numbers, strings, and identifiers.
○ Tokens are then passed to the parser for syntactic analysis.
2. Regular Expression Handling:
52
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Since AWK heavily relies on patterns and regular expressions, the lexer includes
logic to identify and handle regular expressions as distinct tokens.
3. Whitespace and Comments:
○ The lexer skips over whitespace and comments, ensuring they don’t interfere
with the interpretation of the AWK program.
4. Error Reporting:
○ The lexer identifies invalid sequences in the input and generates appropriate
error messages.
5. State Management:
○ The lexer keeps track of the current state of parsing (e.g., whether it's inside a
string, processing a regular expression, or handling an escape sequence).
If you browse lex.c, you’ll likely encounter (at least it was there when this chapter was
written!):
● Character Classes: Functions that categorize characters (e.g., isalpha for letters or
isdigit for digits).
● State Machine: Logic for handling different parsing states, such as transitioning
between reading identifiers, strings, or numbers.
● Regular Expression Parsing: Code to distinguish regular expressions from other
patterns.
The lexer is a crucial part of the AWK interpreter. It lays the groundwork for how AWK
understands your input program by converting raw text into a structured format that the parser
can use. If you’re studying lex.c, you’re diving into the core of how AWK interprets its scripting
language.
53
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Purpose: AWK was designed as a simple and efficient text-processing tool, allowing
users to process text files, extract information, and perform data transformations using
patterns and actions.
● Inspiration: The language's design reflects the combined influence of Unix's pipeline
philosophy and the power of regular expressions, making AWK a perfect tool for
Unix-like systems.
Fun Fact
AWK was so influential that it inspired later scripting languages like Perl, which extended many
of AWK's ideas. Even today, AWK remains a widely used tool for quick and powerful text
processing.
54
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Aho spent much of his career at Bell Labs, where he worked on various influential projects,
including AWK. He later joined Columbia University, where he became the Lawrence
Gussman Professor Emeritus of Computer Science. In 2020, he was awarded the Turing
Award, often referred to as the "Nobel Prize of Computing," alongside Jeffrey Ullman, for their
work on algorithms and theory in computer science.
55
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Legacy
Aho’s work bridges theoretical computer science and practical programming, making his
contributions essential to both academic research and real-world software development. His
name is synonymous with excellence in algorithms, programming languages, and systems, and
his tools and ideas remain influential to this day.
56
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Tree Structure: In an expression tree, each internal node represents an operator, and
each leaf node represents an operand. The left and right children of a binary operator
node, for instance, represent the sub-expressions for that operator.
● Infix Representation: The tree can naturally represent the precedence of operations in
infix notation. For example, the expression (a + b) * c would have * as the root, with the
left child being the subtree for a + b and the right child being the leaf node c.
● Evaluation: You can easily evaluate the expression by recursively evaluating the
subtrees.
● Traversal: An inorder traversal of the tree will give you the original infix expression.
def evaluate(node):
"""
Produces the value of an arithmetic expression.
Examples:
>>> e = Node('+', Node(23), Node(19))
>>> evaluate(e)
42
57
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
if node.value == "+":
return left_val + right_val
elif node.value == "-":
return left_val - right_val
elif node.value == "*":
return left_val * right_val
elif node.value == "/":
return left_val / right_val
This implementation represents arithmetic expressions using a tree structure, where each node
can be an operator (like +, -, *, or /) or a value (a number). Here's a breakdown of how it works:
1. Node Class: Each Node object represents a part of the expression. It has three
attributes:
○ value: This can either be an operator (like "+") or a numeric value.
○ left: This points to the left child node (the left operand).
○ right: This points to the right child node (the right operand).
2. Evaluate Function: This function recursively computes the result of the arithmetic
expression represented by the tree.
○ Base Case: If the current node is a leaf (meaning it has no children), it returns
the integer value of that node.
○ Recursive Case: If the node is not a leaf, it evaluates the left and right subtrees
by calling evaluate on each child. This gives the values of the left and right
operands.
○ Operator Application: After obtaining the values of the operands, it applies the
operator stored in the current node’s value:
■ For "+", it adds the two operands.
■ For "-", it subtracts the right operand from the left.
■ For "*", it multiplies them.
■ For "/", it divides the left operand by the right.
3. Examples: The comments provide examples of how the tree structure represents
expressions:
○ An expression for 23+1923 + 1923+19 is created as a tree with "+" as the root,
23 as the left child, and 19 as the right child.
○ For a more complex expression like (23+19)−20(23 + 19) - 20(23+19)−20, the
tree has "+" as a child of the "-" operator.
In summary, this implementation allows for the representation and evaluation of arithmetic
expressions in a structured way using a tree-like data structure, making it easy to extend or
modify for more complex expressions or operations in the future.
Could you add a second function to print the expression in infix notation?
Here's our implementation:
def print_infix(node):
"""
This function prints the arithmetic expression in infix notation.
58
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Examples:
>>> e = Node('+', Node(23), Node(19))
>>> print_infix(e)
'(23+19)'
This function, print_infix, converts and displays arithmetic expressions in infix notation (the
standard way of writing expressions, where operators are placed between operands). Here’s
how it works:
1. Function Purpose: The goal of print_infix is to represent the arithmetic expression
stored in the tree structure as a string in infix form.
2. Base Case:
○ If the current node is None, it returns an empty string. This handles cases where
there might be no expression.
○ If the node is a leaf (meaning it has no children), it returns the string
representation of its value. This is for numeric values like 23 or 19.
3. Recursive Case:
○ The function recursively calls itself for the left and right child nodes of the current
node, retrieving the left and right sub-expressions.
○ It then combines these sub-expressions with the current node's operator (the
value) placed between them.
4. Expression Formatting:
○ The returned string for the current node wraps the entire expression in
parentheses. This is important to maintain the correct order of operations and
clearly define the structure of the expression.
○ For example, if the current node represents the operator "+", and its left child is
"23" and right child is "19", the result will be "(23+19)".
5. Examples:
○ For a simple expression like 23+1923 + 1923+19, the function outputs "(23+19)".
○ For a more complex expression such as 20−(23−19)20 - (23 - 19)20−(23−19),
the function builds the string by first calculating the inner expression "(23-19)",
then wrapping the entire expression as "(20-(23-19))".
59
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Examples:
>>> e = Node('+', Node(23), Node(19))
>>> print_postfix(e)
'23 19 + '
The print_postfix function converts and displays arithmetic expressions in postfix notation
(also known as Reverse Polish Notation, or RPN). Here’s how it differs from the previous
print_infix function:
1. Notation:
○ Infix notation (used by print_infix) places operators between operands (e.g.,
23+1923 + 1923+19).
○ Postfix notation places operators after their operands (e.g., 232323 191919 +++).
This eliminates the need for parentheses to dictate order of operations, as the
sequence of operations is inherently defined by the position of the operators and
operands.
2. Structure:
○ In the print_postfix function, the output is generated by recursively
processing the left and right subtrees before appending the operator. This reflects
the order in which operations should be performed.
○ The function builds the output by first evaluating the left expression, then the right
expression, and finally adding the operator at the end.
3. Output Formatting:
○ Each operand and operator is separated by a space, which helps in clearly
delineating them in the output (e.g., "23 19 + ").
60
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Infix Notation
Postfix Notation
61
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def evaluate(node):
if node is None:
return 0
if node.left is None and node.right is None: # Leaf node (operand)
return int(node.value)
# Recursively evaluate the left and right subtrees
left_val = evaluate(node.left)
right_val = evaluate(node.right)
# Apply the operator
if node.value == '+':
return left_val + right_val
elif node.value == '-':
return left_val - right_val
elif node.value == '*':
return left_val * right_val
elif node.value == '/':
return left_val / right_val
elif node.value == '%': # Handle modulo operation
return left_val % right_val
Could you think about an easier way to add new operators without having
to modify lots of stuff?
Yes, we could use a more object oriented approach. Example:
class Expression(ABC):
@abstractmethod
def eval(self):
raise NotImplementedError
class Num(Expression):
def __init__(self, num):
self.num = num
def eval(self):
return self.num
class BinaryExpression(Expression):
def __init__(self, left, right):
self.left = left
self.right = right
62
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
@abstractmethod
def eval(self):
raise NotImplementedError
class Add(BinaryExpression):
def eval(self):
return self.left.eval() + self.right.eval()
class Sub(BinaryExpression):
def eval(self):
return self.left.eval() - self.right.eval()
class Mul(BinaryExpression):
def eval(self):
return self.left.eval() * self.right.eval()
class Div(BinaryExpression):
def eval(self):
return self.left.eval() // self.right.eval()
def evaluate(exp):
"""
Evaluate an arithmetic expression.
Examples:
>>> n1 = Num(28)
>>> n2 = Num(4)
>>> e = Div(n1, n2)
>>> evaluate(e)
7
>>> n1 = Num(7)
>>> n2 = Num(4)
>>> e = Mul(n1, n2)
>>> evaluate(e)
28
"""
return exp.eval()
63
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Add(BinaryExpression):
def eval(self):
return self.left.eval() + self.right.eval()
def __str__(self):
64
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return "+"
class Sub(BinaryExpression):
def eval(self):
return self.left.eval() - self.right.eval()
def __str__(self):
return "-"
class Mul(BinaryExpression):
def eval(self):
return self.left.eval() * self.right.eval()
def __str__(self):
return "*"
class Div(BinaryExpression):
def eval(self):
return self.left.eval() // self.right.eval()
def __str__(self):
return "/"
def print_prefix(exp):
"""
Evaluate an arithmetic expression.
Examples:
>>> n1 = Num(28)
>>> n2 = Num(4)
>>> e0 = Div(n1, n2)
>>> n3 = Num(5)
>>> e1 = Mul(e0, n3)
>>> f"{print_prefix(e1)} = {e1.eval()}"
'* / 28 4 5 = 35'
>>> n1 = Num(28)
>>> n2 = Num(4)
>>> e0 = Div(n1, n2)
>>> n3 = Num(50)
>>> e1 = Div(n3, e0)
>>> e2 = Mul(e1, e0)
>>> f"{print_prefix(e2)} = {e2.eval()}"
'* / 50 / 28 4 / 28 4 = 49'
"""
s = str(exp)
if isinstance(exp, BinaryExpression):
s = f"{s} {print_prefix(exp.left)} {print_prefix(exp.right)}"
return s
def print_infix(exp):
s = str(exp)
if isinstance(exp, BinaryExpression):
s = f"({print_infix(exp.left)}) {s} ({print_infix(exp.right)})"
return s
65
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def print_postfix(exp):
s = str(exp)
if isinstance(exp, BinaryExpression):
s = f"{print_postfix(exp.left)} {print_postfix(exp.right)} {s}"
return s
Adding the __str__ method to each expression class allows for flexible and consistent string
representations of expressions, making it easy to print them in different notations. Here’s how
this change addresses the problem of printing in various formats:
66
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In summary, adding the __str__ method streamlines the process of generating string
representations for various expression types, allowing different printing functions to construct
the desired notation flexibly and efficiently. This object-oriented design enhances modularity,
making the code easier to read, extend, and maintain.
In your code:
● Component: The Expression class acts as the abstract base class (or component) in
the pattern, defining a common interface (eval) for all expressions.
● Leaf: The Num class represents the leaf objects in the tree, which have no children and
implement the eval method to return a value directly.
● Composite: The BinaryExpression and its subclasses (Add, Subtract, Multiply,
Divide, Modulo) act as composite objects. They contain references to child
components (left and right) and implement the eval method by delegating to their
children.
The key aspect of the Composite pattern is that it allows you to treat individual objects (like Num)
and compositions of objects (like Add or Multiply) uniformly through the common eval
method. This is exactly what your implementation achieves.
1. Procedural Approach
In the procedural approach, the operations (such as evaluating, printing in infix/postfix notation)
are implemented as functions that operate on data structures representing the expressions.
Here's a summary:
● Characteristics:
○ Data and functions are separate.
○ Expressions are represented by nested data structures (e.g., tuples, lists).
○ Functions are responsible for interpreting and manipulating these data structures.
○ New operations may require modifying existing functions to handle new data
structures.
67
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the OO approach, the expressions themselves are represented as objects with methods that
define their behavior. Each expression type knows how to evaluate itself, print itself, etc.
● Characteristics:
○ Data and behavior are encapsulated within objects.
○ Each expression type is represented by a class, and operations are methods
within these classes.
○ New operations or expression types can be added by defining new classes or
methods.
1. Procedural Approach
● Advantages:
○ Simplicity: For small, well-defined problems, the procedural approach can be
simpler and more straightforward.
○ Performance: It can be more efficient in terms of memory and speed because
there’s less overhead compared to objects.
○ Easier to Understand for Simple Problems: The logic is often more
transparent, especially for those familiar with procedural programming.
● Disadvantages:
○ Scalability Issues: As the number of operations or types of expressions grows,
the code can become harder to maintain. Adding a new type of expression may
require modifying multiple functions.
○ Tight Coupling: The functions are tightly coupled to the specific data structures,
making it harder to extend or modify.
○ Less Reusability: Code reuse is more challenging since the logic is tied to
specific data structures.
● Advantages:
○ Extensibility: Adding new expression types is easier. For example, adding a new
type of expression, like modulo, only requires creating a new class, without
modifying existing ones.
○ Encapsulation: Each expression type encapsulates its behavior, leading to
better organization and separation of concerns.
○ Reusability: Code can be reused more effectively, especially with inheritance
and polymorphism.
○ Scalability: The system can grow more easily, and it’s simpler to maintain and
extend.
68
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Disadvantages:
○ Complexity: The OO approach introduces more complexity, especially for small,
simple problems.
○ Overhead: There’s typically more memory and CPU overhead due to the use of
objects and method calls.
○ Learning Curve: It requires a good understanding of OO principles, which might
be more challenging for those new to OO programming.
○ Extensibility: Adding new operations to existing types is difficult. For example, to
add an operation to compute the "size" of expressions it would be necessary to
change all the subclasses of Expression, adding the new method to them.
Object-Oriented Approach:
● When you expect the system to grow or require modifications over time.
● When you want to add new types of expressions without changing existing code.
● When code maintainability, extensibility, and reusability are priorities.
69
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ In C, you would use struct to define a tree node instead of a class. Each struct
could contain a value (operator or operand) and pointers to left and right child
nodes, similar to the Node class.
Overall, your C implementation would focus on structured programming principles, using structs
and function calls, reflecting a more traditional tree-based approach rather than the more
abstract and modular object-oriented design seen in languages like Python or C++.
70
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In Standard ML (SML), you can use algebraic data types to represent arithmetic expressions
effectively. Below is an implementation that captures the essence of the expression
representation and evaluation in SML/NJ:
(* Example usage *)
val expr1 = Add (Num 23, Num 19) (* Represents 23 + 19 *)
val expr2 = Sub (expr1, Num 20) (* Represents (23 + 19) - 20 *)
(* Evaluate expressions *)
val result1 = eval expr1 (* Should be 42 *)
val result2 = eval expr2 (* Should be 22 *)
(* Print expressions *)
val infix1 = print_infix expr1 (* Should print "(23 + 19)" *)
val infix2 = print_infix expr2 (* Should print "((23 + 19) - 20)" *)
Explanation:
71
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This approach is idiomatic in SML, leveraging its powerful type system and pattern matching
capabilities to create a clear and concise representation of arithmetic expressions.
1. Operation-Centric:
○ In both implementations, the primary structure is defined by the operations (like
eval, print_infix, etc.) rather than by encapsulated types. This is typical of
functional programming, where functions operate on data types directly.
2. Pattern Matching vs. Control Structures:
○ SML utilizes pattern matching, which succinctly handles different expression
types, while C uses conditional statements to achieve similar results. Both focus
on the behavior associated with each type rather than defining behavior within a
type itself.
3. Separation of Data and Behavior:
○ In both languages, data structures (like struct in C or datatype in SML) are
separate from the functions that operate on them. This contrasts with
object-oriented programming, where methods are typically defined within classes
that encapsulate data.
4. Lack of Inheritance and Polymorphism:
○ Neither implementation leverages inheritance or polymorphism, which are
hallmarks of object-oriented design. Instead, they rely on function overloading (in
SML's case) and control flow constructs (in C's case).
Conclusion:
The designs reflect a more operational paradigm, which is common in procedural and functional
languages, where the focus is on defining how to manipulate data rather than on modeling
real-world entities through objects. Object-oriented programming, by contrast, emphasizes
72
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
It seems that adding new types would be tough in the C or in the SML
implementations, but adding new operations would be easy?
That's right! The design of both the C and SML implementations makes it relatively easy to add
new operations but more challenging to introduce new types of expressions. Here’s why:
Conclusion
The structure of these implementations favors the addition of operations due to their focus on
functionality and separation from data types. However, introducing new types requires more
extensive changes, making it less convenient. This reflects a common trade-off in software
73
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
design, where extensibility in one area may lead to challenges in another. Your insight captures
this fundamental aspect of managing complexity in programming languages!
So, an advantage of the OO approach is that it's very easy to add new
types. Is that so?
Yes, that's correct! One of the significant advantages of the object-oriented (OO) approach is the
ease with which new types (or classes) can be added, along with associated behavior. Here are
some key reasons why this is beneficial:
1. Encapsulation:
○ In OO design, data and behavior are bundled together within classes. When you
introduce a new type, you can define its properties and methods in one place,
making it easier to manage and understand.
2. Inheritance:
○ New types can inherit from existing classes, allowing you to reuse code and
behavior. For example, if you have a base class for expressions, you can create
a new subclass for a specific type of expression, inheriting common functionality
while also extending or overriding specific methods.
3. Polymorphism:
○ OO allows you to define common interfaces for different types. You can introduce
new types that implement the same interface, making it possible to treat different
types uniformly. This reduces the need to modify existing code when adding new
types.
4. Extensibility:
○ Because OO design encourages the use of interfaces and abstract classes, it’s
often easier to add new types without altering existing code. You can focus on
implementing new functionality rather than modifying existing logic.
5. Cleaner Code:
○ By organizing code into classes, it becomes clearer how different types interact
and what operations they support. This structure can lead to more maintainable
and understandable code as new types are introduced.
Conclusion
Overall, the OO approach provides a flexible framework for adding new types while minimizing
the impact on existing code. This leads to a more modular design, making it easier to manage
complexity as software systems grow and evolve. However, adding new operations is difficult,
because that would require us changing all the subclasses of tree nodes. This tension between
extensibility of types and extensibility of operations is at the core of a software-design challenge
known as the Expression Problem. We shall talk more about it when we discuss Visitors.
74
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Are these tree-like structures what we call an Abstract Syntax Tree (AST)?
Yes! An abstract syntax tree (AST) is a tree-like data structure used in computer science,
particularly in the context of compilers and interpreters, to represent the hierarchical structure of
code. It abstracts away surface-level syntax details and focuses on the logical structure and
relationships of the program's components.
75
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ They allow for optimizations (e.g., constant folding) and provide a structure for
emitting machine code.
3. Code Analysis and Transformation:
○ ASTs are used in static analysis tools, refactoring tools, and formatters to inspect
or modify code.
4. Interpretation:
○ Interpreters use ASTs to evaluate code directly, traversing the tree to compute
results.
● Parse Tree: Captures all syntax details, including parentheses, punctuation, and
keywords.
● AST: Focuses only on the semantic structure, omitting unnecessary syntax details.
For instance, a parse tree for 3 + 5 would include the tokens for the number 3, the + operator,
and the number 5, along with syntactic markers for precedence rules. The AST would simply
represent the operation + with 3 and 5 as its operands.
Conclusion
An AST is a powerful abstraction that allows compilers, interpreters, and other tools to work with
the logical structure of code. It simplifies complex syntax into a manageable format, enabling
various stages of processing, such as semantic analysis, optimization, and code generation.
76
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this sense, building an Abstract Syntax Tree (AST) instead of processing the program as a
string is essential for several reasons:
1. Logical Representation:
An AST represents the logical structure of a program, abstracting away its concrete syntax. This
means that details like parentheses, indentation, or specific syntactic sugar are encoded into the
tree structure rather than needing to be interpreted repeatedly during processing. For example,
the expression `(a + b) * c` would be represented as a tree where `*` is the root, with `a + b` as
one child and `c` as the other. This makes the program easier to analyze and manipulate.
2. Separation of Concerns:
Parsing converts raw source code into a structured representation (the AST), isolating syntax
recognition from the later stages of the compilation pipeline (e.g., type checking, optimization,
code generation). This separation simplifies both the parser and the subsequent processing
stages, improving maintainability and scalability.
3. Efficiency:
Processing a program as a string would require repeated parsing for tasks like type checking,
optimization, and code generation. With an AST, the program structure is parsed once, and
subsequent operations can work directly with the tree, significantly improving efficiency.
Following principles from the Language-theoretic Security (LangSec) framework, input should
be fully recognized and validated before processing begins. The AST serves as proof that the
77
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
input conforms to the grammar of the language. To illustrate, consider an anti-pattern like
shotgun parsing (The first antipattern in The seven turrets of Babel), where parsing and
processing are mixed. This approach often leads to incomplete error checking, as errors
discovered late in processing could mean invalid input has already been partially processed,
making it difficult to recover or predict program behavior. An AST avoids this problem by fully
validating the structure of the input before any processing occurs.
An AST provides a structured and consistent foundation for extending a language or compiler.
New features, optimizations, or transformations can be added by working with the tree structure
rather than reworking complex string-processing logic.
Conclusion
In summary, an AST transforms the program into a clean, structured representation, facilitating
reliable, maintainable, and efficient language processing. While it’s theoretically possible to
process a program as a string, doing so would lead to inefficiency, complexity, and fragility in the
system.
1. Origins of ASTs
78
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
79
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion
The abstract syntax tree evolved gradually, with early influences from ALGOL, LISP, and PL/I.
While ASTs were not explicitly named or formalized in the 1950s and 1960s, tree-like
representations of code were in practical use. By the 1970s, ASTs became a recognized part of
compiler theory, and their widespread adoption in modern programming languages cemented
their importance in computer science.
80
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ BCPL was one of the first languages to introduce the idea of portable
compilers. Its compilers used an intermediate representation known as O-code
(object code), which could be executed on different platforms with the help of an
interpreter.
○ This approach inspired future virtual machines, including the JVM (Java Virtual
Machine) and LLVM.
3. Minimalism:
○ BCPL was designed to be simple and efficient, targeting early computers with
limited resources.
○ It demonstrated how a minimalistic language could support powerful
programming techniques, making it popular in academic and system
programming circles.
4. Influence on Compiler Theory:
○ BCPL’s compiler was a practical demonstration of many theoretical compiler
concepts, including the use of an intermediate representation to separate
front-end parsing from back-end code generation.
○ This separation is a core feature of modern compilers, often relying on ASTs to
represent the structure of the source code.
5. System Programming:
○ BCPL was one of the first high-level languages to be used for system
programming. Many early operating systems and tools were written in BCPL,
such as portions of the Cambridge CAP computer and Tripos.
While it might not have been explicitly called an "AST" in the 1960s, the concept was inherently
present.
Legacy of BCPL
81
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Portability:
○ BCPL’s portability was revolutionary for its time, setting a standard for languages
to be easily adapted across diverse hardware platforms.
Conclusion
BCPL was indeed very important in programming language history. Its focus on simplicity,
portability, and compiler innovations had a lasting impact on computer science. While not the
first language to use tree-like structures, BCPL was among the early pioneers that influenced
how compilers represent and process programs internally, paving the way for widespread use of
ASTs.
Martin Richards is a renowned British computer scientist best known for designing the
programming language BCPL (Basic Combined Programming Language) in 1966. He was
born in 1940 and studied mathematics at the University of Cambridge, earning his Ph.D. under
the supervision of David Wheeler, one of the pioneers of computer science.
Richards has had a long and impactful career, primarily at the University of Cambridge, where
he worked on programming languages, operating systems, and compiler design. He has made
significant contributions to the development of computer science, particularly in the areas of
programming language design and compiler technology.
82
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Legacy
● Martin Richards is widely respected for his contributions, though he is less well-known to
the general public compared to figures like Dennis Ritchie or John Backus.
● His work on BCPL and its influence on C secured his place in the history of programming
languages and systems software.
● Richards’ emphasis on simplicity and portability in programming languages remains a
cornerstone of software development.
Publications
● Richards authored several academic papers and technical reports related to BCPL,
programming language implementation, and operating systems. His 1981 book "BCPL:
83
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
84
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def compute_prefix(lexer):
"""
Examples:
>>> lexer = Lexer("+ 3 * 4 2")
>>> e = compute_prefix(lexer)
>>> e.eval()
11
if token.kind == TokenType.NUM:
# Base case: return the value if it's a number
return Num(int(token.text))
if token.kind == TokenType.ADD:
return Add(a, b)
elif token.kind == TokenType.SUB:
return Sub(a, b)
elif token.kind == TokenType.MUL:
return Mul(a, b)
elif token.kind == TokenType.DIV:
return Div(a, b)
else:
raise ValueError(f"Unexpected token type: {token.kind}")
This code builds an expression tree for prefix expressions by recursively parsing the input and
constructing a tree of nodes for the operations. Each node represents a number or an arithmetic
85
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
operation, and the recursive nature of the function allows it to handle nested operations
correctly.
Structure Overview:
Code Explanation:
86
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This builds a tree where Add is the root, and its left and right children represent the
multiplication and number operations, respectively. When you call .eval() on this tree, it
computes the value of the expression (which is 14).
87
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
So, in short, the expression is already structured in a way that avoids the complexity found in
other notations like infix notation. That's why you don't have to write any complicated recursive
calls or worry about precedence rules, making it a straightforward recursive descent parser.
E ::= num
| + E E
88
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
| - E E
| * E E
| / E E
This grammar specifies that an expression (E) can be:
● Parsing refers to the process of analyzing a sequence of tokens (like numbers and
operators) to determine their syntactic structure based on a formal grammar.
● In the case of compute_prefix, it reads tokens from the lexer and constructs an
expression tree by recursively applying production rules from the grammar (as we
discussed earlier).
● It's a recursive descent parser because it uses recursion to handle sub-expressions.
Each recursive call processes part of the input and returns a subtree, just like a typical
recursive descent parsing algorithm would.
So, the compute_prefix function fits well within the definition of a parser: it interprets the
structure of an input (the prefix notation expression) and converts it into a meaningful
representation (the expression tree).
● The first step is to describe the structure of the language (or the expressions you're
parsing) using a formal grammar, often in Backus-Naur Form (BNF) or Extended BNF.
● The grammar should specify how valid sequences of tokens can be formed, typically
defining rules for each type of construct (e.g., expressions, terms, factors, operators).
89
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Each rule of the grammar corresponds to a function (or method) in your recursive
descent parser.
○ Base cases (like num) are handled directly by returning a value (or building a
corresponding AST node).
○ Recursive cases (like operators followed by two sub-expressions) are handled by
calling the parsing function recursively.
● The parser function processes the input token by token. When it encounters a token that
matches a grammar rule, it processes it (and possibly calls itself recursively) to handle
sub-expressions.
● This approach mirrors the structure of the grammar:
○ Each function in the parser corresponds to a rule in the grammar.
○ Each call to a parsing function corresponds to applying a rule.
● A key task of parsing is to transform the input into a more structured form, such as an
abstract syntax tree (AST).
● Each function in the recursive descent parser typically returns an AST node representing
the result of parsing that part of the input.
So, can we say that our expression tree is an abstract syntax tree?
Yes, you can absolutely say that the expression tree is a type of abstract syntax tree (AST). In
fact, the two terms are often used interchangeably in this context.
Expression Tree: This typically refers to a tree structure where each node represents an
operator or operand in an arithmetic expression. For example:
Abstract Syntax Tree (AST): This is a tree representation of the abstract syntactic structure of
some code. The "abstract" part means it doesn't include unnecessary syntactic details like
parentheses, focusing instead on the logical structure of the expressions or statements.
So, in the context of arithmetic expressions, the expression tree is an AST because it captures
the essential structure of the expression without extra syntactic details like parentheses. It's a
compact, meaningful representation that can be used for further processing (evaluation,
printing, optimization, etc.).
90
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
E ::= num
| '(' E ')'
| E '+' E
| E '-' E
| E '*' E
| E '/' E
● One set of productions where A does not appear on the right-hand side (let's call these
β).
● Another set where A appears as the first symbol (let's call these α).
Then, replace the left-recursive rule with a right-recursive version by introducing a new
non-terminal A' (often called the "tail" or "prime" version of A).
91
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Convert it into:
A ::= β A'
A' ::= α A' | ε
class Parser:
"""
This class implements a parser for infix arithmetic expressions.
E ::= num F
| '(' E ')' F
F ::= '+' E F
| '-' E F
| '*' E F
| '/' E F
| empty
Example:
>>> lexer = Lexer("2 + 3 + 4")
>>> parser = Parser(lexer)
>>> e = parser.E()
>>> e.eval()
9
92
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
>>> e.eval()
20
def E(self):
"""
Parse the base case (numbers or parenthesized expressions).
E ::= num F
| '(' E ')' F
"""
token = self.current_token
if token.kind == TokenType.NUM:
self.eat(TokenType.NUM)
num = Num(int(token.text))
return self.F(num)
F ::= '+' E F
| '-' E F
93
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
| '*' E F
| '/' E F
| empty
Notice that we must pass the left side of the expression to it, otherwise we
would not be able to build the expression node.
"""
token = self.current_token
if token.kind == TokenType.ADD:
self.eat(TokenType.ADD)
right = self.E()
return self.F(Add(left, right))
The grammar provides the wrong result for "2 * 3 + 4". Why?
That happens because the grammar does not recognize any precedence between * and +. It
solves the operands right-to-left. and 3 + 4 comes first (in the right-to-left order). Thus, the
grammar is right associative (what is not what we want!) and it does not define the precedence
of the multiplicative operators over the additive operators. More generally, the grammar is
ambiguous, something that is a big problem!
94
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Both interpretations are valid according to the grammar, leading to two different parse trees.
This ambiguity arises because the grammar does not enforce any precedence between the
operators (+, -, *, /), nor does it define associativity (whether operations should be grouped
left-to-right or right-to-left).
Could you draw two parsing trees for "num + num * num"?
Sure. See the figure below, which shows two different (non-isomorphic) parsing trees:
95
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In essence, while for some specific grammars it may be easy to check for ambiguity manually,
the existence of a general procedure that works for every context-free grammar is impossible.
This result is akin to other undecidability problems in computer science, such as the halting
problem.
Therefore, while there are heuristics and techniques to analyze particular grammars, the
ambiguity problem in general cannot be decided algorithmically.
96
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To elaborate, a grammar is considered ambiguous if there exists at least one string that can be
generated by the grammar in multiple distinct ways (i.e., there are two or more different parse
trees for the same string). There is no algorithm that can decide for every CFG whether it is
ambiguous or not.
The connection to the PCP arises because you can construct a reduction from PCP to the
ambiguity problem. If you could decide on ambiguity, you could also decide instances of PCP by
encoding them as grammars. Since PCP is undecidable, it follows that the problem of
determining grammar ambiguity must also be undecidable.
Thus, it is correct to say that the ambiguity of grammars is an undecidable problem, and this
conclusion leverages the properties of the Post's Correspondence Problem.
E ::= num
| + E E
| - E E
| * E E
| / E E
Since the operators are explicitly part of the productions (e.g., + E E), there is no ambiguity in
deciding which operator applies at any point, because each production begins with a unique
operator symbol, and no production yields the empty string. For any string generated by this
grammar, the first operator in the string (e.g., +, -, *, /) uniquely determines the structure of the
parse tree. Once the first operator is identified, it’s clear that the next two sub-expressions will
be parsed as E. This fixed structure ensures that no two distinct parse trees can be formed for
the same string.
Notice, however, that the grammar below is still ambiguous, even though every production starts
with a different non-terminal:
E ::= num F
| '(' E ')' F
F ::= '+' E F
| '-' E F
| '*' E F
97
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
| '/' E F
| ε
The production for F can generate the empty string, which allows expressions to stop at
arbitrary points, leading to ambiguity. The empty string production for F plays a role in the
ambiguity by allowing the parser to end a sub-expression (E) at different points. When F ::=
empty, it terminates the current expression, which can lead to different groupings of operators
and operands.
F ::= num
| '(' E ')'
class Parser:
def __init__(self, lexer):
self.lexer = lexer
self.current_token = self.lexer.next_valid_token()
def E(self):
exp = self.T()
token = self.current_token
if token.kind == TokenType.ADD:
self.eat(TokenType.ADD)
right = self.E()
exp = Add(exp, right)
98
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return exp
def T(self):
exp = self.F()
token = self.current_token
if token.kind == TokenType.MUL:
self.eat(TokenType.MUL)
right = self.T()
exp = Mul(exp, right)
return exp
def F(self):
token = self.current_token
if token.kind == TokenType.NUM:
self.eat(TokenType.NUM)
return Num(int(token.text))
This code defines a recursive descent parser for arithmetic expressions, using a grammar that
handles addition, subtraction, multiplication, and division. The Parser class processes tokens
provided by the Lexer and generates a tree-like data structure representing the parsed
expression. The parser has three main methods: E() for handling expressions (with + and -),
T() for handling terms (with * and /), and F() for handling factors (numbers and
parenthesized expressions). The parser processes tokens recursively by consuming them with
the eat() method, and based on the current token, it either builds a subexpression or raises an
error if the token is unexpected. This structure enforces the correct precedence and
associativity rules for arithmetic operations.
Can you show how the parser handles "num * num + num"?
Let's walk through how the parser processes the string "num * num + num" step by step,
assuming that num represents a number token.
99
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Initialization:
○ The parser is initialized with a lexer that tokenizes the string. The first token
(current_token) is num.
2. Call to E():
○ The parser starts at the E() method, which processes expressions.
○ Inside E(), it first calls T() to handle terms (multiplication and division).
3. Call to T():
○ T() handles terms, which involve factors combined with * or /.
○ Inside T(), it calls F() to process the first factor.
4. Call to F():
○ F() handles factors, which are numbers or parenthesized expressions.
○ Since the current token is num, it matches the NUM case, consumes the token
with eat(TokenType.NUM), and returns a Num node representing the first
number.
5. Return to T():
○ Now, the current token is *. Since this is a multiplication operator
(TokenType.MUL), T() consumes the * token with eat(TokenType.MUL).
○ Next, T() calls itself recursively to process the right-hand side of the
multiplication (the second factor).
6. Second call to F() (inside T()):
○ F() is called again to process the second num.
○ Just like before, it consumes the num token, returning another Num node
representing the second number.
7. Return to T():
○ Now that both factors have been processed, T() constructs a Mul node with the
two Num nodes as its left and right operands. This represents the expression num
* num.
○ T() returns this Mul node to the E() method.
8. Back to E():
○ At this point, E() has received the left-hand side of the addition (num * num).
○ The current token is now +, so E() consumes the + token with
eat(TokenType.ADD) and calls itself recursively to handle the right-hand side
of the addition.
9. Recursive call to E() (right-hand side of +):
○ In this recursive call to E(), the parser calls T() to handle the term on the
right-hand side of the +.
10.Third call to F() (inside T() of the right-hand side):
100
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This last grammar has an indirect left recursion. Yet we could implement a
parser easily. When is left recursion a problem?
The grammar is indirectly left-recursive, meaning there's an intermediate production (like T or
F) before returning to the original rule (E). This doesn't lead to immediate infinite recursion
because the parser still consumes input as it processes these intermediate productions. So, the
recursion doesn't happen "infinitely" without progress.
Is there a general procedure to change a grammar like " E ::= num '+'
E | num '*' E | empty", so that the precedence of multiplication
becomes higher than addition's?
To make * have higher precedence than +, we need to separate them into different levels of
the grammar. The general rule is to create a new non-terminal for each precedence level.
101
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you further modify it, so that the grammar becomes left associative?
A grammar is left-associative if operators are grouped from the left in cases of equal
precedence. This means that expressions like a - b - c are parsed as (a - b) - c (the left side is
evaluated first).
E ::= T EE
EE ::= '+' T EE
| '-' T EE
| empty
T ::= F TT
TT ::= '*' F TT
| '/' F TT
| empty
F ::= num
| '(' E ')'
E' ::= '+' T E' and E' ::= '-' T E' allow an expression like a - b - c to be
parsed by first evaluating a - b (through recursion on E'), and then combining the result with
c. The recursion happens on the leftmost operand first. Similarly, T' ::= '*' F T' and T'
::= '/' F T' make sure that multiplication and division are also grouped from the left. In
short, when processing E' ::= '+' T E', we only move onto E' once we are done with T,
which comes on the left of E'.
class Parser:
def __init__(self, lexer):
self.lexer = lexer
self.current_token = self.lexer.next_valid_token()
def E(self):
exp = self.T()
return self.EE(exp)
102
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
if token.kind == TokenType.ADD:
self.eat(TokenType.ADD)
right = self.T()
return self.EE(Add(left, right))
def T(self):
exp = self.F()
return self.TT(exp)
if token.kind == TokenType.MUL:
self.eat(TokenType.MUL)
right = self.F()
return self.TT(Mul(left, right))
def F(self):
token = self.current_token
if token.kind == TokenType.NUM:
self.eat(TokenType.NUM)
return Num(int(token.text))
The code implements left-associative behavior. This is due to how the EE() and TT()
methods recursively call themselves after processing an operator. Each time an operator is
encountered, the parser creates a new binary expression (like Add or Mul) and recursively
103
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
passes this result into the next call of EE() or TT(). This causes the leftmost operators to be
evaluated first, making the expressions group to the left. For instance, when parsing a + b +
c, it will first combine a + b, then combine c with the result of a + b. Thus, the parser is
left-associative. The figure below shows how the parser builds up an expression:
Even though the recursive structure of the EE() and TT() methods might initially suggest
right-association, their design actually leads to left-associative behavior. When the parser
processes an operator, it immediately combines the current left-hand expression with the next
term, before making the recursive call. This ensures that the left-hand side is built up first,
leading to a left-associative parse.
Is it the case that any context-free grammar can be rewritten to remove left
recursion?
Yes, any context-free grammar can be rewritten to remove left recursion, whether it is direct
or indirect.
Important Considerations:
104
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Removing left recursion changes the structure of the grammar, but the resulting
grammar remains context-free and accepts the same language.
● Removing left recursion can make the grammar right-recursive, which is suitable for
top-down parsing (e.g., recursive descent).
● The process is algorithmic and can be applied to any context-free grammar, ensuring
that any CFG can be rewritten to remove left recursion.
1. LL(k) Grammars: Recursive descent parsers are generally limited to LL(k) grammars,
which means they are suitable for grammars that can be parsed with at most k tokens
of lookahead in a left-to-right scan of the input and leftmost derivation of the parse
tree.
○ LL(k) grammar: A grammar where the parser can decide which production to
apply based on the next k tokens in the input.
However, not all context-free grammars are LL(k). Some grammars require more
complex parsing strategies (such as LR parsing), or they may require more than k
tokens of lookahead to resolve which production to apply.
105
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
An LL(k) parser is a top-down parser for context-free grammars that processes the input
left-to-right (L) and constructs a leftmost derivation (L) of the input string. The (k) part refers
to how many tokens of lookahead the parser can use to make decisions during parsing.
1. Top-down Parsing: LL(k) parsers begin parsing from the start symbol of the grammar
and try to match the input string to the grammar’s productions. They expand
non-terminals by applying productions until they can match the input tokens.
2. Predictive Parsing: LL(k) parsers are also called predictive parsers because they try
to predict which production to use by looking ahead at the next k tokens in the input. In
an LL(1) parser, for instance, the parser makes decisions based on just one token of
lookahead.
3. Deterministic: LL(k) parsers are deterministic. For any input, they follow a single path of
decisions without backtracking or guessing. This determinism means that the grammar
must be unambiguous and compatible with the predictive parsing strategy.
4. Left-to-Right Parsing: The parser processes the input from the leftmost token to the
rightmost token, consuming each token as it builds the parse tree.
1. Efficiency: LL(k) parsers are typically very efficient, especially when k = 1, because
they make decisions based on a small, fixed number of lookahead tokens. LL(1) parsers
are often used in compiler design due to their simplicity and efficiency.
2. Simple to Implement: LL(k) parsers can often be implemented as recursive descent
parsers. Each non-terminal in the grammar corresponds to a function in the parser, and
parsing can be accomplished using straightforward recursive functions.
3. Limitations: Not all grammars are LL(k) for any finite k. In particular, LL(k) parsers have
trouble with:
○ Left recursion: Grammars with left recursion (e.g., E ::= E + T) can't be
parsed by LL(k) parsers.
106
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This is true. LL(k) grammars are specifically designed to be parsed using predictive parsing
techniques like recursive descent. Recursive descent parsers work by:
Since LL(k) grammars allow for parsing decisions to be made with up to k tokens of lookahead,
recursive descent parsers can handle them effectively. Therefore, any LL(k) grammar can be
parsed by a recursive descent parser.
● The grammar must be predictive, meaning that at each step, the parser must be able to
determine the correct production rule based on a bounded number of lookahead tokens.
● The grammar must be non-ambiguous and non-left-recursive (left recursion needs to
be eliminated to avoid infinite recursion in a recursive descent parser).
These are precisely the characteristics of an LL(k) grammar. If a recursive descent parser can
successfully parse a grammar, then by definition:
In Summary:
107
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While recursive descent parsers are naturally suited for LL(k) grammars, they cannot handle all
context-free grammars. For example, grammars with left recursion or ambiguities might not be
suitable for recursive descent parsing without modifications, such as removing left recursion or
transforming the grammar to fit the LL(k) form.
108
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Input:
○ Parsing requires us going over every token in the input; hence, it is at least O(n).
2. Lookahead:
○ If we don't fix the k (meaning that we want the complexity for any given k), then
the size of the lookahead will be part of the complexity; hence, it will be at least
O(k * n). However, k will be a constant if we fix the implementation of the parser.
3. Grammar Depth:
○ The depth of recursion in the grammar’s structure affects the constant factor in
parsing. The time complexity, letting the grammar vary, will include terms like
O(m * d * n), where m is the number of production rules and d is the depth of
recursion. However, m and d are constants if we fix the grammar.
4. Parsing Table Size (for Table-Driven Parsers):
○ For an LL(k) parser implemented with a parsing table (non-recursive), the size of
the table can grow exponentially with k, leading to space complexity of O(V *
k^V), where V is the number of non-terminals. This space complexity doesn't
usually affect the time complexity directly, but it is a factor in the practicality of
larger k values.
In the theoretical worst case, if you consider an arbitrary grammar and a variable k, the
complexity might look like this:
109
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Pascal:
○ Early implementations of Pascal used recursive descent parsers, as Pascal's
grammar is relatively simple and well-suited for predictive parsing.
2. Java:
○ Some early Java compilers, like Sun's javac, were built using recursive descent
parsers. However, modern Java compilers often use more sophisticated parsing
techniques to handle the more complex grammar of the language.
3. Modula-2 and Modula-3:
○ Both languages, designed by Niklaus Wirth (the same creator of Pascal), use
recursive descent parsers in their implementations, as their grammars are
designed to be LL(1) or close to it.
4. Oberon:
○ Another Wirth language, Oberon, was designed with recursive descent parsing
in mind and uses this technique in its implementations.
5. Lua:
○ The lightweight scripting language Lua uses a hand-written recursive descent
parser, which works well for its fairly straightforward grammar.
6. Go:
○ The Go programming language uses a variant of recursive descent parsing in its
compiler. The Go grammar is designed to be simple enough for efficient recursive
descent parsing.
7. Python:
○ Early versions of Python used a recursive descent parser. More recent versions
use a more sophisticated parser generator (PEG), but recursive descent
principles were influential in its design.
110
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Disclaimer: While Clang's parsing process is largely based on recursive descent principles, it's
not a straightforward, textbook implementation. Clang uses a combination of techniques,
including recursive descent and context-sensitive parsing, to handle the complexities of C, C++,
and Objective-C syntax.
1. lib/Parse/Parser.cpp: This is a central file in the parsing process. It contains the
top-level parsing functions and handles the overall structure of the parse.
2. lib/Parse/Parse.h: This header file declares the prototypes for the parsing functions
defined in Parser.cpp.
3. include/clang/Parse/Parser.h: This header file provides an interface for external
clients to access the parser.
4. lib/Basic/Token.h: This file defines the Token class, which represents a single
lexical token in the source code.
5. lib/Lex/Lexer.h: This file defines the Lexer class, which is responsible for
tokenizing the source code.
While you might not find a direct, textbook implementation of recursive descent, you can
observe the recursive nature of the parsing functions in Parser.cpp. For example, the function
that parses a statement might recursively call functions to parse expressions, declarations, or
other statement types.
Specific Examples:
Code Structure:
111
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The parsing code often follows a top-down approach, starting with the top-level grammar rules
and progressively descending into more specific rules. This is consistent with the principles of
recursive descent parsing. Nevertheless, clang's parsing is highly optimized and may use
techniques like context-sensitive parsing to deal with complex language constructs. These
techniques can make the code structure less straightforward than a pure recursive descent
implementation.
The Lua parser is implemented in C and can be found in the llex.c and lparser.c files
within the Lua source code distribution.
Key Files:
● llex.c: This file handles lexical analysis, breaking the source code into tokens.
● lparser.c: This file contains the core parsing logic, implementing the recursive
descent parser.
You'll find functions in lparser.c that correspond to the various grammar rules in Lua's
syntax. For example, there might be functions for parsing expressions, statements, blocks, and
other language constructs. These functions often call each other recursively to parse nested
structures. As an example, consider the entry point of the parser:
112
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. block_follow(ls, 1): This function checks if the current token is a valid token to
end a block (e.g., end, else, or elseif).
3. TK_RETURN: This is a Lua token representing the return keyword.
4. statement(ls): This function presumably parses a single statement.
The function uses a loop to repeatedly parse statements until it encounters a token that signals
the end of the statement list. This recursive nature is characteristic of recursive descent parsing.
Key Points:
● The function's name (statlist) suggests that it's parsing a list of statements.
● The loop condition (!block_follow(ls, 1)) ensures that statements are parsed until
the end of the block.
● The special handling of return statements ensures that return is the last statement in
a block.
So, can you list the advantages and disadvantages of recursive descent
parsers?
1. Simplicity:
○ Recursive descent parsers are relatively easy to implement and debug. Each
non-terminal in the grammar corresponds to a function in the parser, making the
code very readable and modular.
2. Flexibility:
○ They allow for easy integration of semantic actions, custom error handling, and
more fine-grained control over the parsing process.
3. Maintainability:
○ Hand-written recursive descent parsers can be modified more easily than parsers
generated by parser generators like Yacc or Bison, making them preferable in
cases where the grammar changes frequently.
Disadvantages:
113
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Published in Communications of the ACM in 1961, Irons’ paper is one of the earliest to
describe syntax-directed compilation, focusing on the direct use of grammar rules to
guide parsing.
● Irons’ method was top-down:
○ He demonstrated how to write a compiler for ALGOL 60 by systematically
working through grammar rules.
○ The compiler directly translated grammatical structures into machine code as it
parsed the input.
● Irons’ method essentially hardcoded grammar rules into the compiler’s structure.
● While not named as such, this is an early form of recursive descent parsing because
the parsing process recursively followed grammar rules to recognize syntactic structures.
● Published the same year as Irons’ work, Lucas’ paper is also fundamental to recursive
descent parsing.
● Lucas focused on structured top-down parsing and explicitly described the use of
recursive procedures for parsing grammar rules.
● Lucas formalized the recursive nature of the parsing process, explicitly using recursive
functions to implement grammar rules.
● His approach more closely resembles modern recursive descent parsing, with each
grammar rule corresponding to a specific recursive function.
114
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Both papers contribute significantly, but they approach the problem from slightly different angles:
1. Irons described a more general syntax-directed compilation process, which included
elements of recursive descent but wasn’t exclusively focused on parsing.
2. Lucas gave a more explicit and systematic description of recursive descent parsing as
we recognize it today.
● If we are looking for the first application of recursive descent parsing: Irons.
● If we are looking for the first explicit description of recursive descent parsing: Lucas.
The method’s simplicity and clarity ensured its widespread adoption and long-term influence,
even in modern parsing techniques.
Conclusion
Both Irons and Lucas played foundational roles in the development of recursive descent parsing
in 1961. Lucas’ description is often considered the first explicit description, while Irons’ work
laid the groundwork as part of a broader exploration of syntax-directed compilation. Both
deserve recognition for their contributions to this key advancement in compiler design.
Is this the same Peter Lucas who inspired the design of PL/I?
Yes, pretty much! Peter Lucas played a key role in the formal definition of PL/I through his work
on the Vienna Definition Language (VDL). Peter Lucas (born 1935 in Austria) is a pioneering
computer scientist known for his contributions to programming language theory, compiler
design, and formal methods. He studied mathematics and earned his doctorate in 1959 under
Leopold Schmetterer at the University of Vienna. Lucas was part of the team that developed
ALGOL 60 compilers and played a leading role in advancing the formal description of
programming languages.
115
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Throughout his career, Lucas worked on both theoretical and practical aspects of programming,
becoming a key figure in early efforts to formally define programming languages. His influence
extended through his work on the Vienna Definition Language (VDL) and his contributions to
parsing techniques, particularly recursive descent parsing. Lucas held academic and research
positions in Austria and internationally, shaping the foundations of modern computing.
116
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Lucas advanced the field of formal methods, providing rigorous frameworks for
reasoning about programming languages and their implementation.
○ His work laid the groundwork for the modern study of programming language
semantics and influenced formal verification techniques.
5. Theoretical Computer Science:
○ Lucas explored topics like syntax-directed translation, block structure in
programming languages, and hierarchical representations of programs.
○ His theoretical insights into language design and compiler construction have had
long-lasting impacts on both academic research and practical compiler
development.
Legacy
Peter Lucas’ work bridged the gap between theory and practice in early computing. His
pioneering efforts in parsing, formal language definition, and compiler design continue to
influence the development of programming languages and compilers. While his contributions
are often associated with foundational work in the 1960s, their relevance endures in modern
computer science.
117
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Bottom-up Parsing
Code: https://github.com/pronesto/DCC053_Public/tree/main/4_bottomUp
class Parser:
"""
This parser implements an attempt to recognize the language of the following
grammar:
S ::= ( S ) S | Empty
"""
def __init__(self, lexer):
self.lexer = lexer
self.current_token = self.lexer.next_valid_token()
def S(self):
if self.current_token.kind == TokenType.LPR:
self.consume(TokenType.LPR) # consume '('
self.S() # parse S
self.consume(TokenType.RPR) # consume ')'
self.S() # parse S
def parse(self):
self.S()
if self.current_token.kind != TokenType.EOF:
raise ValueError(f"Unexpected token {self.current_token.kind}")
def test_parser(input_str):
"""
This function tests our parser of Dick language.
Example:
>>> test_parser('()()()')
()()() is valid.
>>> test_parser('(()())')
(()()) is valid.
>>> test_parser('((()))')
((())) is valid.
118
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
>>> test_parser('')
is valid.
>>> test_parser('(()))')
(())) is invalid: Unexpected token TokenType.RPR
>>> test_parser('((())')
((()) is invalid: Expected TokenType.RPR, got TokenType.EOF
"""
lexer = Lexer(input_str)
parser = Parser(lexer)
try:
parser.parse()
print(f"{input_str} is valid.")
except ValueError as e:
print(f"{input_str} is invalid: {e}")
The parser we've implemented recognizes the language of balanced parentheses based on the
grammar "S ::= ( S ) S | Empty". It uses a recursive descent approach, where the S method is
called recursively to handle nested structures. The parser consumes tokens from the lexer and
checks for matching pairs of parentheses, raising errors when it encounters unexpected tokens
or if the structure is invalid.
In terms of classification, this parser is an LL parser because it processes the input from left to
right and constructs a leftmost derivation of the sentences in the language. It does not
backtrack; instead, it relies on the structure of the grammar to guide the parsing decisions. If it
encounters an unexpected token, it raises an error rather than attempting to find an alternative
parse path.
Write a parser for the language with no less left parentheses than right
parentheses.
Here's the grammar:
S ::= ( S | P
P ::= ( P ) | Empty
Let's try to write a recursive descent parser. It should look like:
def S(self):
token = self.current_token
if token.kind != TokenType.LPR:
raise ValueError("Unexpected token")
# Now, do we call production "S ::= ( S", or "S ::= P"?
In fact, we can't parse this grammar with a recursive descent parser... When we encounter a left
parenthesis (, the parser cannot determine whether to expand S or P because both productions
begin with (. The choice between these two productions requires more context than can be
provided by just looking at the next token (in this case, "(").
119
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To correctly decide whether to use S or P, the parser would need to see additional tokens,
specifically a right parenthesis ), to clarify the structure. This dependency on future input makes
it impossible for an LL(k) parser (which makes decisions based solely on the current lookahead
tokens) to parse the grammar correctly, as it can't backtrack or re-evaluate choices based on
later tokens. So, while the grammar is not ambiguous, the inability to make a correct decision
upon seeing a left parenthesis renders it unsuitable for LL(k) parsing.
So, if there are LL(k) grammars, then there should also be LR(k)
grammars?
An LR(k) grammar is a type of context-free grammar that can be parsed by an LR(k) parser.
Here’s what the terms mean:
S ::= ( S | P
120
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
P ::= ( P ) | Empty
We would have the following transition table, assuming input "( ( ) )":
We have two "shift" states, 0 and 2. Why do they handle ")" differently?
The first state "knows" that there is no P onto the stack. So, if it reads ")", then it must push "P )"
onto the stack. In this way, either we will end up with "(...(P)", which can be reduced, or with
"P)", which is an error.
121
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
stack.append('P')
stack.append(TokenType.RPR)
return 1
elif token.kind == TokenType.EOF:
stack.append('S')
return 1
else:
raise ValueError(f"Unknown token: {token.text}")
def test_parser(input_str):
lexer = Lexer(input_str)
next_state = 0
stack = []
try:
while next_state >= 0:
action = state_machine[next_state]
next_state = action(stack, lexer)
print(f"{input_str} is valid")
except ValueError as e:
print(f"{input_str} is invalid")
122
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
performed by the functions shift_0, reduce_1, and shift_2 in your code. We can see it as
the following diagram:
● States:
○ The figure has three labeled states: 0, 1, and 2.
○ Each state corresponds to a specific action in the parser. For example:
■ State 0: Shifts input into the stack (handled by shift_0 in your code).
■ State 1: Reduces patterns in the stack (handled by reduce_1 in your
code).
■ State 2: Handles additional input and shifts (handled by shift_2 in your
code).
● Transitions:
○ The arrows between states indicate the transitions based on specific tokens.
○ For instance:
■ From state 0 to state 1, the transition happens if an RPR (right
parenthesis) token or EOF (end of file) is encountered. This is consistent
with the code in shift_0.
■ From state 1 to state 2, it reduces a valid parentheses expression P ->
(P).
■ From state 2 to state 1, the transition happens when an RPR token is
encountered.
123
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary:
The figure is a finite state machine that works with an LR parser designed to match balanced
parentheses. Each state corresponds to an action (shift or reduce) in the parser, and the
transitions between states happen based on the input tokens (LPR, RPR, EOF). The program
you've provided implements this state machine, where the states and transitions in the code
map directly to the states and arrows in the diagram.
124
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Shift and Reduce Actions: The parser uses functions (shift_0, shift_2,
reduce_1) to define actions based on the current state. It shifts tokens onto the stack or
reduces the stack contents when patterns that match the grammar are found.
2. State Management: The use of a state machine to dictate transitions allows for flexibility
in handling different parsing scenarios. The current state determines the action to take,
whether to shift or reduce, which is characteristic of shift-reduce parsers.
3. Non-Recursive: Unlike a recursive descent parser, this implementation does not rely on
recursive function calls. Instead, it manages the parsing process through an explicit
stack and state transitions, making it more iterative.
4. Handles Context Sensitivity: This parser can decide whether to shift or reduce based
on the current stack content and the tokens received, effectively handling the
dependencies inherent in the given grammar.
Bottom-Up Parsers:
● General Category: Bottom-up parsing refers to any parsing strategy that starts from the
leaves (tokens) and works up to the root (the start symbol of the grammar). This
category includes several types of parsers, such as LR parsers, SLR parsers, and LALR
parsers.
● Reduction: Bottom-up parsers primarily focus on reducing sequences of input tokens
into non-terminals according to the grammar rules until they eventually derive the start
symbol.
LR Parsers:
● Specific Type: LR parsers are a specific kind of bottom-up parser. The "LR" stands for
"Left-to-right" scanning of the input and "Rightmost derivation" in reverse.
● State Machine: LR parsers use a finite state machine to decide when to shift (add
tokens to the stack) and when to reduce (replace a sequence of tokens with a
non-terminal).
● Powerful: LR parsers can handle a larger class of grammars than many other parsing
techniques, including some that are not suitable for LL parsing.
125
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In summary, while all LR parsers are bottom-up parsers, not all bottom-up parsers are LR
parsers. LR parsers are a particular implementation of the bottom-up approach, distinguished by
their method of processing input and managing states.
S ::= ( S | P
P ::= ( P ) | Empty
This grammar is not LL(k) for any k, even though it is a context-free language. Here's why: for a
language to be LL(k), the parser must be able to determine the correct production to use based
solely on the next k tokens of lookahead. However, in this grammar:
126
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The parser must decide whether to continue with S ::= ( S or switch to P. This
decision is based on whether there will eventually be a matching ) after a sequence of
(s, which could be far ahead in the input.
● When the parser sees a (, it doesn’t know whether this ( belongs to the S ::= ( S
branch, which could be part of an arbitrary sequence of ('s, or to the P ::= ( P )
branch, which requires matching bs later.
When we talk about LL and LR parsers, the first 'L' means that they read
the input from left to right. What about the second letter, L or R?
The second letter in LL and LR parsers refers to how the parser constructs the parse tree:
● LL: The second "L" stands for Leftmost derivation. An LL parser constructs a parse
tree by expanding the leftmost non-terminal first. This means it builds the parse tree from
the top down, choosing productions that match the leftmost part of the remaining input.
● LR: The "R" stands for Rightmost derivation in reverse. An LR parser constructs the
parse tree by reducing the rightmost non-terminal first, but it does so in reverse order.
This means it builds the parse tree from the bottom up, starting with the input symbols
and working backward to derive the start symbol of the grammar.
Considering the sentence "n0 + n1 * n2", we would have the following derivations:
127
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Derivations:
S→AB
A→a
B→b
Leftmost Derivation:
In a leftmost derivation, at every step of the derivation, the leftmost non-terminal (the first one
from the left) is expanded (replaced by its corresponding rule). This approach expands
non-terminals starting from the left side of the string.
128
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. In leftmost derivation, we expand the leftmost non-terminal first (which is A):
A B→a B
3. Now we expand the next leftmost non-terminal, which is B:
a B→a b
This is the leftmost derivation because we always expand the leftmost non-terminal (A first,
then B).
Rightmost Derivation:
In a rightmost derivation, at every step of the derivation, the rightmost non-terminal (the one
furthest to the right) is expanded first. This approach expands non-terminals starting from the
right side of the string.
This is the rightmost derivation because we always expand the rightmost non-terminal (B first,
then A).
Connection to Parsers:
129
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Assuming that <name> and <CPF> were already scanned by the lexer. LR(0) parsers, in turn,
are a bit more common. The parser that we have written in this class is LR(1). Based on the
stack, the state and the current token, they don't have any shift/reduce conflict.
Are LR(0) parsers and SLR (Simple LR) parsers the same thing?
LR(0) parsers and SLR (Simple LR) parsers are related but not the same. Here's the difference:
● Lookahead: LR(0) parsers do not use any lookahead, while SLR parsers use lookahead
derived from Follow sets.
● Parsing Table: The construction of the parsing table is different. SLR parsers can
handle grammars that would cause conflicts in an LR(0) parser's table.
● Power: SLR parsers are more powerful and can parse a broader set of grammars than
LR(0) parsers.
130
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The diagram presents a hierarchy among different types of parsing strategies based on the
languages they can recognize:
131
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ LR(0) is the simplest class, with no lookahead, while increasing the complexity to
SLR(1), LALR(1), and LR(1) enables the recognition of more complex
languages.
3. LR(k) covers more languages than LL(k):
○ LR(k) parsers can recognize a broader set of languages than LL(k) parsers for
any given k, which is reflected in the image by the nesting. This means there are
grammars that an LR(k) parser can handle but that an LL(k) parser cannot.
The recursive-descent parser from our previous class is typically an LL(k) parser. It parses the
input in a top-down manner with a lookahead of k. So:
● Recursive-Descent ≈ LL(k): The two are equivalent in terms of what they can parse,
though recursive-descent parsers are usually simpler implementations of LL(k) parsing.
● According to the figure, LL(k) parsers are properly contained within LR(k) parsers. This
means that LR(k) parsers can recognize a strictly larger class of grammars than LL(k).
132
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The figure produced by Abhishek Ghosh attempts to relate different types of grammars from the
Chomsky hierarchy with families of parsers.
Meanwhile, parser families (like LL, LR, etc.) are tied to classes of grammars that define
specific languages. The hierarchy of parsers corresponds to the types of grammars and
languages they can recognize.
Grammar-Parser Correspondence
● LL(k) and LR(k) correspond to deterministic context-free grammars. These are the
grammars that can be parsed by deterministic parsers like LL and LR.
133
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● L(LL(k)) would refer to the set of languages that can be parsed by an LL(k) parser.
● L(LR(k)) would refer to the set of languages recognized by LR(k) parsers.
● L(LL(1)) is strictly smaller than L(LL(k)), since increasing k allows recognition of more
complex languages.
● L(LR(1)) is larger than L(LL(k)) for any fixed k, meaning LR(1) parsers can recognize all
LL(k) languages, but also some grammars that LL(k) parsers cannot.
● L(LR(k)) includes a broader class of context-free languages, and when k increases, the
set of recognizable languages also increases.
Yes, it makes sense to think about the hierarchy of parsers and their relation to the Chomsky
hierarchy. The figure from Stack Exchange aims to capture this idea by placing parsers like
LL(k) and LR(k) within a broader context of language classes.
So, in summary, talking about the language families recognized by different parser families
(like L(LR(k))) is meaningful. The figure shows that LR(k) parsers can handle deterministic
context-free grammars, and by extension, the more complex parsers can handle increasingly
larger subsets of context-free or even context-sensitive languages.
134
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
}
Do you parse 'x' as a type (and then 'x * y' is declaring a variable called y that is a pointer to x),
or do you parse it as another variable (and then 'x * y' is a multiplication). The statement x *
y; can indeed be ambiguous in terms of parsing, but when it comes to actual compilation in C,
the compiler would reject it if x is not defined as a type or as a variable.
Key Points:
1. Ambiguity: In a purely syntactic sense, the statement can be interpreted in two ways: as
a declaration (x as a type) or as an expression (multiplication).
2. Compiler Behavior: If x is not defined as a valid type (like an int, float, or a
user-defined type) or as a valid global variable, then the compiler will reject the program
because it cannot resolve the symbol x.
From the above figures, it seems that LR parsers cannot deal with
ambiguous grammars, yet, they deal with the if-then-else/if-then commands
in languages like C and Java. How so?
LR parsers typically can't deal with ambiguous grammars directly. However, when handling
common ambiguous constructions like the dangling else problem in languages like C and
Java (where both if-then-else and if-then commands exist), LR parsers rely on
disambiguation rules embedded into the parsing process.
This problem arises because, in grammars for languages like C or Java, both the if-then and
if-then-else constructs are allowed. When the parser encounters an if followed by a
nested if-then construct, it might not know which else corresponds to which if, leading to
ambiguity. Consider the following code:
if (a)
if (b)
foo();
else
bar();
if (a) {
if (b) foo();
135
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
} else {
bar();
}
if (a) {
if (b) foo();
else bar();
}
LR parsers handle this ambiguity by introducing precedence and associativity rules during
the parsing process. Specifically, languages like C and Java follow a simple disambiguation
rule for the dangling else problem:
This is a semantic rule that's enforced by the grammar design and the parser.
The grammar for an LR parser is written in such a way that it prioritizes matching an else with
the most recent if. Here’s a simplified grammar that demonstrates how this is handled:
stmt → if_stmt
| other_stmt
In this grammar:
● The production IF '(' cond ')' stmt ELSE stmt ensures that when an else is
encountered, it’s matched with the nearest preceding if.
● The production IF '(' cond ')' stmt allows if statements without an else.
LR Parsing Tables:
During parsing, the LR parser uses shift-reduce decisions based on the parsing table it
generates from the grammar. When the parser encounters an else, the table guides it to
136
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
reduce using the rule IF '(' cond ')' stmt ELSE stmt, ensuring that the else binds to
the closest if. This decision-making process is encoded in the parser’s shift-reduce actions:
● When the parser is in a state where it expects an else (right after a stmt that follows
an if without else), it shifts to process the else.
● The parser avoids reducing the earlier if production too early, ensuring the else can
be attached to the nearest if.
if (a)
if (b)
foo();
else
bar();
1. First if: The parser shifts the first if and matches it to the condition a.
2. Nested if: The parser shifts the nested if and matches it to the condition b.
3. foo(): The parser shifts the foo() statement.
4. else: When the parser encounters else, it shifts and reduces using the rule that
matches else to the innermost if (if (b)), ensuring the correct parsing.
This mechanism ensures the parser always handles the dangling else by matching it with the
closest preceding if.
In Summary:
● LR parsers can't handle ambiguity directly, but they can resolve specific ambiguous
constructs like the dangling else through disambiguation rules built into the grammar
and the parser’s shift-reduce actions.
● The "match else to the closest if" rule is enforced through the structure of the
grammar and the decisions made by the parser’s parsing table.
And why does the C language require us to put declarations before uses. Is
it to help the parser?
Pretty much! The requirement for declarations before use in C serves primarily to inform the
parser and compiler about the nature of symbols—whether they are user-defined variables,
functions, or types.
137
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Type Information: Declarations provide the necessary type information that the compiler
needs to interpret symbols correctly. For example, without a declaration, the compiler
cannot determine whether x is a variable or a type, leading to ambiguity in our previous
example.
2. Correct Parsing: When the parser encounters a symbol, having prior declarations
allows it to resolve the symbol's meaning immediately. This avoids situations where the
parser must guess or defer the decision, which could lead to errors or incorrect
interpretations.
The compiler typically resolves this ambiguity by looking at the context. For example, if a is
followed by parentheses, it's more likely to be a function call.
Again, the context helps the compiler decide. If a is followed by an operator, it's more likely to be
a variable.
● Pointer Declaration: a ∗ b;
138
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The compiler usually relies on the context and the type of the operands to determine the correct
interpretation.
1. Context-Free Grammar:
● C can be defined by a context-free grammar (CFG) in terms of its basic syntax rules.
Many elements, such as expressions and statements, can be represented with CFG
constructs.
2. Context-Sensitive Features:
3. Practical Implications:
● While parsers for C are often built using techniques that work well with context-free
grammars (like LR or LL parsing), additional mechanisms (like symbol tables) are
needed during semantic analysis to handle context-sensitive aspects.
139
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion
In summary, while the C grammar can be expressed as context-free for many syntactic
constructs, the language's practical parsing and compilation require context-sensitive
information to resolve ambiguities and ensure correct type handling. This is why C compilers
implement a two-phase process: parsing for syntax and semantic analysis for type and symbol
resolution.
Shortly after LL parsing, Knuth also developed LR(k) parsers (bottom-up parsers) in 1965,
which parse input from left to right and construct a rightmost derivation in reverse. The LR
parser is generally more powerful, capable of handling a broader class of grammars than LL
parsers.
When Algol was invented, there was not much parsing theory around. It
seems that much of this theory and practice was invented for Algol. Is that
true?
Yes, when ALGOL (Algorithmic Language) was first developed, formal parsing theory was not
yet well-established, and the process of parsing programming languages was still largely ad
hoc. ALGOL, particularly ALGOL 60, is a significant historical milestone because it influenced
the development of formal syntax description methods, including Backus-Naur Form (BNF),
and also contributed to the emergence of formal parsing techniques.
140
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Parsing ALGOL 60
ALGOL 60’s development began in the late 1950s, and it was first specified in 1960. At the time,
the formal understanding of parsers—such as LL(k) and LR(k)—had not yet been developed,
but ALGOL 60’s syntax played a pivotal role in shaping the future of programming languages
and parsing theory.
Context
1. Backus-Naur Form (BNF): One of the key innovations introduced with ALGOL 60 was
the use of BNF, which John Backus and Peter Naur developed to formally describe the
syntax of the language. This was a significant step toward a rigorous understanding of
grammar and parsing.
○ BNF described context-free grammars, which later became the foundation for
parsing theory. However, BNF itself was not a parser but a way to formalize the
grammar in a way that was easier to implement and reason about.
2. Ad hoc and Recursive-Descent Parsers: Early ALGOL 60 compilers relied heavily on
recursive-descent parsing techniques, often implemented in an ad hoc manner.
Recursive-descent parsers are top-down parsers, which are hand-crafted to recognize
different productions of the language. These parsers were written directly from the
grammar, with one function per non-terminal symbol in the grammar. However,
recursive-descent parsers can struggle with certain constructs, particularly
left-recursion, which ALGOL 60's grammar possessed. This led to the need for some
clever modifications, such as:
○ Refactoring left-recursive grammar rules into right-recursive ones to make
them work with recursive-descent parsers.
○ Backtracking: Some early ALGOL compilers likely used backtracking in their
recursive-descent parsers when choices in the grammar were ambiguous.
3. Limited Parsing Theory: Since LL(k) and LR(k) parsing theory hadn’t been developed,
early ALGOL parsers were typically LL(1)-like recursive-descent parsers. They would
rely on just one token of lookahead or implement backtracking when needed. The
limitations of these parsers, such as their inability to efficiently handle certain grammars,
were tolerated or patched over by practical fixes, but they laid bare the challenges of
parsing more complex languages.
4. Practicality over Formal Theory: The main focus was getting a working compiler, so
developers often implemented whatever techniques worked in practice. These early
techniques were not yet formalized in the way that later parsers, such as LL(k) and
LR(k) parsers, were.
The difficulties encountered when trying to parse ALGOL 60 motivated significant advances in
parsing theory. For example:
141
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Donald Knuth developed LR parsing partly to handle the class of grammars that
ALGOL 60 fell into, as it was difficult to parse with the simpler LL(1) methods.
● The distinction between different grammar types (such as LL vs. LR) became clearer as
formal parser theory evolved in the 1960s, leading to better parsing techniques for
handling languages like ALGOL.
In summary, ALGOL was initially parsed using recursive-descent parsers, often with
backtracking, at a time when formal parsing theory was still nascent. Its syntax and the
challenges it presented significantly influenced the development of BNF and more sophisticated
parsing algorithms like LR parsing, which were created in response to the difficulties of parsing
languages like ALGOL.
Components of BNF
BNF defines grammars in terms of productions, which describe how strings (sequences of
symbols) in a language can be generated. Each production consists of non-terminal symbols,
terminal symbols, and rules.
142
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ These are the basic, indivisible symbols of the language, often corresponding to
tokens like keywords, operators, or literals (e.g., if, +, 123).
○ Terminal symbols are not replaced further in the grammar; they appear directly in
the input program.
3. Production rules:
○ A production rule specifies how a non-terminal symbol can be replaced with a
sequence of non-terminal and terminal symbols.
○ The general form of a rule is: <non-terminal> ::= <symbol sequence>.
The ::= symbol is read as "is defined as" or "can be replaced by."
○ Multiple alternatives for a non-terminal can be specified using the vertical bar |,
meaning "or."
BNF grammars describe the valid structure of sentences (or programs) in a language by
recursively defining how higher-level structures (like expressions or statements) can be
constructed from simpler ones. These definitions allow a parser to validate whether a given
input conforms to the language's syntax.
143
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This recursive process continues until all the input is matched to the grammar rules.
Variations of BNF
There are a few variations and extensions of BNF that are commonly used:
1. EBNF (Extended Backus-Naur Form): This version includes additional notations to
make grammars more concise and readable. For example, it includes operators like:
○ [] to indicate optional elements.
○ {} to indicate repetition (zero or more occurrences).
○ () to group alternatives.
2. ABNF (Augmented BNF): This is a variation of BNF used in defining internet standards
(such as in RFCs). It is used in protocols like HTTP and SIP.
When ALGOL 60 was developed, there was a need for a precise way to define its syntax. John
Backus first proposed BNF as part of the ALGOL 60 report, and Peter Naur refined it. This was
one of the earliest formal methods for describing the syntax of a programming language, and it
had a profound impact on subsequent languages, parsing techniques, and compiler design.
BNF laid the groundwork for the formalization of context-free grammars and contributed directly
to the development of modern parsing techniques, including LL and LR parsing.
Summary of BNF
BNF is a formal grammar notation that defines how strings (programs) in a language are
generated by a set of production rules. It’s crucial for understanding and designing programming
languages, and it's a key tool for specifying syntax in language documentation, as well as for
implementing parsers.
144
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Grammars were invented by Noam Chomsky, right? But he did not seem to
have worked on the design of any programming language. So, what's his
actual contribution?
Noam Chomsky's work on formal grammars, especially his classification of grammars (now
known as the Chomsky hierarchy), had its roots in linguistics, not computer science. His
primary goal was to formalize how natural languages could be structured using rules, which led
to the development of formal language theory in the 1950s.
However, it was later that computer scientists saw the potential of Chomsky's work in defining
the syntax of programming languages. When people began designing high-level programming
languages like ALGOL in the late 1950s and early 1960s, they needed a formal way to describe
the syntax of these languages. Chomsky's grammars, particularly context-free grammars,
provided the ideal tool for this purpose.
145
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While Chomsky himself didn’t work directly on any programming language projects, his formal
language theory provided the foundation for the entire field of syntax analysis in computer
science, especially in the development of compilers and interpreters. His classification of
grammars became central to understanding which languages could be recognized and how to
construct parsers for programming languages.
Summary:
Chomsky’s work on grammars was originally aimed at understanding natural languages, but
computer scientists adapted his ideas to solve practical problems in programming language
design and parsing. His theoretical framework gave rise to the formal methods used to define
and parse the syntax of programming languages, thereby laying the groundwork for modern
compiler design and parsing theory.
146
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Compilers and Interpreters: They parse source code written in programming
languages to generate intermediate representations, such as ASTs, which are then used
for further processing, optimization, or execution.
2. Data Serialization and Deserialization: Parsers are used to read and write data
formats like JSON, XML, and YAML, converting them between text and structured data
formats for software applications.
3. Configuration Files: Many applications use parsers to read configuration files, allowing
users to specify settings in a readable format that the software can interpret.
4. Natural Language Processing: Parsers analyze human language text to extract
meaningful information, such as in search engines, chatbots, and translation systems.
5. Web Browsers: They parse HTML and CSS to render web pages, turning markup and
style sheets into the visual content displayed on screens.
6. Data Query Languages: Parsers are used in database systems to interpret SQL
queries, translating them into actions that retrieve or modify data.
● Lexing: Also known as lexical analysis, lexing is the process of breaking down a
sequence of text into tokens. Tokens are meaningful units such as keywords, identifiers,
operators, and symbols. The lexer (or tokenizer) identifies these units based on
predefined patterns, often using regular expressions. For example, in the code x = 5 +
3, the lexer would produce tokens for x, =, 5, +, and 3.
● Parsing: Parsing takes the sequence of tokens produced by the lexer and analyzes their
syntactic structure according to grammatical rules. The parser constructs a hierarchical
147
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
structure, such as an Abstract Syntax Tree (AST), that reflects the relationships and
nesting of these tokens. Continuing with the example, the parser would build a tree that
shows the addition operation and the assignment.
In summary, lexing focuses on breaking text into tokens, while parsing focuses on interpreting
the syntactic structure of those tokens.
1. Formalism: CFGs offer a precise and well-defined set of rules for specifying the syntax
of a language. This formality ensures that parsers can be designed to correctly and
consistently interpret the structure of the text.
2. Hierarchical Structure: CFGs allow for the definition of nested structures, which are
common in programming languages and data formats. For instance, CFGs can describe
how expressions are nested within statements or how blocks of code are nested within
functions.
3. Ease of Parsing: Many parsing algorithms are designed to work specifically with CFGs.
For example, algorithms like LL(1) and LR(1) parsers are tailored to handle CFGs
efficiently. CFGs make it possible to use these algorithms to parse and process text
systematically.
4. Ambiguity Handling: CFGs can express different levels of grammar complexity,
including handling ambiguities through various techniques. While CFGs themselves can
be ambiguous, there are well-established methods for disambiguating grammars or
transforming them into a form suitable for parsing.
5. Flexibility: CFGs are expressive enough to model the syntax of many programming
languages, from simple expressions to complex language constructs. This flexibility
makes CFGs applicable to a wide range of applications beyond just programming
languages, such as configuration files and data serialization formats.
In summary, CFGs are essential because they provide a robust framework for defining and
understanding the syntax of languages, enabling the creation of effective and reliable parsers.
So, while context-free grammars are the key theory for parsing, regular
grammars are the key theory for lexing. Is that correct?
Yes, that's correct! Regular grammars are the key theory for lexing, while context-free grammars
are used for parsing. Here’s a brief overview of their roles:
● Regular Grammars: These are used for defining the patterns of tokens in lexing.
Regular grammars are less complex than context-free grammars and are well-suited for
specifying regular languages, which include the basic patterns that lexers need to
148
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
recognize, such as keywords, identifiers, literals, and symbols. Lexers use regular
expressions, which are equivalent to regular grammars, to identify and tokenize these
patterns in the text.
● Context-Free Grammars: These are used for parsing and defining the hierarchical
structure of a language. CFGs can handle nested and recursive structures, making them
ideal for specifying the syntax of programming languages and other complex data
formats. Parsers use CFGs to build structures like Abstract Syntax Trees (ASTs) based
on the tokens produced by the lexer.
In summary, regular grammars are used for the initial stage of text processing (lexing), where
the goal is to break the text into meaningful tokens, while context-free grammars are used for
the subsequent stage (parsing), where the goal is to interpret the syntactic structure of those
tokens.
Why can't Context-Free Grammars be used for parsing and lexing, if they
are strictly more powerful than Regular Grammars?
Context-Free Grammars (CFGs) are indeed more powerful than Regular Grammars, but their
increased expressiveness comes with complexities that make them less practical for lexing.
Here’s why CFGs are not typically used for lexing:
1. Complexity and Performance: Regular grammars, which are equivalent to regular
expressions, are simpler and can be processed efficiently with finite state machines
(FSMs). Lexing involves identifying tokens quickly and efficiently, which regular
expressions and FSMs handle well due to their linear time complexity. In contrast, CFGs
are more complex and often require algorithms with higher time and space complexity,
making them less suitable for the fast, lightweight processing required during lexing.
2. Different Requirements: Lexers and parsers have different goals. Lexers need to
recognize and extract discrete tokens from a stream of text, which often involves simple
pattern matching. CFGs, however, are designed to handle hierarchical structures and
nested patterns, which are not typically necessary for the tokenization phase. CFGs
excel at capturing the syntactic structure of a language but are overkill for the token
extraction process.
3. Finite State Machines vs. Parsing Algorithms: Lexing can be efficiently implemented
using finite state machines (FSMs) or regular expression engines, which are well-suited
for processing regular languages. Parsers, on the other hand, use more sophisticated
algorithms like LL, LR, or other parsing strategies that can handle the hierarchical and
recursive nature of CFGs. The complexity of these parsing algorithms is justified in the
parsing phase but is unnecessary for simple token recognition.
4. Token Boundaries: In lexing, tokens are typically bounded by simple delimiters or
patterns (e.g., whitespace, punctuation), which regular expressions handle effectively.
CFGs are better suited to handle nested and recursive constructs (e.g., nested
parentheses), which are not usually a concern in the tokenization phase.
149
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In summary, while CFGs are more powerful and can theoretically be used for lexing, the
practical requirements of lexing—such as speed, efficiency, and simplicity—make regular
grammars and finite state machines a more appropriate choice. CFGs are reserved for parsing,
where their ability to handle complex, nested structures is essential.
In our last class, we wrote an LR(1) parser. That was very complicated: we
had to design states and transitions by hand. Aren't there alternatives?
Manually implementing an LR(1) parser can be quite complex and involved, especially when
designing states and transitions. Fortunately, there are easier alternatives and tools available
today that can handle this complexity for you. Here are some common approaches and tools
used for parser generation:
1. Parser Generators
Parser generators automatically produce parser code based on a high-level description of the
grammar. This abstracts away much of the complexity involved in manually creating parsers.
Some popular parser generators include:
● ANTLR (Another Tool for Language Recognition): ANTLR is a powerful tool that
generates parsers, lexers, and tree parsers for a variety of programming languages. It
uses a high-level grammar definition and provides a range of features for building
complex language parsers.
● Bison: An extension of Yacc (Yet Another Compiler Compiler), Bison is a well-known
parser generator for C/C++ that handles LALR(1) grammars. It is widely used for
generating parsers and can be combined with Flex for lexing.
● PEG.js: For JavaScript, PEG.js generates parsers using Parsing Expression Grammars
(PEGs). It simplifies grammar definitions and provides a JavaScript parser generator.
2. Parser Combinators
Parser combinators provide a more flexible and modular approach to building parsers. Instead
of defining a complete grammar in one go, you combine simpler parsers to build more complex
ones:
● Parsec (Haskell): A library for writing parsers in Haskell using combinators. It provides a
high-level way to build parsers by combining smaller parsers.
● PyParsing (Python): A library for constructing grammars and parsers in Python using a
combinator-based approach. It allows for easy and intuitive grammar definitions.
● FastParse (Scala): A parser combinator library for Scala, offering a functional approach
to building parsers.
3. Parser Libraries
150
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
There are libraries available for various languages that provide built-in parsing capabilities for
common grammar types:
● Tree-sitter: An incremental parser used in many modern text editors to provide syntax
highlighting and code navigation.
● ANTLR4 Runtime Libraries: For various programming languages, including Java, C++,
Python, and more, making it easier to integrate ANTLR-generated parsers into your
projects.
Summary
While manually implementing an LR(1) parser is an educational exercise and useful for
understanding the underlying mechanisms, in practice, using parser generators, parser
combinators, or specialized libraries is much more efficient and manageable. These tools
abstract away the low-level details and allow developers to focus on defining and using
grammars without dealing with the complexities of state management and transition design.
1. Lexical Analysis (Tokenization): Some parser generators include tools for generating
lexical analyzers (or lexers) that tokenize the input text. This is often done with a
separate tool or integrated with the parser generator.
2. Parsing: The core functionality of a parser generator is to produce code that can parse
the input according to the grammar rules. This involves creating state machines or tables
that represent the parsing logic.
3. Error Handling: Parser generators often include mechanisms for handling syntax errors,
providing useful feedback if the input does not conform to the grammar.
4. Tree Construction: Many parser generators build data structures like Abstract Syntax
Trees (ASTs) during parsing, which represent the hierarchical structure of the input.
151
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Integrate and Use: You integrate the generated parser code into your application. You
can then use it to process input text, analyze it according to the grammar, and perform
actions based on the results.
In summary, parser generators streamline the process of creating parsers by automating the
generation of parsing code based on formal grammar specifications. This allows developers to
focus on defining the language syntax rather than dealing with low-level parsing mechanics.
Let's demonstrate how parser generators work using Antlr. Have you ever
heard of Antlr before?
ANTLR (ANother Tool for Language Recognition) is a powerful parser generator used to read,
process, execute, or translate structured text or binary files. It's widely used for building
compilers, interpreters, and data processors. ANTLR takes a grammar that defines the syntax of
a language and automatically generates a parser (and optionally a lexer) that can recognize and
process text according to that grammar.
ANTLR supports multiple languages, including Java, Python, C#, and JavaScript, making it
flexible for various development environments. Its grammars are highly expressive, supporting
context-free grammars, and it can automatically generate code to handle tokens, parse trees,
and syntax analysis.
● Lexer and Parser Generation: Automatically creates a lexer (to recognize tokens) and
a parser (to build the syntax tree) from a given grammar.
● Listener and Visitor Patterns: It generates base classes that you can extend to define
custom behaviors during the parsing process, making it easier to build syntax-directed
interpreters and translators.
By using ANTLR, developers can focus on defining the language structure through grammars,
and ANTLR takes care of the low-level details of writing parsers and lexers.
152
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
$ cd /usr/local/lib
$ curl -O http://www.antlr.org/download/antlr-4.13.1-complete.jar
Add that path in the CLASSPATH, so that Java can find that .jar i.e. ANTLR tool.
$ export CLASSPATH="/usr/local/lib/antlr-4.13.1-complete.jar:$CLASSPATH"
$ java org.antlr.v4.Tool
ANTLR Parser Generator Version 4.13.1
...
We shall be using the Python port of Antlr. To install the Python3 runtime, just do:
This grammar defines a simple structure for arithmetic expressions involving addition (+),
multiplication (*), parentheses, and numbers. It is written in ANTLR syntax and can be used to
generate a parser for these expressions. Let’s break down each part:
grammar Expr;
NUMBER : [0-9]+;
WS : [ \t\r\n]+ -> skip;
153
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
How do we generate the parser, given that we have the .g4 file?
If we have a .g4 file, containing a grammar like the one we just described, then you can
generate all the files that implement the parser with the command:
You will see that this command creates several python files, which implement the parser.
And how can we use these files to write a program that computes the value
of arithmetic expressions?
We can extend one of the classes produced by Antlr, the ExprListener. To compute the value
of an arithmetic expression using the ExprListener, we can extend it by implementing the
exitExpr, exitTerm, and exitFactor methods in ExprEval. We'll use these methods to
evaluate the arithmetic expressions as the parser traverses the tree. Specifically, we'll store
intermediate results and perform operations when we exit the relevant parse tree nodes.
154
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class ExprEval(ExprListener):
def __init__(self):
# Stack to store intermediate values during traversal
self.stack = []
def getResult(self):
# After parsing is complete, the result will be the last value in the stack
return self.stack.pop()
Explanation:
● exitExpr: If the current expression is of the form expr + term, it pops two values
from the stack (the left and right sub-expressions) and pushes their sum.
● exitTerm: If the current term is of the form term * factor, it pops two values (left
and right sub-terms) and pushes their product.
155
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
We still need a driver, eg., a program to run the instance of ExprEval. Can
you do it?
You can use the following code to run the parser and apply the ExprEval listener:
import sys
from antlr4 import *
from ExprLexer import ExprLexer
from ExprParser import ExprParser
from ExprEval import ExprEval
def main(input_stream):
# Create the lexer and parser
lexer = ExprLexer(InputStream(input_stream))
stream = CommonTokenStream(lexer)
parser = ExprParser(stream)
if __name__ == '__main__':
input_expr = input("Enter an expression: ")
main(input_expr)
This implementation walks through the tree produced by ANTLR and evaluates the arithmetic
expression based on the operations defined in the grammar. For instance:
% python3 main.py
Enter an expression: 2 + 3 * 4
Result: 14
156
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Without ANTLR: If you were to write a parser manually, you would need to create your
own lexer (to identify tokens like numbers and operators) and design the entire parser by
handling token sequences, managing recursion, and ensuring correct grammar rule
application.
● With ANTLR: You only need to define your grammar in a high-level format (like the one
you provided), and ANTLR automatically generates both the lexer (which identifies
tokens) and the parser (which builds the parse tree and enforces grammar rules). This
means you don't need to manually write code to handle recursion, state transitions, or
token recognition—ANTLR does it all.
2. Listener/Visitor Framework:
● Without ANTLR: If you wrote a parser manually, you would need to create your own
logic to traverse the parse tree, identify which grammar rules were applied, and figure
out what to do when you encounter certain expressions (like addition, multiplication, or
parentheses).
● With ANTLR: ANTLR generates listener and visitor patterns, which allow you to easily
respond to parsing events. For example, in the ExprEval.py, the enter and exit
methods of ExprListener are invoked automatically during parsing. You simply need
to override these methods to execute your desired logic (like evaluating expressions).
● Without ANTLR: You would need to manually implement error handling to deal with
syntax errors or malformed input, which can be tricky.
● With ANTLR: ANTLR automatically provides mechanisms for syntax error detection and
reporting. You can customize error handling if needed, but the basic error checking is
already built into the generated parser.
157
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Without ANTLR: Designing a parser for more complex grammars (e.g., involving
precedence, associativity, or recursive expressions) can be error-prone and tedious.
Manually handling operator precedence, for example, can become very complicated.
● With ANTLR: ANTLR handles precedence, associativity, and recursion for you. The
recursive nature of grammar rules (as in your expr rule) is managed seamlessly by
ANTLR, making it much easier to handle complex expressions like 3 + 4 * 5 where
multiplication has higher precedence than addition.
5. Multi-Language Support:
● Without ANTLR: Writing a parser in multiple languages (e.g., Python, Java, C#,
JavaScript) would require developing and maintaining separate implementations for
each language.
● With ANTLR: You can define the grammar once and generate parsers for different
languages simply by specifying the target language (-Dlanguage=Python3 for Python,
for example).
● Without ANTLR: You would need to manually build and traverse parse trees or ASTs,
which requires extra effort.
● With ANTLR: The generated parser automatically builds a parse tree as it processes
input, and you can traverse this tree easily using listeners or visitors. This makes it
simpler to evaluate expressions, generate code, or perform other tasks based on the
input.
Recursive Parsing Write recursive functions for each ANTLR generates the recursive
Logic grammar rule manually. parsing logic for grammar rules.
158
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Error Handling Implement custom error detection ANTLR provides built-in syntax
and reporting for syntax issues. error handling and reporting.
Tree Building Manually construct a parse tree Parse tree is automatically built
(Parse Tree/AST) or AST during parsing. during parsing.
Multi-language Write separate parsers for each Write the grammar once and
Support target language (e.g., Java, generate parsers for multiple
Python). languages.
Precedence and Handle grammar precedence and Define precedence and associativity
Associativity associativity manually in the directly in the grammar, ANTLR
code. handles it.
grammar Expr;
NUMBER : [0-9]+;
WS : [ \t\r\n]+ -> skip;
159
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class ExprEval(ExprListener):
...
...
1. LL: ANTLR parsers are top-down parsers that process input from left to right and build a
leftmost derivation of the input string. The "LL" refers to this left-to-right, leftmost
derivation process.
2. (*): The * means that ANTLR can look ahead at an arbitrary number of tokens to make
parsing decisions. Unlike traditional LL(k) parsers, which use a fixed amount of
lookahead (k), ANTLR’s LL(*) parser dynamically determines how much lookahead is
needed at each decision point.
3. Adaptive LL(*): ANTLR doesn't hard-code how many tokens it needs to look ahead.
Instead, it adapts the amount of lookahead based on the input it’s parsing, making it
more powerful than LL(k) parsers and able to handle complex grammars, including
certain ambiguous or nondeterministic constructs that would be difficult or impossible for
traditional LL parsers to manage.
4. Recursive-descent: The parsers that ANTLR generates are recursive-descent parsers,
meaning they use recursive functions to process each grammar rule, which is typical for
LL parsers.
160
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Direct Left Recursion: ANTLR can handle direct left recursion through an internal
transformation of the grammar rules.
● Backtracking: While ANTLR typically avoids backtracking due to its efficiency, LL(*)
parsing enables it to backtrack and explore multiple lookahead options if needed.
● Error Handling: ANTLR parsers include robust error recovery strategies, which are
well-suited for practical programming languages and tools.
But LL cannot deal with left recursion, right? And our example grammar is
left recursive. How come?
As for left recursion, classical LL parsers (including LL(k)) cannot handle it because they would
enter an infinite loop when encountering rules with left recursion. However, ANTLR can handle
direct left recursion through a specific transformation technique.
ANTLR rewrites left-recursive rules to make them non-left-recursive during parser generation. It
does this automatically through a process known as left-recursion elimination. The generated
parser no longer follows the standard LL(*) behavior for these rules but instead uses a more
specialized parsing strategy for handling recursion.
This rule is left-recursive because expr appears on the left-hand side of its own production.
Normally, this would cause infinite recursion in an LL parser. However, ANTLR rewrites this
internally into a non-left-recursive form that looks more like this:
This transformation eliminates the direct left recursion by converting it into a loop that can be
handled by the LL(*) parsing strategy.
161
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
ANTLR does not automatically handle indirect left recursion, where recursion occurs across
multiple rules. In such cases, you need to manually refactor the grammar to remove the
recursion or redesign it to avoid the issue.
Summary
● ANTLR produces LL(*) parsers, which extend LL(k) by allowing arbitrary lookahead.
● ANTLR automatically eliminates direct left recursion by transforming the grammar
during parser generation.
● It cannot automatically handle indirect left recursion, which needs to be addressed
manually.
This flexibility is one of the reasons ANTLR can handle a wide range of grammars, including
those that would traditionally be problematic for LL parsers.
import sys
from datetime import date
from parsy import regex, seq, string, ParseError
def build_parser():
year = regex("[0-9]{4}").map(int).desc("4 digit year")
month = regex("[0-9]{2}").map(int).desc("2 digit month")
162
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
fulldate = seq(
year=year << dash,
month=month << dash,
day=day,
).combine_dict(date)
return fulldate
def parse_dates():
fulldate = build_parser()
try:
for line in sys.stdin:
parsed_date = fulldate.parse(line.strip())
print(parsed_date)
except ParseError:
print("Done parsing.")
parse_dates()
The program reads a string in the format YYYY-MM-DD and parses it into a Python date object.
If the input is in the correct format, it prints the resulting date object; otherwise, it raises a
parsing error.
This approach simplifies parsing by breaking the task down into modular components (year,
month, day) and using clear and concise syntax to combine these components into a working
parser. Here's an explanation of the key points of your code using Parsy:
1. Imports:
163
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Parser Composition: Parsy allows you to break down complex parsing tasks into
smaller, composable pieces. Each part of the date (year, month, day) has its own parser,
and these are combined sequentially with seq.
● Chaining Parsers: The << operator is used to chain parsers together, enforcing the
presence of delimiters like dashes ("-").
164
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Mapping and Transforming: .map(int) is used to transform the parsed string into an
integer. Parsy makes it easy to transform input data while parsing.
● Combining into a Result: .combine_dict(date) takes the parsed dictionary of
values and uses them to create a date object, which is returned after successful
parsing.
The combine_dict method in Parsy can be used to combine parsed values into any kind of
object, not just date objects. The key feature of combine_dict is that it collects the parsed
results into a dictionary, where each key corresponds to a named part of the parsing result, and
then it passes this dictionary to a function or constructor that you provide. The function can
produce any kind of object, depending on what you want.
In your code, combine_dict(date) is being used to take the parsed year, month, and day
from the input (as dictionary keys) and then call the date constructor from the datetime
module to create a date object.
1. The parser extracts the year, month, and day from the input string and maps them to
integers.
2. These values are labeled as year, month, and day (because of the named arguments
in seq(year=..., month=..., day=...)).
3. combine_dict takes these values and calls the date constructor like
date(year=..., month=..., day=...), producing a date object.
However, you can pass any other function or constructor to combine_dict. For example, if
you wanted to produce a dictionary or some custom object instead of a date object, you could
define a different function or class.
def build_parser():
year = regex("[0-9]{4}").map(int).desc("4 digit year")
month = regex("[0-9]{2}").map(int).desc("2 digit month")
day = regex("[0-9]{2}").map(int).desc("2 digit day")
dash = string("-")
fulldate = seq(
year=year << dash,
month=month << dash,
day=day,
165
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return fulldate
class CustomDate:
def __init__(self, year, month, day):
self.year = year
self.month = month
self.day = day
def __repr__(self):
return f"CustomDate(year={self.year}, month={self.month}, day={self.day})"
def build_parser():
year = regex("[0-9]{4}").map(int).desc("4 digit year")
month = regex("[0-9]{2}").map(int).desc("2 digit month")
day = regex("[0-9]{2}").map(int).desc("2 digit day")
dash = string("-")
fulldate = seq(
year=year << dash,
month=month << dash,
day=day,
).combine_dict(CustomDate) # Uses CustomDate constructor
return fulldate
So, combine_dict is flexible—it can produce any object as long as you provide the
appropriate function or constructor.
Can you modify the implementation a bit, to print the number of days from
the given date to the current date?
To calculate the number of days between each parsed date and the current date, you can
modify the parse_dates function to use Python's datetime.date object and calculate the
difference between the parsed date and the current date using subtraction.
166
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
import sys
from datetime import date
from parsy import regex, seq, string, ParseError
def build_parser():
year = regex("[0-9]{4}").map(int).desc("4 digit year")
month = regex("[0-9]{2}").map(int).desc("2 digit month")
day = regex("[0-9]{2}").map(int).desc("2 digit day")
dash = string("-")
fulldate = seq(
year=year << dash,
month=month << dash,
day=day,
).combine_dict(date)
return fulldate
def parse_dates():
fulldate = build_parser()
try:
for line in sys.stdin:
parsed_date = fulldate.parse(line.strip())
today = date.today()
delta = today - parsed_date
print(f"Parsed Date: {parsed_date}, Days until today: {delta.days}")
except ParseError as e:
print(f"Parse error: {e}")
parse_dates()
Changes:
1. Difference Calculation: After parsing each date, I calculate the difference between the
parsed date and the current date using today = date.today() and delta =
today - parsed_date.
2. Printing the Result: For each line, the script now prints both the parsed date and the
number of days from that date to today.
Could you further modify the parser, so that we could accept dates like
2017-01-02, or 2017-1-02, or 2017-01-2 or 2017-1-2?
Sure! To make the parser more lenient and accept dates in formats such as 2017-01-02,
2017-1-02, 2017-01-2, or 2017-1-2, you need to adjust the regular expressions for months
and days to allow for one or two digits.
167
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def build_parser():
year = regex("[0-9]{4}").map(int).desc("4 digit year")
# Allow one or two digits for month and day
month = regex("[0-9]{1,2}").map(int).desc("1 or 2 digit month")
day = regex("[0-9]{1,2}").map(int).desc("1 or 2 digit day")
dash = string("-")
fulldate = seq(
year=year << dash,
month=month << dash,
day=day,
).combine_dict(date)
return fulldate
Explanation:
1. Month and Day Changes: I changed the regular expressions for month and day from
"[0-9]{2}" to "[0-9]{1,2}", allowing either one or two digits for both.
2. Functionality: This parser will now accept any date in the format YYYY-MM-DD, where
the month and day can be written with one or two digits.
I would like to modify the parser to read the name of a person before the
date, as in "Fernando Pereira: 1980-01-07". Could you do it?
To handle this new input format, we need to modify the parser to recognize the person's name
followed by the date. The name can be a combination of letters and spaces, and we’ll separate
the name from the date by a colon and a space (: ). Then we’ll combine the name with the date
parsing to calculate the number of days and print the appropriate message.Here’s the modified
code:
import sys
from datetime import date, datetime
from parsy import regex, seq, string, ParseError
def build_parser():
# A name can contain letters and spaces
name = regex("[A-Za-z ]+").desc("person's name")
year = regex("[0-9]{4}").map(int).desc("4 digit year")
month = regex("[0-9]{1,2}").map(int).desc("1 or 2 digit month")
day = regex("[0-9]{1,2}").map(int).desc("1 or 2 digit day")
dash = string("-")
colon_space = string(": ").desc("colon and space after name")
168
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
).combine_dict(date)
fullparser = seq(
name=name << colon_space,
birthdate=fulldate
)
return fullparser
def parse_dates():
fullparser = build_parser()
today = datetime.today().date()
try:
for line in sys.stdin:
parsed_data = fullparser.parse(line.strip())
person_name = parsed_data['name']
birthdate = parsed_data['birthdate']
delta = today - birthdate
print(f"{person_name} lived {delta.days} days.")
except ParseError as e:
print(f"Error parsing input: {e}")
parse_dates()
Changes:
1. Name Parsing: I added a name parser using the regular expression "[A-Za-z ]+" to
match a sequence of letters and spaces.
2. Colon Separator: I added a colon_space parser to match the : separating the name
and the date.
3. Full Parsing: The fullparser now combines the name and birthdate using seq to
create a parser that expects both parts in the correct order.
4. Output Format: The function now prints how many days each person has lived, based
on the parsed input.
1. Modularity: Parser combinators promote modular design. You build small parsers for
individual components and then combine them to handle more complex structures.
2. Flexibility: You can easily extend or modify parsers by adding or changing combinators
without altering existing parsers significantly.
169
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Ease of Use: They allow you to write parsers directly in the programming language of
your choice, often making them more intuitive and easier to integrate with the rest of
your code.
4. Declarative Style: Parser combinators often lead to a more declarative style of
programming, where you describe what you want to parse rather than how to parse it.
Disadvantages
Let's compare parsy with Antlr. How would the days-since-birth app be
implemented in Antlr?
Let’s walk through how you can implement a similar parser in ANTLR.
You can define the grammar for parsing names and dates in ANTLR in a file called
DateParser.g4:
grammar DateParser;
// Root rule
entry: person_name COLON_SPACE date EOF;
// Lexer rules
NAME: [A-Za-z]+;
SPACE: ' ';
DASH: '-';
COLON_SPACE: ': ';
DIGIT: [0-9];
WS: [ \t\r\n]+ -> skip; // Skipping whitespace
170
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● entry: This is the root rule that matches a person’s name, followed by a colon and
space (COLON_SPACE), and then the date.
● person_name: Matches the name, which consists of one or more words (NAME),
separated by spaces.
● date: Matches a date in the format YYYY-MM-DD.
● year, month, day: These rules ensure proper date format.
To generate the Python parser from this grammar, you need to run ANTLR.
Now you can write a Python script to use the generated parser and process the parsed data.
Below is the Python code for parsing input and calculating the number of days since the birth
date:
import sys
from datetime import date, datetime
from antlr4 import *
from DateParserLexer import DateParserLexer
from DateParserParser import DateParserParser
from DateParserListener import DateParserListener
from antlr4.error.Errors import ParseCancellationException
class DateListener(DateParserListener):
def __init__(self):
self.name = None
self.year = None
self.month = None
self.day = None
171
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def main():
today = datetime.today().date()
try:
for line in sys.stdin:
# Create a new input stream and lexer/parser
input_stream = InputStream(line.strip())
lexer = DateParserLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = DateParserParser(stream)
except ParseCancellationException as e:
print(f"Error parsing input: {e}")
if __name__ == '__main__':
main()
Explanation:
1. DateListener: This class extends the ParseTreeListener from ANTLR. It listens for
exit events when the parser has finished processing a person’s name and date, and
stores the parsed information (name, year, month, day).
2. main():
○ For each line of input, an ANTLR input stream, lexer, and parser are created.
○ The DateListener walks the parse tree, collecting information.
○ Once the listener has gathered the name and date, the script calculates how
many days have passed since the birthdate and prints the result.
3. Error Handling: The script catches ParseCancellationException to handle parse
errors gracefully.
172
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you summarize the differences between Antlr and Parsy using a table?
ANTLR is generally more powerful and suitable for larger and more complex grammars. It
provides a clearer separation between lexing and parsing phases and supports advanced
features such as error recovery. However, for small and simple parsers like date parsing, Parsy
offers a quicker and more Pythonic approach without the need to generate code from a
grammar file. Here's a summary of differences:
Parser Uses Python code and regex-based Uses grammar files to define
Definition parsers parsing rules
Lexer Built-in with the regex parser Separate lexer defined in grammar
Error Handling Exception-based with ParseError More advanced error handling with
listeners
Use Case Simple, small parsing tasks Large, complex language grammars
Could you list and summarize some of the parser generators that exist out
there?
Here's a brief overview of some popular parser generators that one should consider. These
tools differ in terms of flexibility, ease of use, and performance. Depending on the complexity of
the grammar and the environment, different parser generators might be more appropriate:
● Languages: Supports multiple languages including Java, Python, C#, JavaScript, etc.
● Overview: ANTLR is a powerful tool for generating parsers for any language, and it
provides built-in support for generating parsers, lexers, and even listeners and visitors
for tree traversal. It's widely used in both academia and industry for creating interpreters,
compilers, and DSLs (domain-specific languages).
● Example Use: It’s excellent for building parsers with well-defined grammars for complex
languages like SQL, programming languages, or configuration files.
2. Bison
173
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. PEG.js
● Languages: JavaScript
● Overview: PEG.js is a simple parser generator for JavaScript that uses Parsing
Expression Grammar (PEG) instead of context-free grammars like ANTLR. It’s
lightweight and easy to use, making it suitable for web-based applications and quick
parsing tasks.
● Example Use: Useful for building parsers for web-based DSLs or small configuration file
formats.
4. Lemon
● Languages: C
● Overview: Lemon is a parser generator developed by D. Richard Hipp (the author of
SQLite). It is designed to be simpler and more efficient than Yacc/Bison. Lemon
generates parsers for LALR grammars.
● Example Use: SQLite uses Lemon to handle SQL parsing. It's efficient and well-suited
for embedded systems or software requiring high performance.
5. Parboiled
● Languages: Python
● Overview: PLY is a Python implementation of the traditional Lex and Yacc tools,
providing a way to write lexers and parsers in Python. It’s particularly useful for Python
developers who need to quickly create parsers using Python.
● Example Use: Useful for Python-based compilers, interpreters, or even parsers for
specific file formats or protocols.
174
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
7. Ragel
8. Spirit (Boost)
● Languages: C++
● Overview: Spirit is a library in the Boost C++ Libraries that allows for writing parsers
directly in C++ using template metaprogramming. It provides a very expressive and
powerful way to write recursive-descent parsers.
● Example Use: Suitable for C++ developers looking to create complex parsers with full
integration in C++ projects.
● Languages: C/C++
● Overview: Flex is a fast lexical analyzer generator, often used with Bison. Flex is used to
recognize tokens from the input, and Bison generates the parser.
● Example Use: Commonly used together for creating compilers, interpreters, or any tool
that needs to understand a language’s structure.
1. Parsy
● Languages: Python
● Overview: Parsy is a simple and lightweight parser combinator library for Python. It
allows you to construct complex parsers by combining small, reusable parsers. It’s
particularly easy to use and integrate into Python projects.
● Example Use: Great for parsing simple DSLs, date formats, and custom file formats
within Python applications.
● Why Use It: Easy to learn, integrates well with Python code, and is great for lightweight
parsing tasks.
2. Parsec
● Languages: Haskell
175
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Overview: Parsec is one of the most well-known parser combinator libraries, developed
in Haskell. It allows you to build parsers for complex languages in a modular and
readable way, where parsers are defined as first-class values.
● Example Use: Ideal for parsing complex languages, building interpreters, or working
with DSLs in Haskell.
● Why Use It: Parsec is highly expressive and can handle complex, recursive grammars
with ease. It’s used extensively in Haskell for language and data parsers.
● Languages: Scala
● Overview: This library is a parser combinator framework in the Scala standard library. It
allows developers to write parsers as Scala code, composing smaller parsers into larger
ones, making it ideal for DSLs and other parsing tasks.
● Example Use: Useful for creating parsers for DSLs in Scala, as well as processing
complex structured data.
● Why Use It: Seamlessly integrates with Scala’s functional programming style, making
parsers highly compositional and expressive.
4. Attoparsec
● Languages: Haskell
● Overview: Attoparsec is a fast, efficient parser combinator library in Haskell, designed
for parsing binary and text data. It emphasizes high performance and efficiency,
especially for network protocols or large data streams.
● Example Use: Parsing large text files or binary protocols, such as JSON or HTTP.
● Why Use It: Offers better performance compared to Parsec in some scenarios,
especially when working with large data streams.
5. Elm Parser
● Languages: Elm
● Overview: Elm’s parser library provides parser combinators for building parsers in the
Elm programming language. It follows a similar approach as Parsec, allowing complex
grammars to be built using small parsers.
● Example Use: Building custom parsers for Elm-based applications, such as parsing
JSON or custom markup languages.
● Why Use It: Works natively in Elm, a functional language for building web applications,
making it a good fit for web-based parsing tasks.
6. FastParse
● Languages: Scala
176
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Overview: FastParse is a fast and efficient parser combinator library for Scala. It’s
designed to be both easy to use and high-performance, making it suitable for
production-level parsing tasks.
● Example Use: Parsing complex text-based formats like programming languages or
configuration files.
● Why Use It: Combines ease of use with performance, and it’s especially good for
situations where you need to build a custom parser quickly but also need good runtime
efficiency.
7. PARSIMONIOUS
● Languages: Python
● Overview: Parsimonious is another Python library that uses a parsing expression
grammar (PEG) approach, which is closely related to parser combinators. It allows for
creating simple, elegant parsers with Pythonic syntax.
● Example Use: Parsing structured data formats or defining grammars for custom DSLs in
Python.
● Why Use It: It provides a concise syntax for grammar definitions and integrates well with
Python projects.
8. FParsec
● Languages: F#
● Overview: FParsec is a parser combinator library for F#, designed for building parsers in
a functional style. Like Parsec in Haskell, it is highly modular and allows for defining
parsers by combining smaller ones.
● Example Use: Useful for writing parsers for DSLs, programming languages, or
configuration files in F#.
● Why Use It: Highly expressive and works well with F#’s functional features, making it a
popular choice for F# developers working on parsing tasks.
9. Megaparsec
● Languages: Haskell
● Overview: Megaparsec is a modern and flexible parser combinator library in Haskell. It
is similar to Parsec but includes more features, better error reporting, and additional
combinators for more complex parsing needs.
● Example Use: Parsing complex grammars for languages and DSLs in Haskell, while
benefiting from rich error reporting.
● Why Use It: It’s more feature-rich than Parsec, providing enhanced error handling and
better support for advanced parsing scenarios.
10. Kombu
177
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Languages: Python
● Overview: Kombu is a simple parser combinator library for Python. It allows you to
create combinators for parsing data structures or languages in a way that is similar to
Parsec, but for Python.
● Example Use: Small-to-medium-scale parsers for structured data formats.
● Why Use It: Lightweight and easy to use in Python, making it good for simpler parsing
tasks.
Ok, so, can you again, briefly compare parser generators and parser
combinators in terms of advantages and challenges?
Here’s a more detailed expansion on the comparison between Parser Generators and Parser
Combinators, touching on their strengths, challenges, and applications:
Parser Generators
178
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Examples:
○ ANTLR: Used for defining grammars for large-scale languages, including many
programming languages (e.g., SQL, Java).
○ Yacc/Bison: Classic parser generators that are often used for implementing
language compilers and interpreters.
Parser Combinators
179
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Parser Generators: Better suited for complex, recursive grammars, especially when
there are ambiguities. Tools like ANTLR or Bison come with built-in capabilities to handle
such complexities, using techniques like backtracking, lookahead, or left-factoring.
● Parser Combinators: More challenging to handle complex or ambiguous grammars,
especially as grammars grow larger. While possible, it often leads to verbose and less
efficient code.
Error Handling:
● Parser Generators: Typically provide robust error handling and recovery mechanisms
automatically, which is crucial when dealing with user inputs or complex language
constructs.
● Parser Combinators: Error handling is more manual, but tools like Parsec and
Megaparsec provide good support for custom error messages and error recovery. Still, it
requires more explicit effort from the developer.
Debugging:
● Parser Generators: Debugging generated code can be tricky because you’re working
with auto-generated code that might not be as readable or understandable as
hand-written code.
● Parser Combinators: Easier to debug as parsers are written directly in your codebase,
making it straightforward to test, trace, and modify.
Incremental Development:
● Parser Generators: Less flexible for incremental development. Changing the grammar
typically requires re-generating the lexer and parser, which can be a heavier process,
especially if the grammar is large.
● Parser Combinators: Much more suited for incremental development. Parsers can be
extended and modified directly in the code, allowing for more agile iterations.
Performance:
180
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Final Comparison
● For complex grammars: Parser generators (ANTLR, Bison, Yacc) are better because
they efficiently handle ambiguity, recursion, and performance. They are more suited for
parsing formal languages like programming languages or network protocols.
● For simpler or DSL grammars: Parser combinators offer more flexibility, making them
easier to use and modify for lightweight parsing tasks. They shine in functional
programming ecosystems where incremental, compositional parsing is needed.
● For error handling: Parser generators usually have the upper hand, as they provide
automatic, detailed error reporting and recovery.
● For development ease: Parser combinators are generally easier to use in development
because they don’t require an external toolchain, and parsers can be incrementally
updated. This makes them ideal for quick prototyping and smaller grammars.
Each approach has its strengths, and the choice depends on the complexity of the grammar,
performance needs, and how much flexibility is required in the parsing process.
● ANTLR: Used in languages like C#, Python, SQL dialects, and even for parts of Java.
● Bison/Yacc: Frequently used for building language compilers, such as C/C++ compilers.
● PEG-based tools like Pest and PEG.js are increasingly used for languages that need
more flexibility in their syntax parsing.
Parser Combinators, while not as commonly used in full-scale programming languages, are
excellent for creating DSLs or embedded languages within other systems. For example:
It offers grammars for a wide range of popular languages—like Java 9, C, Python 3.12, and
SQL—as well as more niche or domain-specific languages and formats. By studying these
grammars, you can gain insights into how different languages are structured and how their
syntax rules are formalized. This can be especially useful for comparing the syntax of different
181
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Beyond exploration, this repository can help you jumpstart your own projects. Whether you are
building a compiler, code analyzer, interpreter, or a custom tool that needs to process a specific
language, you can leverage the ready-made grammars to avoid the time-consuming task of
writing one from scratch. You can modify the grammars as needed to suit your specific use case
and combine them with ANTLR’s powerful parsing capabilities to create efficient
language-processing tools.
1. Choose a Language: Select the grammar for the language you're interested in from the
repository.
2. Generate a Parser: Use ANTLR to generate a parser from the grammar.
3. Parse Code: Feed the parser some code written in the chosen language. The parser will
analyze the code and produce an Abstract Syntax Tree (AST), which represents the
structure of the code.
Comparing Languages
By examining the grammars for different languages, you can gain insights into their similarities
and differences. For example, you might notice that some languages have more complex syntax
than others, or that they use different control flow structures.
Creating Tools
Once you have a parser for a language, you can use it to create various tools, such as:
● Code Analyzers: Tools that can identify potential issues in code, such as syntax errors
or performance problems.
● Code Generators: Tools that can automatically generate code based on a given
specification.
● Translators: Tools that can translate code from one language to another.
What's the history of these parser generators? Did it start with yacc?
Well, not really. It might have started a bit before that. While Yacc (Yet Another
Compiler-Compiler), developed by Stephen C. Johnson in 1975, is one of the earliest and most
influential parser generators, it was not the first ever invented. The history of parser generators
goes back further, as part of the broader development of compiler technology in the mid-20th
century. Here's an overview:
182
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ The theoretical foundation of parser generators was laid in the 1950s by Noam
Chomsky's work on formal languages and grammars.
○ John Backus introduced a practical way to define programming languages
through what became the Backus-Naur Form (BNF) in the 1960s for the ALGOL
language.
2. Recursive Descent Parsers:
○ Early parsers were handwritten, often using recursive descent methods, based
on the structure of context-free grammars.
3. Automated Parsing Research:
○ The 1960s saw the first research into automating the process of generating
parsers, particularly for languages defined by regular and context-free grammars.
Post-Yacc Evolution
Summary
183
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While Yacc wasn't the first parser generator, it was a seminal tool that made parser generation
accessible and practical for a broad audience. It became the foundation for many compilers and
inspired a lineage of tools that expanded on its capabilities. The concept of automated parser
generation predates Yacc, but its introduction marked a turning point in the widespread adoption
of such tools.
● It was included in the UNIX operating system's standard distribution, which made it
widely available to researchers and developers.
● Yacc's design was extensible and influenced by earlier theoretical work, including
Donald Knuth's introduction of LR parsing and Alfred Aho's refinements.
Legacy:
● Yacc's influence persists in modern parser generators like GNU Bison (a free software
implementation of Yacc) and many other tools inspired by its design principles.
● Stephen C. Johnson’s work extended beyond Yacc; he contributed broadly to UNIX
development, including the creation of tools like lint for static code analysis.
184
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Johnson developed lint, one of the first tools for static code analysis, aimed at identifying
bugs, inefficiencies, and portability issues in C programs.
● Lint set the stage for modern static analysis tools, fostering the development of safer and
more robust software.
185
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Johnson was deeply involved in improving the portability of software across different
hardware platforms, a challenge during the 1970s and 1980s.
● His tools, like lint, helped identify portability issues, aiding the widespread adoption of C
and UNIX across platforms.
Legacy
186
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Features:
● Name: A variable has a unique identifier, or name, chosen by the programmer, which is
used to access its stored value.
● Value: The value of a variable is the data it currently holds, which can be of different
types (like integers, strings, etc.).
● Binding: A variable is bound to its value, meaning the name is associated with a value
or object in memory.
So, when we write, in Python `x = 10`, here, x is a variable that is bound to the value 10. The
name x acts as a reference to the location in memory where the value 10 is stored.
187
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
end
end
end;
int main() {
int x = 0;
{
int x = 1;
printf("%d\n", x);
}
printf("%d\n", x);
}
Which techniques could you use to explain this difference more formally?
We could use operational semantics! Operational semantics is a formal way to describe how
programs execute, specifying the behavior of a programming language by defining how each
statement or expression in the language changes the state of the program step by step. It does
so by providing rules that explain how the execution of a particular language construct (like
assignment, conditionals, loops, etc.) affects variables, memory, and control flow.
Key Concepts:
1. State: The state of a program typically refers to its variables and the values they hold at
any given point in time, as well as other runtime information like the program counter.
2. Transition Rules: Operational semantics uses transition rules to show how the
execution of a statement or expression transforms the program state from one step to
the next. For example:
○ If x := 5 is executed, the state changes by binding x to the value 5.
○ If x := x + 1 is executed, it evaluates x, increments its value, and updates the
binding of x in the state.
Can you give me examples of transition rules? Let's assume we don't have
variables yet.
188
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Why do these rules look like fractions? What's the meaning of them?
They look like fractions because they are written as implication rules in logic, with the top part
(premises) representing conditions that must be met, and the bottom part (conclusion)
representing the result. Let's consider the rule for addition, for instance:
e0 → v0 e1 → v1
----------------------------------
e0 + e1 → v0 + v1
The top part of the "fraction" (premises) must hold for the bottom part (conclusion) to be valid.
Each arrow (→) represents evaluation: on the left, we have expressions, and on the right, we
have the values they evaluate to. This kind of structure provides a formal way to reason about
the behavior of expressions in a programming language.
189
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Prove that "if e0 then e1 else e2" = "if not e0 then e2 else
e1":
To show that if e0 then e1 else e2 is equivalent to if not e0 then e2 else e1
using the given operational semantics rules, we need to demonstrate that both expressions
yield the same result for any evaluation of e0, e1, and e2. We will prove this by considering all
possible values of e0 (there are two cases: true or false) and showing that in each case, both
expressions evaluate to the same result.
190
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this case, both if e0 then e1 else e2 and if not e0 then e2 else e1 evaluate
to e1. Case 2 is similar.
191
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For example:
So, to obtain the value of a let-in-end block, we must perform the following steps:
● Evaluate the expression e0: The first step evaluates the expression e0 to a value v0 in
the current context C. This part binds the result of e0 to v0.
● Extend the context: The context is extended with a new binding {name: v0},
associating the variable name with the result of e0.
192
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Evaluate e1 in the new context: Using the extended context (C + {name: v0}), the
body expression e1 is evaluated, which results in the value v1.
● Return the value of e1: The value v1 is the final result of the entire let expression.
Step-by-step evaluation:
The tree below shows how we can compute the value of let x = 12 in 21 + x end:
1. Expression Class
This abstract base class (ABC) defines the interface for all expression types. Every subclass
must implement the eval method, which takes an environment C and evaluates the expression
based on that environment.
This class represents a variable expression. It stores the variable name (self.name) and looks
up the corresponding value in the environment C.
● Operational Semantic Rule: If C[name] exists, return the value. If the variable is not
found, it raises an error.
193
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This class represents a constant numeric value. The eval method simply returns the number.
This class serves as a base class for binary operations (like addition, subtraction, etc.). It has
two fields: left and right, representing the two sub-expressions of the binary operation. The
actual evaluation logic is provided by subclasses.
Addition (Add)
● Operational Semantic Rule: Evaluate both sub-expressions, then return their sum.
194
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
5. Let Class
The Let expression binds a variable to a value for the evaluation of another expression.
195
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
The figure combines formal operational semantics and their Python implementation for
evaluating expressions in a simple language. The key point is that each expression type
(variable, number, binary operation, and let-binding) is implemented with its own evaluation
logic, and the rules on the right formalize how these expressions are evaluated step-by-step.
● Context updates in a purely additive manner: When a new binding is created, it only
affects the current scope without altering previous bindings.
196
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Lack of mutable operations: The rules don't include any that would change the value
of a bound variable.
let name = 1 in
let name = 2 in
name
end
end
Step-by-step evaluation:
197
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This example shows shadowing rather than mutation. The name in the outer let is never
changed. Each let creates a new, independent binding that exists only within its own scope.
Once you exit that scope, the previous binding remains intact.
In languages with mutable variables, this situation would involve actual modification (mutation)
of the name variable. However, in languages with immutable semantics like SML/NJ, variable
bindings are fixed once they are made. New bindings can temporarily shadow old ones, but they
never modify them.
So, what would the semantics with actual mutable variables look like?
In this case, we'd need rules that allow a variable's value to be updated (mutated) after its initial
binding. With immutable variables, we create new bindings in new scopes (shadowing).
With mutable variables, we directly change the value associated with a variable in the current
environment. In mutable semantics, we'd introduce a rule for assignment. This rule would
update the value of a variable that has already been declared, unlike shadowing where a new
binding is created. Let’s go through what this would look like.
General setup:
The environment C will map variable names to locations (ℓ), and we’ll maintain a separate
store S that maps these locations to values.
For example:
198
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● These rules introduce a state S, which maps memory locations (e.g., loc) to values (v).
The state evolves as expressions are evaluated, and this reflects the mutable nature of
variables.
● The let rule for bindings shows how a new variable is bound to a location (loc) in
memory (loc is fresh), and how the value of the expression e0 is stored in that
memory location. The state S1 is updated with {loc: v0}. This allows the value
associated with a variable to be changed later by modifying the memory location.
● The rule C[name] = loc, S[loc] = v shows that a variable name is evaluated by
first looking up its memory location in the context C and then retrieving its value from the
state S.
● This is where mutability is clear. The assignment rule for C, S₀ ⊢ name := e → (S₁,
v) updates the value of name by changing the state S₁, associating the memory location
loc with the new value v. This reflects how variables are mutable: we can change their
values in memory.
199
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Why are new stores created during the evaluation of expressions? For
instance, when evaluating addition, we go from S0 to S1 and then S2.
The changes in the store across different steps in the operational semantics, as depicted in the
image, can indeed be described as side effects.
● Side effect refers to any change in the state of the program that persists beyond the
local scope of an expression or function. This includes things like modifying variables,
writing to memory, printing to the console, or updating files.
● Pure expressions (without side effects) always return the same value for the same
inputs and don't alter any external state. In contrast, an expression that has side effects
may return a different result or alter the state of the system (like memory or variables)
each time it is evaluated.
● Evaluating an expression (like a sum) produces new stores at each step (e.g.,
S0→S1→S2). This shows that the evaluation of expressions can modify the store, which
is the state of the program. Such modifications are side effects because the state (store)
is globally affected by the operation.
● For example, in an assignment operation (name := e), evaluating e changes the store
by updating the memory location bound to name, producing a new store. This
modification is a side effect of the assignment.
These transformations of the store are side effects, since they involve changes to the global
state of the program, and those changes persist and can influence the future execution of the
program.
Here’s why:
C was designed in the early 1970s as a systems programming language to be simple, fast, and
efficient. The primary goal was to give programmers direct control over the hardware,
particularly in writing operating systems like Unix. This led to a focus on:
200
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this context, immutability was not a priority. Instead, C focused on mutable variables and
pointers, allowing programmers to change values as needed for system-level operations. The
idea of distinguishing variables by immutability vs mutability (as a language-level construct)
wasn't as important as providing low-level access to memory.
2. Pointer-Based Mutability
C relies heavily on pointers, which allow direct manipulation of memory. In C, a variable can be
reassigned or modified freely unless explicitly declared as const. The use of pointers and
direct memory manipulation provides a level of mutability that was seen as more important than
strict immutability.
For example:
int x = 10;
x = 20; // mutable by default
While C does not enforce immutability by default, it does provide the const keyword, which
allows a variable to be declared as immutable (read-only). However, the use of const is
optional and not as strongly enforced or idiomatic as in languages like Kotlin, Scala, or SML.
For example, in C:
This is C's way of providing immutability, but it’s something the programmer has to opt into,
rather than the language making it a fundamental distinction between val and var like Kotlin or
Scala.
C’s design philosophy centers on giving developers flexibility and control over their programs.
Adding immutability at the core language level would introduce constraints that go against C’s
ethos of letting the programmer decide how memory and values should be handled. Because C
is used in performance-critical applications (e.g., OS development, embedded systems),
mutability by default aligns with the need for efficiency and control.
201
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Languages like Kotlin, Scala, Rust, and even SML have evolved over time, building upon ideas
like immutability that became more prominent in software design for reliability, safety, and
concurrency. These newer languages recognize the benefits of distinguishing between mutable
and immutable variables at the language level:
● Concurrency: Immutable data structures prevent race conditions and make parallel
processing easier.
● Safety: Immutable variables reduce unintended side effects, making code easier to
reason about and less error-prone.
C's design predated many of these concerns, and while it's possible to simulate immutability
with const, it's not enforced in the same manner as more modern languages.
Summary
C doesn’t make a strict distinction between mutable and immutable variables because:
● It was designed with simplicity and flexibility in mind, focusing on performance and
low-level control.
● Immutability wasn’t a key concern during its development.
● It uses the const keyword for immutability, but this is optional and not central to the
language.
Newer languages evolved with the goal of safety, concurrency, and immutability, thus making
mutability vs. immutability a central part of their design.
● val is used to declare immutable variables, which means their value cannot be changed
once assigned.
● var is used to declare mutable variables, which allows their value to be reassigned.
fun main() {
// Immutable variable (val)
val immutableVar: String = "I am immutable"
// Trying to reassign a value to immutableVar would cause a compile-time error
// immutableVar = "New value" // Uncommenting this line would cause an error
202
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
● immutableVar is declared with val, meaning once assigned, its value cannot be
changed.
● mutableVar is declared with var, allowing its value to be reassigned.
If you try to reassign a value to immutableVar, the compiler will throw an error, ensuring
immutability.
In Scala, the distinction between mutable and immutable variables is similar to Kotlin:
Explanation:
This behavior in Scala is very similar to Kotlin in terms of the distinction between immutability
(val) and mutability (var).
203
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In SML, the concept of immutability is a bit more pronounced because all variables are
immutable by default. However, you can use references (ref) to create mutable variables.
(* Immutable variable *)
val immutableVar = "I am immutable";
(* Trying to reassign would result in an error *)
(* immutableVar := "New value"; (* Uncommenting this would cause an error *) *)
Explanation:
Thus, while immutability is the default in SML, you can simulate mutability with references
(ref), allowing reassignment of values.
Can you implement a Python interpreter for the language with mutable
variables?
Let's implement these semantics in our existing Python code, which currently handles let
bindings but not mutable variables.
1. Introduce a Store class to manage memory locations (store) that map locations to
values.
2. Modify the Let construct to allocate memory locations for variables and add them to
the store.
204
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Implementation Plan:
1. Store Class:
The store will hold the mutable state (like a dictionary) that maps memory locations to values.
Each variable will get a fresh memory location when it’s introduced.
class Store:
def __init__(self):
self.store = {}
self.next_loc = 0
def __repr__(self):
return str(self.store)
When creating a new variable, we will allocate a fresh memory location for it, store its initial
value, and update the environment to map the variable name to this location.
class Let(Expression):
def __init__(self, name, exp_def, exp_body):
self.name = name
self.exp_def = exp_def
self.exp_body = exp_body
205
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This new class will handle assignments (name := e). It looks up the variable’s memory
location from the environment, evaluates the expression to get a new value, and updates the
store at that location.
class Assign(Expression):
def __init__(self, name, exp):
self.name = name
self.exp = exp
We need to modify the Var class to look up the value from the store using the memory location
found in the environment.
class Var(Expression):
def __init__(self, name):
self.name = name
206
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
if self.name in env:
loc = env[self.name]
# Look up the value from the store
return store.lookup(loc)
else:
sys.exit(f"Undefined variable {self.name}")
class Num(Expression):
def __init__(self, num):
self.num = num
Example Usage:
if __name__ == "__main__":
# Create an empty environment and store
env = {}
store = Store()
# let x = 5 in x := 10 end
let_exp = Let("x", Num(5), Assign("x", Num(10)))
Breakdown:
● The Store class manages memory locations and stores variable values.
● The Let expression allocates a memory location for a variable and updates the
environment and store.
● The Assign expression updates the value of a variable by modifying the store.
207
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ While primarily known for denotational semantics, their work influenced the
formalization of programming languages and inspired alternative semantic
frameworks.
So, Plotkin is like the "father" of operational semantics? What else did he
do for Computer Science?
Gordon David Plotkin is a prominent computer scientist and logician known for his
groundbreaking work in programming language theory, operational semantics, and formal
208
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
methods. Born in Scotland in 1946, Plotkin earned his Ph.D. from the University of Edinburgh in
1972 under the supervision of Rod Burstall. His dissertation focused on lambda calculus and
its applications in programming languages, laying the foundation for his later work in
semantics.
Plotkin became a professor at the University of Edinburgh, where he remained a central figure
in the School of Informatics. His influence extends beyond academia, as his research has
deeply shaped the understanding of programming languages and computational logic. Plotkin
has received numerous accolades, including a fellowship from the Royal Society, recognizing
his outstanding contributions to theoretical computer science.
3. Full Abstraction
209
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Plotkin analyzed the lambda calculus and its relation to programming languages,
particularly regarding call-by-name and call-by-value evaluation strategies.
● His work informed the design of abstract machines and the operational reasoning used
in compilers and interpreters.
● Later in his career, Plotkin explored probabilistic and quantum models of computation,
contributing to the understanding of how these paradigms relate to traditional
computation.
● His research influenced the design and semantics of programming languages like ML,
Haskell, and others, particularly in areas like type systems, evaluation strategies, and
compiler correctness.
Legacy
Gordon Plotkin's work has shaped the theoretical foundations of programming languages and
computational systems. His pioneering contributions, particularly in operational semantics,
remain a cornerstone of computer science education and research, making him one of the most
influential figures in the field.
210
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The size of an expression is 1 plus the size of its subparts. Add the size
capability to our representation of expressions.
import sys
from abc import ABC, abstractmethod
class Expression(ABC):
@abstractmethod
def eval(self, env):
raise NotImplementedError
class Var(Expression):
def __init__(self, identifier):
self.identifier = identifier
def eval(self, env):
if self.identifier in env:
return env[self.identifier]
else:
sys.exit(f"Undefined variable: {self.identifier}")
def size(self):
return 1
class Num(Expression):
def __init__(self, num):
self.num = num
def eval(self, _):
return self.num
def size(self):
return 1
class BinaryExpression(Expression):
def __init__(self, left, right):
self.left = left
self.right = right
@abstractmethod
def eval(self, env):
raise NotImplementedError
def size(self):
return 1 + self.left.size() + self.right.size()
class Add(BinaryExpression):
def eval(self, env):
return self.left.eval(env) + self.right.eval(env)
class Sub(BinaryExpression):
def eval(self, env):
return self.left.eval(env) - self.right.eval(env)
211
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Mul(BinaryExpression):
def eval(self, env):
return self.left.eval(env) * self.right.eval(env)
class Div(BinaryExpression):
def eval(self, env):
return self.left.eval(env) // self.right.eval(env)
class Let(Expression):
def __init__(self, identifier, exp_def, exp_body):
self.identifier = identifier
self.exp_def = exp_def
self.exp_body = exp_body
def eval(self, env):
e0_val = self.exp_def.eval(env)
new_env = dict(env)
new_env[self.identifier]= e0_val
return self.exp_body.eval(new_env)
def size(self):
return 2 + self.exp_def.size() + self.exp_body.size()
class Expression(ABC):
@abstractmethod
def eval(self, env):
raise NotImplementedError
class Var(Expression):
def __init__(self, identifier):
self.identifier = identifier
def __str__(self):
"""
Example:
>>> e = Var('var')
>>> print(e)
var
"""
return self.identifier
class Num(Expression):
def __init__(self, num):
self.num = num
def __str__(self):
"""
Example:
>>> e = Num(3)
>>> print(e)
3
"""
return str(self.num)
212
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class BinaryExpression(Expression):
def __init__(self, left, right):
self.left = left
self.right = right
def size(self):
"""
Example:
>>> e = Add(Var('x'), Num(2))
>>> e.size()
3
"""
return 1 + self.left.size() + self.right.size()
class Add(BinaryExpression):
def __str__(self):
"""
>>> n1 = Num(3)
>>> n2 = Num(4)
>>> e = Add(n1, n2)
>>> print(e)
(3 + 4)
"""
return f"({str(self.left)} + {str(self.right)})"
class Sub(BinaryExpression):
def __str__(self):
"""
>>> n1 = Num(3)
>>> n2 = Num(4)
>>> e = Sub(n1, n2)
>>> print(e)
(3 - 4)
"""
return f"({str(self.left)} - {str(self.right)})"
class Mul(BinaryExpression):
def __str__(self):
"""
>>> n1 = Num(3)
>>> n2 = Num(4)
>>> e = Mul(n1, n2)
>>> print(e)
(3 * 4)
"""
return f"({str(self.left)} * {str(self.right)})"
class Div(BinaryExpression):
def __str__(self):
"""
>>> n1 = Num(28)
>>> n2 = Num(4)
>>> e = Div(n1, n2)
>>> print(e)
(28 / 4)
"""
213
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Let(Expression):
def __str__(self):
"""
>>> e = Let('v', Num(42), Var('v'))
>>> print(e)
let v = 42 in v end
How could you change your design, to avoid having to modify classes
whenever you need a new functionality?
One solution would be to use type inspection to find out the type of each expression, and handle
them accordingly. See one solution below:
def to_str(e):
"""
Pretty print an expression based on its type.
Example:
>>> e = Let('v', Num(42), Var('v'))
>>> print(to_str(e))
let v = 42 in v end
214
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
>>> print(to_str(e))
let v = (40 + 2) in (v * v) end
"""
if isinstance(e, Var):
return e.identifier
elif isinstance(e, Num):
return str(e.num)
elif isinstance(e, Add):
return f"({to_str(e.left)} + {to_str(e.right)})"
elif isinstance(e, Sub):
return f"({to_str(e.left)} - {to_str(e.right)})"
elif isinstance(e, Mul):
return f"({to_str(e.left)} * {to_str(e.right)})"
elif isinstance(e, Div):
return f"({to_str(e.left)} / {to_str(e.right)})"
elif isinstance(e, Let):
return f"let {e.identifier} = {to_str(e.exp_def)} in {e.exp_body} end"
else:
raise ValueError("Unknown expression")
The issue we're facing relates to breaking the Law of Demeter (also known as the Principle of
Least Knowledge). This principle states that an object should have limited knowledge of the
internal details of other objects. In your to_str function, you're directly accessing fields like
e.identifier, e.left, e.right, etc., which exposes the internal structure of your
Expression subclasses (e.g., Var, Add, Let).
By directly accessing internal details like e.identifier, the function tightly couples itself to
the specific implementations of the expression types. This makes the code less maintainable
and more prone to errors if the internal structure of these types changes. If we ever change the
implementation of the Var class, we would need to change the implementation of to_str too.
To adhere to the Law of Demeter, you could push the responsibility of converting an expression
to a string onto the subclasses themselves, either by defining a __str__ method in each class
or by adding a specific method like to_str in each subclass. But then we are back into the
Open/Close issue...
215
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ This happens when you tightly couple an external function (to_str) to the
internal structure of multiple classes.
2. Breaking the Open/Closed Principle:
○ If you add new methods (like to_str) to each class, the design becomes harder
to extend without modifying the base code.
216
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
the objects it operates on. It lets you add new operations to a group of objects (often in a
hierarchy, like an AST for expressions) without modifying the objects themselves.
Key Concepts:
1. Visitor: A class that defines operations to be performed on different object types in the
hierarchy (e.g., visit_add, visit_var).
2. Element: The object structure (like Expression subclasses) that accepts a visitor and
delegates the operation to it via the accept method.
How It Works:
● Each object in the hierarchy (e.g., Var, Add, Mul) implements an accept(visitor)
method.
● The accept method calls the appropriate method on the visitor, depending on the
object's type (visit_var, visit_add, etc.).
● The logic for new operations (like pretty-printing or evaluation) is added to the Visitor,
not the objects themselves.
class VisitorStr:
"""
Pretty print an expression based on its type.
Example:
>>> e = Let('v', Num(42), Var('v'))
>>> v = VisitorStr()
>>> print(e.accept(v, None))
let v = 42 in v end
217
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
import sys
from abc import ABC, abstractmethod
class Expression(ABC):
@abstractmethod
def accept(self, visitor):
pass
class Var(Expression):
def __init__(self, identifier):
self.identifier = identifier
def accept(self, visitor, arg):
return visitor.visit_var(self, arg)
class Num(Expression):
def __init__(self, num):
self.num = num
def accept(self, visitor, arg):
return visitor.visit_num(self, arg)
class BinaryExpression(Expression):
def __init__(self, left, right):
self.left = left
self.right = right
class Add(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_add(self, arg)
218
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Sub(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_sub(self, arg)
class Mul(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_mul(self, arg)
class Div(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_div(self, arg)
class Let(Expression):
def __init__(self, identifier, exp_def, exp_body):
self.identifier = identifier
self.exp_def = exp_def
self.exp_body = exp_body
def accept(self, visitor, arg):
return visitor.visit_let(self, arg)
class VisitorEval:
"""
Pretty print an expression based on its type.
Example:
>>> e = Let('v', Num(42), Var('v'))
>>> v = VisitorEval()
>>> print(e.accept(v, {}))
42
219
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Law of Demeter: This principle, also known as the principle of least knowledge,
suggests that a module should not know about the internal details of the objects it
manipulates. The Visitor Pattern helps adhere to this principle by allowing operations to
be performed on elements of an object structure without the need for those elements to
expose their internal details to the visitor. Essentially, the pattern enables you to add new
operations to existing object structures without modifying their classes, thereby
minimizing the need for deep coupling.
2. Open/Closed Principle: This principle states that software entities (classes, modules,
functions, etc.) should be open for extension but closed for modification. The Visitor
Pattern aligns with this principle by allowing you to add new operations to a set of
objects without changing the classes of the objects themselves. Instead of modifying the
classes to add new behavior, you create new visitor classes that implement the
additional operations, thereby extending the functionality while keeping the existing code
closed for modification.
In summary, the Visitor Pattern helps maintain a clean separation of concerns, promoting both
the Law of Demeter by reducing the dependency on internal object details and the Open/Closed
Principle by facilitating the addition of new functionalities without altering existing code.
220
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Encapsulation of Behavior: Each expression type (Var, Num, Add, Sub, Mul, Div,
Let) does not need to know the specific details of the other expression types or their
implementations. The VisitorEval class encapsulates the logic for evaluating
expressions, so each expression type (Var, Num, etc.) only needs to interact with the
visitor interface. This minimizes the need for expressions to interact directly with other
expressions, thus adhering to the Law of Demeter.
2. Decoupling Operations from Data Structures: The accept method in the
Expression class allows an expression object to pass itself to a visitor without
exposing its internal details. The visitor (VisitorEval) performs operations on the
expression objects without needing to understand the internal implementation of these
objects. For example, the visit_add method in VisitorEval interacts with Add
objects by calling accept on their left and right components. This avoids direct
manipulation or knowledge of the internal structure of the Add object.
3. Layered Interaction: When the VisitorEval methods like visit_add or
visit_let handle expressions, they delegate the evaluation of sub-expressions (like
add.left.accept(self, env)) to the sub-expressions themselves. This layered
approach ensures that the visitor does not need to know the details of how left and
right are implemented internally; it only needs to know how to interact with them
through their accept methods.
4. Avoiding Deep Coupling: Since the VisitorEval methods do not directly access or
manipulate the internal data of expressions but rather rely on the accept method to
traverse and evaluate the expressions, it avoids deep coupling between the visitor and
the expression classes. This adheres to the Law of Demeter by reducing the
dependencies between components.
In summary, by using the visitor pattern, this implementation ensures that the VisitorEval
class performs operations on expression objects without needing to understand their internal
workings, thus following the Law of Demeter. Each expression type interacts with the visitor
through a well-defined interface (accept), minimizing direct dependencies and ensuring that
changes to expression types do not affect the visitor’s logic.
1. Extensibility without Modifying Existing Code: The VisitorEval class can be
extended to add new operations or evaluations without modifying the existing
Expression classes. For instance, if you want to add a new type of operation, such as
Exponentiation, you can create a new visitor class that implements the necessary
221
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
logic for Exponentiation without changing the Var, Num, Add, Sub, Mul, Div, or Let
classes.
2. Adding New Visitors: To introduce new functionalities, you create additional visitor
classes. Each new visitor class implements the visit_* methods for the different
expression types. This approach allows you to extend the behavior of the expression
evaluation system without altering the existing expression classes. For example, if you
wanted to add a visitor for pretty-printing or optimizing expressions, you could define a
new visitor class like VisitorPrettyPrint or VisitorOptimize, each
implementing its own visit_* methods.
3. Expression Classes Remain Closed for Modification: The expression classes (Var,
Num, Add, etc.) do not need to be changed when new operations are introduced. They
are designed to be open for extension (through new visitors) but closed for modification.
Their implementation of the accept method remains the same, and they do not need to
be altered to support new visitors.
4. Consistent Interface for Extensions: The accept method in the Expression class
provides a consistent interface for all visitors. This means that regardless of the type of
visitor or the specific operations it performs, the expression classes remain unchanged.
New visitor classes can be added as needed, demonstrating adherence to the OCP by
enabling extensions through new visitors rather than modifications to existing classes.
class VisitorOptimize:
"""
This visitor optimizes expressions, folding operations that involve only
numbers.
Example:
>>> e0 = Add(Num(40), Num(2))
>>> optimizer = VisitorOptimize()
>>> e1 = e0.accept(optimizer, {})
>>> printer = VisitorStr()
>>> print(e1.accept(printer, None))
42
222
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"""
def visit_var(self, var, env):
return var
Have you ever heard of an issue in software design called the Expression
Problem?
The expression problem is a challenging problem in programming languages that concerns the
extensibility and modularity of statically typed data abstractions. The goal is to define a data
abstraction that is extensible both in its representations and its behaviors, where one can add
new representations and new behaviors to the data abstraction, without recompiling existing
code, and while retaining static type safety (e.g., no casts). The statement of the problem
exposes deficiencies in programming paradigms and programming languages.
223
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Here's a breakdown of how the Visitor pattern interacts with the Expression Problem:
The Visitor pattern excels at adding new behaviors (operations) to existing data types without
modifying their definitions. By defining a new visitor class, you can extend the functionality of
your data structure without changing the underlying structure (such as expression classes in the
example above).
In the context of the Expression Problem, this solves part of the challenge. For example, you
can easily add new visitors to evaluate expressions, pretty-print them, optimize them, etc., by
simply adding new Visitor implementations. This satisfies the need for behavioral
extensibility.
● You need to modify the visitor interface every time you add a new data variant (e.g.,
adding an Exponentiation operation). This limits representational extensibility.
While the Visitor pattern is good at adding new behaviors, it struggles with adding new data
representations (i.e., new variants of data types) without recompiling or modifying existing code.
In the case of statically typed languages like Java, C++, or Python with type hinting, adding a
new type (e.g., a new expression class like Exponentiation) requires modifying the Visitor
interface and the existing visitor classes to handle the new variant.
● If you add a new expression type (e.g., Exponentiation), you would need to update
the Visitor interface to include a visit_exponentiation method.
● You also need to modify all existing visitor implementations (VisitorEval,
VisitorOptimize, etc.) to implement this new method, thus breaking the "closed for
modification" aspect of the Open/Closed Principle.
Example:
224
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Suppose you want to add an Exponentiation expression to the existing set of Add, Sub,
Mul, and Div. You would need to:
This violates representational extensibility because you are forced to touch multiple parts of the
code when adding new data types.
1. Clang/LLVM: Although LLVM itself is not strictly object-oriented in the traditional sense,
Clang, the front-end for C, C++, and Objective-C, is written in C++ and uses visitor-like
patterns to traverse ASTs. For instance, Clang has AST visitors that traverse the syntax
tree to perform semantic analysis, including checking types and symbols.
2. Java Compiler (javac): The javac compiler, written in Java, uses visitors for AST
processing. In this case, the Visitor pattern is useful for separating the operations (such
as type inference, bytecode generation, or optimizations) from the structure of the AST
nodes, making the code easier to extend without modifying the AST classes themselves.
3. GCC (GNU Compiler Collection): While GCC is largely written in C, the design of its
AST traversal shares similar principles with the Visitor pattern. It uses function pointers
or switch-based approaches to traverse different kinds of nodes, which is conceptually
similar to how Visitor pattern works in OOP contexts.
4. Scala Compiler (scalac): In the Scala compiler, the Visitor pattern is used in the
implementation of different phases of the compiler, like tree rewriting. Scala’s
pattern-matching features also help manage tree traversals, which serves a similar
purpose to the Visitor pattern in managing AST nodes.
In practice, the Visitor pattern's main benefit in compilers is that it allows different kinds of
operations (e.g., interpretation, compilation, optimization) to be added without altering the AST
node classes. It makes the code more modular and easier to extend, which is especially useful
in large and complex systems like compilers.
225
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Before we get started, we need to have Clang and LLVM installed, along with the Clang Tooling
infrastructure. We'll be working with Clang’s libTooling.
We’ll create a visitor that looks for function declarations and counts them.
Here’s the code for a minimal Clang tool using the visitor pattern:
#include "clang/AST/AST.h"
#include "clang/AST/RecursiveASTVisitor.h"
#include "clang/Frontend/ASTConsumers.h"
#include "clang/Frontend/FrontendAction.h"
#include "clang/Frontend/CompilerInstance.h"
#include "clang/Tooling/Tooling.h"
private:
ASTContext *Context;
};
226
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
private:
FunctionDeclVisitor Visitor;
};
Step 3: Explanation
227
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To run this tool, you would compile it using Clang and run it on some C++ source code. It will
print the names of all functions declared in the code, as well as the total number of functions
found.
#include <iostream>
void foo() {
std::cout << "Hello, World!" << std::endl;
}
int bar(int x) {
return x + 1;
}
When running the tool on this code, the output would be:
To build the tool, assuming Clang and LLVM are installed at llvm-project/build/lib:
You can then run the tool on your source code like this:
./func_counter file.cpp
Conclusion
This is a simple example of how you can implement an analysis using the Visitor pattern in
Clang. You can extend this pattern to perform more complex analyses, such as finding variable
declarations, counting loops, or checking coding style.
228
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While the GoF book is credited with popularizing the Visitor pattern, the concept is rooted in
earlier work on object-oriented programming and double dispatch. Here are some notable
influences and precursors:
● The GoF formalized the Visitor pattern as a way to add operations to classes without
modifying them, using double dispatch to resolve method calls based on both the
visitor and the element being visited.
● In the GoF's classification, Visitor is primarily used when:
○ The operations need to work across a variety of classes with distinct interfaces.
○ You want to keep the element classes unchanged when adding new functionality.
229
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides: While not the
inventors, their presentation of Visitor in the GoF book made it one of the most
recognized design patterns in object-oriented programming.
If you’re interested in the pattern’s historical applications, compiler theory and tree traversal
algorithms in languages like Lisp or Simula are excellent places to explore its conceptual roots!
Yes, John Matthew Vlissides (August 2, 1961 – November 24, 2005) was a very important and
endearing person. He was an influential computer scientist, best known as one of the "Gang of
Four" authors of Design Patterns: Elements of Reusable Object-Oriented Software. Vlissides
was born in Michigan, USA, and displayed an early aptitude for mathematics and science. He
pursued his education at the University of Virginia, earning a bachelor's degree in electrical
engineering. He later completed his Ph.D. in computer science at Stanford University in 1988,
where his thesis focused on object-oriented user interface design.
Throughout his career, Vlissides worked at IBM’s Thomas J. Watson Research Center, where
he made significant contributions to software design and reuse. His passion for patterns, design,
and usability in software extended beyond his research, as he actively shared his knowledge
through publications, talks, and mentoring.
He passed away in 2005 at the age of 44 after a battle with a brain tumor. Vlissides is
remembered not only for his professional achievements but also for his warm personality,
intellectual curiosity, and dedication to the software engineering community.
230
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Vlissides’s early work focused on object-oriented user interface frameworks and design,
which laid the groundwork for several modern principles in GUI and framework
development.
● His research explored ways to improve the reusability and maintainability of software
systems.
● At IBM, Vlissides contributed to projects related to software architecture, reuse, and user
interface design.
● He played a pivotal role in advancing IBM's efforts to incorporate design patterns into its
tools and methodologies.
● Vlissides was a strong proponent of software reuse and modularity, advocating for
practices that reduced redundancy and improved the efficiency of development
processes.
Legacy
John Vlissides’s work has left an indelible mark on the field of software engineering. While his
life was tragically short, his contributions to design patterns and his mentorship have inspired
generations of software developers. His colleagues and friends remember him as a kind,
231
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
thoughtful, and generous individual, making his legacy one of both professional and personal
excellence.
232
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Type Systems
Code: https://github.com/pronesto/DCC053_Public/tree/main/8_TypeVerification
Do you know any modern programming language that does not have
types?
Nearly all modern programming languages use types in some form, though the way they
handle types varies. While some languages don't require explicit type annotations from the
programmer, every language that manipulates data needs some concept of types, whether
implicit or explicit.
Languages that might appear not to use types typically employ dynamic typing where the
types are inferred and checked at runtime, rather than at compile-time. Examples of these are:
● JavaScript: JavaScript doesn't require the programmer to declare types explicitly, but
the language internally keeps track of types during runtime.
● Python: Like JavaScript, Python does not require the explicit declaration of types, but
every object in Python still has a type that the interpreter checks at runtime.
Even in languages without explicit type systems, the underlying implementation must manage
types for the sake of operations like arithmetic, memory management, and function calls.
In summary, while there are languages that do not require explicit types (dynamic languages),
there are no modern languages that completely avoid the concept of types. Implicit types still
exist, even if they are not directly visible to the programmer.
● Typeless Nature: In BCPL, all data is treated as a word of memory, typically a fixed-size
unit that could represent anything: an integer, a pointer, or even part of a string. There
was no distinction between different data types, and it was up to the programmer to
interpret what the bits represented.
● Manual Interpretation: Since BCPL lacked type safety or type enforcement, it was the
programmer’s responsibility to ensure that the right operations were applied to the right
kind of data. For example, adding two values was allowed regardless of whether they
were integers or memory addresses.
233
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Here's a simple example of a BCPL program that highlights its typeless nature. This program
swaps two values in memory and prints them, demonstrating how BCPL treats everything as
raw machine words without regard for types like integers, pointers, or characters.
GET "libhdr"
RESULTIS 0
}
Key Points:
1. No Explicit Types: There are no type declarations for a and b. They are just treated as
memory locations that store machine words. Here, they are used as integers, but they
could just as easily store pointers or other data.
2. Generic Assignments: The swap operation (a := b) is completely typeless. The
program doesn't care whether a and b are integers, pointers, or any other kind of data. It
simply moves raw machine words around.
3. Printing Values: The writef function is used to print the values. %n is used to indicate
a numeric value, but BCPL doesn't internally enforce that a and b are numbers. The %n
is more about interpreting the memory word as a number during output.
234
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● No Type Safety: You can assign anything to a or b, and there’s no type checking. In
modern languages, swapping two variables would require that they have compatible
types. In BCPL, however, everything is just a word of memory.
● Flexibility: This flexibility allows you to perform operations directly on memory, but at the
cost of potential errors, since there is no protection against assigning the "wrong type" of
data (e.g., trying to add a pointer to an integer).
BCPL’s typeless nature made it very powerful for low-level programming, but also error-prone,
as the programmer had to manually ensure that memory was used correctly.
235
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While both the Turing Machine and Lambda Calculus are extremely powerful and can
theoretically compute anything computable (per the Church-Turing thesis), they do not deal with
types in their original forms. Their typeless nature reflects their role as abstract models of
computation rather than practical programming languages that deal with real-world constraints
like memory safety or data type compatibility.
So, every modern programming language has a type system. Why are
types so important?
Types are important in modern programming languages because they enhance safety,
readability, performance, and correctness. They allow programmers to reason about
programs more effectively, avoid common errors, and ensure efficient and reliable execution.
While early languages (like BCPL or the Turing Machine) were typeless, types have become a
core part of programming practice as languages have evolved to handle increasingly complex
systems. Here's why types are so important:
● Type errors (e.g., adding a number to a string) can cause programs to behave
unexpectedly or even crash. A type system helps detect these errors early—either at
compile time (in statically typed languages) or at runtime (in dynamically typed
languages). This prevents bugs and reduces the likelihood of unpredictable behavior.
● Memory safety: Types can enforce rules that prevent common errors like accessing
invalid memory locations (e.g., null pointer dereferences), buffer overflows, or
out-of-bounds errors.
● Types document the program: By specifying the types of variables, functions, and data
structures, types act as a form of documentation. This makes it easier to understand
what the program does, especially when working in large codebases or teams.
● Explicit contracts: Types serve as a contract that describes the expected input and
output of functions, making programs more readable and reducing ambiguity.
● Compiler optimizations: Knowing the types of variables allows the compiler to make
assumptions that lead to more efficient code. For instance, knowing that a variable is an
integer rather than a floating-point number allows the compiler to use more efficient
machine instructions.
● Memory management: Types help the compiler allocate memory efficiently. For
example, a compiler knows exactly how much space to reserve for an int versus a
float or a custom data structure.
236
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example: In languages like C or Rust, knowing the size and layout of types at compile time
allows for stack allocation and optimized memory usage.
● Avoiding unintended behavior: Without types, it’s easy to make mistakes like applying
inappropriate operations on data. Types prevent this by enforcing strict rules on how
data can be used.
● Defined behavior: Languages with strong type systems (like Haskell or Rust) often
guarantee that well-typed programs cannot crash due to type-related issues. This avoids
many runtime surprises.
Example: In Rust, the type system ensures memory safety through ownership and borrowing,
preventing common issues like dangling pointers or race conditions without needing a garbage
collector.
● Type-driven design: In languages with expressive type systems like Haskell or Idris,
the type system can guide program design. You can define complex behaviors and
invariants directly in the type system, ensuring correctness by construction.
● Type-level programming: Some advanced languages use types as a powerful
abstraction tool. In dependent types, for instance, types can encode and enforce
arbitrary properties about data, guaranteeing correctness beyond traditional type safety.
237
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
>>> x = 5
>>> y = "four"
>>> x + y
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
This example shows a simple and effective way to illustrate how types prevent common
programming errors by catching them up. In this case:
#include <stdio.h>
#include <stdlib.h>
int* createArray() {
int arr[5] = {1, 2, 3, 4, 5}; // Local array (on stack)
return arr; // Returns pointer to a local array
}
int main() {
// ptr now points to invalid memory (stack frame has been destroyed)
int* ptr = createArray();
The function createArray creates an array on the stack. When the function returns, the stack
frame is destroyed, and the pointer ptr refers to invalid memory. Accessing ptr[0] can lead to
undefined behavior, including crashes or corrupt data, because the program is trying to read
from an invalid memory location.
238
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Now, if you were to write a program that tries to return a reference to local memory (like the C
example), Rust’s compiler would give an error:
Concluding:
● In C, the lack of strong type safety allows for dangerous operations, such as accessing
invalid memory locations through dangling pointers, leading to undefined behavior.
● In Rust, the type system enforces strict rules about ownership and borrowing,
preventing such errors by ensuring memory is valid whenever accessed. Rust’s
ownership system and borrow checker help guarantee memory safety at compile
time, preventing issues like dangling pointers, buffer overflows, and other
memory-related bugs common in languages like C.
239
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you implement these rules in our Python interpreter? Start by defining
classes for the new kinds of expressions.
Find below an implementation of classes for the different types of expressions.
import sys
from abc import ABC, abstractmethod
class Expression(ABC):
@abstractmethod
def accept(self, visitor):
pass
class Var(Expression):
def __init__(self, identifier):
self.identifier = identifier
def accept(self, visitor, arg):
return visitor.visit_var(self, arg)
class Num(Expression):
def __init__(self, num):
self.num = num
def accept(self, visitor, arg):
return visitor.visit_num(self, arg)
class Bln(Expression):
def __init__(self, bln):
self.bln = bln
def accept(self, visitor, arg):
return visitor.visit_bln(self, arg)
240
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class BinaryExpression(Expression):
def __init__(self, left, right):
self.left = left
self.right = right
class Add(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_add(self, arg)
class Sub(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_sub(self, arg)
class Mul(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_mul(self, arg)
class Div(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_div(self, arg)
class And(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_and(self, arg)
class Lth(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_lth(self, arg)
class Let(Expression):
def __init__(self, identifier, exp_def, exp_body):
self.identifier = identifier
self.exp_def = exp_def
self.exp_body = exp_body
def accept(self, visitor, arg):
return visitor.visit_let(self, arg)
class IfThenElse(Expression):
def __init__(self, cond, e0, e1):
self.cond = cond
self.e0 = e0
self.e1 = e1
def accept(self, visitor, arg):
return visitor.visit_ifThenElse(self, arg)
241
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return num.num
242
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. First Operand (v0): The left side (exp.left.accept(self, env)) is evaluated first.
2. If v0 is False: The method immediately returns False without evaluating the right side.
This is because, in an and operation, if one operand is False, the whole expression is
False regardless of the other operand.
3. If v0 is True: Only then does it proceed to evaluate the right side
(exp.right.accept(self, env)) because the result of the and operation now
depends on the right-hand side.
In this case, the division by x only occurs if x != 0. Without short-circuit evaluation, you could
end up dividing by zero and causing a runtime error.
In your visitor pattern, the visit_and method is a classic example of short-circuit evaluation
applied in the context of evaluating boolean expressions.
243
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the current implementation of the visitor, the methods assume that the operands are of the
correct types but do no type checking. Here are the problems in the specific examples you
provided:
244
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
245
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return False
def type_safe_eval(e):
v = VisitorTypeSafeEval()
try:
return e.accept(v, {})
except TypeError as tp_error:
print(tp_error)
246
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Using isinstance(value, int) would allow booleans to pass through since, in Python,
bool is a subclass of int. This can cause unintended behavior, especially when strict type
checking is desired, such as when you're distinguishing between integers and booleans.
Here's an example:
v = VisitorEval()
# Old implementation:
e = Lth(Num(3), Bln(True)) # Could have passed because bool is a subclass of int
e.accept(v, {}) # Would give False instead of a type error
Here's why:
1. Dynamic Typing: In dynamically typed languages like Python, JavaScript, or Ruby,
types are determined at runtime rather than at compile time. Similarly, in our interpreter,
we don't know the types of expressions until they are evaluated. The type-checking is
deferred until execution time, where the ensure_type function verifies whether the
value matches the expected type.
2. Type Safety: Despite being dynamically typed, our interpreter enforces type safety.
Type safety means that the program will not perform operations on data of incompatible
types. For instance, if we attempt to compare a number with a boolean, or add a string to
an integer, our interpreter throws a TypeError, similar to how dynamically typed,
type-safe languages handle such cases.
Many dynamically typed languages provide runtime checks to prevent type errors:
○ In Python, adding an integer to a string (1 + "hello") raises a TypeError.
○ In Ruby, trying to add a string and integer ("1" + 1) will also result in an error.
3. Run-time Errors: Unlike statically typed languages (e.g., Java, C#), which catch type
errors at compile time, our interpreter (like dynamically typed languages) detects type
mismatches during execution. But since we're ensuring type correctness at runtime, we
maintain a form of dynamic type safety.
247
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
How does our interpreter implement Dynamic Typing with Type Safety?
Here’s how the flow works in our interpreter:
● Type inference at runtime: When the interpreter evaluates an expression, the type of
each value is determined at the moment it’s executed, much like how dynamically typed
languages work.
● Runtime checks: The ensure_type method acts like a type checker that runs during
execution, ensuring that operations (like +, <, and) are only performed on compatible
types.
Example Comparison:
x = 5
y = "hello"
x + y # TypeError: unsupported operand type(s) for +: 'int' and 'str'
● Our Interpreter:
Both examples demonstrate runtime detection of type errors, ensuring type safety despite the
dynamic typing. The interpreter we've created behaves like an interpreter for a dynamically
typed language with runtime type safety. It allows flexibility in how expressions are typed but
enforces correctness during execution, preventing invalid operations from proceeding, which is
the hallmark of type-safe dynamic languages!
So, the more "Dynamic" a language is, the more checks happen at running
time? In this case, are there languages more or less "Dynamic"?
Yes, that makes some sense. You're describing an important aspect of the distinction between
dynamically and statically typed languages. In dynamically typed languages, type information is
not known at compile time, so type checks must be performed at runtime, as in your example of
the ensure_type method. This results in a lot of conditional checks being required to ensure
type safety during execution. In contrast, statically typed languages check and enforce types at
compile time, meaning that such runtime checks are usually unnecessary.
● Very static languages (like C, Rust, or Haskell) rely heavily on compile-time type
checking. Once compiled, these languages assume that the types are correct, and no
248
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
additional type checks are performed during execution. This leads to more efficient code,
but it also means the programmer has to provide all the necessary type information
upfront.
● Very dynamic languages (like Python, Ruby, or JavaScript) allow you to write code
without declaring types explicitly, but this flexibility requires more runtime checks,
especially when operations depend on specific types (like division requiring integer
operands). These checks add some overhead and potential for runtime type errors.
● Languages in between (like Java or C#) have a combination of both compile-time and
runtime checks. For example, Java uses a static type system, but due to features like
reflection or casting, there can still be situations where runtime type checks are
necessary, leading to a mix of static and dynamic behavior.
So, yes, you can think of a spectrum based on how much a language relies on runtime type
checking versus compile-time type enforcement. The more dynamic a language is, the more
frequently runtime checks like ensure_type are needed to maintain safety and correctness.
Conversely, the more static a language, the fewer such checks occur at runtime, because the
compiler ensures type correctness ahead of time.
Type-Safe Languages
Type-safe languages ensure that operations are only performed on data of compatible types,
either at compile time or runtime. This prevents certain classes of errors, such as trying to add a
string to an integer or accessing invalid memory.
1. Prevention of Type Errors: Type-safe languages guarantee that you cannot perform
illegal operations on types (like adding a string and an integer, or accessing memory
outside the bounds of an array).
2. Static or Dynamic Type Checking:
○ Statically Typed: In statically typed type-safe languages, like Java or Haskell,
type errors are caught at compile time. The program won’t run if there's a type
mismatch.
○ Dynamically Typed: In dynamically typed type-safe languages, like Python or
Ruby, type checks happen at runtime. The program will raise an error if it
encounters a type mismatch during execution.
249
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Memory Safety: Type-safe languages often provide memory safety by ensuring that
data is accessed in a controlled manner, preventing issues like buffer overflows, invalid
pointer dereferencing, or use-after-free errors.
4. Examples:
○ Statically typed and type-safe: Java, C#, Haskell
○ Dynamically typed and type-safe: Python, Ruby, JavaScript (these raise
runtime errors for invalid type operations)
Benefits:
● Fewer runtime errors: The type system catches many potential errors either at compile
time or runtime.
● Clearer code: Types act as documentation, helping developers understand the kind of
data being manipulated.
● Memory protection: Prevents common memory-related bugs such as accessing
unallocated memory or buffer overflows.
Type-Unsafe Languages
1. No Strong Type Checking: In type-unsafe languages, you can perform operations on
incompatible types without any guarantee that the types are correct. The language won't
prevent you from adding incompatible types or accessing invalid memory.
2. Direct Memory Access: Type-unsafe languages often allow direct manipulation of
memory, which can lead to security vulnerabilities like buffer overflows, memory leaks, or
crashes. This is common in low-level languages where performance and control over
memory are prioritized.
3. Pointer Arithmetic: Some type-unsafe languages, like C or assembly, allow pointer
arithmetic and raw memory manipulation, which can lead to accessing uninitialized or
invalid memory locations.
4. Examples:
○ C: While C has types, it doesn’t enforce strong type checking in all scenarios. For
example, casting pointers between incompatible types can lead to undefined
behavior.
○ Assembly language: Type-safety is absent, as operations directly manipulate
memory addresses, with no distinction between different types of data.
Risks:
250
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example in C (Type-Unsafe):
int x = 5;
char* ptr = (char*)&x;
*ptr = 'A'; // Modifying the memory where 'x' is stored, treating it as a char
This code treats the memory of an int variable as a char, which can result in undefined
behavior, a type of problem that would be prevented in a type-safe language.
Key Differences:
Type-Safe Languages Type-Unsafe Languages
Help ensure memory safety (no buffer overflows, Allow direct memory manipulation and
invalid accesses) pointer arithmetic
Raise errors when type mismatches occur Undefined behavior or incorrect results
However, type-unsafe languages are still widely used in systems programming (like C), where
low-level control over memory and performance is essential. These languages trade type safety
for finer control, but they require more careful programming and thorough testing to avoid errors.
Conclusion:
The primary distinction is that type-safe languages catch type mismatches and prevent
undefined behavior, while type-unsafe languages allow greater flexibility but at the cost of
potential bugs and unpredictable outcomes. Type safety is a crucial concept that helps
maintain program correctness and security.
251
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Performance Efficiency
One of the biggest advantages of type-unsafe languages is runtime performance. Since they
don’t enforce type checks or other safety mechanisms at runtime, they can execute faster.
Every additional type check, dynamic dispatch, or safety guarantee in type-safe languages adds
some overhead.
● C's Efficiency: For instance, C doesn’t check whether an array index is within bounds or
whether pointers point to valid memory locations. This lack of overhead makes C highly
efficient, which is why it’s commonly used for systems programming (like operating
systems or embedded systems) where performance is critical.
Type-unsafe languages give you direct access to memory, enabling fine-grained control over
how data is stored and manipulated. This control is particularly valuable when writing code that
interacts with hardware or system resources, such as:
In type-unsafe languages, you can bend the rules in ways that type-safe languages would
prevent. This flexibility allows you to write code that may be highly optimized or implement
complex data structures more easily. For example:
● Type Casting: In C, you can cast variables of one type to another, even if they are not
related. While this can lead to errors if misused, it also offers the ability to work with raw
data or hardware interfaces more flexibly.
● Union Types: C allows the use of unions, where multiple data types can share the
same memory space. This is useful in situations where you need to pack different types
of data into a smaller memory footprint.
252
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. Bare-Metal Programming
Type-unsafe languages like C are commonly used in embedded systems and real-time
systems because they allow you to write bare-metal code — code that runs directly on the
hardware without an intervening layer of abstraction like a virtual machine or runtime system.
This is important for:
● Writing Device Drivers: You can directly interact with hardware registers,
memory-mapped I/O, and interrupt handlers, which would be hard to achieve in
type-safe languages.
● Operating System Kernels: Operating systems like Linux and Windows are often
written in C because it provides the control needed for tasks like memory management,
scheduling, and hardware control.
Type-safe languages can sometimes impose restrictions that make certain programs hard or
impossible to write. Type-unsafe languages give you the freedom to write programs without
being blocked by the limitations of a strict type system. This can be beneficial when you want to:
● Write Low-Level Optimizations: In C, you can optimize code in ways that would be
difficult or impossible in languages with a strict type system. For example, casting
between types or manipulating memory directly allows you to fine-tune performance.
● Handle Generic Data: In C, you can write generic code by using void* pointers or
unions to work with data of any type, without needing to use templates or type-safe
generics.
While these advantages make type-unsafe languages like C appealing in many contexts, they
come with serious trade-offs:
● Safety Risks: The lack of runtime checks can lead to dangerous bugs (e.g., buffer
overflows, memory corruption, null pointer dereferencing), which are harder to detect
and can cause security vulnerabilities.
● Complex Debugging: Without type-safety guarantees, debugging errors caused by
incorrect memory access or type mismatches can be difficult and time-consuming.
● Manual Memory Management: The absence of garbage collection in C means that
developers need to manage memory manually. While this can lead to efficient memory
usage, it also increases the risk of memory leaks and dangling pointers.
253
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
type-unsafe languages offer power and control that is essential for certain low-level
programming tasks, especially in performance-critical or hardware-specific applications.
Conclusion
However, the risks of type-unsafe languages, such as memory errors and security
vulnerabilities, require careful programming practices. Type-safe languages may restrict certain
operations, but they offer robustness and security, reducing the likelihood of runtime errors.
So, while both types of languages have their place, the choice of language depends on the
trade-offs between performance needs and safety guarantees.
1. luaV_objlen:
This function handles the # (length) operator for different data types.
2. luaV_arith:
● Attempt to convert to numbers: It attempts to convert both operands (rb and rc) to
numbers using luaV_tonumber. If successful, it performs the arithmetic operation
using luaO_arith.
254
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Metamethod fallback: If conversion fails (operands are not numbers), it tries to call a
binary metamethod (based on the operation) using call_binTM.
● Error handling for invalid types: If neither conversion nor a metamethod works, it calls
luaG_aritherror to report an error.
Key Takeaways:
● Both examples showcase how Lua prioritizes type conversion: First attempts to convert
values to expected types (e.g., numbers for arithmetic).
● These functions showcase the two primary ways Lua handles type mismatches at
runtime:
○ Raising errors: When a type mismatch is encountered and no alternative action
is available (e.g., luaG_typeerror, luaG_aritherror).
○ Metamethod fallback: Allowing custom behavior through metamethods for
specific types (e.g., call_binTM).
Further Exploration:
Algol 60 (introduced in 1960) was a major step forward in formalizing types. It introduced:
● Explicit type declarations, requiring variables to have types like INTEGER or REAL,
which improved clarity and safety.
● Scope rules that connected types to block-structured programming, influencing how
types interact with the language's scoping and control structures.
● The idea of type ascription for parameters, making type checking at least partially
formalized.
255
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While Algol didn’t introduce a fully general type system, it laid the groundwork for structured
programming and type-related ideas that heavily influenced later languages. Algol's design was
instrumental in showing how types could contribute to both the design of programming
languages and their formal semantics.
The programming language ML (Meta Language), developed in the 1970s by Robin Milner,
was arguably the most influential language in shaping modern type systems. ML introduced:
256
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Static type inference (e.g., in Haskell, Rust, and Kotlin) owes much to ML.
● Dynamic typing (e.g., in Python, Ruby, JavaScript) builds on ideas explored in
Smalltalk.
● Type polymorphism and advanced features like dependent types (e.g., in Coq and
Idris) push type theory into formal verification and proof systems.
Was it Pascal the language that introduced this notion of "Strong Typing"?
Oh, no, not really! While Pascal is often associated with "strong typing," the term itself can be
ambiguous and context-dependent. Let me elaborate on Pascal's role and clarify whether it truly
"introduced" strong typing.
Introduced by Niklaus Wirth in 1970, Pascal was designed to teach structured programming
and good software engineering principles. Its type system emphasized:
1. Explicit type declarations: Variables and function parameters required clear type
annotations.
2. Strict type compatibility: Operations were only allowed on values of the same type
unless explicit casting was used.
3. User-defined types: Pascal allowed programmers to define their own types, such as
enumerated types (type Color = (Red, Green, Blue);) or records (akin to
structs), enabling more precise modeling of problems.
4. Type safety in arrays: Array bounds were enforced (at least in theory; some
implementations didn’t), catching errors at compile time or runtime.
These features certainly advanced the concept of strong typing, making Pascal a significant
milestone in type system design. However, Pascal did not "invent" strong typing, as similar
ideas existed in earlier languages like Algol 60.
1. Algol 60’s Influence: Algol 60 already required explicit type declarations and enforced
type-checking at compile time. Pascal inherited much of its type system philosophy from
Algol 60, so it didn’t introduce strong typing in a vacuum.
2. Fortran's Rudimentary Typing: Fortran had basic type safety rules (e.g., distinguishing
between integers and floating-point numbers), but its type system allowed implicit type
conversions, making it weaker than Algol’s.
3. Ada and Later Innovations: Ada, which came after Pascal, took strong typing even
further, introducing more rigorous type constraints, such as type ranges and modular
types.
257
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While Pascal didn’t invent strong typing, it did popularize and refine the concept by:
● Introducing type safety into a language explicitly designed for teaching and structured
programming.
● Formalizing user-defined types, which made it easier for programmers to model
real-world problems with custom, precise types.
In practice, Pascal's strong typing was strict enough to prevent common errors while being
simple enough to teach beginners. This balance contributed to its lasting reputation as a
"strongly typed" language.
Conclusion
Pascal didn’t "introduce" strong typing but played a crucial role in popularizing and advancing
the practical application of strong typing. Its influence can still be seen in languages like
Ada, Modula, and even modern languages like Python (which enforces types dynamically) and
Rust (which enforces types strictly and safely at compile time). Would you like to discuss other
historical aspects of strong typing or Pascal’s legacy in programming?
258
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Niklaus Wirth made numerous contributions to computer science that have had a lasting impact
on the field. Here is a summary of his key contributions:
259
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Formulated Wirth's Law, which states, "Software is getting slower more rapidly
than hardware becomes faster," advocating for lean and efficient software design.
7. Recognition and Impact:
○ Won the Turing Award in 1984 for his contributions to programming languages
and algorithms.
○ His work influenced generations of programmers and shaped modern software
engineering and language design.
Wirth’s contributions span education, research, and practice, making him one of the most
influential figures in the history of computer science.
260
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Type Checking
Code: https://github.com/pronesto/DCC053_Public/tree/main/9_TypeChecking
In the last class we talked about dynamic and static typing. Do you
remember what that is?
In programming languages, static and dynamic type systems refer to how and when types are
checked in a program.
In a static type system, types are checked at compile time, before the program runs. The type
of every variable, function, or expression is known and fixed before the program executes. This
system aims to catch type errors early, preventing type-related bugs before execution.
In a dynamic type system, types are checked at runtime. Variables do not have fixed types,
and types can change as the program runs. This system is more flexible but shifts the
responsibility of type safety to runtime.
To summarize:
● Static typing ensures type correctness at compile time but may require more upfront
type information.
261
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Dynamic typing offers flexibility by allowing type changes at runtime but increases the
risk of runtime errors.
For instance, a program like "3.14 if True else [1, 2, 3]" is valid in a dynamically
typed language because type compatibility is checked at runtime. This means that as long as
the code can be executed (e.g., as part of a control flow statement), it is considered valid. In
contrast, a statically typed language would require the program to pass a type-checking phase
before execution, rejecting programs that don't meet strict type requirements.
So, if you imagine fixing a length for all programs and then checking how many of those
programs are "valid," the dynamic typing system would indeed have a higher density of valid
programs because it imposes fewer constraints. Static typing reduces the set of valid programs
by enforcing stricter rules at the type level, making certain programs invalid that would run in a
dynamically typed setting.
This comparison holds when you define the "density" in terms of valid programs out of all
possible programs of a given length, as dynamic typing allows for more flexibility and fewer
rejections.
262
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
a function that works with multiple types without needing to define different types or overload
functions.
# Or lists (concatenation)
print(add([1, 2, 3], [4, 5])) # Output: [1, 2, 3, 4, 5]
Explanation:
1. Flexibility: The add() function works for integers, floats, strings, and lists. Python’s
dynamic typing allows the same function to handle different types without modification. In
a statically typed language, you would either:
○ Have to write multiple versions of the function (e.g., using function overloading).
○ Use more complex constructs such as generics or templates to achieve similar
behavior.
2. Concise Code: You don’t need to specify the types of a and b. Python automatically
handles the correct operation based on the types passed in at runtime.
● Rapid Prototyping: You can quickly write a generic function that works across various
types without worrying about type annotations or specific implementations.
● Code Simplicity: No need to declare types explicitly, making the code cleaner and more
focused on functionality rather than type constraints.
This is one of the core strengths of dynamically typed languages, especially in scenarios where
flexibility and rapid development are prioritized over the strict type safety provided by static
typing.
263
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In statically typed languages, type errors are caught at compile time, before the program runs.
This prevents many runtime errors related to type mismatches.
● Example: In a statically typed language like Java, if you try to add a string to a number
without explicitly converting one of them, the compiler will throw an error. This prevents
potential crashes at runtime, making the system more reliable.
2. Performance Optimization
Statically typed languages allow compilers to optimize code more efficiently. Since the types of
all variables and functions are known ahead of time, the compiler can generate optimized
machine code for those types.
● Example: C++ and Rust are known for their high performance partly because they are
statically typed, allowing the compiler to produce highly efficient executables with
minimal runtime overhead.
IDEs and editors can provide better autocomplete suggestions, refactoring tools, and code
navigation (e.g., "jump to definition") in statically typed languages, as the type information is
available during development.
● Example: In statically typed languages like Java or C#, when you start typing the name
of a variable, method, or class, the IDE can automatically suggest options that match the
correct types, leading to faster and more accurate coding.
● Example:
264
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
}
● In this case, anyone reading the code knows that add() takes two integers and returns
an integer. There is no ambiguity about the expected types.
5. Refactoring Safety
Refactoring (changing existing code) is safer in statically typed languages because the compiler
will catch any inconsistencies in types across the codebase. When you modify a function's
signature or change the type of a variable, all incorrect uses will be flagged during compilation.
6. Predictable Behavior
Since types are enforced and known at compile time, you can have more confidence that your
code behaves as expected, reducing unexpected type coercion or runtime errors from type
mismatches.
But, how can static typing work? Knowing if a variable will ever be used as
an integer should be undecidable, as per Rice's Theorem, isn't it?
While Rice's Theorem proves that we can't decide all non-trivial properties of programs (like
whether a variable is always used as an integer), static typing works by:
1. Limiting the scope of the problem: It checks the structure of the code and uses local
information rather than trying to analyze all possible executions.
2. Conservatively approximating the type behavior of the program, potentially rejecting
valid programs for the sake of soundness.
3. Imposing restrictions that make type checking decidable, avoiding the general case
where undecidability applies.
By focusing on syntactic structure and conservative type rules, static typing provides a
decidable, sound system that catches many common errors at compile time, without needing to
solve the general undecidability of arbitrary program properties.
265
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
So, if Rice's Theorem still applies, then there are programs that are type
safe, but that we would never know, right?
Yes, Rice's Theorem still applies, and the undecidability that it implies means there are
programs where it's impossible to know at compile time whether they are truly "type safe" in all
cases. However, static type systems bypass this issue by making conservative checks and not
analyzing certain complex or dynamic behaviors.
Goldbach's Conjecture claims that every even natural number greater than 2 can be
expressed as the sum of two primes. This is a mathematical statement that hasn't been
proven or disproven yet, which makes it a great analogy for undecidable problems in computer
science. In this sense, consider the program below:
#include <stdio.h>
#include <stdbool.h>
int zero_is_int() {
int x = 0;
int main() {
int result = zero_is_int();
printf("Result: %d\n", result);
return 0;
}
266
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Rice's Theorem tells us that non-trivial properties of programs (e.g., "will this program
ever assign a string to an integer?") are undecidable.
● In this case, determining whether x = "zero" will ever execute requires solving a
conjecture that hasn't been proven or disproven.
● The compiler can't know the behavior of all possible inputs (or mathematical truths)
ahead of time, so static type systems rely on conservative rules that catch these
mismatches before runtime, forcing the programmer to handle them explicitly.
267
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This example demonstrates how static type systems prevent undecidable runtime behaviors by
rejecting programs that violate type safety—even when that violation depends on an
undecidable property, like the truth of Goldbach’s Conjecture. The type system is inherently
conservative, preferring to prevent unsafe behavior (even if the condition under which it
happens is undecidable).
1. Conservative Approximation
A static type system assigns types based on the structure of the program without needing to
evaluate it. The type checker will use the declared types (or inferred types in some cases) and
validate that operations between them are safe according to the type rules of the language.
● Static type checkers don't need to evaluate the full behavior of the program or explore
every possible execution path (which would indeed be undecidable).
● They instead use type annotations (or inferred types) to conservatively reason about
how values and expressions can be used.
For example, in Java, when you declare a variable of type int, the static type checker will
ensure that this variable is only used in ways compatible with integers throughout the code. If it
detects an attempt to use it in an incompatible way (such as adding it to a string without proper
conversion), it flags it as a type error.
2. Type Soundness
A sound type system guarantees that if the program type-checks successfully, certain errors
(like type mismatches) will not occur during runtime. However, to achieve this, type systems
impose restrictions that ensure correctness without attempting to solve every possible edge
case.
● The type system doesn't need to account for all possible runtime behaviors—just
whether the program, as written, adheres to its type rules.
● This ensures a safe subset of the program’s behavior can be guaranteed, while other
behaviors that the type system cannot handle may either result in compile-time errors or
be prohibited altogether.
268
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Static typing defines a subset of rules that are decidable, meaning that even if we can't predict
the full runtime behavior of the program (as per Rice's Theorem), we can still enforce rules on
the structure of the code.
● For instance, in statically typed languages like Java or C++, you can define that integers
are only added to other integers and arrays are indexed with integers, and the compiler
will ensure this is true based on how you’ve written the code.
● While this may limit some flexibility (since it can't reason about all possible runtime
conditions), it guarantees that, within its constraints, the program will be type-safe.
To work around the limitations imposed by undecidability, many languages use explicit type
annotations (or allow type inference) to limit what kinds of values a variable can hold or
operations it can participate in.
● Explicit annotations: Programmers explicitly declare the types, which makes type
checking a local, syntactic property. The compiler does not need to analyze all possible
uses of a variable; it just ensures that every use matches the declared type.
● Type inference: Some languages (e.g., Haskell or Rust) use sophisticated inference
algorithms to deduce the type of expressions without explicit annotations, but these
systems still work within the boundaries of decidable, sound rules.
5. Imposed Restrictions
Static type systems also make conservative assumptions that can sometimes reject valid
programs. This happens because they are designed to err on the side of safety. Since we can't
decide all possible program behaviors, the type system will flag some programs as type errors
even though they would never fail at runtime. This trade-off is made to ensure the soundness of
the system.
● For example, in Java, if you declare a method to return an int, the compiler won't allow
it to return a string, even if that particular code path might never be executed.
1. Type Checking
Type checking is the process where the compiler (or interpreter) verifies that the types of
variables, functions, and expressions are used consistently according to the rules defined in the
programming language.
269
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Explicit Types: In many statically typed languages, you explicitly declare the types of
variables and functions. The compiler then checks that these declared types are used
consistently throughout the program.
● Example: In languages like Java, C, or C++, you explicitly declare variable types:
● Role of Type Checking: In this case, type checking ensures that the value assigned to
x matches its declared type (int in this example). Any violations (such as attempting to
assign a string to an integer variable) will result in compile-time errors.
● Proactive Error Checking: Since the types are declared up front, errors related to type
mismatches are caught early, often during compilation.
2. Type Inference
Type inference, on the other hand, allows the compiler to deduce the types of variables and
expressions based on the context in which they are used. In this system, you don't always have
to declare types explicitly; the compiler infers them for you.
● Implicit Types: In languages with type inference, you may not need to explicitly declare
variable types. Instead, the compiler looks at how a variable is used and assigns a type
accordingly.
● Example: In languages like Haskell, ML, or Rust, type inference is commonly used:
● Role of Type Inference: Here, the compiler looks at the assignment x = 10 and knows
from the value 10 that x must be an integer. The type is inferred without the programmer
needing to specify it. The type checking still happens in the background, ensuring
consistency, but the programmer doesn’t need to write explicit types.
● Reduction of Boilerplate: Type inference reduces the need for repetitive type
declarations while still ensuring type safety. The programmer can write less code, but the
type system remains just as strict.
Type Types are explicitly declared by the Types are inferred automatically by
declaration programmer. the compiler.
Programmer Requires more effort to manually Less effort, as types are inferred
effort annotate types. without annotations.
270
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Error detection Errors are caught when declared Errors are caught when inferred
types are inconsistent. types are inconsistent.
● Type checking and type inference are not mutually exclusive. In fact, many modern
languages with static typing use a combination of both.
○ For example, Rust and Scala use type inference to reduce verbosity while
ensuring that every variable still has a well-defined type at compile time.
○ In Java, some variation of type inference (forward only) has been introduced in
recent versions (e.g., with var in Java 10) to reduce the amount of boilerplate
code, though Java still performs strict type checking based on the inferred or
declared types.
Both type checking and type inference are mechanisms to implement static typing. Type
checking typically involves explicit type declarations, ensuring correctness via compile-time
verification. Type inference deduces types from the context, offering more flexibility without
sacrificing type safety. They complement each other in modern statically typed languages,
enabling a balance between strictness and convenience.
● Type inference allows the compiler to deduce types when they are not explicitly stated,
reducing the need for verbose type declarations.
● Type checking ensures that inferred types, along with any explicitly declared types, are
used consistently and correctly throughout the program.
In Rust, for instance, type inference is used extensively, but the type checker verifies their
consistency. Let’s expand on the previous example:
fn main() {
271
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this example:
1. x: The type of x is inferred to be i32 because it is assigned the integer literal 10, which
in Rust defaults to i32.
2. y: The type of y is explicitly declared as f64 (a floating-point number).
3. z: The type of z is inferred based on the return type of the add_one function, which
explicitly declares its parameter and return type as i32.
Even though some variables (like x and z) do not have explicit type annotations, type checking
ensures that all types are used consistently. For example, the type checker will prevent us from
accidentally assigning y (a f64) to x (an inferred i32).
That's a lot of info. Can you summarize why one benefits from combining
type checking and inference?
● Reducing Boilerplate: In many statically typed languages, requiring type annotations
everywhere can be cumbersome, especially for types that can easily be inferred from
context. Type inference reduces the need for these redundant annotations.
● Early Error Detection: Despite not needing explicit type annotations everywhere, the
compiler knows the type of every expression at compile-time due to inference. This
enables early error detection, just like in traditional static type checking.
● Flexibility and Convenience: Type inference offers flexibility for programmers, allowing
them to write cleaner, more concise code without losing the benefits of static typing.
For type checking to work, the user needs to have a way to specify types.
How is this done?
For type checking to work, the user needs to have a way to specify types in the program. This
can be done either explicitly (where the programmer manually declares the types) or implicitly
(through type inference, where the compiler deduces the types automatically), but in both cases,
the type system must have a way to assign and enforce types.
272
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In many statically typed languages, the programmer explicitly specifies the types of variables,
function parameters, and return values. These declarations provide clear instructions for the
type checker to ensure that variables and expressions are used consistently according to their
declared types.
● Example (Java):
● In this case, the types (int, String, etc.) are explicitly stated, and the type checker can
verify at compile time that all operations involving these variables are type-safe.
Even if types are not explicitly written, the language must have some mechanism (e.g., type
inference) to infer the types from the context of the code. The user may not need to write types
explicitly, but the compiler still deduces types behind the scenes, so the type system can
enforce type safety.
● Example (Rust):
● Although the programmer didn’t explicitly state the types of x or name, the compiler
infers the types, and the type checker can then verify that the code uses these types
correctly.
For type checking to function correctly, the type system must know the types of:
1. Variables: Whether explicitly declared or inferred, the compiler must know what type a
variable holds to perform type checking.
2. Function Signatures: The types of function parameters and return values are crucial for
ensuring that functions are called and used correctly. In some languages, function types
can be inferred from the body of the function.
3. Expressions and Operations: The type checker verifies that the types involved in
operations (e.g., arithmetic, comparisons, method calls) are compatible.
273
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In many modern languages, you can mix both explicit and implicit types:
● Explicit types for function signatures and complex logic where clarity is important.
● Inferred types for local variables where the type is obvious from context.
func add(a: Int, b: Int) -> Int { // Type annotations for parameters and return type
let sum = a + b // Type of local var sum is inferred as int
return sum
}
let result = add(a: 5, b: 3) // Type of result is inferred as Int
Here, the user specifies types explicitly for the function parameters (a, b) and return value
(Int), while relying on inferred types within the function body (the compiler infers that sum is
also an Int).
If a language doesn't have a way to specify types (whether explicitly or implicitly), it would be a
dynamically typed language. In such cases, the language doesn't enforce type safety at
compile time. Instead, types are checked at runtime, meaning errors related to type mismatches
only occur when the program is running, not during compilation.
● Example (Python):
In dynamically typed languages, there’s no compile-time type checking, so the user doesn’t
have to (and typically cannot) specify types in a formal way. However, this can lead to runtime
errors when incompatible types are used.
274
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Notice that the only difference to our language of expressions thus far is the fact that let
bindings now have an extra parameter: the type of the bound variable:
275
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. TAdd (Addition): It checks that both expressions e0 and e1 are of type int, and the
result of their addition is also int.
2. TLth (Less-than comparison): It ensures that both operands of the less-than
comparison are integers, and the result is a boolean.
3. TAnd (Logical AND): It checks that both expressions e0 and e1 are booleans, and the
result of their logical conjunction is also a boolean.
4. TIf (Conditional expression): The condition e0 must be a boolean, and both branches
(e1 and e2) must have the same type t, which becomes the type of the whole
conditional expression.
5. TLet (Let-binding): The expression e0 should have type t0, and within the scope of the
let, name is assigned type t0. The body of the let (expression e1) should have type
t1, and this becomes the overall type.
6. TVar (Variable typing): It states that if the variable name is in the context Σ and has type
t, then name can be used as type t.
Can you implement these rules as a Python Visitor? Which kind of tests
would you have for your implementation?
We can start with the tests. For instance, consider the tests below:
# Example Usage
env = {'x': int, 'y': bool}
checker = TypeChecker()
# Example expressions:
expr = Add(Var('x'), Num(5)) # x + 5
print(expr.accept(checker, env)) # Should return int
# Example expressions:
expr = And(Var('y'), Bln(True)) # y and True
276
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
We'll assume a context (env) which maps variable names to types (similar to the Σ in your
typing rules).
class TypeChecker:
def visit_var(self, var_exp, env):
# Look up the variable's type in the environment
if var_exp.identifier in env:
return env[var_exp.identifier]
else:
raise TypeError(f"Variable {var_exp.identifier} is not defined")
277
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
● TypeChecker: Implements the type checking logic using visitor methods for each
expression type.
● Environment (env): This is a dictionary that keeps track of variable bindings and their
associated types.
● Visitor Methods: Each visitor method checks the types of sub-expressions and ensures
that they conform to the expected types (e.g., addition requires two ints, if requires a
bool condition, etc.).
This setup should allow you to type-check the expressions in your language according to the
rules you provided.
278
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Evaluation is not too different from type checking. Can you see how similar
these visitors are?
The structure for program evaluation and type checking is indeed quite similar. Both
approaches:
1. Traverse the abstract syntax tree (AST) recursively using a visitor pattern.
2. Visit each node type (like Var, Num, Add, IfThenElse, etc.) to apply specific logic.
3. Rely on an environment (env) to track variable bindings, whether it's values (for
evaluation) or types (for type checking).
Similarities:
● AST Traversal: Both implementations walk the AST and process expressions in the
same order.
● Environment Handling: In both cases, variables are looked up in an environment
(env)—during evaluation, this contains variable values; during type checking, it contains
types.
● Binary Operations: For binary operations like Add or And, both implementations ensure
that the two operands conform to specific rules—either they must be integers/booleans
for evaluation or types must match for type checking.
● Conditionals: In IfThenElse, both visitors check the condition and then process the
branches, although the type checker ensures type consistency while the evaluator
computes the actual value.
279
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Some programs still run fine under the dynamic type verifier, but are ruled
out by the static type verifier. Can you give examples?
See this comparison below:
280
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the code below, the dynamic evaluation runs fine because the interpreter is focused on
whether the if condition is True and simply selects the appropriate branch without caring that
the branches have different types.
However, in this new example, the static checker fails because it enforces that both branches of
the conditional must have the same type, ruling this program out due to the mismatch between
the int and bool branches.
This discrepancy occurs because static type checking aims to prevent type errors in any
execution path, while dynamic type checking only raises errors when a particular path is actually
followed. This is why some programs pass the dynamic check but fail the static one.
● Dynamic type checker: Evaluates expressions at runtime and checks types as needed.
If a branch is never executed, its type mismatch won't cause issues (e.g., IfThenElse
with mismatched types in different branches).
● Static type checker: Enforces type rules at compile time, meaning it checks both
branches of an IfThenElse even if one is never executed. This makes the static type
checker stricter, as it ensures type correctness for all possible execution paths, even if
some are never taken.
281
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
So, is that why you said that dynamic typing allows more programs to run
than static typing?
Yes, that is it! Dynamic typing allows more programs to run because it performs type checks at
runtime, evaluating only the code that is actually executed. This means that type mismatches in
parts of the code that aren't run do not cause issues, which can lead to greater flexibility.
For example, in a dynamically-typed language or checker, the following program would run fine
as long as the branch with the type error isn't executed:
if condition:
x = 5 # int
else:
x = "string" # str
If condition is True, the program never evaluates the second branch, so the type mismatch
doesn't matter. The dynamic checker only cares about the types when the values are actually
used.
On the other hand, static typing checks all possible branches of code at compile time,
regardless of which branches are executed. This ensures stricter guarantees about type
correctness, but it also means that some programs that are valid in a dynamic system would be
rejected by a static type checker. In the example above, the static type checker would reject the
program due to the type mismatch in the else branch, even if that branch is never executed.
Summary:
● Dynamic typing: More flexible and allows more programs to run because it only checks
types when they are needed at runtime. This can sometimes allow type inconsistencies
that never manifest in execution.
● Static typing: Stricter and catches more potential type errors at compile time. This
ensures type safety for all execution paths but also prevents some programs from
running due to type mismatches, even if those mismatches are not actually problematic
during execution.
Dynamic typing's flexibility comes at the cost of potentially discovering errors only at runtime,
whereas static typing's rigidity prevents certain kinds of programs from running at all.
282
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Progress: This property ensures that a well-typed program is never "stuck" — it can
always make progress in its computation. This means that, for any well-typed term
(piece of code), it is either a value (it represents a final, non-computable result) or it can
take a step forward in its evaluation (i.e., reduce to another term). This property
guarantees that, as long as a program type-checks, it won’t encounter runtime errors
due to unhandled operations (like trying to apply a function to something that isn’t a
function).
2. Preservation: Also known as the "type preservation" or "type safety" property,
preservation ensures that if a well-typed term evaluates to another term, the resulting
term is still well-typed and has the same type as the original. In other words, during
evaluation, types are preserved — types don’t change unexpectedly as the program
executes. This property makes sure that types stay consistent throughout the program’s
execution, so if a function returns an integer, it will keep returning an integer through
every evaluation step.
Together, progress and preservation form the foundation of a sound type system. They ensure
that type-checked programs can execute safely without certain kinds of runtime errors, providing
guarantees about both the program’s ability to continue running and its consistency in type
behavior. The figure below states these two theorems, together with examples of evaluation and
typing rules:
283
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Goal (Progress Statement): The goal is to show that if Σ⊢e:t, then either e is a value or
there exists an v such that e→v. For the if-then-else expression, this means proving
that if the expression type-checks, it can always take a step forward in evaluation.
2. Inductive Hypothesis: Since this proof relies on induction, the dashed arrows labeled
"Induction" suggest applying the inductive hypothesis to sub-expressions:
○ By induction, we assume that the sub-expressions e0, e1, and e2 satisfy
progress. This means each one can either evaluate to a value or take a step
forward.
3. Case Analysis: The evaluation rules (IfTrue and IfFalse) for the if-then-else
construct imply that:
○ If e0evaluates to true, then the entire expression reduces to e1.
○ If e0evaluates to false, then it reduces to e2.
4. The proof diagram suggests that if we apply progress to e0, we can show that it must
eventually evaluate to either true or false (since it’s a boolean). In either case, the
whole if-then-else expression can take a step forward by reducing to e1or e2.
5. Conclusion: Once e0 is evaluated, the entire if-then-else expression can proceed
according to the IfTrue or IfFalse rules, achieving progress.
284
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this approach, you use the structure of the derivation tree, which is formed by applying typing
rules, to prove a property (like progress) for each expression that type-checks. Here’s a more
detailed breakdown of what this means in the context of your proof:
1. Derivation Tree: The derivation tree is constructed by applying typing rules to build up a
proof that a given expression has a certain type. For an if-then-else expression, the
tree would start with the Tif typing rule, which in turn relies on the types of the
sub-expressions e0, e1, and e2.
2. Induction on the Tree Structure: To prove the progress property, you proceed by
induction on the structure of this derivation tree. Each node in the tree corresponds to a
different typing rule, and you handle each rule as a separate case in the proof. For
example:
○ If the derivation ends with the Tif rule, then you examine the sub-expressions
e0, e1, and e2.
○ You then apply the inductive hypothesis to these sub-expressions, assuming
progress holds for each one individually.
3. Using Typing Rules as Cases: Each rule applied in the derivation tree corresponds to a
case in the proof. By handling each rule and using induction on the derivation of
sub-expressions, you show that if the overall expression is well-typed, it can either take a
step in evaluation or is a value.
In summary, induction on the derivation tree is a method of proof where you use the structure of
the type derivation (proof tree) to inductively prove a property for all well-typed expressions. It’s
a standard technique in proving properties like progress and preservation in type systems.
285
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The diagram outlines a proof of preservation for the if-then-else expression. Here’s an
explanation:
Preservation is a property that says: if an expression e has a type t (written Σ⊢e:t), and e
evaluates to v (written e→v), then the resulting value v should also have the type t as well (i.e.,
Σ⊢v:t). This ensures that evaluation doesn’t “break” the type of an expression—any steps taken
by well-typed expressions produce new expressions of the same type.
Proof Structure
To prove preservation, we do induction on the derivation tree of the typing judgment. The proof
involves a case analysis on the last typing rule used in the derivation. In this case, we’re
focusing on the if-then-else expression, specifically the typing rule Tif.
286
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
The diagram walks through a preservation proof for the if-then-else construct by:
This approach confirms that well-typed if-then-else expressions maintain their type after
evaluation, satisfying preservation. Great work on structuring this!
In contrast, type safety (which often implies soundness, i.e., progress and preservation) has a
clear, formal meaning. A type-safe language prevents certain kinds of runtime errors by
ensuring that programs respect their type annotations throughout execution. This is a property
that can be precisely stated and proven, making it a more useful term in formal discussions
about type systems.
So, is it true that C does not meet these properties of progress and
preservation?
Yes, it’s generally true that C does not meet either the progress or preservation properties in
the same way that type-safe languages do. Here’s why:
Progress
● Progress says that a well-typed program is either a value or can take a step in
evaluation without getting "stuck" (i.e., encountering an error that halts execution
unexpectedly).
287
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● In C, a program can be well-typed (according to C’s type system) but still get "stuck" at
runtime due to issues like null pointer dereferencing, buffer overflows, uninitialized
variables, or out-of-bounds memory access. These issues often lead to undefined
behavior, crashes, or segmentation faults, which violate the progress property.
Preservation
● Preservation means that if a program is well-typed and it takes a step in evaluation, the
result is still well-typed.
● C does not strictly maintain preservation because of unsafe type casts and pointer
arithmetic. For example, a void* pointer can be cast to any other type of pointer, and
C’s type system will not prevent you from treating incompatible data as if it were a
different type. This can lead to type inconsistencies during execution, which violate
preservation.
● C was designed with a focus on performance and low-level memory control rather than
type safety. This means that the language allows many operations that sacrifice safety
(and soundness) in favor of giving the programmer more direct control over the
hardware.
● These design choices allow C to be extremely powerful and efficient, but they also mean
that the language doesn’t enforce the same guarantees about types and memory safety
that sound type systems do.
In summary, C’s type system is intentionally permissive and does not enforce progress or
preservation in the way that a sound type system does. This is why C is often described as an
unsafe language—its type system doesn’t prevent certain kinds of runtime errors that sound
type systems are designed to avoid.
And are there languages whose type system has been formally proven to
be sound?
In certain languages like Standard ML (specifically, implementations like SML/NJ) and Haskell,
the type system has been formally proven to satisfy progress and preservation. These
properties make these languages type-safe and provide strong guarantees that well-typed
programs will not encounter certain runtime errors.
Researchers have formally defined the semantics of languages like Standard ML and proved
the type safety theorems for their type systems. This involves constructing formal models that
288
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
capture the language's syntax, type rules, and evaluation rules. Using these models, they can
prove that if a program type-checks, it will always satisfy progress and preservation:
1. Progress: A well-typed Standard ML program will always either be a value (i.e., fully
evaluated) or be able to take a well-defined step in its execution without getting "stuck."
2. Preservation: During evaluation, a well-typed Standard ML program maintains its type
correctness. Each step of evaluation preserves the types, ensuring that the program
remains well-typed throughout its execution.
Languages like Standard ML and Haskell were designed with type safety as a core goal. Their
type systems are expressive but restrictive enough to prevent unsafe operations, like arbitrary
pointer manipulation or unchecked casting. This is one of the reasons these languages can
guarantee certain kinds of safety at runtime.
● Reliability: These guarantees mean that many classes of errors (such as null pointer
dereferences or type mismatches) simply cannot occur in well-typed programs.
● Optimization: Because of these strict guarantees, compilers for type-safe languages
can make certain optimizations that wouldn’t be safe in languages with weaker type
systems.
● High-Assurance Systems: In areas where reliability is critical (e.g., in financial software
or formal verification), languages with proven type safety can offer strong assurances to
developers and users alike.
In addition to Standard ML (SML/NJ) and Haskell, other languages like Coq, Idris, F#, and
OCaml also have robust type systems that support various forms of type safety, often extending
these guarantees through additional features like dependent types (in the case of Coq and
Idris).
Summary
Yes, it’s true that certain languages—especially those rooted in formal logic and type
theory—have been shown to meet the progress and preservation properties through formal
proofs. Standard ML and its implementation SML/NJ are classic examples, and their formally
proven type safety is a significant reason for their reliability in handling type-related errors.
289
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The process involves creating a formal model of the programming language. This model
includes:
Once the language is defined, we use induction and structural reasoning on the syntax and
evaluation rules to prove properties like progress and preservation. This can be very tedious by
hand, so proof assistants help by checking each logical step and ensuring there are no
mistakes.
● Define the syntax and typing rules: Coq has a built-in language to define types and
functions, so you can model the type system and evaluation rules of a language
precisely.
● Express formal properties: You can formally state the properties you want to prove
(e.g., "for all expressions e of type t, either e is a value or it can take a step").
● Automate proofs with tactics: Coq provides tactics (proof strategies) that help
automate some parts of the proof, especially those that involve routine reasoning or
induction.
● Ensure correctness: Coq’s type-checking ensures that only correct proofs are
accepted. If there’s an error or a gap in the reasoning, Coq will reject the proof.
290
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Show that each possible form of e can either take a step or is already a value,
depending on the evaluation rules for that expression.
Proofs like this are broken down into cases, and Coq can verify each case individually.
A well-known example of this approach is the Software Foundations series, which includes
formal proofs of type safety for a small functional language. The proofs in this book are
developed entirely in Coq, and they cover progress and preservation among other properties.
This resource is widely used for learning how to formalize and prove properties of programming
languages using Coq.
Using Coq to prove properties of type systems and programming languages is beneficial for:
In short, Coq provides a framework where you can model a language precisely, express formal
properties, and rigorously prove those properties with machine-checked accuracy.
Historical Background
291
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ The foundational ideas behind type soundness and the relationship between
types and program behavior can be traced back to early studies of typed lambda
calculi, such as Alonzo Church’s work in the 1930s on the simply typed lambda
calculus.
○ Later, logicians like Haskell Curry and William Howard (of the Curry-Howard
correspondence fame) explored the correspondence between types, proofs, and
program correctness.
○ Robin Milner’s work in the 1970s on polymorphic type systems (notably in ML)
also emphasized the role of types in preventing runtime errors, laying
groundwork for the notion of type safety.
2. Wright & Felleisen (1992):
○ Wright and Felleisen’s syntactic approach departed from earlier, more
semantic-based techniques for proving type soundness (such as those using
denotational semantics).
○ They introduced the progress and preservation theorems as distinct, separable
components of proving soundness.
■ Progress handles the dynamic behavior of terms, ensuring evaluation
doesn't get stuck.
■ Preservation formalizes the static guarantees of types through
evaluation.
○ The key insight in their paper was to show that type soundness could be reduced
to these two components in a way that was simple, modular, and accessible.
3. Impact on the Field:
○ Wright and Felleisen’s paper is one of the most cited works in programming
languages because it provided a unified proof methodology that could be
applied across a wide variety of type systems, from simple typed lambda calculi
to more complex type systems (e.g., polymorphism, subtyping).
○ Their framework became a cornerstone for the study of type theory in
programming languages and influenced numerous textbooks, including Pierce's
"Types and Programming Languages", which solidified progress and
preservation as the standard terminology.
● Wright and Felleisen likely chose these terms to align with the conceptual essence of
what the proofs represent:
○ Progress captures the forward motion of evaluation (i.e., programs keep
"progressing" towards completion).
○ Preservation emphasizes the idea of maintaining type-correctness (i.e., type
invariants are "preserved" through execution).
Before their formalization, there wasn't a canonical terminology or proof framework for type
soundness, though similar ideas were implicit in earlier works on operational semantics and
type systems.
292
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The success of Wright and Felleisen’s approach lies in its syntactic nature:
This Felleisen guy did quite a lot for computer science, especially for
programming languages, right?
Oh, yes! Matthias Felleisen is a highly influential figure in computer science, particularly in the
fields of programming languages, software engineering, and education. His work spans a wide
range of topics, from foundational research in programming language theory to the development
of tools and methodologies that have had a practical impact on how programming is taught and
practiced.
Key Contributions
293
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Legacy
294
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Bridging the gap between theoretical computer science and practical software
development.
● His ability to innovate in both theory and practice, making his work accessible and
impactful to students, educators, researchers, and practitioners.
● His commitment to teaching and mentoring, which has inspired generations of
computer scientists.
295
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Type Inference
Code: https://github.com/pronesto/DCC053_Public/tree/main/10_TypeInference
In SML/NJ, we don't write types. Does it use the same type system as
Python?
In SML/NJ (Standard ML of New Jersey), type inference is used, which is why you don’t need
to write type annotations, yet the language is still statically typed. This means that types are
determined at compile time, without requiring you to explicitly annotate every variable or
function.
● Type inference automatically deduces the types of variables, functions, and expressions
during compilation.
● The compiler analyzes how you use each value in the program to infer its type and
ensures type correctness without you specifying it.
● Python uses dynamic typing. In Python, types are associated with values at runtime,
and you can freely change the type of a variable by assigning it a different value.
● In contrast, SML/NJ is statically typed through type inference, which means that types
are determined and checked at compile time. Once a type is inferred, it’s fixed for that
variable or function throughout its scope.
Example in SML/NJ:
fun add(x, y) = x + y;
In summary:
● SML/NJ uses type inference to automatically deduce types but still enforces static
typing (types are fixed at compile time).
● Python is dynamically typed, meaning types are determined at runtime and can change
throughout the program’s execution.
296
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
So yes, SML/NJ is not like Python in this sense—it's statically typed, but thanks to type
inference, you don't need to write explicit type annotations!
Can you give me an example that shows the difference between SML/NJ
and Python?
Let’s use a conditional expression example to highlight the difference between Python (which
allows dynamic typing) and SML/NJ (which enforces static typing with type inference).
print(dynamic_example(True)) # Outputs: 10
print(dynamic_example(False)) # Outputs: ten
In Python:
In SML/NJ:
● This code would fail to compile. Why? Because SML/NJ enforces that both branches of
the conditional must return the same type.
● Here, the if branch returns an int (10), while the else branch returns a string
("ten"), leading to a type mismatch error.
● SML/NJ performs static type checking and will not allow this mixing of types.
If you tried to compile the SML/NJ code above, you might see an error like:
297
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Error: types of then and else branches do not agree [int tycon mismatch]
then branch: int
else branch: string
Key Difference:
● Python allows the return type to be dynamically determined at runtime. Each function
call can return different types, and the language doesn't enforce consistency.
● SML/NJ requires that the types be consistent at compile time, ensuring that both the
then and else branches return the same type (either both int or both string).
This is a demo of how dynamic typing in Python allows more flexibility, while static typing
with type inference in SML/NJ enforces stricter type rules, even though you don't need to
explicitly write the types!
Type Checking:
● What it does: Verifies that the types of variables and expressions in the program are
correct according to a set of rules.
● When it happens: Can be done at compile time (static type checking) or runtime
(dynamic type checking).
● Input: Requires explicit type annotations from the programmer or runtime values.
● Example:
● The type checker will ensure that x and y are integers before the program runs or while
it's running.
Type Inference:
● What it does: Automatically deduces the types of variables and expressions without
explicit type annotations.
● When it happens: Typically at compile time, but sometimes can occur during
interpretation (e.g., in dynamically-typed languages).
● Input: No need for explicit type annotations—types are inferred based on how the
variables are used.
● Example:
298
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● In a language with type inference (e.g., Haskell or SML/NJ), the types of x and y would
automatically be inferred as integers based on how they are used (addition).
Key Difference:
#include <iostream>
#include <fstream>
#include <string>
#include <map>
When you use auto in C++, the compiler deduces the type of the variable based on the value
or expression used to initialize it. At first glance, that looks similar to how languages with type
inference, like SML/NJ or Haskell, automatically infer types without requiring explicit type
annotations from the programmer.
299
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example:
auto x = 10; // x is deduced to be an int
auto y = 3.14; // y is deduced to be a double
auto z = "hello"; // z is deduced to be a const char*
In this example, C++ determines the type of x to be int, y to be double, and z to be const
char*, based on the values assigned to them. The programmer doesn’t need to explicitly write
out the type, and the compiler does the work of deducing it.
Key Points:
● Type inference in C++ with auto is local, meaning it works only within the context of
the specific variable initialization.
● It does not mean C++ has full-blown type inference like in some functional languages
(e.g., SML, Haskell), where types are inferred globally in functions or across the
program.
● Type safety is still guaranteed in C++ even with auto. Once the type is deduced, it's
treated exactly as if you had written it explicitly.
So, while auto in C++ gives some of the convenience of type inference, C++ remains a
statically-typed language where the type must be determinable at compile time.
So, what's the difference between auto in C++ and type inference in
SML/NJ?
The auto keyword in C++ and the type inference system in SML/NJ (Standard ML of New
Jersey) represent different approaches to type inference in programming languages. Here's a
comparison highlighting their differences:
● Scope and Usage: The auto keyword in C++ simplifies the type declaration for
variables. It allows the compiler to deduce the type based on the initializer. For example,
auto x = 42; makes x of type int because 42 is an int.
● Complexity: C++'s type inference with auto is relatively simple. It does not infer types
beyond what can be determined from a single initializer. The auto keyword is limited in
scope, mostly applied to variable declarations.
● Flexibility: While it reduces verbosity, auto does not provide deep or contextual type
inference. It relies on explicit initializers and does not support advanced features like
polymorphism or higher-order types.
300
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Scope and Usage: SML/NJ uses a more sophisticated type inference system that can
infer types in a broader context, including function signatures and complex expressions.
This system allows for more advanced features like polymorphic types and higher-order
functions without explicit type annotations.
● Complexity: SML/NJ employs the Hindley-Milner type inference algorithm, which is a
powerful and general algorithm used in many functional languages. It can infer the most
general type of a given expression, even when multiple type variables and constraints
are involved.
● Flexibility: The Hindley-Milner algorithm supports type polymorphism, allowing functions
to be written generically and applied to different types without needing explicit type
annotations. This system can infer types in complex scenarios involving multiple
functions, type constraints, and higher-order functions.
Key Differences
In summary, SML/NJ’s type inference is a more comprehensive system that supports advanced
features and more complex type scenarios compared to the simpler auto keyword in C++.
Could you implement a visitor that propagates types, similarly to what auto
does in C++?
To implement a visitor that infers the type of all variables created through let bindings, you
need to enhance your current visitor pattern. This new visitor should be able to traverse the
expression tree and infer the types based on the expressions used in let bindings.
1. Define Type Classes: First, define a basic set of types that you might expect, such as
INT, BOOL, and potentially more if needed.
2. Implement Type Inference Visitor: Create a VisitorTypeInfer class that will
traverse the expression tree and infer types.
301
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class VisitorTypePropagator:
def visit_var(self, var, env):
if var.identifier in env:
return env[var.identifier]
else:
raise TypeError(f"Variavel inexistente {var.identifier}")
302
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This visitor will traverse the expressions and infer types based on the given rules. If it
encounters type mismatches or unsupported operations, it will return ErrorType.
Can you use this visitor to implement a find_type function that returns
the type of an expression?
Here's the implementation of one such a function:
def find_type(e):
"""
Finds the type of an expression via type propagation.
Example:
>>> e0 = Let('w', Num(2), Add(Var('v'), Var('w')))
>>> e1 = Let('v', Num(40), e0)
>>> find_type(e1)
<class 'int'>
303
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
It seems that type inference via auto depends on the ordering of program
constructs. Does that make sense?
Yes, that makes sense. The auto keyword in C++ relies on contextual information to deduce
types. Here's how your observation breaks down:
Starting with Base Types: For auto to deduce a type, it must begin with expressions where
the type is directly apparent—literals or expressions with fully known types. For example:
Here, auto can deduce x as an int because the literal 5 has an evident, fixed type (int).
Dependent Constructs: Once the type of a base construct is inferred, it can be propagated to
dependent constructs:
auto y = x + 3; // 'y' is deduced as 'int' because 'x' is 'int' and '3' is 'int'.
The deduction of y depends on the already inferred type of x (int). This creates an order of
inference, where types are resolved in a step-by-step manner.
Since w is not declared or defined in the code, the type of z cannot be inferred. The inference
process must have complete information for all involved constructs—no missing types or free
variables.
Annotated or Inferred Variables: If free variables are annotated or inferred separately, type
deduction can proceed:
int w = 4;
// Now 'z' can be deduced as 'int', because 'w' is explicitly an 'int'.
auto z = w + 2;
So, in short: type inference (like auto) depends on the ordering of program constructs. The
process starts from constructs whose types are directly knowable (like literals or explicit
declarations), and it moves to more complex expressions that depend on already-inferred parts
304
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
of the program. Free variables, without proper annotations or separate inference, can halt this
process.
It seems that this forward style of type inference is just type checking like in
the last class. What's the difference?
Yes, type checking also assumes a proper ordering of statements, and also propagates types
forwardly. The key difference appears in the implementation of let bindings, e.g.:
The essential difference between type checking and forward type propagation lies in their
approach to verifying and assigning types within code, especially in the context of explicit versus
implicit type information.
In the type-checking method, types are explicitly declared by the programmer. The primary
role of the method is to verify that these declared types match the actual types of expressions.
Here’s how it works:
● The type of the expression on the right side of the let binding (let_exp.exp_def) is
computed.
● This computed type (tp_var) is then compared to the type explicitly declared in the
code (let_exp.tp_var).
● If they match, the environment is updated with this type information, and the body of the
let expression is type-checked with this updated environment.
● If they don’t match, a TypeError is raised, indicating a mismatch between the declared
and actual types.
Thus, type checking is about validating that the types explicitly stated by the programmer
align with the types determined by analyzing expressions. It relies on an upfront declaration of
types and verifies correctness against those declarations.
305
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In forward type propagation, no explicit type is declared for let bindings. Instead, the method
infers the type of the expression:
Forward type propagation is thus about automatically inferring and propagating types without
requiring explicit type annotations. It computes types based on the values of expressions and
allows them to flow through the program.
In Summary
● Type Checking: Validates that declared types match actual expression types, enforcing
correctness based on programmer-provided types.
● Forward Type Propagation: Deduces and assigns types based on expression values,
enabling implicit type information to propagate through the code.
Both are forms of static analysis, but type checking is more about verification, while forward
type propagation is about inference and automatic type deduction.
In C++'s auto, type deduction happens locally and immediately—you must have enough
information at each point in the program to infer the type. Hindley-Milner, however, can infer
types globally, meaning it doesn't require the explicit ordering of constructs. In HM, types of
variables and expressions can be inferred in any order, and type constraints are solved
together across the entire program. Here's how it works:
306
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
fun f(x) = x + 1
val y = f(true) (* Error: type mismatch *)
● Here, even though the type of f is not explicitly declared, HM can infer that f must take
an integer as a parameter because it's being used with + 1. The system also knows that
true is a boolean, so it can catch the type mismatch at compile time.
One of the key features of Hindley-Milner is type generalization (or polymorphism), where the
type inference system can infer generic types (polymorphic types) for functions and
expressions. This allows it to infer more general types that can work with multiple types of data.
fun identity x = x
Here, HM infers that identity is polymorphic—it can take any type as an argument and
return that same type. Its inferred type is:
identity : 'a -> 'a (* For any type 'a, it returns 'a *)
● This generality is much more powerful than C++'s auto, which only infers specific types
based on the context of each expression.
3. Postponing Resolution:
In HM-based systems, the inference process can postpone resolution until more information is
available. Here's why this is powerful:
fun f g x = g x
● This means f takes a function g that maps some type 'a to 'b, and an argument of type
'a, and returns a result of type 'b. Even though f doesn’t specify the exact types of g
307
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
and x, HM is able to infer a generic, polymorphic type that works for any function g and
any type of x.
Hindley-Milner can infer types across function definitions without requiring explicit type
annotations. This means that you can have mutually recursive functions or references to
functions defined later, and the system will still be able to infer types. Example:
● HM will infer that both even and odd work on integers, even though they refer to each
other. The type system doesn’t require you to define them in a strict order or annotate
them.
Hindley-Milner can infer types for free variables in functions by introducing type variables and
later resolving them. This makes it capable of handling more complex inference scenarios
where types are not immediately apparent. Example:
fun apply f x = f x
Summary of Differences
Type Deduction Local type deduction based on Global type inference that works
Scope immediate context across the entire program
308
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Free Variables Cannot infer types for free Can infer types for free variables
variables without explicit using type variables and unification
annotations
Hindley-Milner is more powerful because it can infer types without needing immediate
resolution or explicit ordering of constructs. It can infer generic types through polymorphism
and resolve types globally across the program, even when the relationships between variables
and functions are complex. This gives it a broader range of inference capabilities compared to
the local and immediate nature of C++'s auto.
This separation between constraint generation and unification is a key feature of HM inference.
309
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Unifies type variables: If we encounter two type variables T1 and T2, we unify them by
making them the same type.
● Unifies types: If we see a type variable T1 and a specific type Int, we unify T1 with
Int.
● Handles polymorphism: When we generalize a function (i.e., infer a polymorphic type),
we ensure that the types can adapt to different input types without conflict.
● For fn x => x, we assign a fresh type variable to x, say T_x. Since the function just
returns x, the return type is also T_x. Therefore, the type of id is T_x -> T_x.
● For id 42, we generate the constraint that id is a function that takes an Int as input,
which gives us the constraint T_x = Int.
Step 2: Unification
● We have the constraint T_x = Int from the function application, so we unify T_x with
Int.
● This gives us the final inferred type for id: Int -> Int.
● Stage 1 (Constraint Generation): We traverse the program and infer relations between
types using type variables. Each operation or function application introduces constraints
that need to be satisfied.
● Stage 2 (Unification): We solve these constraints using unification, substituting type
variables with concrete types and generalizing where needed (e.g., for polymorphic
functions).
This separation makes the algorithm powerful because it can handle complex expressions and
polymorphism efficiently.
310
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class CtrGenVisitor(Visitor):
"""
Examples:
>>> e = Let('v', Num(40), Let('w', Num(2), Add(Var('v'), Var('w'))))
>>> ev = CtrGenVisitor()
>>> sorted([str(ct) for ct in e.accept(ev, ev.fresh_type_var())])
["('TV_1', 'TV_2')", "('TV_2', 'TV_3')", "('v', <class 'int'>)", "('w', <class
'int'>)", "(<class 'int'>, 'TV_3')", "(<class 'int'>, 'v')", "(<class 'int'>, 'w')"]
"""
def __init__(self):
self.fresh_type_counter = 0
def fresh_type_var(self):
"""
Example:
>>> ev = CtrGenVisitor()
>>> [ev.fresh_type_var(), ev.fresh_type_var()]
['TV_1', 'TV_2']
"""
self.fresh_type_counter += 1
return f"TV_{self.fresh_type_counter}"
311
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
>>> e = Var('v')
>>> ev = CtrGenVisitor()
>>> e.accept(ev, ev.fresh_type_var())
{('v', 'TV_1')}
"""
return {(exp.identifier, type_var)}
312
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Fresh Type Variables: You are generating fresh type variables using
fresh_type_var() whenever a new variable or function is encountered. This ensures
that different variables get different types unless constraints force them to be the same.
2. Handling Expressions: For each expression, you recursively visit its subexpressions
(e.g., for Add, you visit both the left and right-hand sides). You are imposing constraints
that specify both sides of an operation must have the same type, and you're doing this
for arithmetic (Add) and logical (And) operations, which aligns with HM.
313
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Unification Through Pairs: The output of each visitor method is a set of pairs (or
constraints) of the form (type0, type1). These pairs capture relations like:
○ A variable must have the same type as its use (e.g., in visit_var).
○ Both sides of an addition must be int (in visit_add).
○ Logical operations must involve bool types (in visit_and).
4. Let Binding: In the visit_let method, you are generating constraints for the definition
(exp_def) and ensuring that the type of the variable (exp.identifier) is consistent
with its use in the body. The fresh type variable for the body (fresh_type_var())
captures the idea that the result of the entire let expression could be a new type,
distinct from the type of the bound variable.
5. Conditionals: The visit_ifThenElse correctly ensures that the conditional
expression must have a bool type and that the types of the two branches must match.
Can you show the pairs that would be produced for “let val v = 3 * 7
in v + v end"?
Sure, let's do it with a figure:
314
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example:
>>> sets = unify([('a', type(1))], {})
>>> integers = sets[type(1)] - {type(1)}
>>> sorted(integers)
['a']
Notice that at this stage, we still allow sets with invalid types. For
instance, the set associated with 'b' in the example below will contain
315
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
316
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This algorithm seems very inefficient. Surely, that's not how unification is
implemented in practice, right?
In practice we could use union-find to implement unification. The inefficiency in our
implementation of unification arises from the repeated merging of sets. In each recursive call,
when two elements t0 and t1 are found not to be equal, it merges the sets associated with
each element by creating a union of s0, s1, t0, and t1. Then, it updates every element in
this new set to refer to the merged set. This approach has two main problems:
1. Repeated Set Reassociation: Each time two sets are merged, every element in both
sets must be updated to point to the newly formed merged set. As the number of sets
grows, these updates become increasingly expensive, especially when the sets overlap
significantly. This can lead to quadratic or even worse time complexity, depending on the
number and structure of constraints.
2. Lack of Efficient Lookup and Merging: The implementation lacks an efficient way to
determine which set a particular element belongs to. It manually merges sets by iterating
through all members, which is inefficient for large sets or numerous constraints.
317
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The union-find (or disjoint-set) algorithm addresses these inefficiencies by providing a more
efficient way to manage and merge sets. Union-find is specifically designed to handle dynamic
connectivity problems, where the task is to quickly determine which elements are in the same
set and to efficiently merge sets when necessary. It has two main components: find and union,
each optimized with techniques that greatly reduce the time complexity.
1. Find with Path Compression: In union-find, each element initially points to itself as its
own "parent." When determining which set an element belongs to, the find operation
follows parent pointers until it reaches the root of the set. To improve efficiency, path
compression is used to make each element in the path point directly to the root. This
flattens the structure, so future lookups for elements in the same set become very fast.
2. Union with Union by Rank: When two sets need to be merged, the union operation
links the root of one set to the root of the other. Union by rank (or union by size)
ensures that the smaller tree is always attached under the root of the larger tree. This
keeps the overall structure balanced, minimizing the height of the trees.
Summary of Improvements
● Repeated full-set updates, since each element only needs to point to a root, not to
every other element in the set.
● Inefficient set merges, as union-find directly merges roots instead of handling all
individual elements in the sets.
Overall, union-find significantly speeds up the unification process by eliminating redundant work
and providing a more scalable way to manage dynamic set connections, making it well-suited
for unification tasks with potentially large numbers of constraints.
318
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When unifying type constraints, multiple variables might represent the same type, but initially,
they are distinct. After unification, we want to treat all equivalent types as a single entity, which
simplifies reasoning about them. This is where canonicalization comes into play: it selects a
single "canonical" type for each group of unified types.
Why is it Necessary?
During unification, several type variables or concrete types might end up in the same set. For
example, if we unify the type variable 'a' with int, then 'a' should now be considered
equivalent to int. The purpose of canonicalization is to:
1. Ensure consistency: Instead of having a set of equivalent types (e.g., {'a', 'b',
int}), we map all type variables in the set to a single canonical representative, such as
int.
2. Error detection: If a type set contains multiple concrete types (e.g., int and bool), this
indicates a type error (as a variable cannot be both an int and a bool).
Canonicalization helps detect and handle such cases by checking the size of the set of
types.
3. Simplify further processing: Once the canonical form is established, subsequent type
checks or inferences can simply refer to the canonical type instead of having to reason
about a set of potential types.
The term canonical comes from mathematics and computer science, where it refers to a
standard, normalized, or simplified form of an object. In type systems:
● A canonical type is the "final" or "standard" type after unification, such as int, bool,
etc.
● Canonicalization is the process of reducing a set of equivalent types (or type variables)
to one standard representative type.
● Removing non-concrete types (like type variables 'a', 'b') from the set.
● Ensuring the set contains only one concrete type (e.g., int, bool).
● If there's exactly one concrete type, it's chosen as the canonical representative.
● If the set contains multiple concrete types or no concrete type at all, a "Type error" is
raised, signaling ambiguity or failure in unification.
Example of Canonicalization
319
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Canonicalization will:
● Ignore the type variables 'a' and 'b' (since they are not concrete types).
● Recognize that the only concrete type left is int.
● Set 'a' and 'b' both to refer to int as their canonical type.
In this way, 'a' and 'b' are now treated as equivalent to int everywhere, which simplifies
further type reasoning.
Thus, canonicalization is a crucial step in ensuring that the unified types are logically consistent
and usable in further stages of type inference.
Here's an implementation:
def name_sets(sets):
"""
This method replaces type sets with "canonical type names". A canonical
type name is the name of a type set. For instance, the type set
{'a', 'b', type(int)} has the canonical name type(int)
Example:
>>> sets = name_sets({'a': {'a', 'b', type(1)}, 'b': {'a', 'b', type(1)}})
>>> [sets['a'], sets['b']]
[<class 'int'>, <class 'int'>]
320
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def canonicalize(s):
aux = set(s)
for e in aux:
if not isinstance(e, type):
s.remove(e)
if len(s) == 0 or len(s) > 1:
sys.exit("Type error")
Good! Now, as a final step, can you implement a type inference engine that
uses all your functions?
321
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def infer_types(expression):
"""
This method maps all the program variables to type names.
Example:
>>> e = Let('v', Num(42), Var('v'))
>>> type_names = infer_types(e)
>>> type_names['v']
<class 'int'>
Unification in Prolog
In Prolog, unification is the core operation that drives the logic programming language. It is the
process by which the interpreter matches terms (such as variables, atoms, or structures) when
attempting to satisfy logical queries. Prolog's unification:
322
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
?- X = foo(Y), foo(bar) = X.
Prolog will unify X with foo(Y) and foo(bar), resulting in the binding Y = bar.
In the HM type inference algorithm, unification is used to ensure consistent types for
expressions in a typed functional language (e.g., ML, Haskell). It:
● Matches type variables with concrete types or other type variables to ensure type
consistency.
● Is directional in most cases, since unification proceeds as part of type checking.
● Occurs at compile-time: In the HM algorithm, unification is part of static type inference,
helping to deduce the types of expressions without explicit type annotations.
For example, if you have the expression f(x), where the type of f is inferred to be a -> b and
the type of x is a, unification ensures that the input type of f and the type of x are compatible by
solving constraints like a = int.
Commonalities
1. Variable Binding: Both in Prolog and HM, unification binds variables to values (in HM,
these are type variables and types).
2. Finding a Substitution: In both systems, unification finds a substitution (or solution) that
makes two terms (or types) identical.
3. Failure Handling: In both Prolog and HM, unification can fail if no consistent substitution
is found (e.g., trying to unify an int with a bool in HM, or an incompatible structure in
Prolog).
Differences
1. Purpose: In Prolog, unification is used for logical reasoning and query resolution, while
in HM, it is used for type inference to ensure type consistency.
2. Symmetry: Prolog's unification is symmetric, while HM unification is often directional
because types are inferred in a structured way (from expressions to types).
3. Application Domain: Prolog uses unification in the context of logic programming and
deductive reasoning, while HM unification is about static typing and type inference in
functional languages.
Both forms of unification share the same theoretical basis: they are both about solving
equations between terms or types by finding a substitution that makes them identical. In that
sense, they are different applications of the same fundamental concept.
323
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion
Although unification in Prolog and in HM type inference serve different purposes and have
distinct characteristics, they are based on the same underlying principle of matching structures
through substitution. The term "unification" is thus used in both cases because both involve
solving the same type of problem—making two things equivalent by finding suitable variable
assignments.
Now that we saw HM Type Inference, can you contrast it with C++11's
auto?
In languages like C++ and Rust, the auto keyword (or similar mechanisms) allows the compiler
to deduce the type of a variable based on the expression assigned to it. This deduction
happens as the compiler processes the code in a single pass, propagating type information
forward based on the context where the variable is used. The important characteristics of this
process include:
● Deterministic: The compiler deduces the type without any backtracking. Once a type is
inferred, it is fixed.
● Context-sensitive: Type deduction relies on the immediate context of the variable, such
as the right-hand side of an assignment or the return type of a function.
● Ordered: Type deduction happens in a linear, top-to-bottom fashion, following the order
in which the code is written.
This process doesn't involve solving complex constraints like in the Hindley-Milner (HM)
system. Instead, the compiler simply propagates type information through the code, from the
point where the type is obvious (e.g., from a literal or function return type) to the places where it
needs to be inferred.
324
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Type Information Flows Forward: The type deduced for an expression (e.g., a function
call) is propagated to variables or further expressions, without the need to solve
bidirectional constraints like in HM or Prolog.
2. No Backtracking: Once the type is propagated, it remains fixed. There's no need to
revisit earlier parts of the code to adjust types based on future information, which is a
key distinction from unification-based systems.
3. Sequential Nature: The deduction happens in the order that the program constructs are
encountered, and types are propagated forward through the code in a single pass.
● Directional Flow: Types are deduced and propagated through the program as the
compiler encounters each construct.
● No Backtracking: The process doesn't involve backtracking or revisiting earlier points in
the code to revise types, as unification might.
● Simple Deduction: The types are deduced from well-defined rules and available
information at the point of assignment, rather than solving a system of constraints as in
HM.
Conclusion
Yes, you can indeed call this process type propagation, as it aptly describes how type
information is forwarded through the program constructs in a linear, deterministic manner. This
differentiates it from unification-based type inference systems like HM, where types are
resolved through a constraint-solving process that may involve backtracking or complex
interactions between type variables.
325
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
It seems that Robin Milner did much for computer science. Could you tell
me some of his contributions?
Yes, he did quite a lot! Robin Milner (1934–2010) was a British computer scientist who
profoundly influenced programming languages, concurrency theory, and formal methods. He
was recognized with the Turing Award in 1991 for his foundational contributions to computer
science.
2. Concurrency Theory
326
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● LCF (Logic for Computable Functions): Milner worked on this early interactive
theorem prover, which helped formalize proofs in logic and computation. It also
introduced innovations like tactics and tacticals, which have influenced modern proof
assistants.
4. Semantics of Programming
● Milner's work emphasized the need for precise semantics in programming languages,
helping bridge the gap between theoretical models and practical programming.
● In addition to the Turing Award, Milner received numerous honors, including election to
the Royal Society and the US National Academy of Sciences. His influence extends
across programming language design, software verification, and theoretical computer
science.
Robin Milner's work remains a cornerstone of computer science, influencing areas as diverse as
functional programming, distributed systems, and formal verification methods.
327
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Anonymous functions are functions that do not have a name. They are often used for short-lived
operations, passed as arguments to other functions, or used in functional programming
contexts. They are useful for creating concise, inline functions and can be defined and used in a
single line of code.
328
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In Standard ML of New Jersey (SML/NJ), anonymous functions are defined using the fn
keyword. Here are some examples to illustrate how anonymous functions work in SML/NJ:
Here, fn (x, y) => x + y is an anonymous function that takes two arguments and returns
their sum.
You can use anonymous functions directly in expressions without assigning them to variables:
In this case, (fn x => x * x) defines an anonymous function that squares its argument, and
5 is the argument provided to this function.
Anonymous functions are often used as arguments to other functions. For example, using
List.map to apply a function to each element of a list:
Here, fn x => x * x is an anonymous function used to square each element in the list nums.
Anonymous functions can also be returned from other functions. For instance:
329
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this example, makeAdder returns an anonymous function that adds a given number n to its
argument. add5 is an instance of this function with n set to 5.
In SML/NJ, every function takes in just one argument; however, functions can still receive
multiple parameters via tuples:
Here, fn (x, y) => x * y defines an anonymous function that multiplies two numbers.
In this case, different anonymous functions are used based on the pattern matched by the case
expression.
1. Function with Environment: A closure is a function that has access to variables from
its creation context, known as its environment. This environment includes any local
variables that were in scope when the function was created.
2. Encapsulation: Closures allow functions to encapsulate state. This means that a
closure can hold on to the state of variables across function calls, which can be
particularly useful for creating objects with private state.
330
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Function and Data Together: A closure consists of the function itself and the
environment in which it was defined. This means that the function can operate on the
variables from its creation environment, even after that environment has gone out of
scope.
This behavior is characteristic of closures, and it allows for creating functions that are more
flexible and reusable. Here's how we could use inc:
In this case, inc is a closure because it captures and retains the value of x from the
environment where it was defined. This behavior—where a function retains access to its
environment even after its outer function has returned—is the defining feature of a closure.
331
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
332
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ If C[name] equals v, then the variable name evaluates to v. This rule is standard
for looking up values in the context C.
6. ELet: This rule handles let expressions:
○ If e0 evaluates to v0, and e1 evaluates to v1 in the extended context C +
{name:v0}, then let name = e0 in e1 end evaluates to v1. This is how
we evaluate let expressions, where a new variable binding is introduced.
7. EFn: This rule defines the creation of an anonymous function:
○ The rule says fn x => e evaluates to the function fn(x, e). This matches
the creation of a closure, where the function body e is captured along with its
argument x.
8. EApp: This rule handles function application:
○ If e0 evaluates to a function fn(x, e), and e1 evaluates to v1, then applying
e0 to e1 results in evaluating e in the context extended with the binding {x:v1}.
This rule correctly represents the evaluation of function applications, where the
function body is evaluated with the argument substituted for the parameter.
Before Functions:
In languages without first-class functions, values typically belong to a limited set of primitive
types, such as:
Now, in addition to integers and booleans, functions themselves become values. This is a
fundamental feature of languages that treat functions as first-class citizens. So, you have a third
category of values:
● Function Values: These are represented as closures (e.g., fn(x, x + 1)), which
consist of the function body along with any variables captured from the surrounding
environment (i.e., its lexical scope).
In operational semantics, once you introduce anonymous functions, you expand the notion of
what constitutes a "value." A function like fn(x, x+1) is a legitimate value, just like 1 or true.
333
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
It can be passed around as a parameter, stored in data structures, or returned as the result of
other functions.
This change is a hallmark of higher-order languages, where functions can be treated just like
any other value type.
Implications:
1. Functions as Values: In languages that support closures, functions are treated as
values that can be created dynamically, passed as arguments, returned from other
functions, and stored in variables.
2. Evaluation: In your operational semantics, this is captured by the rule EFn, where a
function expression evaluates to a closure value (e.g., fn(x, e)).
3. Extended Type System: The type system needs to recognize this new category of
values. For example, if you had a static type system, the set of types would now include
function types like int -> int in addition to primitive types like int or bool.
4. Higher-Order Programming: Now that functions are first-class values, you can write
higher-order functions, which are functions that take other functions as arguments or
return functions as results.
Example:
For instance, after defining val inc = fn(x) => x + 1, inc is now a value in the
program's environment. Just like you can refer to 1 or true, you can refer to inc and pass it
around, apply it to arguments, etc.
This idea fits perfectly with what you are noticing in the semantic rules: when functions are
first-class citizens, the set of values must naturally be extended to include function values.
import sys
from abc import ABC, abstractmethod
class Expression(ABC):
@abstractmethod
def accept(self, visitor):
pass
class Var(Expression):
def __init__(self, identifier):
self.identifier = identifier
334
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Num(Expression):
def __init__(self, num):
self.num = num
class Bln(Expression):
def __init__(self, bln):
self.bln = bln
class BinaryExpression(Expression):
def __init__(self, left, right):
self.left = left
self.right = right
class And(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_and(self, arg)
class Add(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_add(self, arg)
class Lth(BinaryExpression):
def accept(self, visitor, arg):
return visitor.visit_lth(self, arg)
class Let(Expression):
def __init__(self, identifier, exp_def, exp_body):
self.identifier = identifier
self.exp_def = exp_def
self.exp_body = exp_body
class IfThenElse(Expression):
def __init__(self, cond, e0, e1):
self.cond = cond
self.e0 = e0
self.e1 = e1
class Fn(Expression):
def __init__(self, formal, body):
self.formal = formal
335
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
self.body = body
class App(Expression):
def __init__(self, function, actual):
self.function = function
self.actual = actual
The only changes from the language of arithmetic expressions we had seen in the last class are
the addition of two new classes: Fn and App, which represent the implementation of anonymous
functions and the application of functions. The actual evaluation of these functions will happen
through a visitor.
class Function:
"""
This is the class that represents functions. This class lets us distinguish
the three types that now exist in the language: numbers, booleans and
functions. Notice that the evaluation of an expression can now be a
function. For instance:
Example:
>>> f = Function('v', Add(Var('v'), Var('v')))
>>> print(str(f))
Fn(v)
"""
def __str__(self):
return f"Fn({self.formal})"
So, can you implement this VisitorEval that evaluates the language
augmented with functions?
Here's the implementation:
class VisitorEval:
336
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"""
Pretty print an expression based on its type.
Example:
>>> f = Fn('v', Add(Var('v'), Var('v')))
>>> e = App(f, Add(Num(40), Num(2)))
>>> ev = VisitorEval()
>>> e.accept(ev, {})
84
337
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The VisitorEval class is an evaluator for a language that includes basic arithmetic, boolean
logic, variable bindings, and anonymous functions. It works by traversing an Abstract Syntax
Tree (AST) that represents the program, evaluating each node and producing a result. The
evaluation is driven by the Visitor Pattern, where each node type in the AST has its own
corresponding visit method.
● visit_add(self, add, env): Recursively evaluates both the left and right sides of
an addition and returns their sum.
338
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let Bindings
Function Evaluation
1. Variables are evaluated by looking up their values in the environment (visit_var).
2. Addition is evaluated by recursively evaluating both operands (visit_add).
3. Anonymous functions are evaluated into Function objects, but are not immediately
executed (visit_fn).
4. Function application evaluates the function and argument, then executes the function
body with the argument bound to the formal parameter (visit_app).
Summary
● This class implements an interpreter for a language supporting basic arithmetic, boolean
logic, let bindings, and first-class anonymous functions.
339
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
let
w = 2
in
let
f = fn v => v + w
in
let
w = 3
in
f 1
end
end
end
To write this program in the Python representation used by your VisitorEval class, you need
to represent the nested let expressions, the function f, and the function application f 1.
Here's the Python representation of the program, followed by an explanation of how it works
with your visitor.
340
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation
Evaluation
When we evaluate the program using the visitor ev with an empty environment ({}):
The reason you're getting the result 4 instead of 3 has to do with the scoping of variables,
specifically how w is handled in the function's closure.
341
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Inner let w = 3: This let introduces a new variable w, which shadows the outer w =
2 within the body of this inner let. So within this block, w is 3.
When you apply f 1, even though f was defined when w = 2, it is evaluated in the current
environment where w = 3 is bound. Since f contains a free variable w, this variable is looked
up in the environment where f is applied (not the environment where f was defined). This
means w = 3 is used, not w = 2 from the closure.
Why?
In your implementation, it seems that variables are resolved dynamically, based on the
environment at the time of function application. This behavior is called dynamic scoping, where
variables are resolved based on the current context of execution, rather than lexical scoping
(where variables are resolved based on where the function was defined).
Conclusion:
The surprising result occurs because w is resolved dynamically at the time of function
application. In this case, the inner w = 3 shadows the outer w = 2, so the function uses the
inner w when calculating v + w, leading to 1 + 3 = 4.
342
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Dynamic Scoping
In dynamic scoping, variables are resolved based on the current execution context (the
environment at the time the function is called). When the function needs to access a variable, it
looks for the most recent definition of that variable in the call stack at runtime.
● How it works: When a function is called, the interpreter searches for variable bindings in
the current environment, and if not found, it continues to search in calling environments
further up the call stack. This means the function’s behavior can change based on where
and when it’s called.
● Example:
def f():
return x + 1
def g():
x = 10
return f()
x = 5
print(g()) # If dynamically scoped, the value of x used in f would be 10
● Under dynamic scoping, when f() is called inside g(), x would refer to the value 10,
even though f() was defined when x = 5. The function takes the environment where it
is called.
● Languages that use dynamic scoping:
○ Early versions of Lisp (before Scheme introduced lexical scoping).
○ Some shells like bash (variables in functions are dynamically scoped).
○ Some older languages like Emacs Lisp (although it now supports lexical scoping
as well).
● Dynamic scoping is rare in modern programming languages, as it can lead to
unpredictable and hard-to-debug code.
In static scoping (also called lexical scoping), variables are resolved based on the
environment in which the function is defined, regardless of where the function is called. The
343
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
function carries with it a "snapshot" of the surrounding environment at the time it was defined
(this is typically implemented via closures).
def f():
return x + 1
def g():
x = 10
return f()
x = 5
print(g()) # If statically scoped, the value of x used in f would be 5
● Under static scoping, f() uses the value of x that was in scope when f() was defined,
so it will return 5 + 1 = 6, regardless of the value of x inside g().
● Languages that use static (lexical) scoping:
○ Python.
○ JavaScript.
○ Scheme and modern Lisp.
○ ML and OCaml.
○ C, C++, Java, and most modern languages.
Lexical scoping is more intuitive and predictable because the behavior of a function depends on
where it is defined, not where it is called.
● Variable lookup in dynamic scoping: The function looks for variable bindings in the
current execution environment (where the function is called).
● Variable lookup in static (lexical) scoping: The function looks for variable bindings in
the environment where it was defined (the environment at the time the function was
created).
344
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Predictability: With static scoping, you can always tell where a variable will come from
just by looking at where the function is defined. The behavior of functions doesn't change
depending on where or when they are called, making it easier to understand the code.
2. Clarity: Static scoping ensures that the meaning of variables is fixed at the time of
function definition. This prevents confusion, as you don’t need to trace through the
execution context or call stack to figure out what a variable refers to.
3. Easier Debugging: Since variable references are resolved in the environment where the
function was written, debugging is more straightforward. The variable bindings stay
consistent, reducing unexpected behavior or side effects from other parts of the
program.
For these reasons, languages like Python, JavaScript, and C use static scoping, making it the
preferred choice in most modern programming environments.
1. Lexical Environment Capture: When a function is defined, it should capture the
environment (variables and their bindings) that exists at that moment. This is often
referred to as a closure.
2. Environment Chain: Each function keeps a reference to the environment in which it
was created. When looking up a variable, we first check the local environment (within the
function) and then move outward to the environment where the function was defined,
and so on, following this chain until we find the variable.
3. Scope Determination at Compile Time: The compiler or interpreter determines the
scope of variables based on the structure of the program at the time the function is
written. This means variable resolution is based on the function’s lexical scope.
4. Consistent Binding: When the function is invoked later, it uses the environment
captured during its definition, ensuring that variables reference the correct values even if
new variables with the same name exist in the calling environment.
These principles ensure that variables inside a function are always resolved according to their
static scope, which is fixed based on where they were defined in the code.
Can you give me the semantic rules that implement static scoping?
See the new rules below, for the evaluation of function applications:
● In the EFn rule (for function definition), the function fn(x, e) captures the current
context C at the moment the function is defined. This ensures that the function
remembers the environment where it was created, which is crucial for static scoping.
● In the EApp rule (for function application), when the function is applied, the saved
context C' is used to evaluate the body of the function e, augmented with the parameter
345
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
binding {x:v1}. This shows that the function body is evaluated in the context in which
the function was defined, not where it was called.
Thus, the rules ensure that variables are looked up in the context of their definition (static
scoping), rather than in the calling context (which would be dynamic scoping).
class Function:
def __str__(self):
return f"Fn({self.formal})"
class VisitorEval:
...
With these interventions, we are addressing the two key changes required for implementing
static scoping:
346
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you show some code that demonstrates the creation of lexical
closures on your current implementation?
Sure! Consider the program below:
347
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Program Analysis
348
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ In e2, g is bound to the result of e1, and then g is applied to the number 1.
○ This application will result in adding x = 2 (from the closure created by f) to y
= 1 (the argument to g).
Figure Explanation
The figure visualizes this sequence of expressions, focusing on how closures capture
environments:
It seems that dynamic scoping is kind of bad. Why are there programming
languages that use it?
Dynamic scoping is less common today, but it was historically used in some languages because
of a few specific advantages:
349
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ In certain cases, dynamic scoping can offer a form of flexibility. For example, it
allows functions to automatically access variables defined outside their local
scope, even if they were not available when the function was originally written.
This can sometimes simplify certain kinds of programs, especially in small
scripting languages where the overhead of managing scopes explicitly is
undesirable.
However, this flexibility often turns into a source of confusion because it becomes hard to
understand which variables a function will refer to when it's called, leading to errors that are
hard to track down.
#!/bin/bash
x=5
foo() {
echo $x
}
bar() {
local x=10
foo
}
bar
In this script:
However, if you didn't declare x as local in bar, foo would print the global value of x, which is
5.
This dynamic nature of Bash variables is sometimes useful in shell scripting, but it can also lead
to subtle bugs when variables are inadvertently overwritten. That's why experienced Bash
programmers often explicitly use local variables to limit the scope where necessary.
350
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Languages like Bash (and historically Lisp) may use dynamic scoping for the following reasons:
1. Legacy: Older languages and scripting environments often started with dynamic scoping
simply because it was easier to implement and made sense for quick-and-dirty scripts.
Over time, even though static scoping became more common, some languages retained
dynamic scoping for backward compatibility.
2. Scripting Flexibility: Dynamic scoping is useful for small, simple scripts where it's
helpful to have easy access to the global state. For example, shell scripts frequently
manipulate environment variables dynamically, and dynamic scoping allows this to
happen with minimal boilerplate.
3. Niche Use Cases: Certain domain-specific languages or configurations may still favor
dynamic scoping when it fits the particular use case better (e.g., when it's important to
propagate context automatically).
In summary, while dynamic scoping is generally harder to reason about and can lead to
confusing bugs, it can be useful in limited contexts, especially in small scripting languages
where ease of use is prioritized over strict program structure.
In that paper, Landin explored ways to model the evaluation of expressions using what he called
the "SECD machine" (an abstract machine he invented). Closures, in his formulation, were a
way to capture both a function and the environment in which it was defined, allowing functions
to be treated as first-class entities that could be passed around with their associated variables
intact.
Who's this Landin guy? What else did he do for compilers and
programming languages?
Peter Landin was a British computer scientist and one of the foundational figures in
programming language theory, especially known for his work in functional programming and the
development of key theoretical concepts that are widely used today. He was born in 1930 and
became active in the field during the 1950s and 1960s, a formative period for computer science
as an academic discipline.
351
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Introducing Functional Programming Concepts: Landin was among the first to apply
formal mathematical logic to programming languages, especially by advocating for the
use of lambda calculus as a model for computation. This laid a theoretical foundation for
functional programming languages.
2. The SECD Machine: In The Mechanical Evaluation of Expressions (1964), Landin
described the SECD machine, a virtual machine that could interpret lambda calculus
expressions. This work provided a model for implementing functional languages,
influencing many later language designs.
3. Coining the Term "Closure": As you noted, Landin introduced the term "closure" to
describe a function along with the lexical environment it was created in. This concept
became fundamental for understanding and implementing lexical scoping in
programming languages.
4. ISWIM: He proposed the hypothetical language ISWIM (If You See What I Mean) as a
model to demonstrate his ideas. ISWIM introduced several ideas that have since
become commonplace, like nested functions, lexical scoping, and higher-order functions.
5. Advocating for Higher-Order Functions: Landin’s work highlighted the power of
higher-order functions, functions that can take other functions as arguments or return
them as results, an idea essential to modern functional programming languages like
Lisp, Scheme, Haskell, and many parts of JavaScript.
Landin was an academic at Queen Mary College, London, and he influenced generations of
computer scientists, including John McCarthy (inventor of Lisp) and others who pioneered
functional programming. Despite his enormous impact, Landin himself was modest and was
known for his philosophical approach to computer science, often questioning the ethical and
societal implications of technology. His insights laid the groundwork for many later
352
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
developments, making him a key figure in shaping the languages and computational models we
use today.
353
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Recursive Functions
Code: https://github.com/pronesto/DCC053_Public/tree/main/12_RecFun
To explain why, let's break down how function definitions and applications are evaluated:
1. Function Definition (visit_fn): When a function is defined (e.g., fn x => e), the
interpreter creates a Function object with the body e, the formal parameter x, and the
current environment at the time of the function's definition.
This environment (env) is stored inside the Function object, and the function is returned to be
used later.
Function Application (visit_app): When the function is applied, the interpreter:
1. The problem: When defining a recursive function, the function needs to refer to itself by
name within its body (like f in f(x) = f(x-1) + 1). However, in the current
implementation, the function’s name is not added to the environment when the function
354
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
is first created. So, when you try to call the function recursively inside its body, the
function itself is not available in the environment.
In a recursive function, you need the function to be accessible inside its own body. However,
since the function's environment is captured when it is defined, and at that point the function
itself is not yet in the environment, any recursive call will fail because the function is not bound
to its own name.
let
f = fn x => if x == 0 then 0 else f(x - 1)
in
f(3)
end
● It creates a function f that captures the current environment, but at the time f is defined,
f itself is not yet in the environment.
● When the function f tries to call itself inside its body (f(x - 1)), the interpreter will look
for f in the environment and won't find it, leading to an error.
355
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The reason we introduce a new keyword, such as fun, to handle recursive functions is primarily
due to how functions are defined and evaluated in relation to their environment. Let’s
break down why we need this distinction:
When you define a function with fn in languages with static scoping (like Python or most
functional languages), the function captures its environment (the context in which it was
defined). However, this environment does not automatically include a reference to the function
itself.
For recursion, the function needs to refer to itself by name in its body. With the fn keyword (as
we implemented it originally), the function is just an anonymous value — it doesn’t have its own
name in the environment where it's defined. Therefore, recursive calls would be impossible
because the function wouldn’t know its own name.
The fun keyword introduces a named function (e.g., fun f x => e), meaning that when we
define the function, we explicitly bind the function’s name (f) in the environment. This allows the
function to refer to itself within its body, enabling recursive calls.
● In the case of fun, the function f is immediately available in its own environment. This is
done through special handling in the interpreter or semantic rules, where f is bound to a
reference of the function itself inside the environment it captures. So, when the
function executes, it has access to its own name, allowing it to call itself recursively.
If we simply use fn (which creates anonymous functions without names), there’s no mechanism
to bind the function to a name in its own environment. The function doesn't know how to refer to
itself unless you pass it explicitly, which is not how most recursive functions are written.
● In contrast, with fun, the interpreter knows that the function must carry its own name
along with the body and the captured environment, ensuring that the function has access
to itself for recursion.
356
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
By introducing fun, you distinguish between anonymous functions (fn), which don’t need to
be recursive, and recursive named functions (fun), which need to carry a reference to their
own name. This distinction is necessary to avoid always binding the function to a name, which
might not be desired in the case of simple anonymous functions.
In Summary:
● fn: Defines an anonymous function that does not have a name in its own environment. It
cannot refer to itself unless explicitly passed as an argument.
● fun: Defines a named function where the function’s name is available in its own
environment, enabling recursion. The name binding is crucial for recursive functions,
which is why fun was introduced.
This distinction helps keep the language clear and allows the flexibility to define both
non-recursive and recursive functions effectively.
This rule deals with the declaration of recursive functions using the fun keyword.
357
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The key here is that the rec_fn object is designed to carry its own name along with its
definition. This self-referencing is crucial for enabling recursion, as it allows the function to call
itself by name within its body.
How It Works
The combination of these rules allows functions defined with fun to refer to themselves by their
own names. When the function f is called within its body, the environment already includes a
reference to f, thus enabling recursive calls.
● When defining the function, rec_fn captures the function's own definition along with its
environment, which includes a reference to itself.
● When applying the function, the environment in which the function body is evaluated
includes a reference to the function itself, hence supporting recursion.
This approach, inspired by languages like Standard ML, effectively addresses the limitations of
the earlier non-recursive function definitions by explicitly incorporating the function's own
reference into its scope at the point of definition and during its application.
So, now we need two different ways to represent functions. How can we
implement that in our Python interpreter?
358
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Fn(Expression):
"""
This class represents an anonymous function.
class Fun(Fn):
"""
This class represents a named function. Named functions can be invoked
recursively.
Evaluation:
359
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Both classes are consistent with the design pattern you're using (the visitor pattern), and
they clearly distinguish between anonymous and named functions. Recursive function
handling with Fun becomes easier this way.
We now have four kinds of "values": Fn, Fun, Bln and Num. But what is
really a value?
In our language, based on the classes you've implemented (Fn, Fun, Num, Bln), a value is a
data structure that represents a final result that the interpreter will produce after evaluating an
expression. This means that any of these four classes (Fn, Fun, Num, Bln) can represent
values. So, every value is an expression; however, not every expression is a value.
● Values:
○ Num(5): A number, which is a value.
○ Bln(True): A boolean value.
○ Fn('x', Add(Var('x'), Num(2))): A function itself is a value.
○ Fun('f', 'x', Add(Var('x'), Num(2))): A named function is also a
value.
● Non-Values (Expressions):
○ Add(Num(3), Num(2)): An addition expression, which can be evaluated to
produce a value (Num(5)).
○ Var('x'): A variable reference, which must be evaluated by looking it up in the
environment.
○ App(Fn('x', Add(Var('x'), Num(2))), Num(5)): A function
application, which must be evaluated by applying the function to the argument.
In your interpreter:
● When you evaluate an expression, it will eventually reduce (or evaluate) to a value. For
example:
○ Evaluating the expression Add(Num(2), Num(3)) results in the value Num(5).
○ Evaluating the expression Fn('x', Add(Var('x'), Num(2))) results in the
function value Fn('x', Add(Var('x'), Num(2))).
The distinction is important because the evaluator (or interpreter) needs to distinguish
between an expression that needs further evaluation and a value that is already fully evaluated.
360
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For instance, when you encounter a Var('x') expression, the interpreter knows it needs to
look up x in the environment to find its value, while encountering Num(5) means there is
nothing more to do — this is already a value.
Conclusion:
To summarize, in your interpreter, values are those things that do not require further evaluation
— like Fn, Fun, Num, and Bln. In contrast, expressions are things that may require further
evaluation, such as operations or function applications. This distinction ensures that your
interpreter knows when to stop evaluating and return a result.
Now that we have so many different values, type checking becomes even
more important. Can we define type checking rules for functions?
The figure below defines type checking rules for functions, assuming that the parameters are
annotated with user-given types:
In this figure, there are two rules labeled TFn and TApp. These are still type-checking rules for
functions in a statically typed language. Let’s go over each rule to understand them.
This rule helps us determine the type of a function definition. Let’s break down what it says:
361
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ If, under this assumption, the body e has type T2, then we can conclude that the
entire function has type T1 → T2.
3. Final Conclusion:
○ The rule tells us that a function written as fn x => e has type T1 → T2,
meaning it takes an argument of type T1 and produces a result of type T2.
In simpler terms, the TFn rule is used to assign a function type to a function definition. If we
can show that the function body has type T2 when we assume the parameter has type T1, then
the whole function has type T1 → T2.
The TApp rule describes how to type-check a function application, where a function is applied
to an argument.
So, the TApp rule is used to type-check function applications. It confirms that:
● The argument type matches the function's expected input type, and
● The resulting expression will have the function’s output type.
● TFn (Function Definition): This rule is used to assign a type to a function. We assume
the parameter has a certain type and verify that the body type-checks with this
assumption. If it does, we conclude that the function has a type from the parameter type
to the result type.
● TApp (Function Application): This rule is used to type-check the application of a
function to an argument. It ensures the argument’s type matches the function's input
type and that the result will have the function's output type.
362
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
These rules ensure that functions are used correctly in the language, with arguments and return
types that match as expected. They are essential for type safety, allowing the compiler to catch
errors where functions are used with the wrong types of arguments.
The TFun rule is designed to handle recursive functions, allowing a function to refer to itself
within its own body. This aspect distinguishes TFun from TFn, which is typically used for
non-recursive functions. Let’s go over how TFun supports recursion and why it's necessary.
When defining a recursive function, we need to make sure that the function can refer to itself.
Without TFun, the function would not "know" its own type while we’re in the process of defining
it. The TFun rule allows us to solve this problem by adding the function itself to the typing
context.
Example
● In this case, the TFun rule allows fact to refer to itself within its own body.
363
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The rule lets us assume fact : int → int in the typing context while we check the
body of fact. This assumption is essential to type-check the recursive call fact(n -
1).
Summary
● TFun is necessary for recursive functions because it allows a function to assume its
own type in its body.
● This rule allows recursive calls within the function body to type-check correctly, ensuring
that recursive functions can be defined safely and consistently.
● Without TFun, recursive functions wouldn’t be possible in this type system, as they
would have no way of referring to themselves within their own definition.
class Function:
"""
This is the class that represents functions. This class lets us distinguish
the three types that now exist in the language: numbers, booleans and
functions. Notice that the evaluation of an expression can now be a
function. For instance:
def __str__(self):
return f"Fn({self.formal})"
class RecFunction(Function):
"""
This is the class that represents named functions. The key difference between
named and anonymous functions are exactly the "name" :)
364
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def __str__(self):
return f"Fun {self.name}({self.formal})"
Example
outer_env = {'x': 5}
365
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Without storing the original environment (outer_env), you might end up looking for x in the
call_env, leading to a result based on x = 10. But, since the function f was defined in an
environment where x = 5, the correct behavior under static scoping is to use the x from the
definition environment (outer_env).
When you store the environment in the function object, it ensures that when f is evaluated, it
looks up x in the stored environment, not the current environment where the function is called.
● Anonymous Functions (Fn): For anonymous functions, storing the environment allows
the function to capture and reference variables in its lexical context.
● Recursive Functions (Fun): In the case of recursive functions, the function can
reference itself by name inside its body. Storing the environment is critical to ensure the
recursive call happens in the correct context, even if the function is passed around and
invoked in different scopes.
outer_env = {}
Summary
● Lexical Scoping: Ensuring that variables are resolved based on the environment where
the function was defined.
● Handling Closures: A function can close over variables from its definition environment
and carry them along even when called elsewhere.
● Enabling Recursion: Recursive functions must reference themselves, and storing the
environment helps ensure that recursive calls work correctly.
366
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
By storing the environment, you ensure that the interpreter behaves consistently with static
scoping rules, where a function’s behavior depends on the lexical context in which it was
defined, not the context in which it is called.
Let's then implement our evaluation rules in our VisitorEval class. Can
you do it?
Sure! Find below an implementation:
class VisitorEval:
"""
Evaluates an expression with recursive functions.
"""
367
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example:
>>> f = Fun('f', 'v', Add(Var('v'), Var('v')))
>>> e0 = Let('f', f, App(Var('f'), Num(2)))
>>> ev = VisitorEval()
>>> e0.accept(ev, {})
4
"""
fval = exp.function.accept(self, env)
if isinstance(fval, Function):
pval = exp.actual.accept(self, env)
new_env = dict(fval.env)
new_env[fval.formal] = pval
if isinstance(fval, RecFunction):
new_env[fval.name] = fval
return fval.body.accept(self, new_env)
else:
sys.exit("Type error")
Here's an explanation for each of the visit_fn, visit_fun, and visit_app methods, which
together handle the evaluation of anonymous functions (fn), recursive functions (fun), and
function application in the interpreter:
Storing the environment in the function is crucial because it enables lexical scoping. If the
function refers to any variables outside of its body, it will be able to resolve them using the
environment in which it was created.
368
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key points:
Example:
fn x => x + 1
In this case, x is the formal parameter, x + 1 is the body, and the function is closed over the
current environment.
This method evaluates a recursive function expression. In contrast to visit_fn, this method
handles fun expressions, which are named functions that can call themselves recursively.
When evaluating a fun expression, we create a RecFunction object, which extends the
Function class and additionally stores the name of the function (exp.name). Like visit_fn,
this method also captures the current environment to enable lexical scoping. Storing the
function's name is essential for supporting recursion: within the function body, recursive calls
refer to this name.
Key points:
Example:
Here, f is the name, x is the formal parameter, and the body is a recursive factorial
computation.
This method handles function application—that is, applying a function to an actual parameter.
It covers two cases:
369
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The general process is the same for both cases, but with one additional step for recursive
functions.
If the function is a recursive function (i.e., a RecFunction), we must also bind the function’s
name to the function itself in the new environment (new_env[fval.name] = fval). This step
allows the function to call itself recursively.
If the function value (fval) is not a Function or RecFunction, we raise a type error.
Key points:
(fn x => x + 1) 2
1. We evaluate the anonymous function (fn x => x + 1), which gives us a Function
object.
2. We evaluate the actual parameter 2.
3. We create a new environment where x is bound to 2, and then evaluate the function
body x + 1, which results in 3.
1. We evaluate the recursive function (fun f x => ...), which gives us a
RecFunction object.
370
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary:
This approach ensures that the interpreter correctly implements static scoping and supports
both anonymous and recursive functions, maintaining the proper scope when functions are
called.
It simulates the program (fun f x = fn y => y + x) 3 4, which, by the way, is not valid SML/NJ
syntax. You will create a closure (fn y => y + x, {x: 3}). The figure below shows this closure.
371
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's write some cool examples, to see if we can evaluate them. Can you
translate the following function to an Expression tree?
let
fun range n0 = fn n1 =>
if n0 < n1
then n0 + range (n0 + 1) n1
else 0
in
range 2 7
end
Here's how we can translate the given program into a sequence of Expression objects using
the existing expression types: Var, Num, Bln, And, Add, Lth, Let, IfThenElse, Fn, Fun, and
App. I'll represent this in Python-like pseudocode based on your class structure.
372
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The program defines a recursive function range, which takes two parameters n0 and n1. The
function checks if n0 is less than n1, and if so, it adds n0 to a recursive call to range with n0 +
1 and n1. Otherwise, it returns 0. The program then calls range 2 7.
The equivalent expression tree for the program can be built step by step.
2. Expression Tree:
def create_arithmetic_sum(init_value, end_value):
"""
Example:
>>> program = create_arithmetic_sum(2, 7)
>>> v = VisitorEval()
>>> program.accept(v, {})
20
# Whole program:
return Let('range', fun_range, app_range)
373
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you write a new function for the following program? Notice that loop
receives another function; hence, it is a higher-order function.
let
fun loop n = fn f => fn a =>
if n = 1
then a
else loop (n-1) f (f a)
in
loop 10 (fn x => x + 1) 2
end
Here's the translation of the loop function program into Expression objects.
def create_loop(num_iterations):
"""
Example:
>>> program = create_loop(10)
>>> v = VisitorEval()
>>> program.accept(v, {})
11
374
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"""
# The body of the recursive function:
# if n = 1 then a else loop (n - 1) f (f a)
body = IfThenElse(
Lth(Var('n'), Num(2)), # Condition: n < 2)
Var('a'), # Then: a
App( # Else: loop (n-1) f (f a)
App(App(Var('loop'), Add(Var('n'), Num(-1))), # loop (n - 1)
Var('f')),
App(Var('f'), Var('a')) # f a
)
)
# Whole program:
return Let('loop', fun_loop, app_loop)
The function loop is a recursive function that takes three parameters: n, f, and a. If n = 1, it
returns a. Otherwise, it recursively calls loop (n-1) f (f a).
2. General Highlights:
375
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Could you modify the loop example so that it works as a for loop?
We can create a customizable loop structure using higher-order functions and recursion. This
allows you to not only iterate over a range of values but also to apply a custom function (the
"body" of the loop) at each step of the iteration. Here's the implementation:
Example:
>>> program = create_for_loop(2, 10, Fn('x', Add(Var('x'), Num(1))))
>>> v = VisitorEval()
>>> program.accept(v, {})
7
# Whole program:
return Let('loop', fun_loop, app_loop)
376
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
How It Works:
● Recursive Loop (Fun) Construction: This structure mirrors the logic of a for loop by
having the recursive function (loop) handle the increment or update of the loop variable
(n), and apply the given function to the accumulator (a).
● End of Loop Condition (IfThenElse): The condition checks whether the loop has
reached its end (when n < 2). If so, it returns the current value (a). Otherwise, it
continues the recursion.
● Loop Application: After defining the recursive function loop, the actual application
applies it with the start value (begin) and the user-defined function.
Our for loop is simulating the usage of higher-order functions. What is this
concept?
Higher-order functions (HOFs) are functions that can take other functions as arguments
and/or return functions as their result. This concept allows for flexible and reusable code,
making it possible to create generic patterns like your for loop, where the "body" of the loop is
passed in as a function.
377
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This design allows you to reuse the create_for_loop function with different behaviors,
simply by passing in a different higher-order function. For example:
● To simulate a loop that increments a value, you pass a function that adds 1.
● To simulate a loop that doubles a value, you pass a function that multiplies the input by
2.
By separating the loop logic (iteration) from the operation (body of the loop), you've effectively
generalized the loop structure using higher-order functions.
map: Applies a given function to each item in a list (or any iterable) and returns a new list of
results.
def square(x):
return x * x
378
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
numbers = [1, 2, 3, 4]
result = map(square, numbers) # Applies `square` to each element.
filter: Takes a function that returns a boolean and filters a list based on that function.
def is_even(x):
return x % 2 == 0
reduce: Applies a binary function (a function that takes two arguments) cumulatively to the
items of a list, reducing it to a single value.
from functools import reduce
Do you know what was the first programming language ever to become
popular that featured recursive functions?
Lisp probably was the first programming language to ever become popular while featuring
recursive functions. Lisp, developed by John McCarthy in 1958, was one of the earliest
languages explicitly designed to support recursion as a fundamental feature. This was
379
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
groundbreaking at the time, as recursion was not only syntactically supported but also naturally
aligned with Lisp’s structure, allowing functions to call themselves and process complex
recursive data structures like lists and trees.
Recursive functions in Lisp were integral to its functional programming paradigm, distinguishing
it from other early languages like Fortran, which did not initially support recursion. Lisp’s
popularity in artificial intelligence research further helped establish recursion as a powerful
programming technique, influencing the design of many later programming languages.
The Algol family introduced recursion a bit later. Algol 58, the initial version, did not support
recursion. However, Algol 60 (published in 1960) explicitly added recursion as a supported
feature, influenced in part by the concepts emerging in Lisp. Algol 60 introduced block structure
and a formalized way to handle recursive functions, which influenced many later languages like
Pascal, C, and beyond.
To sum up, Lisp was the first widely used language to feature recursion, while Algol 60 was the
first major procedural language to include recursion formally, spreading the concept to a broader
programming audience.
So, in many ways, John McCarthy, the creator of Lisp, is one of the
forefathers of recursion in programming languages, right?
Yes, John McCarthy was indeed a pioneering figure in computer science, especially known for
his foundational work in artificial intelligence (AI) and programming languages.
380
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Artificial Intelligence (AI) – McCarthy is often credited with coining the term artificial
intelligence in 1956, during the famous Dartmouth Conference, which he organized
alongside Marvin Minsky, Claude Shannon, and Nathaniel Rochester. This conference is
generally regarded as the birthplace of AI as an academic field, setting the stage for
decades of research and progress.
2. Lisp Programming Language – In 1958, McCarthy invented Lisp (short for LISt
Processing), which became the second oldest high-level programming language after
Fortran. Lisp introduced key concepts like recursion, symbolic computation, and
automatic memory management (garbage collection), which have since become
essential in many programming languages. Lisp’s design made it ideal for AI research,
as its syntax and structure supported symbolic manipulation and reasoning well-suited to
AI tasks.
3. Concept of Time-Sharing – McCarthy was instrumental in developing the concept of
time-sharing, which allowed multiple users to interact with a computer simultaneously.
This was revolutionary at the time, as early computers typically ran one job at a time.
Time-sharing paved the way for modern operating systems and laid the groundwork for
personal computing and interactive software.
4. Work on Formal Logic and the Theory of Computation – McCarthy made important
contributions to formal logic and the theory of computation, including his work on
formalizing common-sense reasoning in AI. He developed influential theories like the
situation calculus, a mathematical approach to representing dynamic systems in logic.
His work on formalizing knowledge and reasoning processes helped shape early AI
logic-based systems.
5. Advice Taker and Early AI Theories – McCarthy proposed the concept of the Advice
Taker, an early framework for reasoning programs that could process human advice or
commands to accomplish tasks. Though it wasn’t implemented at the time, the Advice
Taker introduced concepts related to knowledge representation, which are core to
modern AI systems.
6. Advocacy for Open Research and AI Ethics – McCarthy was ahead of his time in
promoting open research and discussing AI’s societal impact. He believed in the
collaborative nature of research and shared his ideas widely, encouraging the
development of a global AI community. He also thought deeply about AI’s potential
ethical implications and was one of the early voices advocating for responsible AI.
Overall, McCarthy's work laid much of the foundation for AI and influenced a broad array of
fields, including programming languages, computational logic, and human-computer interaction.
He received numerous accolades, including the Turing Award in 1971 for his contributions to the
field.
381
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
There are several types of machine codes, each tailored to a specific CPU architecture. Here
are a few examples:
382
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● In Compiled Binaries:
○ Programs compiled from high-level languages (C, C++, Rust) into executables.
For example, on Linux, compiled binaries like /bin/ls contain machine code
that the processor directly executes.
● Firmware:
○ The embedded software running on hardware devices like network routers, IoT
devices, and microcontrollers. Firmware is machine code that controls the
device's low-level operation.
● Operating System Kernels:
○ Kernels like Linux or Windows contain large amounts of machine code designed
for specific CPU architectures (e.g., x86, ARM).
● In BIOS/UEFI:
○ The firmware used to initialize hardware during the boot process of a computer is
also written in machine code tailored to specific processors.
Key Takeaways
There are low-level representations that are not really machine code, but
that we can still interpret, right?
Not all low-level code is machine code. There are other forms of low-level code, such as
bytecode and intermediate representations (IR), which are not directly executed by the
383
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
hardware. Instead, these forms are interpreted or further compiled before execution. Let's dive
into the differences.
1. Bytecode
○ What it is: Bytecode is a compact, intermediate form of code that is typically
produced by compilers for virtual machines (VMs). It’s not tied to a specific
hardware architecture but is designed to be interpreted by a virtual machine or
further compiled at runtime (JIT compilation).
○ Key Feature: Bytecode is portable across different hardware platforms since it's
not directly tied to any specific CPU. The virtual machine (e.g., Java Virtual
Machine) abstracts away the hardware specifics.
○ Example: Java Bytecode
■ When you compile a Java program, it doesn't produce machine code right
away. Instead, it generates Java bytecode, which is executed by the
Java Virtual Machine (JVM).
■ Example Bytecode Instruction: 0x3C → istore_1 (store an integer into
local variable 1).
■ The JVM interprets or compiles this bytecode to machine code at runtime.
○ Where it’s found:
■ Java programs: .class files contain bytecode.
■ Android apps: Dalvik/ART (Android Runtime) uses a different form of
bytecode to run applications on smartphones.
○ How it’s executed: The JVM reads the bytecode and either interprets it or
compiles it just-in-time (JIT) to machine code suitable for the host CPU.
2. Intermediate Representation (IR)
○ What it is: IR is a more abstract, platform-independent representation of a
program used during the compilation process. It acts as a bridge between
high-level languages and machine code. IR is not tied to any specific hardware,
and like bytecode, it's not directly executable by the CPU.
○ Key Feature: Designed to be transformed and optimized by compilers before
being translated into machine code for the target architecture.
○ Example: LLVM Intermediate Representation (LLVM IR)
■ The LLVM compiler uses an intermediate representation (LLVM IR),
which is a lower-level, platform-independent representation of the
program.
■ Example IR Instruction:
■ %add = add i32 %a, %b → This is an LLVM IR instruction that
adds two 32-bit integers (%a and %b).
■ Where it’s found: LLVM IR is commonly used in compilers such as
Clang, which translates C/C++ programs into LLVM IR before converting
them into machine code.
○ How it’s used:
384
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
If bytecodes and IR are not machine code, can they still be executed?
Yes, it's still possible to execute programs in these formats, via interpretation.
Bytecode Execution:
● Bytecode is interpreted by a virtual machine (VM) such as the JVM or Dalvik VM.
● In some cases, bytecode can be Just-In-Time (JIT) compiled into machine code at
runtime. This is common in Java's HotSpot JVM, which converts frequently executed
bytecode into highly optimized machine code.
IR Execution:
What are the differences between machine codes and these other low-level
representations?
There are a few differences between these different forms of low-level code:
● Machine code: The binary instructions that a CPU can execute directly, tied to a specific
architecture (x86, ARM, RISC-V).
● Bytecode: A platform-independent, compact code interpreted or compiled at runtime by
a virtual machine (e.g., Java bytecode).
● Intermediate Representation (IR): A low-level, platform-independent code
representation used internally by compilers (e.g., LLVM IR), further compiled into
machine code for the target architecture.
385
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Where it’s Compiled binaries, .class files, Android Inside compilers (LLVM,
found firmware apps GCC)
What is the path that a program will traverse to become machine code?
This path depends heavily on the compiler. As an example, it suffices to take a new look into the
cover of "Crafting Interpreters". But, if we talk about a full-fledged compiler, such as clang or
gcc, then we would find a path more or less like so (using the figure below):
1. Source Code (.c): The top section shows a simple C program with a main function that
prints "Hello" using printf. This is the high-level source code, which is
human-readable.
2. Intermediate Representation (.ll): The next section displays the LLVM IR, generated
from the C code using clang -S -emit-llvm. This is an intermediate representation
that is platform-independent and still readable, although more abstract than the original
C code. It shows the call to printf, which has been transformed into lower-level
operations.
3. Assembly Instructions (.s): The third section displays the assembly code generated by
llc, targeting ARM (-march=arm). This representation is specific to the ARM
architecture and shows instructions like push, mov, and bl (branch link). This is closer
to machine-level operations and is used by the CPU.
4. Machine Code (.o): The final section shows the machine code, which is the binary
representation of the instructions, ready to be executed by the processor. Each
hexadecimal number corresponds to an ARM instruction or data value.
Each step reduces the abstraction level, moving from human-readable C code to
architecture-specific machine code that runs on the hardware.
386
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's try to see the different representations of a larger program. Can you
write a simple program and convert it to the LLVM IR?
Let's go through a simple C program that includes some control flow, then generate its LLVM IR
representation using clang, and finally use llc to produce machine code for different
architectures.
Here’s a simple C program that includes a control flow statement (an if statement):
// example.c
387
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
#include <stdio.h>
int main() {
int a = 5;
int b = 10;
int sum;
if (a < b) {
sum = a + b;
} else {
sum = a - b;
}
To generate the LLVM IR file from the C program, you can use the clang compiler with the -S
and -emit-llvm flags. Open a terminal and run the following command:
This command creates a file named example.ll, which contains the LLVM IR representation
of the program.
You can view the contents of example.ll using a text editor or by running cat on the
program. If you do it, then you should see something like this (simplified) representation:
; ModuleID = 'example.c'
source_filename = "example.c"
target datalayout = "..."
target triple = "x86_64-unknown-linux-gnu"
388
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Now, let's use llc to generate machine code for different architectures. We can generate code
for x86_64 and RISC-V as examples.
389
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
You can translate the assembly code to machine code using an assembler such as the GNU
Assembler (as):
as example.x86 -o example.o
You can examine the generated object files using objdump or similar tools.
For x86_64:
objdump -d example.o
0000000000000000 <ltmp0>:
0: ff c3 00 d1 sub sp, sp, #48
4: fd 7b 02 a9 stp x29, x30, [sp, #32]
8: fd 83 00 91 add x29, sp, #32
c: bf c3 1f b8 stur wzr, [x29, #-4]
10: a8 00 80 52 mov w8, #5
14: a8 83 1f b8 stur w8, [x29, #-8]
18: 48 01 80 52 mov w8, #10
1c: a8 43 1f b8 stur w8, [x29, #-12]
20: a8 83 5f b8 ldur w8, [x29, #-8]
24: a9 43 5f b8 ldur w9, [x29, #-12]
28: 08 01 09 6b subs w8, w8, w9
2c: ca 00 00 54 b.ge 0x44 <ltmp0+0x44>
30: a8 83 5f b8 ldur w8, [x29, #-8]
34: a9 43 5f b8 ldur w9, [x29, #-12]
38: 08 01 09 0b add w8, w8, w9
3c: e8 13 00 b9 str w8, [sp, #16]
40: 05 00 00 14 b 0x54 <ltmp0+0x54>
44: a8 83 5f b8 ldur w8, [x29, #-8]
48: a9 43 5f b8 ldur w9, [x29, #-12]
4c: 08 01 09 6b subs w8, w8, w9
50: e8 13 00 b9 str w8, [sp, #16]
54: e8 13 40 b9 ldr w8, [sp, #16]
58: e0 03 08 aa mov x0, x8
5c: 09 00 00 90 adrp x9, #0
60: 29 01 00 91 add x9, x9, #0
64: e0 07 00 f9 str x0, [sp, #8]
68: e0 03 09 aa mov x0, x9
390
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
This workflow highlights how high-level code is transformed into lower-level representations
suitable for different CPU architectures. Let me know if you need further details or explanations!
● During this phase, high-level constructs such as loops, function calls, conditional
statements, and arithmetic operations are mapped to lower-level abstractions that can
be easily manipulated by the compiler.
● For example, a for loop in C might be transformed into basic blocks with control flow
edges, representing the loop initialization, condition checking, and update statements.
● Interesting part: High-level features, such as complex data types, are simplified. For
instance, structures and arrays are flattened into memory addresses and offsets.
Function calls become simple jumps, while C’s undefined behavior rules can allow
optimizations to discard certain checks.
2. Platform Independence:
391
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Optimization Opportunity:
● The IR generation step sets the stage for optimizations. LLVM IR is designed to
facilitate various optimization passes. For example:
○ Dead code elimination: Removing code that will never be executed.
○ Constant propagation: Replacing variables that are constant with their actual
values.
○ Loop optimizations: Simplifying or unrolling loops to reduce runtime overhead.
● Interesting part: Because IR is more abstract than assembly, the compiler can reason
about the program's behavior without worrying about architecture-specific details,
enabling complex, high-level optimizations.
4. Simplification:
● During this transformation, complex control flow and expressions in the source code are
simplified into basic operations and basic blocks (a straight-line sequence of
instructions with no branches, except at the entry and exit points). This simplification
allows the later stages of the compiler to reason more easily about program behavior.
● For example, a complex if-else statement might be transformed into several basic
blocks with conditional branches between them.
● Interesting part: The control flow graph (CFG) of the program becomes clearer in IR. By
breaking down control flow into branches and blocks, it becomes easier for optimization
passes to analyze dependencies and execution paths.
● The IR abstracts data types and memory accesses. For instance, local variables may
still be represented in IR, but they will eventually be lowered to stack or register
accesses depending on optimizations.
● Interesting part: This abstraction simplifies memory management. Variables are either
SSA values (Static Single Assignment) or memory locations, making it easier to track
where values are modified and to optimize away unnecessary memory operations.
6. Function Representation:
392
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Functions in LLVM IR are represented with basic blocks, which are sequences of
instructions with a clear entry and exit point, and are often laid out in Static Single
Assignment (SSA) form. SSA ensures that every variable is assigned exactly once,
which simplifies dependency analysis and allows for better optimizations.
● Interesting part: SSA form makes dependency analysis (e.g., which variables depend
on which other variables) much simpler, enabling more powerful and safe optimizations.
It also allows easier reasoning about control flow and variable lifetimes.
7. Intrinsic Functions:
● LLVM IR introduces intrinsic functions, which are special functions that represent
low-level operations not directly available in high-level languages. For example, there
might be intrinsics for vector operations, atomic operations, or memory barriers. These
provide hooks for optimizations and fine-tuning of platform-specific operations.
● Interesting part: Intrinsic functions allow the compiler to introduce hardware-specific
instructions without exposing them at the source level, keeping IR platform-neutral while
still offering the potential for fine-tuned performance on different architectures.
Summary:
Let's go over the LLVM IR. What are interesting aspects of this intermediate
representation?
The figure below is a control-flow graph: a way to show the low-level representation of
programs. This CFG was produced for the following program:
#include <stdio.h>
393
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int count = 0;
return 0;
}
We can produce the Control-Flow Graph (CFG) of this program with the following commands in
clang:
394
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The control-flow graph we've produced for the C program is a visual representation of the flow
of execution through the program in LLVM Intermediate Representation (IR). Let's break down
the interesting aspects of this representation.
1. Basic Blocks:
2. Typed Instructions:
395
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Yes, LLVM IR is strongly typed. You can see that every instruction has types
associated with the values they operate on. For instance, i32 stands for a 32-bit integer,
i8 for an 8-bit integer (a byte), and pointers are indicated by i8* or i32*.
● These types ensure that the operations performed on the data are correct. For example,
you can’t add an integer to a pointer without explicit casting, enforcing type safety at this
low level.
● The phi functions (%phi) are specific to Static-Single Assignment (SSA) form, which is
the form used by LLVM IR.
● A phi function is used to choose a value based on the control flow. For example, in block
%10, there’s a phi function (%11 = phi i64 [%15, %10], [0, %7]). This
instruction selects the value of %15 if control comes from block %10, or 0 if control
comes from block %7.
● This is necessary in SSA because variables are immutable, meaning each variable can
only be assigned once. Phi functions allow merging different values depending on the
incoming path in the CFG.
● The edges between the blocks represent the possible control flow paths between basic
blocks. These are determined by conditional and unconditional branch instructions (br in
LLVM IR).
● For instance, in block %2, there’s a conditional branch (br i1 %3, label %4,
label %7). This means that based on the condition %3 (a comparison), control either
flows to block %4 (T) or block %7 (F).
● The control flow ensures that certain code is only executed under specific conditions, like
loops or if-else structures.
5. Loops:
● The red-colored block %10 is part of a loop. You can see that this block branches back
to itself, creating a loop structure. The instruction %14 = icmp eq i8 %13, 0 checks
whether a condition is met (whether the current character in the string is '\0'). If not,
the loop continues.
● LLVM adds metadata for loops (!llvm.loop), which can be useful for loop
optimizations.
6. Function Calls:
396
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● You can spot calls to functions, such as the printf function in blocks %4 and %16. The
LLVM IR includes calls to external functions using the call instruction. The
getelementptr (GEP) is used here to compute the memory addresses for the
arguments passed to printf.
7. Branching Logic:
8. SSA Form:
● LLVM IR is in Static Single Assignment form, meaning each variable (or register) is
assigned exactly once. This simplifies many analyses and transformations in the
compiler. To deal with multiple assignments to a variable, the phi functions are used at
control-flow merges.
Summary:
This CFG allows us to visualize the flow of execution in a program and understand how control
moves from one block to another, based on the values and conditions evaluated during
execution. It’s especially useful when optimizing and transforming code at the compiler level!
Why does LLVM use an infinite number of register names? Is this not too
high level? When will these names be mapped to actual locations?
LLVM uses an infinite number of virtual registers in its intermediate representation (IR) for
two main reasons:
397
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● SSA Form (Static Single Assignment): LLVM IR is in SSA form, meaning each
variable (or "register") is assigned exactly once. This simplifies analysis and optimization
because:
○ There is no need to worry about multiple definitions of the same variable.
○ Data flow analysis becomes straightforward, as each value has a single, clear
origin.
● Virtual registers behave like variables in a high-level language, making transformations
and optimizations easier for the compiler developer.
● Virtual registers are not constrained by the limited number of physical registers on
the target architecture. This enables optimizations without worrying about hardware
constraints during the IR stage.
● This abstraction allows LLVM to perform optimizations independently of the target
machine, making the IR more portable.
This mapping occurs during the backend stage of the compilation pipeline, specifically in the
register allocation phase. Here’s how the process works:
398
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In SSA form:
During register allocation for a target with physical registers R0, R1, and so on:
If there are not enough registers, some of these intermediate results might be spilled to memory.
For instance:
399
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This separation of virtual and physical registers allows LLVM to be both powerful and portable,
catering to a wide range of optimizations and target architectures.
It seems that all the info necessary to run the program is embedded in the
LLVM IR. Can we interpret this IR to see the results of the execution?
Yes, we can! For instance, to run the program above, you can do:
Here's a breakdown:
Conclusion
LLVM IR is not designed specifically for interpretation, but its architecture allows it to be
interpreted or executed via tools like lli. This flexibility is a byproduct of its design for
optimization and portability, making LLVM IR versatile for tasks beyond its primary role.
400
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
So, the LLVM IR is a "Virtual Instruction Set". But the Java Bytecodes are
also a "Virtual Instruction Set". How are they different?
LLVM IR and Java bytecode are both virtual instruction sets, but they serve different purposes,
have different designs, and are intended for different execution environments. Here's a
comparison between them:
● LLVM IR:
○ Low-level Representation: LLVM IR is closer to assembly language but still
abstract enough to be platform-independent. It includes instructions for primitive
operations (arithmetic, memory access, branching) and supports features like
type information, SSA (Static Single Assignment) form, and control flow.
○ Three-Address Code: LLVM IR instructions typically operate in a three-address
format, meaning instructions take a destination and two source operands (e.g.,
%1 = add i32 %2, %3).
○ Type System: LLVM IR has a rich type system, supporting a wide range of
primitive and complex types, including pointers, arrays, structures, and vectors. It
allows fine-grained control over data representation.
401
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● LLVM IR:
○ Compilation Stages: Source code (e.g., in C, C++, Rust) is first compiled into
LLVM IR, which can then be further optimized using the LLVM toolchain. Finally, it
is compiled into native machine code for specific architectures.
○ JIT Compilation: LLVM also supports JIT compilation, where LLVM IR can be
compiled into machine code at runtime, making it flexible for use in dynamic
languages or environments where runtime code generation is needed.
● Java Bytecode:
○ Bytecode Interpretation and JIT: Java bytecode is typically interpreted by the
JVM at runtime. However, JVMs also employ JIT compilation, where frequently
used bytecode is compiled into native machine code for performance
improvements.
○ Write Once, Run Anywhere: Since Java bytecode is platform-independent, it
can run on any system with a compatible JVM. This contrasts with LLVM IR,
which is more of a middle step in a compilation process and not directly tied to a
specific runtime environment like the JVM.
● LLVM IR:
402
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
5. Portability
● LLVM IR:
○ While LLVM IR is portable in terms of the machine code it can generate, it is not
designed for direct portability in the way Java bytecode is. LLVM IR is a
compilation step, not an execution format.
● Java Bytecode:
○ Java bytecode is portable across any platform that supports the JVM. This allows
developers to compile once and run their programs on any system without
recompilation, making it extremely portable.
Summary Table
Feature LLVM IR Java Bytecode
Type System Rich, with pointers, structs, etc. Simplified, JVM-focused types
403
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In summary, LLVM IR is designed for optimization and portability during the compilation
process, while Java bytecode is intended for portability and execution on the JVM.
The figure compares Java bytecode and LLVM IR by showing equivalent factorial
implementations in Java and C, along with their compiled representations. Key differences
include:
404
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● LLVM IR: It uses a three-address code style with more detailed control flow (using labels
and phi nodes) and explicit type information (i32 for integers). LLVM IR is designed for
optimizations and machine code generation across different architectures, while Java
bytecode is optimized for execution on the JVM.
405
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The interpreter just follows the stack operations directly. Now, see how addition is represented in
the LLVM IR:
The interpreter must manage registers (%1, %2, %3), memory locations, and types explicitly.
406
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Instruction Selection:
Tool: TableGen, along with SelectionDAG ISel (Instruction Selector), plays a significant role in
this phase.
2. Instruction Scheduling:
Tool: LLVM’s MachineScheduler (part of the LLVM codegen backend) and TableGen for
target-specific scheduling models.
407
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Register Allocation:
● In LLVM IR, variables (or SSA values) are assumed to have an infinite number of
registers, but real hardware has a limited number of physical registers. Register
allocation is the phase where these virtual registers are mapped to the physical
registers of the target architecture.
● Register allocation involves two key tasks:
○ Register assignment: Mapping virtual registers to physical registers.
○ Spilling: If there aren’t enough physical registers to hold all values, some values
must be spilled to memory (typically the stack), and load/store instructions are
inserted to handle these spills.
● LLVM provides several register allocators, such as:
○ Linear Scan Register Allocator: A simpler, faster allocator, though not as
optimal in terms of minimizing spills.
○ Greedy Register Allocator: A more sophisticated allocator that considers live
ranges and aims to minimize the number of spills and reloads.
After instruction selection, scheduling, and register allocation, several other steps take place to
fully prepare the machine code for execution:
● Prolog/Epilog Insertion: This step involves inserting code for function entry and exit,
such as saving/restoring registers to/from the stack, setting up the stack frame, etc.
● Peephole Optimization: This phase performs small, local optimizations to further
simplify and streamline the machine code, such as removing redundant instructions or
simplifying instruction sequences.
● Code Emission: The final step, where the machine instructions are emitted as assembly
code or directly as object code. This is where the tool llc comes into play.
● llc (LLVM static compiler): This tool is responsible for converting LLVM IR into
machine-specific assembly or object code. It leverages the results of instruction
selection, scheduling, and register allocation to produce optimized machine code.
○ Example: llc input.ll -o output.s generates assembly code
(output.s) from the LLVM IR file (input.ll).
● TableGen: This is a meta-programming tool in LLVM that helps define target-specific
details, such as instruction sets, register classes, instruction scheduling models, etc. It
generates C++ code for instruction selection, register allocation, and instruction
scheduling.
408
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary:
This process allows LLVM to efficiently compile programs for a wide variety of architectures and
is at the heart of LLVM’s portability and optimization capabilities.
Let's use objdump to inspect the binary code that is produced by clang.
Can you go over the binary, and figure it out?
To produce the binary file and inspect it with objdump, do the following:
409
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When using objdump -d to disassemble the binary generated by your program, you're looking
at its low-level machine code. This output shows the ARM64 assembly code for different
sections of your program, primarily focusing on the __TEXT,__text section (where your actual
program's machine code resides) and the __stubs and __stub_helper sections (which are
involved in dynamic linking).
● Stack allocation: The sub sp, sp, #80 instruction reserves 80 bytes on the
stack.
● Save frame pointer and return address: The stp x29, x30, [sp, #64]
stores the old frame pointer and return address.
● Epilogue: Similarly, at the end of the function, the saved frame pointer and return
address are restored, and the stack is deallocated. This returns control to the caller by
restoring the stack and returning with ret:
ARM64 uses the stack to manage local variables and function arguments:
These instructions manipulate data on the stack, such as storing and loading function
arguments or local variables.
For instance:
410
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● stur x1, [x29, #-16]: Stores the value of x1 at an offset of 16 bytes below the
frame pointer.
● ldur w8, [x29, #-8]: Loads a 32-bit value from 8 bytes below the frame pointer
into register w8.
3. Conditional Branching
Here, the cmp w8, #2 compares the value in w8 with 2, and b.ge branches if the comparison
results in greater-than or equal (signed comparison).
Sections __stubs and __stub_helper are related to dynamic linking, which is used to
resolve external functions at runtime (like library functions). This is managed by the
dyld_stub_binder, a part of the dynamic linker that resolves addresses of external functions.
● The __stubs section has a branch through register (indirect jump) that takes the
program to the dynamically linked function:
● The __stub_helper section is involved in setting up the address for the external
function and managing stack frames.
In the main section, you see a call to bl (branch and link), which is an instruction used to call
external functions. The address of dyld_stub_binder is resolved and used to perform
dynamic linking:
6. Instruction Formats
● ARM64 instructions are 32 bits wide and use a fixed instruction format. You see a variety
of data manipulation instructions such as ldr (load register), str (store register), and
branch instructions like b and bl.
411
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
objdump is a powerful tool used to inspect and analyze binary files. In this case, you are using
it to disassemble your binary and inspect the generated assembly code. Beyond disassembly,
objdump can also:
In summary, you're seeing the machine-level translation of your program, with key features such
as function prologues/epilogues, stack usage, control flow (branching), and dynamic linking
mechanisms, all captured by objdump.
412
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The output of objdump -s a.out shows the contents of different sections in the compiled
binary a.out. These sections organize various parts of the program's data and instructions.
Here's an explanation of each section in the output:
Sections Breakdown
1. __text
○ Purpose: Contains the program's machine code (compiled instructions).
○ Details:
■ The hexadecimal numbers are ARM64 instructions (based on the format).
■ This section represents the executable logic of your program, including
the main function and any other compiled functions.
2. __stubs
○ Purpose: Holds stubs (placeholders) for dynamically linked functions like
printf.
○ Details:
■ When a dynamically linked function is called for the first time, the stub
redirects control to the dynamic linker to resolve the function's address.
3. __stub_helper
○ Purpose: Assists in resolving symbols during dynamic linking by providing helper
code.
○ Details:
■ This section contains code that helps with lazy symbol resolution.
4. __cstring
○ Purpose: Stores constant strings used in the program.
○ Includes:
■ Usage: %s <input_string>
■ The input string '%s' has %d characters.
5. __unwind_info
○ Purpose: Contains metadata for stack unwinding, used in exception handling
and debugging.
○ Details:
■ This metadata helps restore the state of the program if an exception
occurs or during stack tracing for debugging purposes.
6. __got (Global Offset Table)
○ Purpose: Holds addresses of dynamically resolved symbols (functions or
variables).
○ Details:
■ Initially, the entries may be zeros or placeholders.
■ At runtime, the dynamic linker populates this table with actual addresses.
7. __la_symbol_ptr
413
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The binary file you compiled is dynamically linked and uses the default build configuration. This
results in additional sections to support runtime features like:
414
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Static linking: Use gcc -static -o a.out program.c to embed all dependencies
into the binary.
● Stripping symbols: Use strip a.out to remove debugging and linking metadata.
1. Disassemble the Code: Run objdump -d a.out to see the assembly code
corresponding to the __text section.
2. Extract Strings: Use strings a.out to list all readable strings, like the ones in
__cstring.
3. Analyze Dynamic Linking: Use ldd a.out to see the shared libraries your binary
depends on.
● Complex Instructions: CISC instructions are often variable-length and can perform
multiple operations in a single instruction (e.g., load from memory and perform arithmetic
in one go).
● Memory-to-Memory Instructions: CISC instructions can work directly on memory,
unlike RISC, where memory is accessed only via load/store instructions.
415
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
CISC architectures tend to have more complex instructions that may execute multiple
operations. Examples of CISC instruction sets include:
RISC architectures aim to simplify instructions, focusing on instructions that execute in a single
clock cycle. Examples of RISC instruction sets include:
416
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ One of the most widely used RISC architectures, especially in mobile devices.
○ ARM uses a fixed instruction length (typically 32 bits in older versions, 64 bits in
ARM64).
○ Example Instruction:
■ ADD R0, R1, R2 — Adds the contents of R1 and R2 and stores the
result in R0.
■ LDR R0, [R1] — Loads data from memory (address held in R1) into R0.
2. MIPS (Microprocessor without Interlocked Pipeline Stages):
○ A clean RISC architecture that focuses on simplicity and speed.
○ Example Instruction:
■ ADD $t0, $t1, $t2 — Adds registers $t1 and $t2, stores the result
in $t0.
■ LW $t0, 4($t1) — Loads a word from memory address 4 + $t1 into
register $t0.
3. RISC-V:
○ A newer open-source RISC instruction set architecture.
○ Example Instruction:
■ ADD x1, x2, x3 — Adds registers x2 and x3, stores the result in x1.
■ LW x1, 0(x2) — Loads a word from the memory address stored in x2
into x1.
4. SPARC (Scalable Processor Architecture):
○ Developed by Sun Microsystems.
○ Example Instruction:
■ ADD %r1, %r2, %r3 — Adds the contents of %r1 and %r2, stores the
result in %r3.
■ LD [%r1+4], %r2 — Loads a word from memory address %r1 + 4
into register %r2.
● CISC (e.g., x86, VAX) instructions often combine multiple operations (e.g., load from
memory and perform arithmetic in one step).
● RISC (e.g., ARM, MIPS, RISC-V) instructions are simpler, with separate instructions for
each step (e.g., first load from memory, then perform arithmetic).
CISC instructions tend to have more complex operations but may take multiple cycles to
execute, while RISC instructions are designed to be simple and execute in a single clock cycle.
417
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this course's homework, students will generate code using RISC-V. What
is RISC-V, and which instructions of it we shall be using?
RISC-V is an open-source instruction set architecture (ISA) based on the principles of reduced
instruction set computing (RISC). It provides a simple and flexible framework for designing
processors that range from low-power microcontrollers to high-performance computers.
RISC-V’s modularity allows for extensions (like multiplication or floating-point support) without
unnecessary complexity, making it ideal for educational use and research.
In your course, students will work with 14 key RISC-V instructions that cover arithmetic, memory
access, and control flow. Here's a brief explanation of each:
1. add rd, rs1, rs2: Performs an addition of the values in registers rs1 and rs2 and
stores the result in rd. Example: add x3, x1, x2 adds x1 and x2 and stores the
result in x3.
2. addi rd, rs1, imm: Adds an immediate value (constant) imm to the value in register
rs1 and stores the result in rd. Example: addi x3, x1, 5 adds 5 to the value in x1
and stores the result in x3.
3. mul rd, rs1, rs2: Multiplies the values in rs1 and rs2, storing the product in rd.
Example: mul x3, x1, x2 multiplies x1 by x2 and stores the result in x3.
4. sub rd, rs1, rs2: Subtracts the value in rs2 from rs1 and stores the result in rd.
Example: sub x3, x1, x2 computes x1 - x2 and stores the result in x3.
5. xor rd, rs1, rs2: Performs a bitwise XOR (exclusive OR) of rs1 and rs2, storing
the result in rd. Example: xor x3, x1, x2 computes x1 ^ x2 and stores the result
in x3.
6. xori rd, rs1, imm: Performs a bitwise XOR between rs1 and an immediate value
imm, storing the result in rd. Example: xori x3, x1, 5 XORs x1 with 5 and stores
the result in x3.
7. div rd, rs1, rs2: Divides the value in rs1 by the value in rs2 (signed integer
division) and stores the quotient in rd. Example: div x3, x1, x2 computes x1 / x2
and stores the quotient in x3.
8. slt rd, rs1, rs2: Sets rd to 1 if rs1 is less than rs2 (signed comparison),
otherwise sets rd to 0. Example: slt x3, x1, x2 checks if x1 < x2 and stores 1 or
0 in x3.
9. slti rd, rs1, imm: Sets rd to 1 if rs1 is less than the immediate value imm (signed
comparison), otherwise sets rd to 0. Example: slti x3, x1, 5 checks if x1 < 5
and stores 1 or 0 in x3.
10.beq rs1, rs2, lab: Branches to the label lab if the values in rs1 and rs2 are
equal; otherwise, the program counter increments to the next instruction. Example: beq
x1, x2, label branches to label if x1 == x2.
418
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
11.jal rd, lab: Jumps to the label lab and stores the address of the next instruction
(current program counter + 1) in rd. Example: jal x1, label jumps to label and
stores the return address in x1.
12.jalr rd, rs1, offset: Jumps to the address in rs1 + offset and stores the
return address (current program counter + 1) in rd. Example: jalr x1, x2, 4 jumps
to the address x2 + 4 and stores the return address in x1.
13.sw reg, offset(rs1): Stores the value in register reg to memory at the address
rs1 + offset. Example: sw x3, 4(x1) stores the value in x3 to memory at the
address x1 + 4.
14.lw reg, offset(rs1): Loads a value from memory at the address rs1 + offset
into register reg. Example: lw x3, 4(x1) loads a value from memory at x1 + 4 into
x3.
These instructions form the basis for building more complex operations in RISC-V, focusing on
simplicity and efficiency, typical of RISC architectures. In the rest of these classes, RISC-V will
be our target representation! However, we shall assume one simplification: we will have an
infinite surplus of variable names, just like in the LLVM Intermediate representation.
419
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Program Optimization
420
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. Interprocedural Optimization
Broader Impact
● Breaking Gender Barriers: As the first woman to win the Turing Award and a trailblazer
in a male-dominated field, she inspired generations of women in computing.
● Education and Advocacy: Allen actively promoted the importance of advancing
compiler technology for scientific and industrial applications.
421
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
422
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Each of these rules operates within a state transition system, where the state (S) and registers
(R) are modified by the instructions, and the program counter (pc) advances. The rules describe
what happens to these components when a given instruction is executed. They specify:
● Compare the values in registers rs1 and rs2. If they are equal, update the program
counter (pc) by adding the immediate value imm. Otherwise, move to the next
instruction.
● If R[rs1] is not equal to R[rs2], update the program counter by adding the immediate
value imm. Otherwise, proceed to the next instruction.
● Save the address of the next instruction (pc + 1) in register rd and jump to the address
calculated by adding the immediate value imm to the program counter.
● Save the address of the next instruction (pc + 1) in register rd, then jump to the sum of
the value in register rs1 and the immediate value imm.
5. SW (Store Word)
● Store the value in register rs2 at the memory address calculated by adding the
immediate value imm to the value in register rs1.
6. LW (Load Word)
● Load the value from memory at the address calculated by adding the immediate value
imm to the value in register rs1, and store it in register rd.
423
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
7. ADD (Add)
● Add the values in registers rs1 and rs2 and store the result in register rd.
● Add the immediate value imm to the value in register rs1 and store the result in register
rd.
9. MUL (Multiply)
● Multiply the values in registers rs1 and rs2, and store the result in register rd.
● Subtract the value in register rs2 from the value in register rs1, and store the result in
register rd.
● Perform a bitwise XOR of the values in registers rs1 and rs2, and store the result in
register rd.
● Perform a bitwise XOR between the value in register rs1 and the immediate value imm,
and store the result in register rd.
● Divide the value in register rs1 by the value in register rs2 (signed integer division),
and store the quotient in register rd.
● If the value in register rs1 is less than the value in register rs2, set register rd to 1.
Otherwise, set it to 0.
424
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● This rule applies when the current instruction at pc is a valid instruction (i.e., Prog[pc]
= inst).
● First, the instruction is fetched from the program counter (pc) and executed, producing a
new state (S', R', and pc').
● After executing this instruction, the program continues with the updated state (S', R',
pc'), fetching and executing the next instruction.
● The process repeats until the full program is executed, transitioning through different
states (S'', R'', pc''), as indicated by multiple steps in the rule.
● This rule applies when the program counter (pc) reaches a location in the program
where there is no valid instruction (i.e., Prog[pc] = ⊥).
● At this point, the program has no more instructions to execute, and the final state
remains as it is (S, R, pc).
● This rule essentially signifies the end of the program's execution.
425
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Imagine that you had to write the rules above in Python. What would they
look like?
Here's an idea of how you could translate the operational semantics rules for RISC-V programs,
specifically EIndc and EBase, into Python-like pseudocode. This will give you a sense of how
the rules would be implemented programmatically.
Key Ideas:
This Python-like pseudocode gives you an idea of how the recursive execution of a RISC-V
program could look in code form, corresponding to the rules you're working with!
426
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The program is first represented as an AST, where each node represents a specific
construct in the programming language. For instance, variables, numbers, and arithmetic
operations like addition are all represented as nodes.
● These nodes form a hierarchical structure, where more complex expressions are
composed of simpler ones.
● The Visitor Pattern is used to traverse the AST. The "visitor" visits each node of the
AST and generates the corresponding machine instructions.
● Each type of node in the AST (like a variable, number, or addition) has its own method in
the visitor, which handles how to generate code for that specific node type.
● For each node of the AST, the visitor generates the necessary instructions to evaluate
that expression. For example:
○ A variable node fetches the value of the variable from memory or a register.
○ A number node directly creates a load instruction to store the number into a
register.
○ An addition node generates instructions to add the results of its two
sub-expressions.
● During code generation, the visitor uses registers to hold intermediate results. Since
you're assuming an infinite supply of registers (see last class), each step in the code
generation process returns a new register that holds the result of the computation for
that node.
● The generated instructions reference these registers, and each register corresponds to a
variable holding some value at runtime.
427
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
5. Program Representation
● The program being generated is stored in a structure (let's call it prog, which contains
methods to insert and retrieve instructions). This program consists of a sequence of
instructions.
● The visitor adds instructions to prog as it traverses the AST, ensuring that the final
sequence of instructions correctly evaluates the entire expression.
● After visiting a node, the visitor method returns the name of the register that holds the
result of the code generated for that node. For example:
○ If visiting a number like 13, the visitor generates an instruction to load 13 into a
register and returns the register name.
○ If visiting an addition like 13 + 10, the visitor generates instructions to add two
registers and returns the name of the register holding the result.
By using the visitor pattern and an AST, this approach ensures that the code generation process
is modular, with each type of node responsible for generating its own corresponding machine
instructions. This method is efficient and scalable, making it easy to extend and support more
complex constructs later on.
428
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ After instruction selection, the next phase is register allocation. This is the
process of assigning physical registers to variables or intermediate values.
○ In your case, you’re assuming an infinite number of registers for simplicity.
However, in real-world scenarios, processors have a limited number of registers,
and the compiler must decide how to efficiently use them.
3. Instruction Scheduling:
○ This phase involves reordering instructions to optimize execution, taking into
account the latency of certain instructions or dependencies between instructions.
This ensures that the CPU pipeline is used efficiently.
When generating code for an arithmetic operation (like addition or multiplication), instruction
selection chooses the appropriate assembly instruction based on the operation. For example:
● For a simple number, it might use a "load immediate" instruction like Addi in RISC-V to
put a constant value in a register.
● For an Add operation, it would select the Add instruction.
Instruction selection takes place when you're deciding how to express a high-level operation in
terms of low-level machine instructions. For instance:
429
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In our example, when the visitor generates an Addi instruction for a number or an Add
instruction for an addition, that is instruction selection in action. However, instruction selection
is only one part of the overall code generation phase, which includes additional steps like
register allocation and scheduling.
Let's get back to our code generator. How would you generate code for a
variable?
We can use a simple visit method that only returns the variable name. Remember, we are not
doing register allocation yet. Our visit method would be something like:
class GenVisitor(Visitor):
"""
The GenVisitor class compiles arithmetic expressions into a low-level
language.
"""
...
This visit_var method correctly generates code for variables in a specific way because it
relies on a key assumption: the variable's value is already stored or accessible. Here's why
this approach works and why it doesn't generate any instructions:
● The method is responsible for generating the code corresponding to a variable (like x in
the example).
● Instead of generating any new instructions, it simply returns the variable's name
(exp.identifier), which is the same name used to reference it in the program.
● Variables are already associated with a value: The assumption here is that the
variable's value has been previously initialized or loaded into a register (or is in memory).
430
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Therefore, there’s no need to generate instructions to calculate or assign the value of the
variable.
○ The variable's identifier (exp.identifier) acts as a reference to this
pre-existing value.
○ For example, if x is stored in a register or memory, the code generation process
can simply refer to it by its name ('x'), and no further computation is required.
● The program state already contains the variable: The AsmModule.Program (as
seen in the Usage section) already knows the value associated with the variable x (in
this case, x=1 is passed in). Since the value is known, no instructions are needed to
perform any computation or loading at this point.
● Once we write visit_num, compare code generation for variables with that other
method, where an instruction (e.g., Addi) is generated to load a number into a register.
The reason visit_num generates an instruction is that the number is a constant that
needs to be placed somewhere in a register to be used later.
● Variables, on the other hand, are already named entities that map to a value in memory
or a register. So, there's no need to load them—we already know where they are.
● Once this method returns the variable name, later instructions generated by other
visit_ methods (e.g., for addition, subtraction, etc.) will use this variable name to
perform operations.
○ For instance, when the visitor processes an arithmetic expression like x + 1, it
will refer to the name x (from visit_var) to get the value associated with x in
the register or memory and then generate the appropriate instructions for the
addition.
In Summary:
431
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The visit_var method doesn’t generate instructions because the value of the variable
is already assumed to exist in the program state (a register or memory).
● Instead, it returns the variable’s name (exp.identifier), which can be used by other
parts of the program to access its value.
● This approach optimizes the code generation process by avoiding unnecessary
instructions for variables that have already been initialized.
In this case, how would be the generation of code for number literals?
We can generate code for numbers via the simple visit_num method below:
class GenVisitor(Visitor):
...
def visit_num(self, exp, prog):
"""
Usage:
>>> e = Num(13)
>>> p = AsmModule.Program({}, [])
>>> g = GenVisitor()
>>> v = e.accept(g, p)
>>> p.eval()
>>> p.get_val(v)
13
"""
var_name = self.next_var_name()
prog.add_inst(AsmModule.Addi(var_name, "x0", exp.num))
return var_name
You might find it weird: we use an addition to put a constant into a register. But that's common!
Let's break down why code generation for numbers uses an addition instruction (Addi) to
load a constant into a register instead of simply assigning the number directly, as in var_name
= exp.num.
● In higher-level languages (like Python, C, etc.), you can directly assign values to
variables using statements like var_name = 13. These languages abstract away the
machine-level details.
● In assembly or low-level programming, however, the hardware does not have
"variables" in the same way. Instead, it has registers. Registers are small storage units
inside the CPU where you must load data before you can use it.
○ So, to make the number 13 usable in assembly, you need to load it into a
register.
○ This is done via specific instructions that interact with the CPU’s registers.
432
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● In higher-level languages, it's easy to just say var_name = exp.num because they
automatically handle things like memory management and register allocation for you.
● But at the assembly level, the program needs to manually move data between memory
and registers, or between registers themselves. The assembly language does not have
an equivalent to Python's variable assignment directly.
○ You must use explicit instructions to move data into registers before performing
any operations.
● Assembly languages usually use instructions like Addi because they’re simple and
efficient.
● Some architectures have special load instructions (like Load Immediate), but they
might not always be available, or they could be more complicated to implement.
○ RISC-V (the architecture we’re targeting) focuses on keeping the instruction set
simple, so it uses existing instructions like Addi for tasks like loading constants.
433
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary:
● In assembly code, you can't directly assign a value to a "variable" as you would in
higher-level languages. You need to move the value into a register using specific
instructions.
● The Addi instruction is used here to load the constant into a register because assembly
languages require explicit instructions to move data around.
● While Addi is technically an addition instruction, it’s used here as a way to load a
constant into a register by adding it to the always-zero register "x0".
This design decision actually provides several benefits in RISC (Reduced Instruction Set
Computer) architectures like RISC-V, and it’s not unique to RISC-V. Let's explore why it's
useful, how other architectures like ARM and x86 handle similar scenarios, and why using
literals directly isn’t always preferable.
In RISC-V, x0 is a special register that always holds the value 0, and its contents cannot be
changed. This might seem wasteful at first glance, but it serves several important purposes that
actually enhance simplicity and efficiency:
● RISC architectures, like RISC-V, focus on simplicity and minimizing the number of
instructions. Instead of having many special-purpose instructions for operations involving
zero, a zero register allows a variety of operations to leverage the same set of
instructions.
○ For example, rather than needing a dedicated instruction to move a constant zero
into a register, you can use existing instructions that involve x0. E.g., Addi with
x0 is effectively a way to load a constant without needing a separate "load
immediate" instruction.
434
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● From a hardware perspective, having a dedicated zero register simplifies the design of
the processor.
○ Since x0 always holds 0, there's no need for additional circuitry to check or
modify its value. This consistency simplifies data paths and control logic within
the processor.
c) Efficient Operations:
● Many common operations involve zero, and having a dedicated register for zero reduces
the number of literal values that need to be encoded directly into instructions.
○ Instead of using up bits in every instruction to represent the literal 0, instructions
that refer to x0 can use that space for other purposes, allowing for more efficient
use of instruction encoding.
In RISC-V, you could theoretically use immediate values (literals) directly in some instructions,
but there are trade-offs that make using x0 beneficial:
● Using x0 for zero means the processor doesn’t need special cases for handling
operations involving zero. It can treat zero just like any other register value, leading to
more uniform instruction behavior.
○ This is particularly useful for operations like conditional branches, where you
might want to check if a register is zero.
● Efficiency: Using a zero register can make instruction encoding more efficient because
you don’t need to repeatedly encode the literal 0 in instructions. You just reference x0,
which is represented compactly in the instruction.
● Simplicity: It reduces the need for special cases in instruction sets and hardware logic.
Instead of having different instructions or data paths for zero, you can handle zero like
any other register value.
435
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Uniformity: Instructions become more uniform because you can apply the same
operations to zero as you would to any other value. This makes the design of both the
compiler and the processor simpler.
Conclusion:
Having a dedicated zero register like x0 in RISC-V (or xzr in ARM) is not a waste. Instead, it
simplifies both the processor’s hardware design and the instruction set architecture. It allows
zero to be handled like any other value, reducing the need for special instructions and improving
overall efficiency. In contrast, architectures like x86, which don’t have a zero register, rely on
their more complex instruction encoding and handling of literals, but they still manage to
accomplish the same kinds of operations—albeit with a different design philosophy.
● In AArch64 (64-bit ARM), when certain instructions specify the xzr register as the
source or destination, it acts like a zero register. For example, writing to xzr discards the
result, and reading from xzr always returns zero.
● Like x0 in RISC-V, this design helps avoid needing special instructions for zero-based
operations and keeps the instruction set simple.
x86: x86 does not have a dedicated zero register like RISC-V or ARM.
● Instead, x86 typically uses literal encoding for immediate values, including 0. The
instructions can directly operate on the literal zero, but this can make encoding slightly
more complex, and there’s a trade-off in terms of flexibility and uniformity.
● x86 compensates for the lack of a dedicated zero register by using its more complex
instruction set, which can handle a wider variety of operations directly on literals.
How would you generate code to represent boolean literals, such as true
and true?
We could generate code for booleans using the following visit method:
class GenVisitor(Visitor):
...
def visit_bln(self, exp, prog):
"""
Usage:
>>> e = Bln(True)
436
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
>>> e = Bln(False)
>>> p = AsmModule.Program({}, [])
>>> g = GenVisitor()
>>> v = e.accept(g, p)
>>> p.eval()
>>> p.get_val(v)
0
"""
if exp.bln:
var_name = self.next_var_name()
prog.add_inst(AsmModule.Addi(var_name, "x0", 1))
return var_name
else:
return "x0"
This visit_bln method handles the generation of code for boolean literals (True and False)
in a way that is consistent with RISC-V's handling of constants, using the x0 register and
immediate values. Let’s break down the code step by step to understand its logic and why it
works this way.
The argument exp is an instance of a class Bln that presumably stores a boolean value (True
or False). The method checks if exp.bln is True or False:
When the boolean literal is True, the method generates code that places the value 1 into a new
register:
var_name = self.next_var_name()
prog.add_inst(AsmModule.Addi(var_name, "x0", 1))
return var_name
437
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When the boolean literal is False, the method doesn’t create a new variable or register.
Instead, it returns "x0" directly:
return "x0"
● Why return "x0"?: The register x0 always holds the value 0, which represents False
in the context of booleans. By returning "x0", the method avoids the need to explicitly
generate instructions to load the value 0 into a register. This is efficient because it
leverages the fact that x0 is always zero.
● This is one of the benefits of having a zero register: we can directly use it to represent
False without consuming additional instructions or registers.
You might wonder why the method uses the Addi instruction to load 1 for True, instead of
something like var_name = exp.num, as you mentioned before.
● RISC-V doesn’t have a dedicated "load immediate" instruction for small constants.
Instead, it uses instructions like Addi to load small constants by adding them to a base
register (x0 in this case, which is always zero).
● Using Addi with x0 as the source register effectively loads the immediate value (in this
case 1) directly into the target register (var_name).
Efficiency Considerations:
● For True: The code generates a single Addi instruction to load the value 1 into a
register, which is a straightforward and efficient way to handle the constant.
438
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● For False: It returns the zero register (x0) directly, which is highly efficient because no
additional instructions or registers are needed to represent the value 0.
Summary:
● When the boolean literal is True, the method generates a new variable and uses the
Addi instruction to load the value 1 into a register, since there’s no direct "load
immediate" in RISC-V for small constants.
● When the boolean literal is False, the method returns the x0 register directly, as it
always holds the value 0, making the code more efficient by avoiding unnecessary
instructions.
● This approach leverages RISC-V’s design features, like the x0 register, to generate
simple and efficient code for boolean literals.
How would you generate code to represent an addition, e.g., Add(e0, e1)?
To generate code for an addition expression like Add(left, right), the approach would
follow similar patterns to those you've already seen for numbers, variables, and booleans. The
goal is to generate assembly instructions that compute the sum of the two sub-expressions
(left and right) and store the result in a new register.
Here’s a step-by-step approach to implementing the visit_add method for generating code for
an addition expression:
439
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
440
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion:
The visit_add method uses the same principles seen in earlier examples. It recursively
generates code for the left and right operands, then combines the results with an Add
instruction, storing the sum in a new register. This ensures that code for complex arithmetic
expressions is correctly generated and efficiently executed.
441
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Add(Num(3), Num(4)) is an expression where Add is the root node, and Num(3) and
Num(4) are the child nodes.
● The same applies to more complex expressions, like Add(Var('x'), Add(Num(2),
Num(3))), which forms a nested tree.
The recursive structure of expressions naturally leads to a recursive approach for code
generation.
When generating code, each type of expression has its own rule:
● Base Cases: For simple expressions like Num(3) or Var('x'), the code generation is
straightforward. You generate an instruction that loads a constant or retrieves the value
of a variable.
○ Example: For Num(3), you generate Addi(result, x0, 3), which loads 3
into a register.
○ Example: For Var('x'), you just return the variable's name (or the register it
maps to).
● Recursive Cases: For compound expressions like Add(left, right), you generate
code for the left and right sub-expressions first (this is the recursive part), and then
combine the results (in this case, by adding them using an Add instruction).
○ The code generation for Add(left, right) calls the code generation for both
left and right, producing code for each sub-expression before generating the
final Add instruction.
442
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Register Management
Each sub-expression typically has its own result stored in a temporary register. By recursively
generating code for each sub-expression, the compiler keeps track of these intermediate
results, storing them in temporary registers or variables.
● In the case of Add(left, right), after generating code for left and right, their
results are stored in two registers, say r1 and r2.
● The Add instruction then combines the values in r1 and r2 and stores the result in a
new register, say r3.
Once you have the results of the sub-expressions, you can generate instructions that combine
those results into a final value. This is the "glue" that puts the sub-expression results together:
● For an Add(left, right), the generated code might look like this:
Addi r1, x0, left_value ; Load the value of the left expression
Addi r2, x0, right_value ; Load the value of the right expression
Add r3, r1, r2 ; Add the two values and store in r3
● This combines the results of the left and right expressions into r3.
After generating code for an expression, you return the result (often the name of a register) to
be used by higher-level expressions or the final evaluation. This is crucial for ensuring that the
result of one expression can be correctly used as an input to another.
443
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● At the assembly level, values like integers, booleans, and even characters are often
represented as simple binary data in registers or memory.
● For instance, a boolean True might be represented as 1, and False as 0. Therefore,
an addition like Add(Num(1), Bln(True)) is perfectly valid at the assembly level
because it’s just adding two integers (1 + 1 = 2).
● In assembly, there is no notion of "types" like booleans or integers—everything is just
a sequence of bits or integers. The operations (like Add, Mul, etc.) are agnostic to the
meaning behind those bits.
● High-level languages enforce type rules to ensure programs behave correctly and
meaningfully. For example, adding a boolean to an integer would be nonsensical in
many high-level languages, and such an operation would typically be caught during type
checking.
● The type system ensures that expressions like Add(Num(1), Bln(True)) are not
allowed at the high-level. The type checker would raise an error before the program
reaches the code generation phase, because adding a boolean and an integer would
violate type rules (e.g., in strongly typed languages like Python, Java, or Haskell).
444
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● In most modern compilers, type checking is a phase that happens before code
generation. If the program passes type checking, the compiler guarantees that the
operations being performed are valid according to the language's type rules.
● For instance, if the program tries to perform Add(Num(1), Bln(True)), the type
checker would reject the program before it gets to the code generation phase. As a
result, code generation would never encounter this problematic operation, because the
high-level program itself would be considered invalid.
● Assembly code is untyped, meaning operations like addition can happen between any
values as long as the hardware supports them (e.g., adding two 32-bit registers).
● High-level programs are typed, meaning operations must follow the rules of the
language’s type system. This ensures that nonsensical operations (like adding booleans
and integers) are caught early, preventing them from being translated into assembly
code.
Imagine you have a high-level program like this in Python (which is strongly typed):
● Here, add(1, True) would raise a type error during static type checking (if you use
tools like mypy) or at runtime in Python.
● In languages with static type systems (e.g., Haskell, Java), such code would be rejected
before it is ever compiled to machine code or assembly.
● If the type system allows the program to pass type checking, the compiler then
guarantees that the code being generated will be valid and meaningful at runtime,
preventing unexpected behavior.
● Even though at the assembly level you can perform operations like Add(Num(1),
Bln(True)) (since they are just binary values), the type system in the high-level
language prevents this from happening.
● The type checker ensures that only valid operations are passed to the code generation
phase, meaning that nonsensical or invalid operations are caught early and do not make
it into the final assembly code.
445
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation of visit_let:
def visit_let(self, exp, prog):
"""
Handles code generation for let-bindings.
"""
# Step 1: Evaluate the defining expression (exp_def)
d_name = exp.exp_def.accept(self, prog)
446
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● This step binds the result of exp_def (held in d_name) to the variable
exp.identifier. The variable's identifier (e.g., "v") becomes an alias for the register
holding the result of the defining expression.
● The Addi instruction is used to copy the value from d_name into the variable's name
(exp.identifier). This operation doesn't change the value—it's effectively an
assignment operation. In RISC-V, this is done using Addi with an immediate value of 0,
meaning it's simply copying the value.
○ Example: If d_name is r1 and exp.identifier is "v", this will effectively bind
the value of r1 (e.g., 2) to "v".
Can you show how code would be generated for let x <- 12 in 21 +
(x + true) end?
The figure below shows the code generation process for this particular program. A few things to
notice:
● The visit_let method generates code for let-bindings by first evaluating the defining
expression, binding its result to the variable, and then evaluating the body expression
where the variable is used.
● It uses an Addi instruction to bind the value of the defining expression to the variable
and allows recursive evaluation of nested let-expressions.
● This approach ensures that values are correctly tracked and referenced throughout the
program, even in the presence of nested or complex let-bindings.
447
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
But this approach to code generation assumes that variables have different
names, right? Otherwise, what could go wrong?
That's correct. This approach assumes that every let binding defines a variable with a different
name. Otherwise, imagine that we have something like:
This code should return 25. But our generator would mix the two definitions of variable x. Thus,
the result would be 26. The issue we're describing arises because the current implementation
for generating code for let-bindings doesn't account for variable shadowing. In functional
languages (and many others), variables declared in nested scopes can shadow variables with
the same name from outer scopes. The current approach uses the same variable name for both
bindings of x, leading to the incorrect result of 26 instead of 25.
448
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Revised Approach
1. Introduce a variable name stack: Use a stack or a similar data structure to keep track
of variable names for different scopes. When entering a new let binding, push the new
variable name onto the stack. When exiting the scope, pop the variable name off the
stack.
2. Generate unique names: When processing a let binding, generate a unique internal
name for the variable (e.g., by appending a counter or some identifier) while maintaining
the original name in the stack to correctly reference it.
class CodeGenVisitor:
def __init__(self):
# A dictionary to map variable names to a stack of their unique versions
self.var_scope = {}
# Step 2: Evaluate the defining expression (exp_def) and bind to unique name
d_name = exp.exp_def.accept(self, prog)
prog.add_inst(AsmModule.Addi(unique_name, d_name, 0))
# Step 4: Remove the unique name from the scope when leaving the let-binding
self._pop_variable(exp.identifier)
# Push the unique name onto the stack for this variable
self.var_scope[var_name].append(unique_name)
return unique_name
449
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
Example of Usage:
x_0 = 12
x_1 = 13
result = x_1 + x_0
Thus, the inner x would use x_1, while the outer x remains x_0, resulting in 13 + 12 = 25,
which is the correct behavior.
450
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's break down how to generate code for this type of expression step-by-step, following the
structure of the IfThenElse class you’ve provided.
General Idea:
● Branching: Use a conditional branch instruction (beq or bne) to check if the condition is
true or false.
● Labels: You need labels for branching to different parts of the code. One label for the
start of the "else" branch and one for jumping over the "else" branch after the "then"
branch is executed.
451
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
# The result is the value from either the 'then' or 'else' block
return then_result if then_result is not None else else_result
Conceptual Overview:
The first step is to generate code to evaluate the condition, which is an expression that results in
a boolean value (True or False). This value is stored in a register.
The result of evaluating the condition is needed for the next step, where we decide which
branch to take.
Once the condition is evaluated, we need to generate a conditional branch that determines
which block of code to execute:
452
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To achieve this, we generate a branching instruction that checks whether the result of the
condition is False (i.e., 0). If it’s False, the program jumps to the "else" branch. If the condition
is True, the program continues to the "then" branch.
If the condition is True, we generate code for the "then" block (e0). The result of evaluating the
"then" expression is stored in a register. Once the "then" block has been executed, the program
must skip the "else" block and jump to the end of the if-then-else construct.
If the condition is False, the program jumps to the "else" block (e1). Here, we generate code
for the "else" expression, and the result is stored in a register. Once the "else" block has been
evaluated, the program proceeds to the next part of the overall execution.
At the end of the if-then-else expression, after either the "then" or the "else" block has been
executed, we return the result of the block that was executed.
To manage the control flow between the "then" and "else" blocks, we use labels and jump
instructions:
● We place labels at the start of the "then" block, the "else" block, and the end of the
if-then-else expression.
● After the "then" block is executed, the program jumps over the "else" block to the end
label.
● If the "else" block is executed, the program continues naturally to the end of the
expression.
● Conditional branching ensures that only one of the two branches is executed,
depending on the value of the condition.
● Labels are used to mark the locations of the "then" and "else" blocks, and jump
instructions control the flow between these blocks.
● The final result of the if-then-else expression is the value computed by whichever
block was executed, ensuring that the program behaves correctly according to the
conditional logic.
453
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
# Conditional branch
beq cond_reg, x0, else_label # If cond_reg == 0, jump to else
Summary:
● The condition is evaluated first, and based on its result, the program either jumps to the
"then" or "else" block.
● Only one of the branches is executed, which aligns with the semantics of conditional
expressions.
● The result of the if-then-else expression is the value from either the "then" or "else"
branch, depending on the condition.
● Labels and branching instructions are key to implementing conditional logic at the
assembly level, where flow control is explicit.
● jal (jump and link): Used to jump to a target address and optionally save the return
address in a register.
● jalr (jump and link register): Similar to jal, but allows the target address to be
computed dynamically.
454
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Here:
Explanation
This approach aligns with the standard RISC-V encoding and avoids relying on any custom
instruction like j.
Not every instruction is available in RISC-V. For instance, how would you
implement logical negation?
High-Level Operation:
In high-level languages, the logical negation (NOT) operation flips the value of a boolean:
In RISC-V, however, there isn’t a direct logical negation instruction like NOT. To perform this
operation, we need to combine a few instructions.
455
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Thus, we can use the XOR operation with the value 1 to implement logical negation.
Explanation:
● The method first evaluates the operand (the boolean value that we want to negate). The
result of this evaluation is stored in value_reg, which will hold either 1 (for True) or 0
(for False).
● To negate the boolean value, we use the XORI (bitwise XOR with an immediate value)
instruction.
○ XORI with 1 effectively negates the value:
■ 0 XOR 1 = 1 (negating False results in True)
■ 1 XOR 1 = 0 (negating True results in False)
● The result is stored in result_reg.
● Finally, we return the result stored in result_reg, which now contains the negated
boolean value.
456
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Type Checking: In a typical compiler, the type system ensures that only valid boolean
values are passed to the NOT operation. During type checking (a phase that happens
before code generation), the compiler verifies that operations like NOT are only applied to
boolean expressions.
● Prevents Incorrect Inputs: This type checking prevents cases where non-boolean
values, like 3 or 11, are passed to boolean operations, ensuring that ill-typed programs
are caught before they reach the code generation phase.
If the type checker failed to catch a mistake (or if no type checker exists), and a non-boolean
value (e.g., 3) was passed to the NOT operation, the XORI instruction would still execute but
produce a result that doesn't match the expected logical negation behavior:
457
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Input: 3 (binary 11) → XORI 3, 1 would result in 2 (binary 10), which is not a valid
boolean result.
This illustrates how important the type system is in ensuring that only valid, well-typed
expressions are passed to the code generation phase.
Let’s consider an ill-typed program where the operand passed to NOT is 3 instead of 0 or 1:
Summary:
● The implementation of logical negation using XORI assumes that the input is well-typed
and limited to 0 or 1, representing boolean values.
● In an ill-typed program, where the operand is not a valid boolean, the result of using
XORI would be incorrect, as it would not produce the expected logical negation behavior.
● Type checking in high-level languages ensures that only valid boolean values are
passed to the NOT operation, preventing such issues from reaching the code generation
phase.
This can be achieved using a combination of RISC-V instructions that effectively normalize any
non-zero value to 1 and then invert it using logical operations.
Idea:
The goal is to ensure that for any non-zero e, the result of Not(e) will be 0, and for e == 0,
the result will be 1. The key to achieving this is:
458
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
● The operand exp.operand is evaluated first, and its result is stored in value_reg.
The operand could be any expression that evaluates to an integer (not necessarily a
boolean).
● We use the SLTIU instruction, which stands for Set Less Than Immediate Unsigned.
This instruction compares the value in value_reg with the immediate value 1:
○ If value_reg == 0, the result is 1 (indicating True, i.e., Not(False)).
459
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The result of this normalized and negated operation is stored in result_reg, which is
returned as the final value of the negation.
Example:
Let’s say we evaluate Not(3) and Not(0) using the above code generation approach.
● Not(3):
○ SLTIU compares 3 with 1 and sets the result to 0 (since 3 != 0). The result of
Not(3) is 0.
● Not(0):
○ SLTIU compares 0 with 1 and sets the result to 1 (since 0 == 0). The result of
Not(0) is 1.
Summary:
● By using the SLTIU instruction, we can implement a type-safe version of logical negation
where any non-zero value is treated as True and zero is treated as False. This
approach guarantees that Not(e) is always 1 if e == 0 and 0 if e != 0, regardless of
the original value of e.
● This method ensures that the code generator can handle ill-typed inputs and prevents
incorrect behavior during logical negation.
460
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Who invented this RISC-V architecture that we shall be using as our target
for code generation?
The RISC-V architecture was invented by a team of researchers at the University of California,
Berkeley, in 2010. The key individuals involved in the development of RISC-V include:
The team was motivated by the need for a clean-slate, open, and extensible instruction set
architecture that could be used for academic research and industrial applications without
licensing restrictions. RISC-V was intentionally designed to be simple yet scalable, enabling its
adoption in a wide range of devices, from small embedded systems to powerful
supercomputers.
461
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Recognitions:
○ Turing Award (2017), often referred to as the "Nobel Prize of Computing," which
he shared with John L. Hennessy for their work on RISC architectures.
○ Fellow of the ACM, IEEE, and National Academy of Engineering.
Legacy
462
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
David Patterson’s contributions have profoundly shaped computer architecture and systems
design, making computing faster, more efficient, and more accessible. His influence extends
through his research, writings, and mentorship of many students who have gone on to make
their own mark in the field.
463
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let’s break down how code generation changes for these control-flow constructs:
Can you give me semantic rules for typical statements, e.g., if-then-else,
while, etc?
I'll provide an overview of operational semantics rules for typical imperative constructs like
if-then-else, while loops, and assignment. As we had previously discussed, in operational
semantics, we describe the execution of programs as transitions between configurations
(which include the state of memory, control flow, etc.). For statements, we typically express how
the program moves from one configuration to the next. See below examples of small-step
operational semantics, where the execution is broken into atomic steps:
464
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Assignment Statement: x := e
The assignment statement x := e evaluates the expression e and stores the result in the
variable x. The semantics specify how the state is updated.
Explanation:
465
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
● True Condition: If b evaluates to true in state s, the execution proceeds with the
statement S1 in the same state s.
● False Condition: If b evaluates to false, the execution proceeds with S2 in the same
state s.
The while loop evaluates the boolean expression b. If b is true, it executes the body S and
then repeats the process. If b is false, it terminates the loop.
Explanation:
● True Condition: If b evaluates to true, the loop body S is executed, and after that, the
entire while loop is executed again (hence the transition to S; while b do S).
● False Condition: If b evaluates to false, the loop terminates, and the state s remains
unchanged (the program just proceeds with whatever comes after the loop).
The sequence S1; S2 means that S1 is executed first, and then S2 is executed.
Explanation:
● The first rule handles the case where S1 reduces in multiple steps to a sub-statement
S1', so the sequence continues with S1' followed by S2.
● The second rule handles the case where S1 finishes, and the execution continues with
S2.
466
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The skip statement does nothing and simply returns the current state.
Explanation:
● The skip statement doesn’t alter the state, so the program proceeds with no changes.
With these rules, we can now formally describe how typical imperative constructs work in terms
of transitions between states. Here’s how they fit into code generation and execution:
These rules form the basis of operational semantics for imperative programs and dictate how
control flows through a program. They help in defining how constructs behave, how the state is
updated, and how execution proceeds from one step to the next.
Unlike the if-then-else expression, the C-style if-then-else statement doesn't return a
value. Its purpose is to execute one block of code if a condition is true and another block of
code if the condition is false. After either block, the program continues to the next statement.
# Step 2: Generate labels for the 'else' block and end of the if-then-else
else_label = prog.new_label("else")
467
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
end_label = prog.new_label("end_if")
Explanation:
Can you "draw" the code that will be produced for an if-then-else
statement?
Sure. Depending on the existence of the else branch, we might have two different graphs
showing how control flows throughout the program:
468
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the above figures, the term "fall through" refers to the control flow when no jump is taken after
executing a conditional block, and the program proceeds to the next block of code. Here’s how
the fall through works in the two examples:
The dashed lines represent this "fall through" behavior, indicating where the execution continues
without an explicit branch instruction. It’s essentially the default execution flow when no
condition causes a jump.
469
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
And how would you generate code for a C-Style "while" loop?
A C-style while loop repeatedly executes a block of code as long as a condition holds true.
Like the if-then-else statement, it does not return a value, but it controls the program's flow.
1. Loop Condition: Evaluate the loop's condition at the beginning of each iteration.
2. Branch on Condition: If the condition is true, execute the loop body; otherwise, jump to
the end of the loop.
3. Repeat: After executing the body, jump back to evaluate the condition again.
# Step 6: Jump back to the start of the loop to recheck the condition
prog.add_inst(AsmModule.Jump(loop_start))
Explanation:
● No Value Return: The loop does not return any value; its sole purpose is controlling the
flow of the program by repeatedly executing the loop body as long as the condition
holds.
● Labels for Repetition: The loop has two key labels: one for the start of the loop
(loop_start) and one for the end (loop_end). The program jumps between these
labels to repeat the loop body or exit the loop when the condition is false.
470
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this diagram, there is only one fall through—from the condition block (loop_start) to the
loop body when the condition is true. The control flow diagram now makes it clear that after
executing the loop body, there is an unconditional jump (jump loop_start) back to the
condition check, which means there's no fall through edge there.
● One fall through: From the loop_start block to the loop body if the condition is true.
The transition from the body to the condition check (loop_start) is handled by the jump, not a
fall through. Thanks for the clarification!
Our while contains two jumps: a conditional test and an unconditional one.
Would you be able to generate the while with only one dynamic jump?
Yes! Typical compilers use a common optimization called a loop inversion (also called
do-while conversion) to reduce the number of jumps in a loop, which improves efficiency by
removing the unconditional jump back to the condition.
471
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Instead of placing the condition check at the start of the loop, typical compilers such as GCC
place it at the end of the loop body. This allows the loop to execute the first iteration without a
jump to the condition check, avoiding the unconditional jump altogether after the first iteration.
This method removes the need for an unconditional jump at the end of each iteration and
requires only one conditional branch per iteration.
1. Initial conditional jump to the loop body (if cond is true).
2. After each loop body execution, check cond again (branch if true, exit if false).
This avoids needing to unconditionally jump to the condition after every iteration. Below we see
the new control-flow graph:
472
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● No Unconditional Jump: In the traditional while loop, you'd need a jump to the
condition check after every iteration. By placing the condition at the end of the body, you
eliminate the unconditional jump.
● Reduced Branch Misprediction: With fewer jumps and branches, the processor can
better predict the branch directions, reducing the chances of branch misprediction.
It seems that the inverted loop is a C-style do-while. Is that the case?
Yes, you're absolutely right! The loop inversion optimization transforms a while loop into
something that behaves like a do-while loop in C.
In a do-while loop, the loop body is executed at least once before the condition is checked.
This is exactly what happens in the inverted loop generated by GCC. Here's a breakdown:
While Loop in C:
while (cond) {
// loop body
}
Do-While Loop in C:
do {
// loop body
} while (cond);
● The loop body is executed at least once, regardless of the initial value of cond.
● The condition is checked after the loop body executes.
When a typical compiler like GCC applies the loop inversion optimization, it transforms the
traditional while loop into an inverted form that behaves like a do-while loop:
1. Initial check: The condition is checked once before the first iteration (just like a while
loop), but this is done with an initial branch that skips the loop entirely if the condition is
false.
2. Loop body execution: The loop body is executed at least once (like a do-while loop).
3. Condition after body: After the loop body, the condition is checked again, and if true,
the loop continues. This is just like the condition in a do-while loop being evaluated
after each iteration.
473
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion:
The optimized version does behave like a do-while loop, but the key difference is that the
initial condition is still checked before the first iteration to ensure the loop doesn't run if the
condition is initially false. In contrast, a pure do-while loop in C always runs the loop body at
least once without any prior condition check. However, for subsequent iterations, the condition is
checked after the loop body, making the control flow similar to a do-while.
# Step 2: Initial condition check (if the condition is false, skip the loop)
cond_reg = stmt.cond.accept(self, prog)
prog.add_inst(AsmModule.Beq(cond_reg, "x0", loop_end))
# Step 5: Condition check after the body is executed (inverted loop check)
cond_reg = stmt.cond.accept(self, prog)
prog.add_inst(AsmModule.Bne(cond_reg, "x0", loop_body))
Notice that for a do-while-like loop, we must still check the condition before entering the loop
to avoid running the body when the condition is false right from the start. Without that, the loop
would always execute the body at least once, which doesn't match the behavior of a regular
while loop in C.
474
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
This way, the code behaves like a normal while loop but uses the more efficient inverted loop
structure inside. The surrounding condition ensures that we only enter the loop if the condition is
true initially, avoiding the do-while behavior where the body would always execute at least
once.
475
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
CFG in Optimizations:
When optimizing loops (e.g., using loop inversion, as we discussed), the CFG changes.
Instead of checking the condition first, the compiler may rearrange the flow to execute the loop
body before the condition (similar to a do-while loop). The CFG would reflect this
transformation by adjusting the edges and nodes to reflect the modified control flow.
Summary:
A Control-Flow Graph is a graphical representation of all the possible paths a program might
take during its execution. It is a vital tool in program analysis and optimization, helping compilers
and developers understand how control moves through a program, where optimizations can be
applied, and what parts of the code might be unreachable or need more attention.
Could you give me a visitor that produces code for a C-style for loop?
Here's how you would structure a visitor for a typical for loop. A for loop has four parts:
476
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's create a visitor that generates assembly code for this structure:
This structure captures the typical behavior of a for loop in languages like C, ensuring the
initialization is done once, the condition is checked at each iteration, and the increment is
executed after the body on each loop iteration.
477
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
How can we model a "break" statement that exits the surrounding loop?
Well, let's first discuss how to model an AST node for a break statement and then how to
implement the visit_breakStmt to generate the appropriate assembly code for it.
478
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
A break statement is used to exit the nearest enclosing loop prematurely. To represent this
premature exit in the AST, we could create a simple node with no specific fields because the
break statement doesn't depend on any values or conditions; it just transfers control to the
statement after the loop. Here's a possible structure for a BreakStmt class in your AST:
class BreakStmt:
def __init__(self):
# No fields are necessary since break has no arguments
pass
This BreakStmt class has an accept method that will call the visitor's visit_breakStmt
method, passing in the BreakStmt node and the current program object (prog).
One way to track the current loop's exit_label is by using a stack of labels inside the visitor.
Each time we enter a loop, we push its exit_label onto this stack. When generating code for
a break, we can use the label at the top of the stack to generate a jump to the end of the loop.
Here's an example of how you can modify the visitor to handle this:
1. Add a Stack for Exit Labels: The visitor will need a stack to track the exit labels of
loops.
class CodeGenerator:
def __init__(self):
# Stack for loop exit labels
self.exit_label_stack = []
2. Modify the visit_whileStmt and visit_forStmt Methods: Before generating the
body of the loop, push the exit_label onto the stack. After generating the body, pop
the label off the stack. For the while loop, see the code below (for the for loop, you
would do something similar with the loop_end label):
479
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
loop_start = prog.new_label("loop_start")
loop_end = prog.new_label("loop_end")
prog.add_label(loop_start)
cond_reg = stmt.cond.accept(self, prog)
prog.add_inst(AsmModule.Beq(cond_reg, "x0", loop_end))
stmt.body.accept(self, prog)
prog.add_inst(AsmModule.Jump(loop_start))
prog.add_label(loop_end)
Explanation:
1. visit_whileStmt and visit_forStmt: These methods push the loop_end label
onto the exit_label_stack before generating the loop body. When the loop ends,
they pop the label off the stack.
2. visit_breakStmt: This method generates a jump to the label at the top of the
exit_label_stack, effectively breaking out of the loop.
Summary:
● The BreakStmt AST node is simple because break doesn't take arguments.
● The visitor keeps track of the nearest loop's exit label using a stack
(exit_label_stack), allowing nested loops to work correctly.
● The visit_breakStmt generates a jump to the current loop’s exit_label, making
the break statement functional.
The figure below summarizes the code that we've just produced:
480
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The figure below shows an example of a control-flow graph representing a for loop with a break
statement guarded by a conditional statement. The numbers show the order in which basic
blocks are aligned in the final assembly representation.
481
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To generate code for the continue statement, we'll need to jump to the point where the loop
condition or increment (in a for loop) is re-evaluated. This means we will need to track a
continue label for each loop, in addition to the exit label we used for break. The declaration of
the visitor will become like the following piece of code:
class CodeGenerator:
def __init__(self):
# Stack for loop exit labels (for break statements)
self.exit_label_stack = []
# Stack for loop continue labels (for continue statements)
self.continue_label_stack = []
Plan:
We need to model the continue statement with a simple AST node, similar to BreakStmt.
Here's an example:
class ContinueStmt:
def __init__(self):
# No fields needed since continue has no arguments
pass
482
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
We now modify the code generator to handle continue statements. We will introduce a
continue label stack in the same way we did with the exit_label_stack for break
statements.
Just like we track exit labels for break, we will track continue labels for continue. For a
while loop, the continue label should be the loop condition check. For a for loop, it should be
the increment step. In a while loop, we jump back to the condition check when encountering a
continue statement:
prog.add_label(loop_start)
cond_reg = stmt.cond.accept(self, prog)
prog.add_inst(AsmModule.Beq(cond_reg, "x0", loop_end))
stmt.body.accept(self, prog)
prog.add_inst(AsmModule.Jump(loop_start))
prog.add_label(loop_end)
In a for loop, we jump to the increment step when encountering a continue statement.
483
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
self.exit_label_stack.append(loop_end)
self.continue_label_stack.append(loop_incr)
# Initialization
stmt.init.accept(self, prog)
# Loop body
prog.add_label(loop_body)
stmt.body.accept(self, prog)
# Increment step
prog.add_label(loop_incr)
stmt.incr.accept(self, prog)
# Condition check
prog.add_label(loop_cond)
cond_reg = stmt.cond.accept(self, prog)
prog.add_inst(AsmModule.Bne(cond_reg, "x0", loop_body))
The visit_continueStmt method will generate a jump to the current loop’s continue label,
which is the label at the top of the continue_label_stack.
Summary:
● BreakStmt jumps to the exit_label of the nearest loop to exit the loop.
● ContinueStmt jumps to the continue_label of the nearest loop, skipping the rest of
the loop body and re-evaluating the loop condition or increment.
484
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● We use two separate stacks, one for exit_labels (for break) and one for
continue_labels (for continue), to keep track of nested loops.
This setup will allow you to correctly handle both break and continue statements inside
nested loops in your code generation.
To generate code for a switch statement in RISC-V, we first need to define the structure of the
switch statement in the AST. Below is the representation of the SwitchStmt AST node,
followed by the implementation of the visit_switchStmt method for code generation.
Implementation of visit_switchStmt
def visit_switchStmt(self, stmt, prog):
"""
Generates code for a C-style switch statement with proper fall-through handling.
"""
# Step 1: Evaluate the switch expression and store the result in a register
switch_reg = stmt.switch_expr.accept(self, prog)
# Step 2: Generate labels for each case and the default case
485
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
# Compare the switch expression with the case value and branch if equal
prog.add_inst(AsmModule.Beq(switch_reg, temp_reg, case_label))
# Step 5: Jump to the default case if no matches are found (if default exists)
if default_label:
prog.add_inst(AsmModule.Jump(default_label))
else:
prog.add_inst(AsmModule.Jump(end_label))
# Step 8: Pop the exit label off the stack after the switch
self.exit_label_stack.pop()
Explanation
486
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Consistency with Loops: Using the exit_label_stack ensures that break behaves
uniformly across all constructs (while, for, switch).
2. Fall-Through Logic: By omitting the unconditional jump to end_label in each case,
we naturally allow for fall-through behavior unless explicitly overridden by a break
statement. This aligns with C-style switch semantics.
3. Simplified Break Handling: Since the target of break is stored in
exit_label_stack, the logic for generating break instructions remains unchanged
across constructs. It simply jumps to the top of the stack.
The figure below illustrates how code is produced for a typical switch statement:
487
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
488
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Sequence of Branches: Each case requires a comparison between the switch
expression value and the case value.
2. Linear Scan: The program evaluates cases sequentially until a match is found or all
cases are exhausted. This linear scan means the time complexity grows with the number
of cases.
● Small Case Sets: This approach is simple and works efficiently when the number of
cases is small.
● Dense Case Values: If the case values are dense and numeric (e.g., 1, 2, 3, ...), more
optimized approaches like a jump table could be more efficient.
● Sparse Case Values: For non-contiguous or non-numeric cases, hash tables or decision
trees might be better alternatives.
1. Jump Table
Example Assembly:
489
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Binary Search
Example:
case_values = [1, 3, 7, 10]
targets = [case_1, case_3, case_7, case_10]
# Perform binary search on `case_values`
3. Hash Table
Example:
case_map = {1: case_1, 3: case_3, 7: case_7, 10: case_10}
target = case_map[switch_value]
So, a "good" switch statement involves only numeric cases, right? Which
early languages introduced thi kind of statement?
The earliest programming languages to feature switch statements with numeric cases were
designed with optimization opportunities in mind, particularly for numeric comparisons. These
"good switches" are well-suited for performance improvements, such as jump tables, because
they map numeric case values directly to executable code locations.
ALGOL W (1966)
One of the first languages to include a case statement for numeric values.
case x of
1: statement1;
2: statement2;
490
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
default: statement3;
end case;
Optimization: Numeric cases allow the use of jump tables, and its influence shaped later
languages.
While not a switch statement per se, FORTRAN's Computed GOTO provided a similar
concept. Depending on the value of I, the program jumps to label 10, 20, or 30:
PL/I (1964)
SELECT (X);
WHEN (1) CALL FOO;
WHEN (2) CALL BAR;
OTHERWISE CALL DEFAULT;
END;
C (1972)
One of the most famous and enduring implementations of switch, explicitly designed for
numeric values. Example:
switch (x) {
case 1:
foo();
break;
case 2:
bar();
break;
default:
baz();
}
Optimization: The simplicity of C’s switch syntax and numeric cases made it ideal for jump
tables.
491
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Pascal (1970)
case x of
1: writeln('Case 1');
2: writeln('Case 2');
else writeln('Default');
end;
● Sparse Numeric Cases: Languages like C allowed sparse case values (e.g., 1, 100,
1000), requiring additional handling (e.g., binary search or chains of comparisons).
● Non-Numeric Cases: Later languages like Ada and Java introduced switch
statements supporting non-numeric cases (strings, enums), which complicate
optimization.
Conclusion
The "good switches" — numeric case-based switches in early languages like ALGOL W,
FORTRAN, and C — laid the foundation for efficient jump table optimizations. These early
implementations remain highly relevant, especially in languages like C and its descendants.
492
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Why C?
Historical Impact
493
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Invention of C:
In the early 1970s at Bell Labs, Ritchie developed C as a systems programming
language to write operating systems, most notably UNIX. C struck a balance between
low-level control (close to assembly) and high-level abstraction, making it both powerful
and portable.
○ Impact: C became one of the most widely used programming languages in
history, influencing languages like C++, Java, Python, C#, and many more.
494
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
developed the first version of UNIX in assembly for the PDP-7, Ritchie played a crucial
role in rewriting UNIX in C. This made UNIX portable across different machines, a
revolutionary idea at the time.
● Impact:
○ UNIX became the foundation for modern operating systems, including Linux,
macOS, and indirectly Windows.
○ Its design philosophy (modularity, simplicity, and the "everything is a file"
paradigm) continues to influence systems programming and software
engineering.
● Ritchie helped pioneer compiler design for high-level languages, ensuring they could
generate efficient machine code. His work on C and UNIX required optimizing constructs
like loops, conditionals, and pointers.
● His insights laid the groundwork for many techniques used in systems-level
programming.
● Ritchie’s work bridged the gap between theory and practice. By creating tools like C
and UNIX, he demonstrated how abstract computer science concepts (e.g., structured
programming) could lead to practical systems of immense value.
Legacy
● C’s Influence: Most modern languages owe their syntax and structure to C. Even
domains like embedded systems, graphics, and AI still rely on C for performance-critical
code.
● UNIX’s Progeny: UNIX and its descendants (e.g., Linux) run the majority of servers and
power the internet today. Mobile operating systems like Android and iOS are also
UNIX-based.
● Awards and Recognition:
○ Turing Award (1983): With Ken Thompson, for their work on UNIX.
○ National Medal of Technology (1999): With Thompson, for co-creating UNIX
and C.
495
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Ritchie's Philosophy
Dennis Ritchie valued simplicity, portability, and efficiency in design. He avoided unnecessary
complexity, which allowed his work to endure for decades. His tools and ideas remain the
backbone of modern computing.
496
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Consider the program below. Where do you think each variable will be
allocated?
This program shows local and global variables defined in C:
#include <stdio.h>
int main() {
printf("%p, %lu, global_var\n", &global_var, (size_t)&global_var % 100);
int local_main = global_var + 13;
printf("%p, %lu, local_main\n", &local_main, (size_t)&local_main % 100);
global_var = fun(5 * local_main);
return 3 * global_var;
}
In this C program, the variables will be allocated in different sections of memory, depending on
whether they are global or local, and whether they belong to a function. Here's where each
variable will be allocated:
1. Global Variables:
● global_var:
○ Location: The global variable global_var is allocated in the data segment
(specifically in the initialized data segment). This is because it is defined outside
of any function and initialized with a value (17).
○ Lifespan: global_var persists for the lifetime of the program (from startup to
program termination).
497
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Location: This is a parameter of the fun function, which is treated like a local
variable. It will be allocated in the stack frame of fun. Each time fun is called, a
new stack frame is created, and parameter is stored in that frame.
○ Lifespan: It only exists while fun is executing, and its memory is freed once the
function returns.
● local_fun (in fun function):
○ Location: This is a local variable in the fun function, so it will also be allocated
in the stack frame of fun. Like parameter, it will exist only during the execution
of fun and will be destroyed afterward.
○ Lifespan: It exists while fun is executing and is destroyed when the function
returns.
● local_main (in main function):
○ Location: This is a local variable in the main function, so it will be allocated in
the stack frame of main. Since main is called only once and lasts for the
program's duration, local_main will stay in memory for as long as main is
running.
○ Lifespan: It exists for the duration of the main function, which typically means
until the program terminates.
Execution Flow:
Each time you print the address and remainder, you are observing the addresses where these
variables are stored in memory, and the modulo operation (% 100) provides a convenient way
to distinguish memory locations by checking the last two digits of the address.
498
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Characteristics:
Example:
In the fun function of your C program, both parameter and local_fun are allocated in the
stack frame for fun. When fun is called, space is allocated on the stack for these variables,
and when fun returns, this space is reclaimed.
● Initialized Data Segment: This stores global and static variables that are initialized with
a value at the start of the program (e.g., int global_var = 17;).
● Uninitialized Data Segment (BSS): This stores global and static variables that are not
explicitly initialized (e.g., int uninitialized_global_var;). They are initialized to
zero by default.
Key Characteristics:
499
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Global and Static Variables: Global variables (like global_var in your program) and
static variables (variables declared with the static keyword inside a function or
globally) reside in this segment.
● Lifetime: Variables in the data segment exist for the entire duration of the program. They
are initialized when the program starts and persist until it ends.
● Not Automatically Managed: Unlike stack memory, which is automatically managed,
data in the data segment remains allocated for the entire run of the program, whether or
not it’s actively being used.
Example:
In your C program, global_var is allocated in the initialized data segment because it’s a
global variable with an initial value (17). It exists throughout the program's execution and its
memory is not released until the program ends.
1. Text Segment: Contains the actual compiled code (machine instructions).
2. Data Segment:
○ Initialized Data: Stores initialized global and static variables.
○ Uninitialized Data (BSS): Stores uninitialized global and static variables (default
initialized to zero).
3. Heap: Dynamically allocated memory (using malloc, new, etc.). The heap grows
upwards as memory is dynamically allocated.
4. Stack: Stores function call information, local variables, and function parameters. It grows
downwards as new functions are called and stack frames are created.
The figure below shows how space can be allocated in a typical ARM program.
500
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Recursion: If a function is called recursively, each instance of the function needs its own
set of local variables. If local variables were always allocated at the same address,
recursive calls would share the same memory for these variables, leading to incorrect
behavior because values would be overwritten.
501
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Could you give an example of a function that would fail if local variables
were statically allocated?
Yes. Consider, for instance, the program below:
#include <stdio.h>
int st_fact(int n) {
static int result = 1;
result = n;
result *= (result <= 1) ? 1 : st_fact(result - 1);
return result;
}
int dy_fact(int n) {
int result = 1;
result = n;
result *= (result <= 1) ? 1 : dy_fact(result - 1);
return result;
}
The reason your program produces different results for the static (st_fact) and dynamic
(dy_fact) versions of the factorial function lies in how static variables behave compared to
dynamically allocated local variables.
502
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
from the previous call (result = 2), and this value is used as the base for the
calculation instead of resetting to 1. As a result, once st_fact sets result to a
value less than or equal to 1, the recursive calls always short-circuit (returning 1),
which is why you see 1 as the output for st_fact after the first call.
2. Dynamic Allocation (dy_fact):
○ The int result = 1; in dy_fact creates a new variable result on the
stack for each call to dy_fact. This ensures that the variable result starts
fresh (with result = 1) for every function invocation.
○ Why this gives correct results: Since each call to dy_fact initializes result
independently, the recursion proceeds as expected. Each recursive call
computes result based on the correct value of n passed into the function.
Therefore, the function behaves correctly and returns the correct factorial values.
Example Walkthrough:
● st_fact(2):
1. static int result = 1; → Initializes result (but only the first time the
function is called).
2. result = 2; → result is now 2.
3. Recursively calls st_fact(1) → Since n == 1, assigns 1 to result.
4. Returns 1.
● dy_fact(2):
1. int result = 1; → A new result variable is created with an initial value of
1.
2. result = 2; → Updates result to 2.
3. Recursively calls dy_fact(1) → result = 1; (from initialization in
dy_fact(1)).
4. Returns 2 * 1 = 2.
● st_fact(3):
1. result still holds the value 1 from the previous call (st_fact(2)).
2. result = 3; → Updates result to 3.
3. Recursively calls st_fact(2) → st_fact(2) returns 1 because result is
still 1.
4. Returns 1.
503
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● dy_fact(3):
1. A new result variable is created with an initial value of 1.
2. result = 3; → Updates result to 3.
3. Recursively calls dy_fact(2) → dy_fact(2) returns 2.
4. Returns 3 * 2 = 6.
Conclusion:
● Static variables are shared across function invocations and retain their values between
calls, which can lead to unexpected behavior in recursive functions like st_fact. The
static variable doesn’t reset, leading to incorrect results in this case.
● Dynamic variables are reallocated on the stack for each function call, ensuring that
each call has its own fresh storage for local variables. This allows recursive functions to
work correctly, as seen in dy_fact.
The static version leads to incorrect results because the static variable persists across calls
and causes the recursive function to behave improperly after the first call.
In non-recursive programs, each function is called, executed, and returns before any other calls
to the same function can happen. Since there is no chance of re-entering a function while it is
still active (as recursion would allow), the program only ever needs one copy of each function's
local variables.
● Fortran: Early versions of Fortran (before Fortran 90) did not support recursion.
Therefore, all local variables could be statically allocated in memory, just as you
504
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
described. This design allowed for simpler and more efficient memory management, as
there was no need to maintain a call stack for local variables.
● COBOL: Early COBOL programs also lacked recursion support. The language was
primarily used for business applications, and recursion was not a common pattern in
such applications at the time. This allowed COBOL compilers to allocate all variables
statically, as the memory footprint of the program could be determined at compile time.
● Simplicity: Without recursion, static allocation simplifies both the compiler's task and the
runtime's memory management. There is no need to allocate and deallocate memory for
variables dynamically, and there is no stack frame to manage.
● Efficiency: Static allocation can lead to faster execution, as variables are accessed
directly via known memory addresses, avoiding the overhead of stack manipulation.
Limitations:
● Lack of recursion: Since each function’s local variables are statically allocated,
recursion is impossible without risk of overwriting previous values.
● Limited parallelism: Static allocation also restricts concurrent execution of the same
function (e.g., in multithreading), as all threads would share the same variable memory,
leading to race conditions unless carefully managed.
While modern languages support recursion and dynamically allocate local variables on the
stack, some static languages (e.g., embedded systems or low-level programming in C) still use
global or static variables when recursion is not involved. In such cases, you can sometimes
manually control static allocation for performance reasons, but this generally requires careful
programming.
#include <stdio.h>
int foo(int x) {
return x + 1;
}
int bar(int x) {
return x * 2;
}
505
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
return fptr(val);
}
int main() {
printf("%d\n", call_fun(foo, 3)); // calls foo: should return 4
printf("%d\n", call_fun(bar, 3)); // calls bar: should return 6
}
When calling call_fun(foo, 3) and call_fun(bar, 3), the exact function (foo or bar)
being called is determined at runtime via the function pointer fptr. Therefore, the local
variables and return addresses for foo and bar can't be statically allocated at compile time, as
the target function is only known during execution.
With static allocation, the memory layout for each function is determined at compile time. If
function pointers or indirect calls are involved, the compiler can't determine at compile time:
506
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Because of these unknowns, dynamic allocation using a stack (or heap, in some cases) is
necessary. This allows:
● Dynamic storage of return addresses: Each function call, whether direct or indirect,
needs its own space to store the return address.
● Dynamic allocation of parameters and local variables: Each function call must have
its own memory allocated for its parameters and local variables, especially when
recursion or function pointers are involved.
In languages that don't allow function pointers or indirect calls, the absence of dynamic control
over which function gets called makes it easier to allocate local variables statically. But as soon
as indirect calls are introduced, dynamic memory management becomes necessary to handle
the flexibility that such calls provide.
● Early versions of Fortran and COBOL also avoided indirect calls through function
pointers, which further enabled static allocation.
● Without function pointers or recursion, the call graph of the program is fully known at
compile time, making static allocation straightforward.
Conclusion:
507
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
508
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The figure below shows an example of a call graph for our earlier example involving the two
different implementations of the factorial function:
A cycle in a call graph means that a function can eventually call itself, either directly or indirectly,
leading to recursion. When recursion is detected, the compiler knows that dynamic memory
management (via stack frames) is required for local variables and return addresses.
● Tail Call Optimization: When compilers recognize tail recursion (a function calling itself
as its last operation), they can optimize the recursion by reusing the same stack frame,
avoiding the overhead of a new call. Analyzing the call graph can help identify such
cases.
● Inlining: Inlining decisions—replacing function calls with the body of the function—can
be guided by call graph analysis to improve performance for small or frequently called
functions.
509
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
First, use Clang to compile your C program into LLVM IR, which is necessary to analyze the
program using LLVM's optimization tools (opt).
This will generate the LLVM IR for the C program in a file called program.ll.
LLVM provides the -dot-callgraph option in the opt tool, which outputs the call graph in
DOT format. Here's how to run it:
This command will generate a file named callgraph.dot (or several files if there are multiple
functions).
To visualize the call graph, you can use a graph visualization tool like Graphviz. Graphviz's dot
command can convert the .dot file into an image.
This will create an image file (callgraph.png) containing the call graph that you can view.
Once you have generated the image, you can open it to explore the function calls in your
program. The nodes represent the functions, and the directed edges show which functions call
others.
Let's produce code for static functions. What are the assumptions that we
can make?
There are a number of assumptions that will simplify our job:
1. All the functions have different names. This is valid, for our programming language is
totally static.
2. We have an infinite surplus of virtual registers. That means that we don't need to worry
about saving the state of registers.
3. Every function has only one formal parameter. And all the parameters of functions have
different names.
Under these assumptions, code generation becomes relatively easy. See below the code
generator for function definitions:
510
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
511
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
# Static jump to the function. Notice that we need the name of the
# function to be available at this point. This code generation approach
# would not work with indirect calls:
func_label = self._convert_name_to_addr(exp.function.name)
prog.add_inst(AsmModule.Jal("ra", func_label))
● Parameter Passing: The parameter is computed and placed in the a0 register using the
virtual register param_value.
● Static Function Call: The function is called using a static jump (Jal) to its label, with
the return address saved in ra.
● Return Value Handling: The return value is copied from a0 into a virtual register
(ret_var) for use after the function call.
512
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary:
The code for both sqr and inc follows the same pattern: saving the return address, handling
the parameter, computing the result, and returning the value via a0. The main application just
chains the function calls, passing the result of sqr(4) into inc.
What is an ABI?
An ABI is a set of rules and conventions that define how binary programs interact at runtime. It
ensures compatibility between compiled code, libraries, and the operating system. Specifically, it
governs:
513
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
If your code saves the return address in the ra register and passes parameters in a0, it is
following a specific calling convention—a subset of the ABI. In the context of RISC-V, for
example:
This means your code adheres to a calling convention compatible with the RISC-V ABI, which
ensures that your functions interact correctly with other code that follows the same ABI.
When all components of a program (libraries, OS, etc.) follow the same ABI:
Oh, is that why we can call a C function from a Lua program and
vice-versa?
Exactly! The ability of a Lua program to call C functions (and vice versa) is a perfect example of
the importance of an ABI. Here’s how it works in this context:
514
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Lua uses the C API to enable interaction between Lua scripts and C functions. This API relies
on a shared ABI, ensuring the Lua interpreter and compiled C libraries can communicate
seamlessly.
○ C can also call Lua functions using the same stack mechanism. The C code
pushes arguments onto the Lua stack, invokes a Lua function using the
interpreter, and retrieves the results from the stack.
● Function Calls: The Lua interpreter, written in C, follows the same calling convention as
the Lua C API. For instance, parameters might be passed in specific registers or on the
stack, depending on the platform’s ABI.
● Data Representation: Lua's stack and data structures (like Lua values) are consistent in
memory, enabling C functions to manipulate them directly.
● Binary Compatibility: Precompiled Lua interpreters and C libraries can interoperate as
long as they adhere to the same ABI (e.g., for the x86-64 architecture or RISC-V).
Without an agreed-upon ABI, Lua scripts wouldn't be able to call C functions reliably, and
embedding Lua into a C program would become a nightmare.
The Lua-C interaction showcases how ABIs abstract away low-level details, allowing different
programming languages and systems to interact efficiently. It's not just Lua—any "foreign
function interface" (FFI), such as Python's ctypes or Java's JNI, heavily depends on the
underlying ABI. Thus, if you're working on systems or compilers, understanding the ABI helps
ensure interoperability and efficient execution of multi-language programs.
515
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In a statically allocated language, the memory layout is simpler because the sizes of all
variables (both local and global) and parameters are known at compile-time. There is no need
for complex memory management mechanisms like heap allocation or dynamic memory
resizing, which simplifies the structure of the activation record.
● Fixed Size: Since the size of all variables, parameters, and return values is known in
advance, the activation record has a fixed size.
516
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Static Allocation: Memory for variables is allocated in a predefined memory region (like
the data segment), rather than dynamically allocated at runtime.
The previous example uses "virtual registers", e.g., registers that simulate
variable names, as already discussed in a previous class. Do you
remember what a virtual register really is?
The concept of "virtual registers" is quite common in intermediate representations (IR) used by
compilers. Virtual registers are placeholders or abstract names for values that haven't been
mapped to actual machine registers yet. They are commonly used during the early stages of
compilation, particularly before Register Allocation. As an example, the instruction add
sqr_ret_add ra x0 uses a virtual register (sqr_ret_add), and two physical registers: ra
and x0.
1. Abstraction: Virtual registers allow the compiler to treat values and variables as if there
are an infinite number of registers. This makes it easier to express computations without
worrying about the physical limitations of the hardware (e.g., the number of registers
available on a machine).
2. Simplicity: Virtual registers make it easier to represent the semantics of a program
during the Intermediate Representation (IR) phase. Compilers often use IRs that are
closer to the source code (SSA, for example), where each value has its own unique
"register" (virtual register) to simplify analysis and optimization.
517
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Optimization: Compilers can perform a wide range of optimizations on the code without
needing to assign physical registers. Once optimizations are complete, the final step is to
map the virtual registers to the real physical registers of the target machine.
Register Allocation
In the final stages of compilation, the compiler uses a process called register allocation to map
virtual registers to physical machine registers. The graph coloring algorithm is often used
here:
● LLVM IR: LLVM uses a large number of virtual registers in its intermediate
representation. It defers register allocation until later in the compilation process, after
performing optimizations.
● SSA (Static Single Assignment): Each variable is assigned exactly once, and virtual
registers naturally fit into this model. Register allocation is then performed to map these
variables to physical registers.
In summary, virtual registers provide flexibility and allow compilers to focus on optimization
without worrying about hardware constraints until the final stages of code generation.
Would the code generator we wrote work for a programming language with
anonymous functions?
That code generator would not work because we are calling functions directly via jal
instructions. These instructions need a fixed address (the label that is associated with the name
of the function).
518
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● For integers, you simply store the integer value (e.g., 1, 42, -7) in a register.
● For booleans, you store a binary value (0 for false, 1 for true).
2. Functions:
● Functions are more complex than basic data types like integers or booleans, because
they represent a block of executable code along with the environment in which the
function was created (for closures).
When producing code for a function, the value of a function in this context is the address of the
function's code in memory. This is usually stored as a pointer in a register.
● For non-closure functions (functions that don’t depend on external variables from their
environment), the function's value is simply a pointer to the code of the function, e.g.:
fn x => x + 1
● The value stored in the register is the address of the machine code that implements this
function (i.e., the instructions for x + 1). This address allows the program to jump to
the function's code when it is called.
2. Closure Functions:
● For closures, a function can capture variables from its surrounding environment (lexical
scope). In this case, the value that gets stored in the register is often a closure object
that contains:
○ A pointer to the function's code.
○ A pointer to the environment (i.e., the variables that were captured from the
enclosing scope), e.g.:
let y = 10 in fn x => x + y
● In this case, y is part of the environment captured by the function. The value stored in a
register would be a pointer to a closure that contains:
○ The address of the code for fn x => x + y.
519
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ A reference to the value of y, so that when the function is called, it can correctly
add y to x.
Conclusion
This allows the program to correctly jump to the function's code when it is called, and if
necessary, access any captured variables that are part of its environment.
How would you generate code for anonymous functions then, assuming
that we don't have closures?
Anonymous functions, also called lambda functions or function literals, are functions that do
not have a name. They can be created dynamically, and we often pass them as arguments to
other functions. In this case, the same code generation principles apply as with named
functions, but we treat them as if they have no pre-defined name. To generate code for
anonymous functions, we need to approach it differently. The key challenge is that we can't
rely on static function names. Instead, we need to generate a unique label for each anonymous
function at runtime (during code generation). This allows us to refer to the function even though
it has no name in the source code.
Here’s how we can modify the code generation to support anonymous functions:
1. Assign a Unique Label to Each Anonymous Function: For each anonymous function,
we can generate a unique label dynamically during code generation. For example, we
could name anonymous functions anon_func_1, anon_func_2, etc.
2. Adjust Function Call Code: When calling an anonymous function, we use its
dynamically assigned label.
Let’s modify the visit_named_fn and visit_named_app methods to work with anonymous
functions.
520
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
addr_var = self.next_var_name()
func_after = self.next_fn_label()
prog.add_inst(AsmModule.Jal("x0", func_after))
# Allocate space on the stack for the return address and local variables.
prog.add_inst(AsmModule.Addi("sp", "sp", -4)) # Assuming a word size of 4.
prog.add_inst(AsmModule.Sw("ra", 0, "sp")) # Store the return address.
# Move the result of the function body into the return register 'a0'.
prog.add_inst(AsmModule.Add("a0", return_var, "x0"))
return addr_var # Return the generated label for this anonymous function.
# Generate code for the actual parameter (expression) of the function call:
param_value = exp.actual.accept(self, prog)
# Pass the parameter in register a0.
prog.add_inst(AsmModule.Add("a0", param_value, "x0"))
521
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Helper Method:
We also need a helper method to generate unique labels for anonymous functions. Here’s a
simple version:
def next_fn_label(self):
"""
Generate a unique label for anonymous functions.
"""
if not hasattr(self, '_fn_count'):
self._fn_count = 0
self._fn_count += 1
return f"anon_func_{self._fn_count}"
522
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This method keeps track of how many anonymous functions we’ve seen so far and generates a
unique label for each one.
Conclusion:
With these changes, the code generator can now handle anonymous functions. We generate
unique labels for them and handle function calls dynamically based on those labels. The
approach is similar to named functions, except we don't rely on exp.name and instead generate
labels on the fly.
That's really odd: why is there a jump at the beginning of the function?
The jump at the beginning of the function is required to allow the program to skip over the
function’s code and continue with the main program logic. Here’s why this is necessary:
523
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example Flow:
● The Jal instruction at line 001 makes sure that the program skips over the function
body and jumps to the actual code that calls the function (the main execution part).
● Later, when f 3 is evaluated (around lines 111-115), the address of the function is
loaded from v1 and the program jumps back to the function's code (_fn_count_0) to
execute it.
In Summary:
The initial jump (Jal x0, func_after) is there to ensure that the program can define the
function and then skip over it, so the function doesn’t get executed immediately when it’s
defined. This lets the program continue to its main logic and ensures that the function is only
invoked when a function application occurs (f 3 in this case). This is especially important in
assembly-level code where execution flows linearly unless explicitly controlled with jumps.
Could you show an example with two functions, so that we can see how we
find the first executable instruction of the program?
Sure, see below:
524
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The image illustrates the process of finding the first executable instruction of a program in the
context of RISC-V assembly code, focusing on skipping over function definitions.
Key Points:
525
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
skipping the function body (addresses 002-010) and continuing with the main
program logic, which begins at instruction 011.
3. Finding the First Executable Instruction:
○ After defining the function, the program immediately jumps from address 001 to
address 011, which contains the first executable instruction. This is the point
where the program's main execution begins.
○ Address 011 (f_1 = addi v1 0) initializes the variable f_1. The subsequent
instructions (012-022) set up and compute the results of applying the functions
f and g.
4. Visual Path:
○ In the image, the dashed arrow between assembly addresses 001 and 011
shows the jump that skips the function body, directly leading to the first
executable instruction (011).
○ This dashed line represents the control flow path that the program follows to
avoid executing the function code immediately upon definition.
5. Summary of the Path:
○ The program starts execution at address 000, where v1 is initialized.
○ At address 001, the program encounters the jal x0 11 instruction, which
jumps over the body of f and proceeds to the instruction at address 011,
marking the start of the program’s actual computation.
○ From there, the program begins applying the functions (f and g) and performing
the required arithmetic operations.
Conclusion:
The jump at address 001 ensures that the function definitions are skipped, and the main
execution starts at address 011, which is the first executable instruction of the program.
But, in practice (like when we compile a typical C program) how does the
CPU find the first instruction to execute?
In practice, when you compile a C program (or any program) to machine code, the CPU doesn't
automatically know where to start executing. There are a few key mechanisms that tell the CPU
which instruction is the first one to execute. Let's break this down:
1. Executable Format:
When you compile a program, the machine code is wrapped in a specific executable format
(e.g., ELF for Linux, PE for Windows, Mach-O for macOS). This format contains not only the
machine instructions but also metadata that includes information such as:
● The entry point: This is a special address where the program execution begins.
526
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The program header: Information about memory segments and how they should be
loaded into memory.
In the case of ELF files (used in Linux), this information is stored in the header, where the entry
point address is explicitly defined.
The operating system plays an important role in starting program execution. When you run a
program (e.g., by typing ./myprogram in a terminal):
● The operating system loader reads the executable file (e.g., the ELF file).
● It looks for the entry point address in the file's header.
● The OS sets up the program's memory (loading sections like code, data, stack, etc.).
● Finally, the CPU's program counter (PC) is set to the entry point address, and
execution begins.
In C programs, the typical entry point is the main() function. But this isn't the first function to
run! When you compile a C program, the compiler generates extra code that gets executed
before main(). This code is typically part of the runtime library (e.g., crt0.o in Linux). The
general sequence is:
● The startup code (from the runtime) sets up things like the stack, heap, and
environment variables.
● Once the startup code has initialized the environment, it calls the main() function.
The startup code is located at the entry point that the OS loader finds in the executable file.
The first executable instruction is typically part of the startup code, not the main() function.
For example, on Linux, this is located in crt0.o, which is linked into your program. The
instruction at the entry point might involve setting up the stack or initializing memory, and from
there, control is eventually passed to main().
● When the OS starts the program, it loads the program into memory.
● The program counter (PC) is set to the entry point (as defined in the executable's
header).
527
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The CPU then fetches the instruction at the address in the PC and starts executing.
Can you show how we can find out the entry point of a program in Linux?
Let's compile a simple C program like this:
int main() {
return 0;
}
readelf -h a.out
This tells the OS to start execution at memory address 0x400560, which is where the startup
code (from crt0.o) is located. The CPU will start executing instructions at this address,
eventually calling your main() function after the necessary initialization.
Summary:
● When you compile a program, the compiler generates machine code wrapped in an
executable format (e.g., ELF).
● The executable contains metadata, including the entry point, which tells the operating
system where the first instruction is.
● The operating system loader reads the executable and sets the CPU's program
counter (PC) to the entry point address.
● The CPU starts executing from the entry point, which is typically part of the startup
code, not directly from main().
That's how the CPU finds the first executable instruction in a compiled program!
What is a Closure?
528
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
A closure is a function along with its "environment," which consists of the bindings of free
variables at the time the closure is created. It allows a function to carry its context with it, even
when it is executed outside that context. Here’s how it applies to your example:
Function Definition:
fn x => fn y => x + y
This is a curried function. It takes one argument x and returns another function:
fn y => x + y
Notice that in the inner function fn y => x + y, the variable x is free because it is not locally
defined in the inner function. Instead, x is captured from the environment of the outer function.
Closures are essential in functional programming because they allow functions to:
In the example above, the closure fn y => 2 + y retains the value of x as 2, enabling the
inner function to use it when y is later applied.
529
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Modify the Function: Rewrite the function to explicitly take the environment as an argument.
For example:
fn y => x + y
becomes:
fn (env, y) => #x env + y
530
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Adjust Function Applications: Function applications must pass the environment explicitly. For
example:
let
val closure = (function_pointer, environment)
in
function_pointer(environment, argument)
end
it creates a closure:
function_pointer(environment, 3)
531
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
becomes:
resulting in: 2 + 3 = 5. The figure below illustrates the whole process in SML/NJ:
1. Supports Lexical Scoping: Free variables are captured correctly and used in the
appropriate context.
2. Facilitates Compilation: By converting closures into explicit structures, compilers can
generate efficient code for languages that support higher-order functions.
Closure Conversion
● Goal: To explicitly represent closures (functions and their environments) in a way that
can be efficiently implemented in lower-level code, like machine code or intermediate
representations.
● How It Works:
○ Transform functions to take an explicit environment parameter.
○ Capture free variables in a structured environment (e.g., a tuple or record).
532
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Represent a closure as a pair of the function's code pointer and its environment.
● Key Characteristics:
○ Preserves the lexical scoping of the original program.
○ Used in runtime implementations, especially when higher-order functions are
common.
○ Closures remain first-class values (can be passed as arguments, returned, etc.).
Example:
fn x => fn y => x + y
Lambda Lifting
● Goal: To eliminate nested functions by "lifting" them to the top level, converting a
program into one where all functions are top-level functions.
● How It Works:
○ Identify free variables in nested functions.
○ Add these free variables as explicit additional parameters to the lifted function.
○ Transform the function's definition and all calls to it.
● Key Characteristics:
○ Makes nested functions unnecessary.
○ Results in a "flat" program, where all functions are at the top level.
○ Often used in static compilation to simplify program structure.
Example:
let
val f = fn x => fn y => x + y
in
f 2 3
end
Key Differences
Aspect Closure Conversion Lambda Lifting
533
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Primary Use Dynamic or runtime contexts (e.g., Static or compiled contexts where
Case interpreter backends, JITs). functions must be standalone.
You can think of lambda lifting as a transformation that removes nested functions, while
closure conversion retains them but makes their dependencies explicit.
534
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
535
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The original SML/NJ function funToAddX returns a nested function addX, capturing x
as a free variable.
● In C, this is modeled by creating a closure where:
○ The free variable x is stored in the environment.
○ The function pointer (addX) and the environment are bundled into a closure
struct.
● The figure demonstrates how the free variable (x) and function (addX) are explicitly
represented as runtime constructs (structs in C).
● This transformation allows higher-order functions and lexical scoping to be implemented
in a lower-level language like C, which lacks native support for closures.
536
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion:
The concept of closure conversion evolved organically as part of solving practical and
theoretical problems in the implementation of functional programming languages. If you’re
interested in a foundational text, Reynolds' 1972 paper is an excellent place to start for
understanding the theoretical origins of many related concepts.
537
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Separation Logic
● Reynolds, along with Peter O’Hearn and others, developed separation logic, a formalism
for reasoning about programs that manipulate memory.
● This work has been instrumental in verifying the correctness of concurrent and
imperative programs, and it is widely used in tools like Facebook’s Infer.
538
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● This book emphasizes formal reasoning about programs and algorithm design,
promoting correctness-by-construction methodologies. It bridges theoretical computer
science and practical programming.
6. Hygienic Macros
● Although not directly attributed to him, his exploration of scoping and bindings laid the
foundation for ideas like hygienic macros in Scheme and other languages.
Legacy:
Reynolds was a towering figure in theoretical computer science, blending mathematical rigor
with practical applications. His work underpins much of modern programming language theory
and has influenced the design of languages and tools for decades. Beyond his technical
contributions, his writings, such as The Craft of Programming, have inspired generations of
computer scientists to think deeply about the principles of programming.
539
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Memory Allocation
Code: https://github.com/pronesto/DCC053_Public/tree/main/17_MemoryAllocation
1. Static Memory:
● What it is: Static memory is allocated at compile-time for global or static variables.
These variables persist throughout the entire program's execution.
● When used: If your semantics include global variables or variables that are defined
with static storage duration (like those declared outside functions in C or using the
static keyword), their addresses (loc) would refer to locations in static memory.
● Advantages: Static memory is simple to manage since it exists throughout the
program's life.
● Disadvantages: Limited flexibility; once allocated, static variables remain in memory
until the program exits.
2. Stack Memory:
540
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● What it is: Stack memory is used for local variables within functions. Each function call
creates a new stack frame, which holds its local variables.
● When used: The addresses (loc) for local variables in functions (e.g., those in let
expressions) would typically point to locations in stack memory. These variables are
allocated when the function is called and deallocated when the function returns.
● Advantages: Stack allocation is very efficient since memory is automatically reclaimed
when functions exit.
● Disadvantages: Limited in size; addresses (loc) are only valid as long as the function’s
stack frame exists (i.e., can't return a reference to a local variable).
3. Heap Memory:
● What it is: Heap memory is dynamically allocated at runtime. The programmer controls
allocation (e.g., with malloc in C, new in languages like C++ and Java).
● When used: If the language supports dynamic memory allocation, then the addresses
(loc) would correspond to heap memory. For example, when you explicitly allocate
memory for a variable, its address will be in the heap.
● Advantages: More flexible, as you can control the lifetime and size of objects.
● Disadvantages: Managing heap memory requires explicit allocation and deallocation,
which can lead to memory leaks or fragmentation if not handled properly.
In the operational semantics we are working with, loc is an abstract representation of memory
locations, but in practice, it would map to one of these memory regions depending on the
variable's scope and lifetime:
If you’re extending the language semantics to support heap allocation, you would also need to
add rules for memory allocation and deallocation. For instance, for a dynamic allocation
operation (like malloc), you would create a new location (loc) on the heap and update the
store (S). But this is a story for another chapter :)
541
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
int main() {
// Static allocation:
char* s0 = "DCC053 is awesome!";
// Stack allocation:
char s1[SIZE + 1];
strcpy(s1, s0);
// Heap allocation:
char* s2 = malloc(SIZE + 1);
strcpy(s2, s0);
free(s2);
return 0;
}
Let’s analyze where each variable (s0, s1, and s2) is allocated based on the provided code and
memory layout options:
● Where it is allocated:
The string literal "DCC053 is awesome!" is stored in static memory, specifically in
the read-only data section (e.g., .rodata), because string literals are immutable and
shared across the program.
● Type of memory: Static Memory
● Example of address in output: 0x1047cbf78
● Where it is allocated:
The array s1 is declared as a local variable with automatic storage duration. It is
allocated on the stack memory. The contents of s0 are copied into s1 using strcpy.
● Type of memory: Stack Memory
● Example of address in output: 0x16b637520
542
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Where it is allocated:
The pointer s2 is stored in the stack memory (since it's a local variable), but the memory
it points to is dynamically allocated using malloc(), which means it resides in the heap
memory. The contents of s0 are copied into this heap-allocated memory using strcpy.
● Type of memory: Heap Memory (for the allocated block)
● Example of address in output: 0x600002d10020
Summary Table
Variable Storage Type Memory Segment Address in Output
Important Observations
This segmentation explains the program's behavior and memory allocation as reflected in the
output.
Can you explain how memory allocation would happen in the SML program
below?
543
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The explanations in the figure describe how variables are allocated in memory:
Overall:
The figure shows the distinction between static allocation (for w) and stack allocation (for x). The
program shows how the memory allocation strategy depends on the variable's lifetime and
scope:
544
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The explanation aligns with how memory is managed in most programming languages, where
variables with global scope can be statically allocated, while local variables within functions use
stack memory.
In the assembly code provided below, we can identify where the variables like aux and answer
are allocated by analyzing the stack operations and memory access patterns. Let's break it
down:
Key Concepts:
● sp (stack pointer): This points to the top of the current stack frame.
● x29 (frame pointer): This points to the base of the stack frame.
● x30 (link register): This stores the return address of the current function.
The stack frame is set up at the beginning of the function with the instruction:
545
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
546
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● adrp x9, _answer@PAGE: Loads the base address of the page containing answer
into register x9.
● str w8, [x9, _answer@PAGEOFF]: Stores the value of w8 (which contains m) into
the answer variable.
This means that answer is stored in the global memory (usually part of the data section of the
binary).
The function parameters m and n are passed in registers w0 and w1, but they are spilled (stored)
on the stack early in the function to free up the registers for other purposes:
This means both m and n are allocated on the stack. m is stored 4 bytes below the frame pointer,
and n is stored at an 8-byte offset from the stack pointer.
aux is a local variable that stores the result of m - n. You can see this value being calculated
and stored in the following lines:
This means that aux is also allocated on the stack, specifically at an offset of 4 bytes from the
stack pointer.
Summary:
547
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● aux: Stored at [sp + 4] (4 bytes above the base of the stack), on the stack.
These variables are managed in different memory regions, and the stack is used to store both
local variables (aux) and spilled function parameters (m and n). The global variable answer is
stored separately in the program's data section.
Can you guess how the global variable is allocated in the assembly code
we have seen?
We can see the allocation in the following two lines:
These lines indicate that the variable answer is a global variable and gives details about how
it's allocated in memory. Let's break it down:
1. .globl _answer
● .globl is an assembler directive that marks answer as a global symbol. This means
that the variable answer can be accessed from other files (not just the one where it's
declared). It's essentially making the symbol visible to the linker, allowing the variable
to be used across different translation units in a C program.
● The _answer is the name of the symbol (variable) being defined.
2. .zerofill __DATA,__common,_answer,4,2
This directive is more specific and gives detailed information on how the global variable answer
is laid out in memory.
548
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● _answer: This is the name of the symbol being allocated in this section (the answer
variable).
● 4: This specifies the size (in bytes) of the allocation. Since answer is an int, and on
most systems an int is 4 bytes, this means answer is allocated 4 bytes of memory.
● 2: This is the alignment requirement. It tells the system that answer must be aligned on
a 2-byte boundary. In this case, the alignment is 2 bytes, which is typical for smaller data
types like short, but for an int, the alignment would more commonly be 4 bytes. The
smaller alignment can be a hint that the code was compiled with some specific settings
or optimization criteria (though it’s not uncommon for some systems to relax alignment
constraints).
What it means:
● Global: answer is a global variable, accessible across different source files in the
program.
● Uninitialized: It is stored in the BSS (or __common) section and will be zero-initialized
when the program starts.
● Size: It takes up 4 bytes (consistent with an int).
● Alignment: It is aligned on a 2-byte boundary.
In summary, the code reserves space for a 4-byte global integer answer in the program’s data
segment, specifically in an area where zero-initialized variables are stored. When the program
starts, answer will be initialized to 0 by the system before the program runs.
What if you were producing code for RISC-V, how would the allocation of
variables be?
Data allocation in RISC-V is very similar to ARM. The figure below shows the key parts of an
(unoptimized) assembly file:
549
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's go step by step and explain how variables are allocated in this version of the RISC-V
assembly code to help a computer science student understand it.
In RISC-V, the stack is a region of memory used to store data that doesn't fit in registers or
must persist across function calls. The stack grows downwards (i.e., the stack pointer, sp, is
decremented when allocating space and incremented when releasing space).
550
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the modified assembly code, we use the stack to store local variables (m, n, and aux), as
well as saved values like the return address and frame pointer.
At the beginning of the gcd function, the following code reserves 28 bytes on the stack:
Before doing anything with the function arguments, we save the return address (ra) and frame
pointer (s0) onto the stack, so we can restore them later when the function returns:
sw ra, 24(sp) # Save return address at the top of the stack (offset 24)
sw s0, 20(sp) # Save frame pointer just below it (offset 20)
● The return address (ra) is stored at an offset of 24 bytes from the current stack pointer
(sp).
● The frame pointer (s0) is stored at an offset of 20 bytes.
In the original program, m and n are passed as function arguments. In RISC-V, these arguments
are passed in registers a0 and a1. But since we want to store them on the stack, we do:
This stores:
551
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When we need to perform operations with m and n, we first load them back from the stack into
registers for computation:
Now t0 holds the value of m, and t1 holds the value of n, allowing us to perform comparisons
or calculations.
In the program, aux is computed as m - n. After calculating aux, we store it in the stack:
Here, t2 holds the value of aux, and it is stored at offset 8 from the stack pointer.
5. Recursive Calls
When making a recursive call, we need to pass updated values of m and n. Instead of storing
these values in registers, we continue using the stack to keep them:
552
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This way, we keep the values for the recursive call in the stack, so the program's memory stays
well-managed.
After the recursive call completes, we must restore the saved ra and s0 values from the stack,
so the program can return to the correct location:
Finally, we deallocate the stack space by restoring the stack pointer to its original position:
addi sp, sp, 28 # Restore stack pointer (deallocate 28 bytes)
Summary:
● Global variables like answer are stored in memory, and we access them via memory
instructions like lui and sw.
● Function arguments (m and n) are passed in registers but saved to the stack at specific
offsets (16(sp) and 12(sp)).
● Local variables like aux are stored on the stack (8(sp)).
● The stack is used to store data that needs to persist across recursive calls or when
registers are limited.
In the RISC-V version of the code, the stack is used to manage all variables (m, n, and aux),
demonstrating how data can be stored and accessed in a structured manner when registers are
not sufficient or when we need to save values across function calls. This technique is crucial in
managing memory in low-level systems programming.
So, we have heap, stack, static memory and such. How are these things
organized?
The memory layout of a program depends on several factors—including the architecture, the
operating system (OS), and even the compiler. Here's a typical memory layout for a running
program, dividing memory into its main sections: text (where the program code resides), data
(for global/static variables), heap (for dynamic memory allocation), and stack (for local variables
and function call management).
553
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. .text: The text segment stores the program's executable instructions. It is typically
read-only and executable to ensure security and prevent accidental modification.
2. .rodata: The read-only data segment contains immutable data, such as string literals
and constants. This section is non-writable to protect static data from accidental
changes.
3. .data: The data segment holds global and static variables that are initialized explicitly.
These variables persist throughout the program's execution and are writable.
4. .bss: The bss segment contains uninitialized global and static variables. These
variables are zero-initialized by default when the program starts.
5. Heap: The heap is used for dynamically allocated memory during runtime. Memory in
the heap is managed manually by the program, typically using functions like malloc and
free. It grows upward in memory
6. Stack: The stack is used for local variables, function call frames, and return addresses.
It grows downward in memory and is managed automatically as functions are called and
terminated.
7. Shared Libraries: Shared libraries, such as dynamically linked functions, are loaded
into memory by the operating system. These reside at higher addresses to keep them
separate from the program's static and dynamic memory.
554
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
8. Kernel: In a typical user-space program, the kernel (operating system) resides at the top
of memory. It handles system-level operations such as managing hardware resources
and memory protection.
This memory layout is common for many systems, including those running Unix-like operating
systems. The actual layout may vary depending on the architecture and the operating system,
but this gives a conceptual understanding of how memory is organized.
The distinction between the .data and .bss segments in an executable file is rooted in
efficiency, both in terms of memory usage and the size of the executable itself. Here's why the
distinction is important:
1. Memory Initialization
● .data Segment:
○ Stores initialized global and static variables.
○ These variables have explicitly defined initial values in the source code (e.g., int
x = 10;).
○ The initial values need to be stored in the executable file so that they can be
loaded into memory at runtime.
● .bss Segment:
○ Stores uninitialized global and static variables (or variables initialized to zero,
such as int y; or int z = 0;).
○ Since these variables are zero-initialized by default, there’s no need to store their
values in the executable file. Instead, the operating system simply allocates
zero-filled memory at runtime.
● By separating the .bss segment, the executable file avoids storing large blocks of
zeroes that would represent uninitialized data. Instead, the file only includes metadata
(e.g., the size of the .bss segment), which tells the operating system how much
memory to reserve and initialize to zero during program loading.
● This approach makes the executable file smaller and more efficient to store and transfer.
3. Runtime Efficiency
555
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The .data segment, by contrast, involves copying the predefined initial values from the
executable file into memory, which is more costly than zeroing memory.
● Historically, this separation was crucial for systems with limited storage and memory
resources. It minimized the size of the program stored on disk or tape while still ensuring
the program could be correctly initialized at runtime.
● Modern systems retain this convention because it still provides clarity and efficiency,
especially in systems programming and embedded contexts where executable size and
memory initialization cost are significant concerns.
By distinguishing between the .data and .bss segments, operating systems and compilers
optimize both the storage and the runtime behavior of programs.
● Reduces Memory Usage: By making string literals read-only, the system can store
them in low-cost, read-only memory regions instead of allocating writable memory for
each instance.
● String literals represent constant values that are generally not intended to be changed.
Making them immutable enforces this and helps avoid bugs caused by unintentional
modification of shared literals.
556
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● String literals align with the immutability of other constants in C. Just as const int x
= 42; prevents changes to x, making string literals read-only enforces their constancy.
● In the early days of C, string literals were not explicitly read-only, leading to bugs when
programmers unintentionally modified them. Later implementations standardized this
behavior to prevent such issues while maintaining backward compatibility for programs
that treated literals as char * pointers (though this is discouraged now).
● Immutable string literals encourage developers to adopt better practices, such as using
writable char[] arrays for modifiable strings and keeping literals immutable where they
are meant to be constants.
Summary
Making string literals read-only reflects a balance between efficiency, security, and clarity. It
aligns with the principle of treating literals as constants, prevents inadvertent bugs, and takes
advantage of modern memory management strategies to optimize performance and security.
So, this memory layout depends on what? The architecture? The OS? The
compiler?
Let's break down how each of these influences the layout:
1. Architecture (CPU/Hardware)
● Endianness (little-endian or big-endian) affects how multi-byte data (like integers) are
stored, but not the general layout.
● Address space size, such as 32-bit or 64-bit, dictates how much memory is available.
For example, a 32-bit system has a maximum addressable memory of 4 GB, while a
64-bit system can theoretically access much larger memory spaces (e.g., terabytes).
557
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Memory management hardware can affect the layout. Some architectures may have
specific memory segments reserved for hardware-level operations or certain kinds of
memory protection (e.g., ARM's memory models vs. x86).
● The OS manages memory at a high level, including how the text, data, stack, and heap
segments are placed in virtual memory.
● Virtual memory and address space layout randomization (ASLR): Modern operating
systems often randomize the locations of the stack, heap, and other sections in memory
to increase security. This means that even if the general layout is the same, the actual
memory addresses can vary every time a program runs.
● Some OSes use separate address spaces for user programs and kernel code, while
others (especially embedded systems) may not have such strict separation.
Examples of OS-specific factors:
○ Unix/Linux-based systems: Follow the general layout I showed earlier but
include additional features like shared libraries loaded dynamically at runtime.
○ Windows: Follows a similar layout but has its own memory management
mechanisms, such as how it handles dynamic link libraries (DLLs) and system
calls.
3. Compiler
● The compiler decides how local variables, global variables, and constants are stored in
memory, and it may align data in memory for performance reasons (e.g., padding to
ensure memory alignment for faster access).
● Optimization levels: Higher optimization settings can lead to more aggressive memory
usage, such as reducing stack space by reusing variables or inlining functions.
● Some compilers insert additional data for debugging (like stack canaries to detect
buffer overflows) or profiling, which may change the layout slightly.
● The calling convention (decided by the compiler or the OS/architecture) affects how
function parameters and return values are passed—whether they use registers or the
stack, and how the stack frame is managed during function calls.
Examples of Differences:
● In 32-bit systems, the stack might be closer to the upper memory limit (close to 4GB),
while in 64-bit systems, there’s much more room, so the stack and heap might be placed
further apart.
558
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Windows: For 32-bit Windows programs, there is a special segment called the PE
header used for managing executable formats, which might not exist on Linux-based
systems.
Compiler Example:
● GCC vs Clang: Both are C compilers but might make different decisions about the order
in which data is laid out, how much padding is used, or where variables are placed.
Summary
Despite these variations, the general memory layout (Text, Data, Heap, Stack) remains similar,
but the specifics—like exact memory addresses, alignment, and optimizations—can differ based
on these factors.
In most systems, the kernel is mapped to the upper portion of the virtual memory space, leaving
the lower part for user space. The distinction is enforced by the operating system, and
user-space programs are not allowed to access kernel memory directly.
● 32-bit systems: The kernel memory typically starts at 0xC0000000. The upper 1 GB of
memory (from 0xC0000000 to 0xFFFFFFFF) is reserved for the kernel, while the lower
3 GB (from 0x00000000 to 0xBFFFFFFF) is used by user-space programs.
559
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● 64-bit systems: The kernel memory typically starts higher up, often above the
canonical address boundary of 0xFFFFFFFF80000000 or similar, depending on the
architecture.
Linux Systems
KASLR
● With Kernel Address Space Layout Randomization (KASLR) enabled, the start of the
kernel memory will be randomized at boot time to a different location each time. This is
a security feature that makes it harder for attackers to predict kernel addresses and
exploit vulnerabilities.
○ On systems with KASLR, you may not be able to predict or see the kernel
address range as a regular user.
● Processor architecture (e.g., x86, ARM, RISC-V): This defines how virtual and physical
memory are mapped and the range of addresses.
● Operating system (OS): The OS manages the division of memory between user-space
and kernel-space, including virtual memory management.
● Linker script (.ld file): This defines how a program's sections (like .text, .data,
.bss, etc.) are placed in memory when the program is compiled and linked.
The linker script is a file that the linker (like GNU ld) uses to control the memory layout of a
program. It is typically used in bare-metal or embedded systems where you need explicit control
over memory layout, but it can also be used in standard applications to customize the layout.
560
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the .ld file, you define how the different segments of your program (like the code, data, and
stack) are placed in memory. This is particularly important for low-level programming, such as
operating systems, embedded systems, and bootloaders, where precise control of memory is
needed.
Here's an example of what a basic linker script (memory.ld) might look like:
/* Define the entry point of the program (usually the reset handler) */
ENTRY(_start)
SECTIONS
{
/* .text section: Code goes into the ROM region (executable, read-only) */
.text : {
*(.text) /* Place all .text sections (code) here */
*(.rodata) /* Place read-only data here (like constants) */
_etext = .; /* Mark the end of the .text section */
} > ROM
561
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
562
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Additional Notes:
● The stack and heap are placed at opposite ends of the RAM. This setup allows them to
grow towards each other, ensuring that unused space in the middle of RAM is available
for both stack and heap expansion.
● The .text section is marked read-only in ROM, and the .data and .bss sections are
placed in RAM to allow modification at runtime.
● The AT > ROM directive ensures that the initialized data stored in ROM is copied to
RAM when the program starts.
● Stack size: You can adjust the stack size by changing the size in the .stack section
(e.g., 4K).
● Heap size: You can adjust the heap size similarly in the .heap section (e.g., 16K).
● Memory regions: You can define other memory regions as needed for different
platforms (e.g., for embedded systems with different types of RAM or ROM).
Final Thoughts
A linker script allows you to explicitly control how memory is allocated in your program. It's
necessary for systems where memory is limited or needs to be managed carefully, such as in
embedded systems or bare-metal programming. In high-level applications, the toolchain
usually manages these details automatically, but understanding the linker script gives you more
control over your system's memory layout.
How can I see the .ld file in my Linux distro, assuming I am using Ubuntu
or something like that?
Ubuntu (and most Linux distributions) use linker scripts (.ld files), but these are typically
provided by the system's linker (usually ld from the GNU Binutils) and the C library (e.g.,
glibc or musl), and they aren't usually modified directly in everyday use. Instead, they're
embedded in the toolchain or in the standard libraries.
563
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The default linker script is usually provided by ld. To see which script is being used by default,
you can pass the --verbose flag to the ld linker:
ld --verbose
This will output a lot of information, including the default memory layout and the location of the
system's default linker script. Look for lines similar to:
This will print the default linker script, which is embedded within the toolchain, but it’s not an .ld
file you typically edit directly. However, this shows the default configuration that the linker uses.
If you want to modify or create your own custom linker script, you can create a .ld file in your
project directory, and then tell the linker to use it by passing the -T option when compiling:
This will tell GCC (or Clang) to use your custom .ld file.
Summary
How does the compiler know which variables are global and local?
The compiler distinguishes between global and local variables based on their scope and
placement within the source code, following the rules of the programming language (like C,
C++, etc.). Here's how it works:
1. Global Variables:
564
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Declared Outside Any Function: Global variables are declared outside of all functions,
usually at the top of the file, and have a file-wide or program-wide scope.
● Storage Class: Global variables are stored in the data segment (for initialized global
variables) or the BSS segment (for uninitialized ones) of the memory. These segments
are managed by the operating system and persist for the lifetime of the program.
● Linker Role: The linker is responsible for managing global variables across different
source files if they are declared as extern or have external linkage.
● Visibility: Global variables are visible throughout the entire file (or the entire program, if
extern) unless explicitly marked as static, which restricts their visibility to the current
file.
void someFunction() {
global_var = 20; // Can access global variable here
}
2. Local Variables:
● Declared Inside a Function: Local variables are declared inside a function or a block
(like an if or for block) and have a function-level or block-level scope.
● Storage Class: Local variables are typically stored on the stack, and their lifetime is
restricted to the execution of the function or block in which they are declared. When the
function is called, space is allocated on the stack, and when the function returns, the
space is deallocated.
● Compiler Role: The compiler tracks local variables within the function's symbol table. It
assigns them stack locations (based on the architecture) and ensures they're valid only
within the function or block.
void someFunction() {
int local_var = 5; // Local variable
// Only accessible inside this function
}
565
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you write a visitor that collects all the global variables in our SML-like
language (assuming all the variables have different names)?
The program below implements a visitor pattern to classify global variables in a functional
language with constructs like Let, Fn (functions), App (function applications), and Var
(variables).
class Expression(ABC):
@abstractmethod
def accept(self, visitor, arg):
raise NotImplementedError
class Var(Expression):
def __init__(self, identifier):
self.identifier = identifier
def accept(self, visitor, arg):
return visitor.visit_var(self, arg)
class Num(Expression):
def __init__(self, num):
self.num = num
def accept(self, visitor, arg):
return visitor.visit_num(self, arg)
566
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class Let(Expression):
def __init__(self, identifier, exp_def, exp_body):
self.identifier = identifier
self.exp_def = exp_def
self.exp_body = exp_body
def accept(self, visitor, arg):
return visitor.visit_let(self, arg)
class Fn(Expression):
def __init__(self, formal, body):
self.formal = formal
self.body = body
def accept(self, visitor, arg):
return visitor.visit_fn(self, arg)
class App(Expression):
def __init__(self, function, actual):
self.function = function
self.actual = actual
def accept(self, visitor, arg):
return visitor.visit_app(self, arg)
class Visitor(ABC):
@abstractmethod
def visit_var(self, exp, arg):
pass
@abstractmethod
def visit_num(self, exp, arg):
pass
@abstractmethod
def visit_let(self, exp, arg):
pass
@abstractmethod
def visit_fn(self, exp, arg):
pass
@abstractmethod
def visit_app(self, exp, arg):
pass
class VisitorLocClassifier(Visitor):
def visit_var(self, exp, arg):
"""
>>> e0 = Var('v0')
>>> e0.accept(VisitorLocClassifier(), None)
set()
567
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"""
return set()
568
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
['v', 'x']
"""
globs = exp.function.accept(self, arg)
return globs | exp.actual.accept(self, arg)
In our language, variables declared outside of functions are considered global variables, while
those declared inside functions are local variables. The goal of this visitor is to collect the set
of variables that are global by traversing the structure of the program (which is represented as
an Abstract Syntax Tree, or AST).
Key Concepts
The VisitorLocClassifier class implements the logic to walk through each part of the
program and collect global variables.
1. visit_var(self, exp, arg): For a Var, it does nothing because the classification
only cares about definitions of variables, not where they are used. So it returns an
empty set (set()).
2. visit_num(self, exp, arg): For a Num, since it's just a number, there are no
variables to classify. Like visit_var, it returns an empty set.
3. visit_let(self, exp, arg): This is the key function for classifying global
variables.
○ A Let introduces a new variable identifier in the global scope.
○ The visitor traverses both exp_def (the definition) and exp_body (the body
where the variable is used).
○ The identifier introduced by the Let is added to the set of global variables.
569
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. visit_fn(self, exp, arg): A Fn (function) defines a local variable formal (the
function parameter). No variables in the function body are considered global, so this
function returns an empty set.
5. visit_app(self, exp, arg): For an App, the visitor checks the function being
applied and the actual argument. It collects any global variables found in these
expressions.
● When the visitor encounters a Let, it assumes that the variable is global, unless it is
inside a function. Thus, variables defined in the top-level Let expressions are global.
● Function parameters and variables inside a function body are local and not included in
the set of global variables.
Example Breakdown
Conclusion
This visitor traverses the program’s structure, collects all variables that are declared in Let
expressions (and thus global), and ignores function parameters and locally scoped variables
within functions. It ensures that global variables are recognized correctly and doesn’t classify
local variables by mistake.
1. The visitor accumulates variable names in a set, which automatically handles duplicates.
For example, if the same variable name appears multiple times, it will only be stored
once in the set.
570
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. However, the problem arises when you have shadowing—that is, the same variable
name is reused in different scopes. In your language, a variable declared in an inner
scope (such as within a function or a nested Let expression) could have the same name
as a variable declared in an outer scope. When this happens:
○ The inner variable shadows the outer one, meaning the inner variable is the only
one visible within the inner scope.
○ The outer variable is inaccessible while the inner one is in scope.
Example of Shadowing
Here’s an example where variables have the same name but different scopes:
let
val x = 1 (* Outer x: global *)
in
let
val x = 2 (* Inner x: local to this scope, shadows outer x *)
in
x (* Refers to the inner x *)
end
end
The current approach doesn't account for scope properly. It simply adds every variable name
it encounters in a Let expression to the set of global variables, without checking if the variable
is shadowing an existing one.
● If two variables have the same name but are in different scopes, the visitor will treat
them as the same variable because they share the same name.
● In reality, the inner variable should be treated as local, while the outer variable remains
global. But since the visitor only tracks names, it cannot make this distinction.
If you have variables with the same name in different scopes, the visitor might:
1. Incorrectly classify local variables as global, because it cannot tell when a name is
local (shadowing an outer variable).
571
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Lose track of the original global variable if it is shadowed by a local one, since both
will be merged into a single entry in the set.
● Track scopes explicitly in your visitor. When entering a new scope (like in a function or
nested Let), you would need to:
1. Keep track of the currently visible variables.
2. Distinguish between variables declared in the current scope and those declared
in outer scopes.
A typical way to do this is by using a symbol table or some form of environment that keeps
track of variables and their scopes as the visitor traverses the program. Another approach is
simply to rename all the variables, to ensure that all the definitions have a different name. That's
relatively easy to implement in a language like our dialect of SML.
Summary
● The assumption that all variable names are unique is necessary because the visitor
doesn’t distinguish between variables in different scopes.
● If variable names are reused (shadowing), the visitor will treat them as the same
variable, which can lead to misclassification.
● To handle this more robustly, you would need to add scope-awareness to your visitor,
which can be done using a mechanism like a symbol table.
Ok, but how can we generate code for the variables? Something that we
will need is to map these variables to memory addresses, right?
Right! And we can do that by adding this capability to our code generator, e.g.:
class Program:
...
def get_address(name):
...
def set_address(name, addr):
...
The get_address and set_address methods will be necessary for managing the memory
locations of variables during code generation. Let's break down their roles:
1. set_address(name, addr)
572
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This method is responsible for associating a variable's name (name) with a memory address
(addr). When generating code for expressions that introduce new variables (like in
visit_let), you need to allocate space for the variable, and this method stores the address
where that variable is stored (e.g., on the stack or in a static region).
Why is it necessary?
2. get_address(name)
This method retrieves the memory address associated with a given variable name (name).
When generating code to access a variable (e.g., in visit_var), you need to know the
memory location of that variable, and get_address provides that information.
Why is it necessary?
● Variable Access: To load or store a value in a variable, you need to know its location in
memory. get_address looks up the address associated with the variable and ensures
the code can access the correct location when referring to the variable.
● Correct Code Generation: If you don't know where a variable is stored, you can't
generate the right instructions to manipulate it (e.g., reading it from or writing it to the
stack). This method ensures the code generator knows exactly where each variable
lives.
Example Workflow:
When you encounter a Let expression (e.g., let x = ...), visit_let allocates memory for
x, computes its address, and calls set_address("x", addr). Later, when x is referenced
(e.g., in a Var expression), visit_var calls get_address("x") to fetch its memory
address, which is used to generate the load/store instruction.
Without these methods, there would be no way to track where variables are stored, which would
make it impossible to generate correct assembly instructions for variable access.
Summary:
573
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
These methods are essential for managing the mapping between variable names and their
corresponding memory addresses during code generation.
When a new variable is introduced via a let expression, visit_let is responsible for:
The Workflow:
574
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ When the variable is later referenced in the code (via Var nodes), visit_var
retrieves its memory address using prog.get_address and generates the
appropriate load or store instruction to access it.
Variables in programs need to be mapped to specific memory locations so that their values can
be stored and retrieved. The stack is typically used for local variables (in let bindings and
function parameters), while global/static variables are assigned fixed memory locations.
visit_let is the place where this decision-making happens.
In summary:
● visit_let handles allocating memory for let-bound variables and assigns them a
stack or static memory address. It uses prog.set_address to store this information so
that later access (in visit_var) can retrieve the correct memory location using
prog.get_address.
Can you recap how the code generated for a function would look like?
Sure. The figure below shows an example already seen in the last class:
In this example:
575
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
But, why are we only saving the return address on the stack? What about
the local variables and the parameters that we might have in the function
body?
The reason you're only saving one slot on the stack (for the return address) during function code
generation is because the local variables aren't directly allocated space on the stack at this
stage. Instead, they are stored in what you called "virtual registers." These virtual registers
represent placeholders for values that will later be mapped to either physical registers or stack
slots during a phase called register allocation.
Virtual Registers
In the code generation phase, compilers typically don't immediately assign physical registers or
stack locations to variables. Instead, variables are represented as virtual registers, which
provide a flexible abstraction for storing intermediate values without worrying about physical
resources like registers or memory. Virtual registers are unlimited in number, unlike physical
registers, which are limited by the hardware.
Register Allocation
Later in the compilation process, during register allocation, the compiler analyzes the virtual
registers and decides where to store each value:
1. Physical Registers: If there are enough physical registers available on the machine, the
values in virtual registers can be directly mapped to physical registers, which is faster for
computation.
2. Stack (spill slots): If there aren’t enough physical registers, some of the values will be
"spilled" to the stack. This means that the compiler reserves space on the stack to store
certain local variables or temporary values that cannot fit in the physical registers.
The return address must always be saved because, when the function finishes, it needs to jump
back to where it was called from. For the local variables and intermediate values, at this stage,
they are treated as virtual registers. Their exact location (whether they end up in physical
registers or the stack) will be determined during the register allocation phase.
576
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
returns safely, especially when considering recursive calls, function pointers, and variable
lifetimes, as we had seen in our last class. Here are the key reasons:
Most computer architectures (like RISC-V) follow a defined calling convention to ensure that
function calls work reliably, especially when multiple functions call each other. This convention
dictates:
● Where the function argument(s) should be placed (often in specific registers or on the
stack).
● Where the return address is saved (usually in a specific register, like ra in RISC-V).
● Where the return value should be placed (typically in a register like a0 in RISC-V).
The calling convention guarantees that a function can expect its argument in a known location,
and the caller can expect the return address and result to be in known locations. This is critical
because the function needs to know where to jump back to after it completes execution. Since
functions can call other functions (or themselves recursively), the return address needs to be
reliably saved and restored.
Since the calling convention needs to be adhered to regardless of the function’s implementation,
the return address and argument are typically placed in physical registers or the stack, not
virtual registers.
How would you generate code for the let block, assuming that some
variables are statically allocated?
Here’s the refined visit_let method:
577
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Code for exp_def: We generate code for the definition part of the let expression first,
and d_name holds the result (a virtual register) for that value.
2. Static vs. Local Variable:
○ Static variables: We are checking if the variable should be allocated in static
memory (e.g., a global variable). This part is fine because for static variables, we
would store them in a fixed location in memory.
○ Local variables: For local variables, we opt to use a virtual register. This is a
major simplification, as it avoids allocating space on the stack when the register
allocation phase can later decide the best placement for the variable.
3. Code for the body (exp_body): We generate code for the body of the let expression
after handling the binding, which is exactly the correct order.
But now we have two locations for variables: either they are in virtual
registers, or they are statically allocated. How can you produce code for
variables then?
Here a new version of the visit_var method, which generates code for variables:
else:
# Local variable, already in a virtual register,
# just return its register name
return exp.identifier
The code you’re looking at generates RISC-V assembly instructions for loading the value of a
variable. To understand it, let's break it down into two main cases: static variables and local
variables. Each is treated differently during code generation.
578
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Static Variables
A static variable is a variable whose memory location is fixed at a specific address in memory,
meaning it’s not stored in a register but in global memory. To access it, you need to generate
instructions that load the value from the memory address.
2. Local Variables
A local variable, on the other hand, is stored in a virtual register, which means it’s already in a
temporary storage location within the CPU. No memory load is required. The only task is to use
the variable’s virtual register directly, without needing any additional instructions to fetch it.
For local variables, the code simply returns the name of the register where the variable is
stored, because its value is already available. No memory access is needed, which makes local
variable access faster than static variables.
Summary
This code handles variables in two ways: static variables require loading their value from
memory using load instructions (Lw), while local variables are accessed directly from a virtual
register. This differentiation allows the program to optimize access depending on how the
variable is stored.
How does the loader "know" the addresses of these static variables, the
address of the stack, the heap, etc?
This info is actually packaged into the executable object, in a representation called The
Executable and Linkable Format.
579
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To run a program, the operating system (OS) loader needs specific information about the
executable, including:
● Program entry point: The address where the program counter (PC) should start
execution (typically the start of the main function or initialization code).
● Memory layout:
○ Where code (e.g., .text segment) should be loaded.
○ Where initialized data (e.g., .data segment) and uninitialized data (e.g., .bss
segment) should be loaded.
○ Where writable sections like the heap will be allocated.
● Dynamic linking information (if the executable depends on shared libraries, like
libc).
● Permissions: Whether certain memory regions are read-only, writable, or executable.
This information must be organized in a standard way so the OS can properly load the program
into memory.
The ELF format is essentially a "protocol" that provides the rules and structure for packaging
this information into a binary file. When the OS loader encounters an ELF file, it knows exactly
where to look for specific details.
● ELF Header:
○ Describes the type of file (executable, shared library, or object file).
○ Specifies the target architecture (e.g., x86-64, ARM) and entry point address.
● Program Headers:
○ Tell the loader how to map the executable's segments (e.g., .text, .data,
.bss) into memory, their sizes, and permissions.
○ These are critical for runtime execution.
● Section Headers:
○ Provide detailed information for the linker and debugger (e.g., symbolic names,
relocation entries), but are typically ignored by the loader.
● Dynamic Section:
○ Contains information for the dynamic linker (e.g., library dependencies and
symbol resolution).
580
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. The OS loader reads the ELF header to verify it's an ELF file and determines what kind
of file it is.
2. It consults the program headers to map the segments into memory:
○ .text into an executable region.
○ .data into a writable region.
○ .bss into a zero-initialized region.
3. It initializes the stack and heap:
○ Stack starts at a predefined high address and grows downward.
○ Heap starts just above the program's data and grows upward.
4. It sets the PC to the entry point specified in the ELF header.
5. If the program uses shared libraries, the loader invokes the dynamic linker (also
specified in the ELF file).
Without a format like ELF, the loader wouldn't know how to:
● Locate and map the program’s code and data into memory.
● Handle initialization (e.g., zeroing the .bss segment).
● Link libraries or resolve symbols dynamically.
● Enforce memory permissions (e.g., executable code should not be writable).
ELF provides a standardized way to encode all this information, ensuring compatibility across
different systems and simplifying the design of OS loaders.
5. Analogy
This modularity and standardization make ELF (and similar formats like PE on Windows or
Mach-O on macOS) indispensable in modern computing.
Is it correct to say that the "ELF" representation is the last thing the
compiler produces?
Well, more or less. In Linux, the ELF (Executable and Linkable Format) is typically the last
representation that the compiler toolchain produces. However, let me break it down to
clarify the role of ELF and where it fits in the compilation process:
581
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The compiler proper (e.g., gcc, clang) does not directly produce the final ELF
executable. Instead, it generates:
○ Assembly Code: Textual representation of machine instructions.
○ Object Files: Binary-encoded machine code for individual source files, typically
in ELF format.
● Object files are intermediate products, not yet complete executables.
● After the compiler generates object files, the linker (ld) takes over.
● The linker combines multiple object files, resolves external symbols (e.g., function calls
to libraries), and produces the final ELF executable file.
● This final ELF file includes all the necessary sections: code, data, symbol tables, and
metadata.
● Once the linker generates the ELF file, it is ready to be loaded and executed by the
operating system.
● At this stage:
○ The executable includes instructions for how it should be loaded into memory
(e.g., .text, .data, and .bss sections).
○ The OS loader reads this ELF file and maps the segments into memory for
execution.
● Hence, ELF is the final output of the compiler toolchain before execution.
4. Alternative Outputs
● While ELF is the default output format for Linux, other formats can also be used,
depending on the target system. For instance:
○ PE (Portable Executable) is used in Windows.
○ Mach-O is used in macOS.
○ Flat binaries are possible (e.g., in embedded systems) but lack ELF-like
metadata.
Conclusion
Yes, in Linux, ELF is the final representation produced by the compiler toolchain, but it's
important to note that the full process involves multiple steps: the compiler generates object files
in ELF format, and the linker produces the final ELF executable. This standardized format
ensures compatibility with the Linux OS loader, making it a crucial part of the overall process.
Here's a figure illustrating the whole process:
582
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Contributors
● ELF was developed under the AT&T UNIX System Laboratories (USL), a division of
AT&T responsible for advancing UNIX.
● While specific individuals involved in ELF’s creation are not often named, it was part of a
collaborative effort involving leading engineers at USL and partners like Sun
Microsystems, which contributed to SVR4 development.
583
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Legacy
ELF’s robust design has made it the de facto standard for UNIX-like operating systems, with its
influence extending into modern computing. It is still in active use today, evolving to
accommodate new features like support for modern architectures and advanced linking
techniques.
Is it fair to say that the UNIX Operating System has then shaped the way
memory is organized today, even in non-UNIX systems?
Yes, it is absolutely fair to say that the development of UNIX has profoundly influenced the way
memory is organized in modern operating systems, including those that are not directly based
on UNIX, like Windows. Let's break this down:
UNIX introduced many foundational concepts in memory organization that are now ubiquitous,
including:
● Segmentation of Memory:
○ The division of executable memory into sections like .text (code), .data
(initialized variables), and .bss (uninitialized variables).
○ This segmentation allows efficient memory management, separation of concerns,
and enforcement of permissions (e.g., read-only for .text).
● Process Address Space:
○ UNIX popularized the idea of a process having its own virtual address space,
divided into:
■ Code Segment: For executable instructions.
■ Data Segment: For global and static variables.
■ Heap: For dynamically allocated memory.
■ Stack: For function call frames and local variables.
584
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Concepts like ld.so and shared libraries were refined in UNIX, enabling
modular applications and efficient memory use through shared code.
While some systems are not UNIX-based, they have adopted or adapted many of its memory
management principles due to their effectiveness.
Windows
● Segmented Memory:
● macOS uses the Mach-O binary format, but its memory organization principles are
largely derived from UNIX, since macOS is based on Darwin, a UNIX-like OS.
● Embedded operating systems often adopt simplified versions of these concepts, such as
distinct sections for code and data, because the modular design is versatile even for
constrained systems.
● Standardization:
○ The POSIX standard, derived from UNIX, defines memory-related APIs (e.g.,
mmap, brk, sbrk) that are used across many operating systems.
● Interoperability:
585
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. Divergence in Windows
While Windows has adopted many concepts pioneered by UNIX, there are key differences:
● PE Format:
○ Unlike ELF, Windows' PE format includes additional constructs for features like
delay-loading of libraries and COM components.
● Registry-Based Configuration:
○ Windows relies more on the Registry for dynamic linking configurations, whereas
UNIX systems often use configuration files and environment variables.
5. Conclusion
UNIX played a foundational role in shaping modern memory organization. Its principles of
memory segmentation, process isolation, and dynamic linking have become best practices in
operating system design, influencing even systems like Windows that are architecturally distinct.
This widespread adoption underscores the robustness and utility of concepts developed during
the evolution of UNIX.
586
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Ken Thompson, along with Dennis Ritchie, created UNIX in the late 1960s at Bell Labs. Many
foundational ideas in modern memory management were first implemented or refined in UNIX,
including:
● UNIX introduced the idea of a process address space with distinct sections:
○ Code Segment: Executable instructions.
○ Data Segment: Global and static variables.
○ Stack Segment: For function calls and local variables.
○ This segmentation allowed better memory protection and management,
influencing how later systems, including Linux, macOS, and even Windows,
manage process memory.
587
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Ken Thompson also contributed to the tools and philosophies that underpin memory
organization:
Ken Thompson's work indirectly influenced systems that were not UNIX-based, including
Windows:
● The segmentation of memory into code, data, and stack regions became a standard
that even Windows followed in its PE format.
● His development of dynamic linking concepts (via shared libraries) informed how
modern systems load and manage memory for shared code.
While Thompson did not invent all aspects of memory management, his work shaped how these
ideas were implemented:
● The concept of .bss for uninitialized variables originates from early UNIX.
● The UNIX exec system call, which organizes memory for processes during execution,
was designed under his direction.
5. Legacy
● His work on UNIX not only influenced memory organization directly but also set a design
ethos that carried through to other systems.
● Memory segmentation and executable loading mechanisms have evolved, but they still
echo the principles established in the original UNIX.
Conclusion
588
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Ken Thompson's contributions were fundamental in defining the memory organization principles
used today. While he may not have directly worked on modern formats like ELF or the Windows
PE format, his work on UNIX established the groundwork for concepts like memory
segmentation, process isolation, and dynamic linking—concepts that persist in almost every
modern operating system.
589
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To augment the given operational semantics with pointers, we can add rules that handle taking
the address of variables (&v) and dereferencing pointers (!v). Here's how you can extend the
semantics in your figure:
● The rule EAdr states that, given the context C mapping the variable name to location
loc, taking the address of name results in the memory location loc.
Dereferencing a pointer v means retrieving the value stored at a memory location. Notice that
we are using the SML/NJ syntax to dereference variables (with an exclamation mark).
590
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The rule EPtr states that, if S stores the value v at memory location loc, dereferencing
loc (i.e., !loc) will retrieve v.
Can you add new types of nodes to our abstract syntax tree, to represent
the operations of taking the address and dereferencing variables?
To add the two new operations, address of (EAdr in the figure) and dereference (EDrf), we can
define two new classes that represent these operations and update the visitor to handle them.
Here’s how to extend your code to include these operations:
Notice that in this implementation, we are keeping the context and the memory (the store) as
internal states of the visitor object, instead of passing them as arguments of the visit method.
This design choice has both pros and cons, and which one is "better" often depends on the
591
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
specific context, requirements, and programming style. Let's break down some key
considerations:
1. Simplified Method Signatures: By encapsulating context and store as part of the
visitor's state, we no longer need to pass them explicitly to each visit method, resulting in
cleaner and shorter method signatures.
2. Readability and Maintainability: Having the context and store as internal state of
the visitor can make the code more readable, especially in larger programs. Developers
can see that these values are shared across methods, and there's less
parameter-passing noise in the code.
3. Encapsulation: Encapsulating state within the visitor keeps the state management more
contained. This reduces the risk of accidentally altering the context or store outside
of the visitor’s logic, which is helpful for maintaining consistency in state-dependent
computations.
4. Easier to Extend: Adding new methods that use context and store becomes
simpler, as you don't need to keep passing these around. It can also facilitate adding
more state-tracking features to the visitor, such as logging or tracking the evaluation
steps.
5. Statefulness: If you want to perform a sequence of evaluations where each evaluation
builds on the previous state, this approach naturally supports it. For example, in an
interpreter, it makes sense for the visitor to remember previous variable bindings and
memory updates across multiple evaluation calls.
1. Reduced Transparency: In functional design, passing context and store explicitly
makes data flow very transparent. Anyone reading the code can easily trace how state is
passed and modified at each step. In the object-oriented approach, the state is "hidden"
within the object, which can sometimes make it harder to track changes in state during
evaluation.
2. Less Pure / Functional: Functional programming favors immutability and explicit
state-passing, which can help avoid unintended side effects. By keeping context and
store inside the visitor, the design becomes less functional, potentially introducing side
effects if methods modify the state in unexpected ways.
3. Harder to Reuse Visitor for Multiple Evaluations: If you need to evaluate multiple
independent expressions in parallel or in a multi-threaded context, this approach can be
less flexible. Since the visitor holds the state, you'd need to create separate visitor
instances for each independent evaluation or reset the state each time. The functional
design, by contrast, allows for easy parallelism by passing explicit, independent states.
4. Potential for Hidden Dependencies: When state is kept internally, it can become a
hidden dependency. If you need to pass in an initial context or store or retrieve the final
592
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
state, it requires additional methods or properties on the visitor, which could add
complexity.
There isn’t a definitive answer, as both designs have their use cases. Generally:
1. Interpreter-style evaluation: You’re evaluating expressions in a way that may involve
updating variables and dereferencing locations in memory. Keeping context and
store as part of the visitor’s internal state aligns with the dynamic nature of such
evaluations.
2. Flexible memory model: Since you're managing a simulated memory with locations
and values, the object-oriented design supports this abstraction naturally. Memory
updates feel more intuitive when encapsulated as internal state rather than being passed
around.
3. Readability: Your code has cleaner method signatures without repeated state
arguments. This makes the evaluator’s purpose clear, which can improve readability in
larger projects.
Ultimately, it’s a trade-off. The object-oriented approach may be better for this setting, but if you
encounter situations where independent evaluations are needed or debugging state flow
becomes difficult, revisiting a functional design might be beneficial.
Can you create some tests for our language with pointers?
Sure, see some tests in the table below:
593
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
And it would be very easy to have memory violations in this way, just like we have them in C or
in C++. For instance, consider the program below:
#include <stdio.h>
void function() {
long v[1];
v[3] = v[3] + 12;
}
int main() {
int x;
x = 13;
function();
x++;
printf("%d\n", x);
}
This program might skip the increment of variable x in the main function, as the figure below
shows:
594
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Out-of-Bounds Access: In function(), the array v is declared with only one element
(v[0]). When you access v[3], you’re going out of bounds, which is undefined
behavior in C. The C standard does not enforce bounds-checking on arrays, so
accessing v[3] might access or modify memory that happens to be nearby, such as the
memory location of x in main.
2. Skipping x++: On certain systems and compilers, the v[3] access in function might
actually modify x in main if they happen to be located close together on the stack. When
v[3] = v[3] + 12 executes, it could inadvertently overwrite the value of x before the
increment operation (x++). As a result, x++ might not behave as expected because x
has already been altered by function.
3. Platform-Dependent Behavior: This issue might not be immediately apparent on every
system or compiler, as it depends on how memory is laid out in each environment. Some
platforms, like OSX on ARM64 (e.g., Apple M1), might be more susceptible to this
specific behavior due to differences in memory alignment or stack organization.
This is indeed similar to what can happen in the Python evaluator with pointers. If you
accidentally assign to an unintended memory location (e.g., 0 := y in your evaluator), you
might overwrite a different part of the program’s state, leading to unpredictable results.
595
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In contrast, languages like Java do not allow direct manipulation of memory addresses. Java’s
memory model abstracts away pointers and direct memory access, preventing programmers
from obtaining and manipulating raw memory addresses. In Java:
1. References vs. Pointers: Java uses references instead of pointers. A reference is an
abstract handle to an object in memory, but you cannot retrieve the actual memory
address or manipulate it directly. References in Java are "opaque" — you can’t perform
arithmetic on them or access memory at arbitrary locations.
2. Type Safety: Java’s type system is designed to prevent accessing out-of-bounds
memory or misusing object references. You cannot, for example, cast an integer into a
reference or add an integer offset to a reference, as you can with pointers in C.
3. Automatic Memory Management: Java has garbage collection, which further abstracts
memory management from the programmer. In C, you’re responsible for managing
memory (allocating and freeing it), while in Java, the runtime environment handles
memory allocation and deallocation.
4. Strict Array Bounds Checking: Java arrays are objects, and accessing an element
outside the array’s bounds results in an ArrayIndexOutOfBoundsException. This
bounds checking is enforced at runtime, preventing the kind of out-of-bounds access
seen in C.
The strict separation between integers and references in Java's type system, along with the
absence of pointer arithmetic, prevents the kind of vulnerabilities common in C programs. In C,
the ability to manipulate pointers and treat integers as addresses (or vice versa) gives
programmers direct control over memory, but it also makes the language more prone to errors
and security risks.
In Java, there is no way to directly translate this code because you cannot manipulate raw
memory addresses, nor can you typecast integers into references. This restriction makes Java
safer by design, as it prevents many memory-related issues at the cost of the low-level control
that C provides.
● C’s Flexibility: C’s design allows for high-performance code and is well-suited for
low-level programming, like operating systems and embedded systems, where you need
fine-grained control over memory. However, this flexibility requires careful programming
to avoid errors and security vulnerabilities.
596
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Java’s Safety: Java’s restrictions make it easier to avoid memory errors, which is why
it’s popular for application development, particularly in environments where safety is
essential. However, the lack of direct memory access can limit performance
optimizations and prevent certain low-level operations that are possible in C.
Conclusion
So, yes — Java’s type system and memory model are designed specifically to prevent the kind
of direct memory manipulation that’s common in C. This difference is intentional and contributes
to Java's safety and reliability at the cost of some flexibility and performance.
Well, it seems that we have just added C-style arrays to the language, as in
let x = 1 in &(x)+1 := 2 end?
Yes, you've created a mechanism that resembles C-style arrays! By allowing expressions like
&(x) + 1 := 2, you've essentially enabled pointer arithmetic, where &(x) (the address of x)
can be treated as a base address, and adding an offset (+ 1) simulates accessing an array
element at that offset.
In C, arrays and pointers are closely related, as demonstrated by the program below:
#include <stdio.h>
int main() {
int v[3];
v[2] = 3;
printf("%d\n", v[2]);
*(v + 2) = 5;
printf("%d\n", v[2]);
*(2 + v) = 7;
printf("%d\n", v[2]);
2[v] = 11;
printf("%d\n", v[2]);
}
1. Pointer Arithmetic: In C, v[2], *(v + 2), *(2 + v), and 2[v] are all equivalent
ways to access the third element of the array v (arrays are zero-indexed). This works
because in C, v[i] is defined as *(v + i). The array name v itself acts as a pointer
to the start of the array, and pointer arithmetic allows moving to different indices.
2. Offset Simulation: In your language, &(x) + 1 := 2 does something similar by
allowing addition on the address of a variable (x), so the result &(x) + 1 represents an
offset address. The assignment := 2 then writes 2 to this computed address, just as
v[2] = 3 writes to the third element in a C array.
597
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Indexing Notation Flexibility: In C, different syntaxes like 2[v] are valid because
a[b] is actually syntactic sugar for *(a + b). This flexibility comes from the fact that
a[b] and b[a] both translate to pointer arithmetic, and C treats both expressions as
equivalent. This dual interpretation allows quirky syntax like 2[v], even though it’s rarely
used.
Similarities to Arrays
By combining address-of (&) and addition, your language now supports what is effectively array
indexing, as in C. Here’s how each part maps:
● Base Address: &(x) functions like a base pointer or starting address of an "array."
● Offset Access: The addition &(x) + 1 allows moving to an offset, simulating x[1] or
an array-like structure.
● Assignment at Offset: Assign(Add(AddressOf("x"), Num(1)), Num(2))
writes 2 to the computed offset, just like assigning a value to an array index.
While this approach gives you a limited form of arrays through pointer arithmetic, it does lack
some of the full functionality of true arrays, such as:
Could you generate code for these new constructs we've added to our
language?
1. AddressOf: This construct should generate the address of a variable. In RISC-V,
addresses are usually handled with pointers in memory. Here, we'll simulate this by
mapping variables to memory addresses within the program's environment.
598
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Assign: This construct will store a value in the memory address specified by an address
expression. It will use the sw instruction to store a value into memory at a computed
address.
3. Dereference: This construct will retrieve the value from a memory address. It will use
the lw instruction to load a value from memory at a specified address.
Here’s how you could implement these methods in the CodeGenVisitor class:
Additional Assumptions
To add support for variable address allocation and retrieval, we can introduce new_address
and get_address methods in the Program class. These methods will help manage memory
addresses for variables in the program environment. Here’s how we could implement them:
1. new_address: This method will allocate a new memory address for each variable in the
program. We can use a dictionary (self.__address_map) to store variable-to-address
599
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this new code generator, we still have virtual registers, but they are
mostly temporaries created during code generation. Every variable is now
mapped to memory, right?
Yes, that's exactly right! In this setup, variables are treated similarly to how LLVM handles
"alloca" instructions for stack-allocated variables. By mapping variables directly to memory
addresses, we avoid needing to assign registers to persistent variables; instead, we use virtual
registers as temporary placeholders during code generation.
1. Memory-Mapped Variables: Each variable gets a unique memory address as soon as
it's defined. This mirrors LLVM's approach when we take the address of a variable in
languages like C, where the variable must reside in memory to support address-based
operations. This setup is particularly useful for languages or scenarios where variables
might persist beyond the scope of a single operation or where we need to support
address-based manipulations.
2. Temporary Registers: Virtual registers remain for temporary values that are only
needed briefly during instruction evaluation. These temporary registers help to simplify
expression evaluation and intermediate calculations before moving results to memory if
needed. In LLVM, these would correspond to SSA (Single Static Assignment) values,
which only exist during a particular stage in code generation and don’t persist in memory.
This architecture allows you to manage both transient values efficiently with registers and more
persistent variables using memory, striking a balance between register-based operations and
memory management. It does resemble LLVM's strategy, where values are assigned to
registers or stack-allocated in memory depending on their usage.
Then, with base address plus offsets, we can implement O(1) access to
continuous arrays, right?
Yes, you're correct! Every access in the last line of this program is O(1) because of the way
arrays are laid out in memory with base addresses and fixed offsets:
int static_array() {
int x[4] = {0, 1, 2, 3};
int y[3][2] = {{0, 1}, {2, 3}, {4, 5}};
int z[4][3][2] = {{{0, 1}, {2, 3}, {4, 5}},
600
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Base Address + Offset Calculation: For any statically-sized, multi-dimensional array in
C, memory is allocated as a single contiguous block. This layout allows each element to
be accessed in O(1) time by calculating an offset from the base address. The compiler
computes this offset based on the array's dimensions and the index values, so there's no
need to traverse the array to find elements.
2. Accessing Elements in Constant Time:
○ For x[2], the compiler calculates the memory address by adding 2×sizeof(int) to
the base address of x.
○ For y[2][1], it calculates the address as: base of y + (2×2+1) × sizeof(int).
Here, 2 and 1 are the indices, and 2 is the second dimension size of y.
○ For z[3][2][1], the address is calculated by: base of z + (3×3×2+2×2+1) ×
sizeof(int) where 3, 2, and 1 are the indices, and 4, 3, and 2 are the sizes of
each dimension in z.
3. Efficiency with Contiguous Memory: Since all elements are stored contiguously, each
element's position is computed directly, making all accesses O(1). This efficiency in
memory access is one reason why array indexing in C is so fast.
In summary, the base address plus offset strategy allows each element access to be
constant-time, making operations like those in x[2] + y[2][1] + z[3][2][1] O(1).
For an n-dimensional array A with dimensions d1,d2,…,dn, the memory address of the element
at indices [i1,i2,…,in] is given by:
601
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
where:
This formula works by "flattening" the multi-dimensional array into a single contiguous block in
memory. The innermost dimension contributes the smallest increments, while the outermost
dimension contributes the largest.
Consider a 3D array A with dimensions d1=4, d2=3, and d3=2 (i.e., int A[4][3][2] in C).
Let’s find the address of A[i1][i2][i3].
This formula can be applied to any dimension count, making it versatile for accessing elements
in any multidimensional array stored in a contiguous block of memory. The figure below
illustrates how the formula is applied on different array shapes:
602
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Can you show how the indices are computed using an actual assembly
file?
Sure! Let's consider the C file below, and the corresponding ARM assembly file produced by
clang -O1:
603
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The parameters are passed on the stack, and the assembly code performs calculations to
access the correct memory locations for x[a], y[b][c], and z[d][e][f].
The dimensions A, B, C, and indices a, b, c, d, e, f are loaded from the stack into registers.
2. Accessing x[a]
ldr w14, [x3, w6, sxtw #2] ; w14 = x[a], using x3 (base address of x) and w6 (a)
● This line uses the base address of x in x3 and the index a (stored in w6).
604
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The sxtw #2 operation scales the index by 4 bytes (assuming int is 4 bytes) to
calculate the byte offset.
● The element x[a] is loaded into w14.
3. Accessing y[b][c]
4. Accessing z[d][e][f]
5. Final Computation
● y[b][c], z[d][e][f], and x[a] are summed up and stored in w0, which is the return
register for the result.
605
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Each access follows the general formula we discussed, where indices are multiplied by the
sizes of the sub-dimensions to calculate their byte offsets within the memory. This structure
allows each access to be performed in constant time, with multiplications and additions that are
derived from the array dimensions.
This example highlights how each array element access in C can be efficiently compiled into
ARM assembly, using address arithmetic based on the array dimensions and index values.
All that assumes that elements in each row are stored consecutively in
memory. Is that really the case?
Yes, C does store arrays in a row-major order, meaning that elements in each row are stored
consecutively in memory. In row-major order, an entire row is laid out in memory before moving
to the next row. For example, a 2D array y[A][B] in C is stored such that y[0][0], y[0][1],
..., y[0][B-1] are followed by y[1][0], y[1][1], and so on. This ordering is a convention in
C and C-derived languages (like C++, Java, and Python for their internal implementations).
Column-Major Order
In contrast, column-major order stores elements in each column consecutively. So, in a 2D array
y[A][B], y[0][0], y[1][0], ..., y[A-1][0] would be stored first, followed by y[0][1],
y[1][1], and so on. This arrangement is common in languages where mathematical and
scientific computing is prioritized. The figure below will illustrate the difference:
606
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● MATLAB: MATLAB also uses column-major order, as it’s geared toward matrix
operations where column-oriented access patterns are often more efficient.
Row-major vs. column-major ordering can affect performance based on the type of
computations. For example:
In languages like Fortran or MATLAB, where matrix and linear algebra operations are common,
column-major order is preferred because it optimizes memory access patterns for these
applications.
So, that's why the order that we use to iterate over arrays is so important,
right? I suppose that has an impact on cache locality.
That's correct! The program below illustrates this difference:
607
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The difference in timing between row-major and column-major summation in your program is
due to the way data is accessed in memory and how it interacts with the cache.
In C, arrays are stored in row-major order, meaning that elements in each row are stored
consecutively in memory. When you iterate over the array in row-major order (with the outer
loop iterating over rows and the inner loop iterating over columns), you're accessing elements in
the same row consecutively, which aligns with how the data is stored in memory.
Because modern CPUs load data from memory in blocks (cache lines), accessing memory
sequentially in the same row means fewer cache misses. The CPU can load an entire cache
line containing multiple elements at once, and as long as the accesses remain within that line,
the elements are readily available in cache.
608
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When you switch to column-major order (with the outer loop iterating over columns and the
inner loop iterating over rows), you're effectively jumping from one row to the next for each
element in the column. This results in accessing memory locations that are not contiguous,
causing frequent cache misses. Every time you move to a new row in the column, the CPU has
to load a new cache line, leading to more memory accesses and slower performance.
In your example:
This difference is due to two factors. First, the increased number of cache misses in the
column-major traversal, where the CPU has to fetch data from main memory more often.
Second, the possibility of vectorizing the code in the row-major version of the program. We shall
talk a bit about vectorization in our class about compiler optimizations.
When working with large arrays in languages that use row-major order (like C), accessing data
in a row-major pattern (inner loop over columns, outer loop over rows) is typically much faster
due to better cache utilization. In contrast, languages like Fortran, which store arrays in
column-major order, would benefit from column-major access patterns. Understanding the
memory layout and access patterns in your chosen language can have a substantial impact on
performance, especially for large data sets.
609
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The difference in performance between the IKJ and IJK loop orders likely arises from how
each ordering affects memory access patterns, which can influence cache efficiency.
610
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Accessing elements in A row by row, as done in IKJ, can minimize cache misses
compared to IJK, which might access elements in a less contiguous manner
depending on the size of Y and the row layout.
3. Loop Interchange Optimization:
○ Some loop orders naturally lend themselves to compiler optimizations. The IKJ
loop order may better align with optimizations that the compiler can apply, leading
to faster execution times.
4. Dependency Patterns:
○ By accessing C[i][j] in the innermost loop, IKJ minimizes dependencies on
other memory locations outside of the current calculation. In IJK, the
accumulation in C[i][j] happens after the innermost loop completes, which
might introduce slight inefficiencies.
In summary, IKJ is generally more cache-friendly, particularly for large matrices, because it
minimizes cache misses by reusing C[i][j] and accessing A in a sequential memory order.
This could explain why it performs faster than the IJK ordering in your tests.
Loop interchange is not always safe and can only be applied when certain conditions are met.
Specifically, we can interchange loops only if it does not alter the semantics (i.e., the result) of
the code. The main considerations are:
611
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this example, we can safely interchange the i and j loops, as each iteration only accesses
A[i][j], B[i][j], and C[i][j], all of which are independent of each other.
In this case:
● The range of j depends on i, so interchanging the loops would change the number of
iterations in each loop, resulting in incorrect behavior.
● Furthermore, there’s a data dependency on C[i-1][j], which means each iteration
relies on the previous row of C being computed. This dependency would be violated if we
interchanged i and j.
Summary
612
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Modern compilers can sometimes perform loop interchange automatically if they detect it is both
safe and beneficial, but understanding these principles allows programmers to manually apply it
where necessary or to guide the compiler to optimize effectively.
To understand tiling, let’s look at a matrix multiplication example. In a naive implementation, the
entire matrices A and B might be accessed in ways that cause many cache misses, especially if
the matrices are too large to fit in the cache. By tiling, we divide the computation into smaller
blocks that fit in the cache, reducing the need to reload data.
With tiling, we process the matrices in sub-blocks, so that data loaded into the cache for one tile
can be reused within that tile before moving to the next one. A tiled version might look like this:
613
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this example:
● We split the loops into tiles of size TILE_SIZE x TILE_SIZE, processing each
sub-block independently.
● By processing smaller chunks that fit in cache, we increase the likelihood that A, B, and
C values will stay in the cache until they are no longer needed.
For a compiler to apply tiling automatically, it needs to determine whether tiling is both safe and
beneficial for the specific code being optimized. Here are the main considerations:
614
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Benefits:
● Improved Cache Utilization: Tiling enhances data locality, allowing data to remain in
the cache longer. This reduces cache misses and improves memory bandwidth usage.
● Higher Performance: By accessing data in blocks, tiling can significantly speed up
computations on large data sets, particularly for memory-bound applications like matrix
multiplication.
Challenges:
● Choosing the Optimal Tile Size: The ideal tile size depends on the hardware’s cache
size, cache line size, and associativity. Picking a tile size that’s too large or too small can
negate the benefits of tiling.
● Increased Code Complexity: Tiling can make loop structures more complex, which
may complicate debugging and maintenance. Some compilers may avoid tiling if the
code complexity outweighs the performance benefits.
● Alignment Constraints: In some cases, tiled loops may introduce alignment issues,
particularly if tile boundaries do not align perfectly with data structures. This can
introduce overhead when handling edge cases.
Summary
Tiling is a powerful optimization for improving cache efficiency in programs with large,
multi-dimensional data. While compilers can apply tiling automatically, they must carefully
analyze data dependencies, loop structures, and cache characteristics to determine when tiling
is both safe and beneficial. Some high-level compilers and libraries (like BLAS for linear
algebra) use tiling extensively, but compilers like GCC or LLVM typically require complex
analyses or profile information to apply it effectively.
Can you show how tiling would work on our matrix multiplication example?
Sure we can apply tiling on matrix multiplication. The code below shows a tiled version of matrix
multiplication, where only the innermost loop is tiled:
615
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Usage Notes
Both of these versions use tiles to improve cache efficiency by accessing chunks of memory in a
way that increases the likelihood of reuse within each tile, reducing cache misses. Adjusting the
tile size can further improve performance depending on the matrix size and cache structure of
your system.
616
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● When you declare a struct, the fields are stored in memory sequentially, in the order they
are declared. For example, in this struct:
struct Point {
int x;
int y;
};
● if Point is stored at a memory address 0x1000, then x will be stored at 0x1000, and y
will follow immediately after at 0x1004 (assuming int is 4 bytes).
● Unlike arrays, structs may have padding added between fields to align them according
to the system's memory alignment requirements. This alignment helps improve
performance, as misaligned data accesses can be slower on many architectures.
● For instance, consider this struct:
struct Mixed {
char a;
int b;
char c;
};
● Depending on the compiler and platform, this struct might have padding between a and
b, as well as between b and c, resulting in a larger memory footprint than the raw sum of
each field's size. For example, a might occupy 1 byte, then 3 bytes of padding might
follow, and b will start on a 4-byte boundary.
● Arrays: All elements are of the same type and size, so the address of any element can
be computed by simple arithmetic: address = base_address + index *
element_size.
617
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Structs: The fields within a struct can have different sizes, so the address of each field is
specific to its offset within the struct. The offset is determined by the order of fields and
any padding applied by the compiler.
● Structs are accessed by field names, not by indices as with arrays. This makes working
with structs more flexible when different data types are involved.
● For example, given struct Point p;, you access fields with p.x and p.y, and you
don’t have to worry about calculating the offsets manually.
5. Arrays of Structs
● When you create an array of structs, such as struct Point points[10];, each
Point in the array is stored consecutively in memory, much like an array of arrays would
be. However, within each struct instance, padding might still be present, which can
impact the memory layout.
Summary
While both arrays and structs in C are stored contiguously in memory, the key differences are:
● Arrays are contiguous sequences of identical elements, while structs can contain
different types and may include padding for alignment.
● Arrays allow element access by index, while structs use field names for access, with
fixed offsets for each field.
This contiguous layout and predictable access pattern make structs similar to arrays in terms of
memory organization but with additional complexity due to padding and mixed data types.
That means that accesses to fields within a struct are O(1), just like in
arrays?
Accesses to fields within a struct are indeed O(1), just like array accesses. And, as you
mentioned, struct field access can sometimes be even faster than array access because there’s
no need for multiplication.
Here’s why:
618
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
translates that to a memory address calculation based on a fixed offset from the
base address of p.
○ For example, if x is at offset 0 and y is at offset 4 in struct Point, accessing
p.y will simply involve fetching the value at p + 4 (no need for runtime offset
calculations).
2. Array Access Involves Multiplication:
○ In arrays, even though access is also O(1), the memory address of an element is
calculated at runtime as base_address + index * element_size. The
index * element_size multiplication is necessary because the element size
is not fixed at compile time (the index can vary at runtime). This multiplication can
add a tiny bit of overhead, especially in cases where the element size is not a
power of two and cannot be optimized by the compiler.
3. Efficient Struct Access:
○ Struct access is often as fast as, or even slightly faster than, array access
because the compiler directly knows the memory offsets of fields. For example,
accessing p.x or p.y in struct Point doesn’t require index calculations or
multiplications, just simple addition based on precomputed offsets.
Example Comparison
struct Point {
int x;
int y;
};
struct Point p;
p.x = 10; // Access using a fixed offset
The compiler directly translates p.x access to an instruction that refers to p’s base address with
an offset, without any need for multiplication.
int arr[10];
arr[i] = 10; // Access with runtime index calculation
Summary
619
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Both struct and array accesses are O(1), but struct field access is often slightly
simpler and potentially faster, as it avoids runtime multiplication.
● No Multiplication: For structs, the offsets are determined at compile time, so each field
access is a simple addition rather than an index-based calculation.
This is part of why structs are well-suited for data with fixed layouts where different types are
used, as they allow fast, predictable access patterns without the need for indexing or runtime
calculations.
So, does that mean that you can access the fields of a struct using an
offset, instead of the field's name?
Yes, that's possible. The program below is initializing the fields of a struct by manipulating
memory directly, bypassing the usual field names:
In this example, we are leveraging struct memory layout and field offsets to directly modify
the contents of a struct in memory.
620
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Using this technique, you're effectively doing "manual memory management," treating the struct
as a raw memory block. This is powerful but requires precise knowledge of struct layout and
should be used carefully.
For GCC and Clang, you can use the __attribute__((packed)) directive to pack the
struct, which tells the compiler to eliminate padding between fields.
#include <stdio.h>
621
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int main() {
struct MyStruct s;
char *base = (char *)&s;
return 0;
}
In this version:
The total size of MyStruct will now be 13 bytes, which is the sum of the sizes of the fields: 1 +
8 + 4 = 13.
Important Considerations
622
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Using packing, you eliminate padding, allowing exact control over memory layout, but it’s best
used sparingly due to the potential performance impact.
Is there a general theory that helps us understand how loops can be safely
transformed?
Yes, there is one such theory: the Polyhedral Model, or Theory of Polytopes! This is a
mathematical framework often used to reason about and optimize loop nests in programs. It
provides a way to model computations and data accesses in loops as geometric objects,
enabling systematic exploration of transformations to improve performance on modern
hardware. Let's break it down and relate it to your question about loop transformations.
At its core, the polyhedral model represents iterations of loop nests and their data accesses
using integer points within multi-dimensional spaces. These spaces, called iteration domains,
are bounded by systems of linear inequalities.
1. Iteration Domain: Each loop nest is modeled as a convex polytope, where each
dimension corresponds to a loop index.
1. Loop Fusion: Merging two loops into a single loop can be visualized as combining their
iteration domains into a single polytope. This can improve data locality by reducing the
distance between dependent operations.
623
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Loop Tiling (or Blocking): Divides the iteration space into smaller, rectangular tiles
(sub-polytopes) to improve cache reuse by focusing computations on smaller blocks of
data at a time.
3. Loop Skewing: Modifies the iteration domain by applying an affine transformation to
change the execution order, useful for parallelism or eliminating dependencies.
4. Loop Parallelization: Splits the iteration domain into independent regions
(sub-polytopes) that can be executed in parallel, guided by the dependence analysis.
5. Loop Interchange: Rotates the polytope, changing the nesting order of loops. This can
optimize memory access patterns.
6. Loop Unrolling: Expands iterations within the polytope to expose more instructions for
optimization or pipelining.
Many loop transformations can indeed be explained with the polyhedral model because they
boil down to manipulating iteration domains or access functions. By visualizing loops
geometrically, we gain clarity on how changes impact dependencies, parallelism, and
performance.
624
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Beyond the polyhedral model, Paul Feautrier made important contributions to broader areas in
computer science:
625
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Legacy
Paul Feautrier's work is foundational in the field of loop optimization and automatic
parallelization, particularly in the context of polyhedral theory. His ideas continue to underpin
many modern compiler frameworks that aim to optimize performance-critical code for parallel,
vector, and memory-efficient execution.
626
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Comparison
Aspect Procedural Style Object-Oriented Style
Data and Data (Node) is passive; Data and behavior are encapsulated
Behavior functions operate on it. in objects.
Error Handling Explicit checks (NULL) with Exceptions for invalid operations.
manual errors.
Extendability Adding new types of nodes is Adding new node types is easy via
cumbersome. inheritance.
Both implementations achieve the same functionality, but the OO approach adheres to the
principle of “intelligent types” and makes extending or modifying the program more intuitive. On
the other hand, the procedural approach may be simpler and more efficient in cases where
polymorphism and extensibility are not needed.
627
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In OOP, because objects know how to perform actions related to their type, the
implementation details of these actions can be hidden from the outside world. This is different
from procedural programming, where external functions operate on data structures, and you
must ensure that those functions are always used correctly.
628
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In a procedural style, the internal structure of the type is exposed, and external functions must
manipulate the data directly. For example:
#include <stdio.h>
typedef struct {
int elements[100];
int size;
} List;
// Main function
int main() {
List myList = {{1, 2, 3}, 3};
printf("Size: %d\n", size(&myList));
return 0;
}
Here:
In OOP, the details of how the size is stored or computed can be hidden from the user:
#include <iostream>
class List {
private:
int elements[100];
int size;
public:
List() : size(0) {}
629
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
// Main function
int main() {
List myList;
myList.add(1);
myList.add(2);
myList.add(3);
std::cout << "Size: " << myList.getSize() << std::endl;
return 0;
}
Here:
● The data (elements and size) is private, meaning the user cannot access it directly.
● The type itself (List) knows how to handle its data through the methods add() and
getSize().
Key Takeaway
Encapsulation is so natural in OOP because all actions on a type are internal to the type
itself. The type acts as a self-contained unit that “knows” how to handle its data. This design
principle:
By encapsulating both data and behavior, OOP simplifies the process of designing robust and
reusable components.
630
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Changing the module you want to modify: This is the direct cost of updating the part
of the code that implements a feature.
2. Adjusting the effects of the modification on other modules: This happens when your
change affects other parts of the program, causing a cascade of updates.
Encapsulation, a key principle of object-oriented programming, helps eliminate the second cost
by isolating the implementation details of a module from the rest of the program. This is
achieved by exposing a stable interface while hiding the inner workings of a module.
In C, the length of a string is determined by finding the position of the special character \0 (null
terminator) in memory. Programs have two main ways to find a string’s length:
In contrast, programs that consistently use strlen() would remain unaffected. Why? Because
the interface (strlen) stays the same, even if the implementation of how the length is
determined changes.
Encapsulation in Action
Encapsulation is a way to enforce this separation between interface and implementation. When
a program interacts only with a public interface — like strlen() — and doesn’t rely on
hidden details, we gain several benefits:
631
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● From the outside, you only see the buttons and levers (the public interface).
● You don’t know or care what machinery is inside the box or how it works.
● As long as the buttons and levers don’t change, the black box can be updated,
optimized, or completely redesigned internally, and everything that interacts with it will
continue to work.
Now imagine a system without a black box: if every user could tinker with the internal machinery
directly, any change would require you to ensure that all users’ code still works. This is what
happens in C when people bypass strlen() and rely on \0.
class String {
private:
char* data;
size_t length;
public:
size_t getLength() const {
return length;
}
};
Here:
● The data and length are private, so no external code can directly modify or access
them.
● The program must use getLength() to retrieve the length, which means the
implementation of how length is stored can change freely (e.g., we could calculate it
dynamically or store it as a property).
By enforcing this structure, object-oriented languages make it easier to write code that is robust
to changes and cheaper to maintain.
632
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Takeaway
Encapsulation is a powerful tool that allows us to write software that is easier to maintain, adapt,
and future-proof.
At the heart of OOP is the ability to create instantiable namespaces — constructs such as
classes or structs that combine data and methods. These namespaces:
For example, in a class like Car, we might bind attributes such as speed and methods such as
accelerate():
class Car {
public:
int speed;
void accelerate() { speed += 10; }
};
Each instance of Car represents a specific object with its own state and behaviors.
2. Late Binding
Late binding is the ability to determine, at runtime, the specific implementation of a behavior
(method) to invoke on an object. This feature enables polymorphism, one of the most powerful
tools in OOP:
633
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● It allows different objects to respond in their own way to the same message (method
call).
● Programmers can write code that operates on general types (e.g., base classes or
interfaces) without needing to know the specific type of the object at compile time.
For example:
class Animal {
public:
virtual void makeSound() { std::cout << "Some generic sound\n"; }
};
void interactWithAnimal(Animal* a) {
a->makeSound(); // Late binding decides the behavior at runtime.
}
Here, calling interactWithAnimal with a Dog or Cat object leads to different outputs
because the actual method implementation is determined during execution.
These two features work together to enable the key benefits of OOP:
● Instantiable namespaces let us group data and methods into objects, representing
real-world entities and encapsulating complexity.
● Late binding ensures flexibility by allowing programs to handle different types of objects
in a unified way, supporting extensibility and polymorphism.
Without either of these, the fundamental advantages of OOP — such as code reuse, modularity,
and flexibility — are severely diminished. While other concepts like inheritance or encapsulation
are common in OOP, they are not strictly essential and depend on the specific language or
design.
Key Takeaway
634
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Instantiable namespaces for creating objects that combine data and behavior.
2. Late binding for determining object-specific behavior dynamically at runtime.
These features form the foundation of OOP and distinguish it from other programming
paradigms. They allow us to design systems that are more modular, adaptable, and expressive,
providing a powerful way to model the world in code.
The essence of object-oriented programming (OOP) lies in the encapsulation of data and
behavior, rather than in the specific syntax or constructs provided by a language. While
languages like C++, Java, and Python provide first-class support for object-oriented features
through constructs such as classes and structs, the core idea of combining state (data) with
behavior (functions) into "objects" can be implemented in many ways, even in languages that
don't explicitly support OOP. The figure below shows some examples:
635
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
636
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Broader Implications:
The ability to simulate OOP in non-OOP languages reinforces the idea that OOP is a paradigm,
not a language feature. It showcases the universality of fundamental programming concepts,
such as:
In practice, though, the choice to use a formal OOP syntax often depends on factors like ease of
maintenance, team familiarity with the paradigm, and available tooling. However, as our
example demonstrated, languages with even minimal syntactical support for functions and data
structures can implement OOP effectively.
Let's create a simple class in C++, and see how it would look like in C?
Well, there is not much to it, if the class has no virtual methods. Here's an equivalent
implementation of a simple class in C++ and in C:
637
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
638
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
That does not explain virtual methods though. What are virtual methods?
Virtual methods are a key feature of object-oriented programming that allow programs to
achieve dynamic dispatch, enabling flexibility, extensibility, and elegant design.
A virtual method is a function or method in a base class that can be overridden by derived
classes. When a method is marked as virtual, the method call is determined at runtime, based
on the type of the actual object (not the type of the reference or pointer). This is what we call
dynamic dispatch.
● If a method is virtual, the runtime system decides which implementation of the method
to invoke based on the actual type of the object.
● If a method is not virtual, the compiler determines the method to call at compile time,
based on the type of the reference or pointer.
Examples
In C++, you must explicitly declare a method as virtual in the base class for it to support
dynamic dispatch. If you don't, the method is resolved statically (at compile time).
#include <iostream>
using namespace std;
class Animal {
public:
virtual void speak() { cout << "Animal speaks" << endl; }
void walk() { cout << "Animal walks" << endl; } // Non-virtual method
};
639
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int main() {
Animal* animal = new Dog();
animal->speak(); // Dog barks (dynamic dispatch)
animal->walk(); // Animal walks (static dispatch)
delete animal;
}
In Java, all methods are virtual by default (unless declared final or static). This means
method calls are resolved dynamically.
class Animal {
// Every method is virtual by default:
void speak() { System.out.println("Animal speaks"); }
void walk() { System.out.println("Animal walks"); }
}
640
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Encapsulation:
○ Virtual methods help encapsulate the details of specific implementations while
exposing a common interface. This makes code easier to read and maintain.
4. Elegant Design:
○ Virtual methods embody the "Open/Closed Principle": systems should be open to
extension but closed to modification. You can define a base class and extend its
functionality without changing its code.
The beauty of virtual methods lies in their ability to simplify complex systems while maintaining
flexibility. Consider a simple example: a graphics application.
class Shape {
public:
virtual void draw() const = 0; // Pure virtual method
};
int main() {
Circle c;
Square s;
render(c); // Drawing Circle
render(s); // Drawing Square
}
Without virtual methods, you would need complex conditionals (if or switch) to check the
type of the object and call the appropriate method, violating key principles of OOP. With virtual
methods, adding new shapes is effortless—just define a new subclass!
Trade-offs
641
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Performance:
○ Virtual methods involve a slight runtime cost because they rely on a vtable (a
table of function pointers) for dynamic dispatch. However, this cost is negligible in
most cases.
2. Explicit Control (C++ vs. Java):
○ In C++, you have fine-grained control over which methods are virtual, allowing for
optimization where dynamic behavior isn't needed.
○ In Java, while the always-virtual approach simplifies the language, it may
introduce unnecessary overhead when dynamic dispatch isn't required.
Conclusion
Virtual methods are a powerful mechanism that enable polymorphism and make object-oriented
design both flexible and extensible. Whether controlled explicitly (as in C++) or implicitly (as in
Java), they allow for more modular, maintainable, and scalable systems. They elegantly balance
the need for abstraction with the ability to extend functionality, embodying the true spirit of
object-oriented programming!
In C++, virtual methods are implemented using vtables (virtual function tables). A vtable is
essentially a lookup table containing pointers to the virtual methods of a class. In C, we simulate
this concept by using function pointers within a struct. There are several possible
implementations, in this case: either we can add to the object struct a single pointer to a table of
function pointers (as C++ actually does), or we can add the function pointers directly to the
struct that represents the objects. Let's follow the second approach:
642
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's walk through the C code that simulates the C++ virtual method behavior.
● Virtual Methods (inc_age, get_age): These methods are marked virtual, which
means that they can be overridden in subclasses and that the actual method invoked will
be determined at runtime, based on the actual type of the object. This is the essence of
dynamic dispatch in object-oriented programming.
● vtable (implicit): C++ automatically creates a vtable for the Animal class to hold
pointers to the virtual methods. When we call inc_age or get_age on an object, the
correct method is looked up in the vtable at runtime.
643
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion:
In this C code, the method pointers (inc_age and get_age) inside the Animal struct mimic
the behavior of virtual methods in C++. These pointers allow us to dynamically call the
appropriate function, which is similar to how C++ determines the method to call through the
vtable. By using this technique, you can implement object-oriented behavior, like polymorphism,
even in C, which doesn't have built-in support for classes or virtual methods.
How It Works:
644
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Insight:
This simplicity is what makes the idea of methods elegant: it’s a natural extension of the concept
of a function, tailored to interact with objects!
645
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Both Animal and Elf share the same interface (Animal struct with inc_age and
get_age function pointers).
● However, the behavior of inc_age differs for Animal and Elf because the inc_age
pointer is assigned different functions at runtime (animal_inc_age for Animal and
elf_inc_age for Elf).
3. Dynamic Assignment
646
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
At runtime, when you call inc_age, the program dynamically decides which function to execute
based on the specific object type (Animal or Elf).
● An Animal instance is created, and its inc_age behaves as expected (increment by 1).
● An Elf instance is created, and its inc_age behaves differently (increment by 7),
demonstrating late binding.
1. Common Interface: Both Animal and Elf are represented by the same struct type
(Animal), which acts as a common interface.
2. Dynamic Dispatch:
○ At runtime, the program uses the function pointer stored in the Animal struct to
determine which inc_age method to call.
○ This decision is not made at compile-time but rather at runtime, based on the
specific object being pointed to (Animal or Elf).
3. Flexibility:
○ This mechanism allows the program to treat both Animal and Elf objects
uniformly (as Animal*) while preserving their distinct behaviors.
Late binding is crucial for implementing dynamic dispatch, which is a key concept in
object-oriented programming. It allows different types of objects to define their own behavior
while still being treated as instances of a common base type. In this example:
So, does this virtual table really take some space then, in the class that
contains a virtual method?
Yes, it does! Consider the example below:
647
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The reason an instance of MyString2 is larger than an instance of MyString1 lies in the
presence of a virtual method and the implementation of virtual tables (vtables) in C++.
Key Observations
648
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The presence of virtual methods introduces an overhead to enable dynamic dispatch, but it’s a
tradeoff for the powerful dynamic dispatch mechanism that virtual methods provide. This is why
objects with virtual methods are generally larger than those without.
649
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Shared Layout
In C++:
● When a derived class inherits from a base class, the base class's members are placed
at the beginning of the derived class's memory layout. This ensures that the base
650
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class's fields and methods are at consistent offsets, regardless of whether the object is
accessed as a base class or a derived class.
In the C code:
● We manually ensure that Elf starts with the same layout as Animal. This is achieved
by including the same fields (age and method pointers) in Elf as in Animal, which
guarantees that casting between Animal* and Elf* works correctly.
In C++:
● Dynamic dispatch is achieved through a vtable and vptr mechanism. The base class
has a hidden pointer (vptr) to its vtable, and the derived class overrides the relevant
entries in the vtable.
In your C code:
In C++:
● Derived classes can add new fields and methods while still preserving the base class
layout.
In the C code:
● Elf extends Animal by adding new fields (name) and methods (set_name and
get_name). You ensure compatibility by keeping the Animal fields and method pointers
at the same offsets in Elf.
In C++:
● The compiler automatically manages the layout of base and derived classes, as well as
dynamic dispatch via vtables.
In C:
● We have to manually:
651
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion
However, in C++, the compiler automates all of this, making the code cleaner, safer, and more
maintainable!
In simpler terms, if a class S is a subclass of class T, then anywhere you can use an object of
type T, you should be able to use an object of type S without breaking the program.
This principle ensures that inheritance promotes true substitutability, enabling polymorphism and
modular code design.
652
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ The subclass must preserve the invariants of the superclass (conditions that
must always be true for an object's state).
Real-World Significance
1. Code Reusability: Subclasses can be used in place of their superclasses without
introducing bugs.
2. Polymorphism: The code remains modular and flexible, supporting late binding
effectively.
3. Robust Design: It avoids surprises for developers, as derived classes behave
consistently with their base classes.
Adhering to the Liskov Substitution Principle (LSP) is not an automatic property of every
class hierarchy. A class hierarchy can fail to meet LSP when a subclass violates the behavioral
expectations established by its superclass. Such violations typically arise when the subclass
alters the semantics of inherited methods or introduces constraints that are incompatible with
the parent class's guarantees, as in the example below:
653
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In the example, the Sparrow class adheres to LSP because it maintains the expected behavior
of its superclass: calling fly() on a Sparrow instance produces the same type of behavior
one would expect from any Bird. However, the Penguin class violates LSP because it
redefines fly() in a way that contradicts the behavior promised by the Bird class. Specifically,
instead of performing an action (flying), the method throws an exception. This divergence
means that a Penguin cannot be substituted for a Bird in contexts where the fly() method
is expected to work without failure.
So, this trick of having the layout of the superclass being a prefix of the
layout of the subclass enables the substitution principle, right?
Yes, the idea of having the superclass's layout as a prefix of the subclass's layout is the
cornerstone of how a lower-level implementation of an object-oriented language enables this
form of subtyping.
654
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The key idea is structural compatibility: the fields of the superclass (Animal) are arranged as
a prefix of the subclass (Elf). This means the memory layout of an Elf object begins with the
same fields (in the same order) as an Animal. As a result:
1. Pointer Compatibility: When a pointer to an Elf is cast to a pointer to an Animal, the
casted pointer points to the Animal-compatible part of the Elf. The fields and method
pointers of Animal remain accessible as if the object were just an Animal.
2. Polymorphism Through Function Pointers: In your implementation, each struct
contains function pointers for its methods (e.g., get_age and inc_age). This design
allows overriding superclass methods in the subclass (e.g., Elf overrides inc_age).
When a function like get_age is called on an Animal pointer, it uses the method
pointer stored in the object, whether the object is an Animal or an Elf. This achieves
polymorphic behavior.
Summary
The C implementation achieves subtyping by designing structs with the superclass as a prefix of
the subclass. Function pointers enable polymorphism, allowing subclass methods to override
superclass methods. This design is a simulation of inheritance and demonstrates how principles
from object-oriented programming can be implemented in a procedural language like C.
So, in the end, we can find the target of any method call with two
dereferences. Is that really the case?
This principle is indeed true for statically typed languages that support virtual method tables
(v-tables).
655
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Efficiency
This two-step process ensures that finding the correct function to execute in a polymorphic call
involves just two pointer dereferences:
1. The first dereference retrieves the address of the v-table from the object.
2. The second dereference retrieves the function pointer from the v-table.
● The offset for each virtual method in the v-table is fixed at compile time, so there’s no
runtime searching involved.
● This constant-time access is why virtual method calls are highly efficient and well-suited
for object-oriented programming.
Example:
656
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● For d.eat(), the compiler knows the offset of eat in the v-table.
● Regardless of whether d is an Animal, Mammal, or Dog, the mechanism ensures the
correct eat implementation is called dynamically (based on the actual type of the
object).
If virtual tables weren’t used, polymorphic method calls could require more complex
mechanisms, such as:
657
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Although efficient, the two-pointer dereference adds a slight overhead compared to direct calls
(non-virtual methods), which don't require table lookups. However, the trade-off is often
acceptable for the flexibility of polymorphism.
In conclusion, polymorphic method calls in languages that use virtual tables (like C++, Java,
etc.) require just two pointer dereferences, ensuring they remain both flexible and efficient.
658
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Alan Kay coined the term "object-oriented programming" and envisioned OO as a tool for
user empowerment and problem-solving.
● Modern OO languages like Scala, Kotlin, and Swift integrate OO with functional
programming (FP) concepts like immutability and higher-order functions.
● Languages like Python and JavaScript offer hybrid paradigms, blending OO,
procedural, and functional styles.
1. Model complex systems in a way that mirrors the real world (e.g., simulations).
2. Reduce software complexity by promoting modularity and code reuse.
659
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"The big idea is 'messaging.' The key in making great and growable systems is to
have the right metaphors."
Final Thought
OO programming evolved from a need to create natural, intuitive models of systems. Over time,
it grew into a flexible paradigm that balances structure and creativity, making it one of the most
influential concepts in software development.
660
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
661
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● What It Is:
○ LSP, introduced in a 1987 conference paper with Jeanette Wing, states:
"If S is a subtype of T, then objects of type T may be replaced with objects of type
S without altering the correctness of the program."
● Why It Matters:
○ This principle formalized a key aspect of object-oriented design: ensuring that
subclasses can be substituted for their parent classes without breaking
functionality.
○ LSP is a cornerstone of object-oriented programming (OOP) and is part of the
SOLID principles, a set of guidelines for designing maintainable software.
● Example:
○ If a class Bird has a method fly, then a subclass Penguin must either support
the concept of fly or not be considered a subclass of Bird. Violating LSP leads
to unexpected behavior.
662
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
5. Broader Influence
● Education:
○ Liskov is a professor at MIT and has mentored generations of computer
scientists.
● Philosophy:
○ She advocated for modularity, abstraction, and correctness in software
design, which became universal principles in software engineering.
Barbara Liskov’s work fundamentally shaped how we design and think about software.
Concepts like modularity, abstraction, OOP principles, and fault tolerance in distributed
systems owe much to her innovations. Even though her languages (CLU, Argus) aren’t widely
used today, their ideas became the bedrock of modern programming languages and practices.
663
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Heap Allocation
Code: https://github.com/pronesto/DCC053_Public/tree/main/21_HeapAllocation
1. Static Allocation
● What it is: Variables are allocated a fixed memory location during program compilation
and remain in that location for the program's entire runtime.
● Where it happens: In the data or BSS segments of memory.
● Lifetime: The entire program execution.
● Scope: Can be local to a file, block, or global (depending on the declaration).
Examples:
Static local variables: Declared with the static keyword inside a function; they retain their
value across function calls:
void myFunction() {
static int counter = 0; // Retains value between calls.
counter++;
}
● Advantages:
○ Predictable, fixed lifetime.
○ No dynamic management required.
● Disadvantages:
○ Memory is reserved for the entire program, even if the variable is not actively
used.
2. Stack Allocation
● What it is: Variables are allocated in a last-in, first-out (LIFO) manner from the program's
stack when a function is called. They are automatically deallocated when the function
returns.
● Where it happens: In the stack segment of memory.
● Lifetime: Limited to the scope of the function in which they are declared.
664
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Examples:
Local variables: Declared inside a function or block without the static keyword.
void myFunction() {
// Automatically deallocated when function exits.
int localVar = 5;
}
● Advantages:
○ Fast allocation and deallocation (just moving the stack pointer).
○ Memory is automatically managed.
● Disadvantages:
○ Limited memory size (stack overflow can occur if too many variables are
allocated).
○ Variables cannot outlive their function scope (no persistence).
3. Heap Allocation
● What it is: Memory is manually allocated and freed by the programmer at runtime using
functions like malloc(), calloc(), realloc(), and free().
● Where it happens: In the heap segment of memory.
● Lifetime: Controlled by the programmer; it persists until explicitly deallocated with
free().
● Scope: Determined by the pointer used to access the allocated memory.
Examples:
● Advantages:
○ Flexibility to allocate memory as needed.
○ Lifetime of variables can extend beyond the function scope that allocated them.
● Disadvantages:
○ Slower than stack allocation due to manual management.
665
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Comparison
Aspect Static Allocation Stack Allocation Heap Allocation
When you create local variables using stack allocation, they are automatically deallocated
once the function exits. If you need data that persists after the function ends, you can't use the
stack:
int* createArray() {
// Allocate memory on the heap.
int* arr = malloc(10 * sizeof(int));
// The data persists even after this function exits.
return arr;
}
Without the heap, you wouldn't be able to return dynamically allocated arrays or objects.
The size of stack-allocated variables must generally be known at compile time (though some
compilers support Variable Length Arrays). For cases where the size of the data is determined
at runtime, the heap is essential.
666
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int n;
Heap-allocated memory allows data to be shared between functions, threads, or even across
program components. As long as you have a reference (pointer) to the memory, you can access
it:
void useData() {
int* data;
setData(&data);
// Data persists even after `setData` exits.
printf("%d\n", *data);
free(data);
}
The stack is limited in size and intended for small, short-lived data structures. Allocating large
arrays or objects on the stack risks stack overflow. Heap allocation enables handling larger data
structures without exhausting stack memory:
The heap offers the programmer full control over the lifetime of the data. You can allocate
memory when needed and free it when no longer required, optimizing memory usage.
667
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion
Without the heap, many complex programs (like those managing dynamic data structures,
persistent objects, or large datasets) would be much harder to write efficiently!
Can you write a simple C program that illustrates how the heap is
managed?
Sure. Here's a very simple implementation of the heap, which assumes that every word is
eight-bit long:
#include <stdio.h>
#include <stdint.h>
668
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
// malloc implementation.
void* my_malloc(uint8_t size) {
uint8_t current_offset = free_list;
current->next = new_offset;
current->size = size;
}
current_offset = current->next;
if (current_offset == 0) {
break; // Reached the end of the list.
}
}
669
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
// free implementation.
void my_free(void* ptr) {
if (!ptr) {
return;
}
printf("Heap state:\n");
while (current_offset < HEAP_SIZE) {
BlockHeader* current = (BlockHeader*)&heap[current_offset];
printf("Block at offset %u: size=%u, is_free=%d, next=%u\n",
current_offset, current->size, current->is_free, current->next);
current_offset = current->next;
if (current_offset == 0) {
break; // End of the list.
}
}
}
670
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The heap is simulated as a fixed-size array of bytes (heap). Each memory block in the heap
has a header (BlockHeader), which stores:
1. size: The size of the block's usable space (in bytes), excluding the header.
2. is_free: A flag indicating whether the block is free (1) or allocated (0).
3. next: The offset (relative to the start of the heap) of the next block in the list.
This metadata helps organize the heap into a linked list of blocks.
In this context, a "pointer" is represented by an offset into the heap array. The allocator starts
with a free_list, which is the offset of the first block. Each block header has a next field that
points to the next block's offset, forming a linked list.
By traversing this list, the allocator finds free blocks or reaches the end (next = 0).
When the program begins, the entire heap is one large free block. The init_heap function
creates this initial block by:
1. Setting its size to the total heap size minus the size of the header.
2. Marking the block as free (is_free = 1).
3. Setting the next field to 0, indicating the end of the list.
The my_malloc function finds a suitable block to allocate memory and optionally splits the
block into smaller ones:
1. Search for a free block: It traverses the linked list of blocks using the next pointer,
checking each block's is_free flag and size.
2. Block splitting: If the block is larger than the requested size by at least the size of a
header, the block is split:
○ The first part is allocated and updated with the requested size.
○ The remaining portion becomes a new free block with its own header.
3. Marking the block: The selected block is marked as allocated (is_free = 0).
4. Return pointer: The function returns a pointer to the usable memory (the address after
the header).
671
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Calculating the block's offset in the heap from the pointer given to the function.
2. Setting the block's is_free flag to 1.
1. print_heap traverses the list and prints the properties (size, is_free, next) of each
block. This is useful to understand the current organization of the heap.
2. dump displays the raw byte values of the heap array in chunks, offering a low-level view
of the memory state.
Important Principles
1. Block Splitting: Efficient memory usage is achieved by splitting large free blocks when
the requested size is smaller.
2. Linked List Traversal: Allocation relies on traversing the linked list, which may lead to
performance issues if the list becomes long.
This simple allocator demonstrates the core ideas behind dynamic memory management, laying
the foundation for more sophisticated techniques like best-fit or buddy allocation algorithms.
672
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The memory at that location is properly aligned for the type being cast to.
● There is enough space at that address to hold the entire BlockHeader structure.
3. Is this safe?
4. Is this verified?
Well, not by the language runtime! Neither of the following properties are verified by the code
the C compiler generates:
Verifying these properties is the responsibility of the programmer. Proper implementations would
include checks for these conditions before performing the cast or accessing memory.
673
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Array indexing (heap[current_offset]): This calculates the memory address of the
current_offsetth byte in the array.
2. Address-of operation (&heap[current_offset]): This gives the address of the byte
at the computed index.
3. Casting ((BlockHeader *)): This tells the compiler to interpret the address as
pointing to a BlockHeader.
6. What happens if there are not enough bytes for the struct?
If there are not enough bytes in the array starting at current_offset to cover the
BlockHeader structure, the code will cause undefined behavior when attempting to access
memory beyond the bounds of the array. This could manifest as:
For example:
In summary, the cast is permissible in C but relies on assumptions about alignment and bounds
that are not inherently checked by the compiler. Proper verification is necessary to ensure safety
and correctness.
674
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int main() {
init_heap();
void* a = my_malloc(20);
void* b = my_malloc(30);
my_free(a);
void* c = my_malloc(10);
my_free(b);
}
This figure illustrates the state of the heap after executing the sequence of memory allocation
and deallocation operations in the provided program. Each subfigure represents a snapshot of
the heap at different points in the program's execution.
675
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Initial State
676
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The first subfigure shows the state of the heap after the init_heap function is called. At this
point:
The free list (FL) now points to the free block starting at the offset after the allocated block.
● The allocator traverses the free list to find a block that can satisfy the request.
● The free block of 102 bytes is split into an allocated block of 30 bytes and a new free
block of size 69 bytes.
● The allocated block is marked as not free (is_free = 0), while the remaining free
block is added back to the free list.
The free list (FL) points to the newly created free block starting at the offset after the allocated
block.
● The block that was allocated with 20 bytes is returned to the free list.
● Its is_free field is updated to 1, marking it as free.
● The block is now part of the free list, linked to the remaining free blocks in the heap.
The free list (FL) points to the newly freed block, which links to the next free block in the list.
677
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The allocator searches the free list for a block large enough to accommodate 10 bytes.
● The block of 20 bytes (previously freed) is split into an allocated block of 10 bytes and a
free block of 7 bytes.
● The allocated block is marked as not free (is_free = 0), while the remaining portion
remains free.
The free list now begins at the free block of 7 bytes, which links to the next free block in the
heap.
At this point, the free list contains three free blocks: 7 bytes, 56 bytes, and 69 bytes. The
allocator could potentially coalesce adjacent free blocks into larger ones to reduce
fragmentation, though this implementation does not include that feature.
1. Internal Fragmentation
● Definition: Internal fragmentation occurs when the allocated memory block is larger than
the memory actually requested by the program. The "extra" memory inside the block is
wasted because it cannot be used for any other purpose.
● Cause in this system: If the allocator rounds up requests to the nearest block size or
includes metadata inside the block (e.g., size, is_free, next), this can lead to
unused space within an allocated block. However, we do not face internal fragmentation
in our implementation of the heap, because blocks are sized exactly to match the
request plus metadata. In other words, our implementation does not keep memory
addresses aligned in memory.
678
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. External Fragmentation
For example:
● In the final state of the heap in the figure, the free list contains blocks of 7 bytes, 56
bytes, and 69 bytes.
● If a request for 80 bytes is made, it will fail despite there being 132 bytes of free memory
in total. None of the individual free blocks is large enough to satisfy the request.
1. Wasted Memory: Fragmentation reduces the effective usable memory. Even though the
heap has free space, the allocator might not be able to use it efficiently.
2. Allocation Failures: Programs may fail to allocate memory for large requests, even
when enough total free memory exists.
3. Performance Overhead: Searching through a fragmented free list can increase
allocation time.
● Coalescing: Merge adjacent free blocks into a single larger block whenever possible.
For example, if two free blocks are adjacent in memory, they can be combined to reduce
fragmentation.
● Compaction: Move allocated blocks closer together to create larger contiguous free
blocks. However, this is usually impractical for heaps due to the complexity of moving
allocated memory.
● Better Allocation Strategies: Use algorithms like best fit (allocate the smallest block
that fits the request) or buddy allocation (divide memory into power-of-2 sizes) to reduce
fragmentation.
By addressing fragmentation, the heap system can maintain better utilization and serve larger
allocation requests more effectively.
679
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
What Changed?
Benefits of Coalescing
This implementation ensures that the heap remains efficient even after multiple allocation and
free operations.
680
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int main() {
init_heap();
void* a = my_malloc(20);
void* b = my_malloc(30);
// First figure:
my_free(a);
// Second figure:
void* c = my_malloc(10);
// Third figure:
my_free(b);
return 0;
}
The provided figure shows the state of the heap memory at three different stages during the
execution of the program. The changes are a result of allocating and freeing memory blocks
using the given my_malloc and my_free functions. Below is an explanation of each stage:
First Figure:
● Operation: my_free(a)
● The block allocated for a (size 20) is marked as free.
● However, the coalescing does not occur because the next block (allocated for b, size 30)
is not free.
● Observation: Coalescing only happens if the next block is free, which prevents
combining these two blocks.
Second Figure:
681
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Operation: my_malloc(10)
● A new block is allocated, splitting the previously freed block (size 20).
● The block is divided into two parts:
○ A new allocated block (size 10).
○ A remaining free block (size 10 + metadata overhead).
● The b block remains unchanged.
Third Figure:
● Operation: my_free(b)
● The block allocated for b (size 30) is marked as free.
● The allocator now attempts to coalesce this block with the next free block (size 10).
● Observation: Coalescing succeeds in this case because the next block is free. The two
blocks are merged into a single free block (size 30 + 10 + metadata).
Key Takeaway:
● In the first figure, no coalescing occurs when a is freed because the subsequent block
(b) is not free.
● The design simplifies coalescing by only considering adjacent free blocks (the current
block and its next neighbor). However, it can lead to fragmentation if there are
interleaving allocations and free blocks in the heap.
682
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
if (current->next == block_offset) {
// Found the previous block.
if (current->is_free) {
current->size += sizeof(BlockHeader) + block->size;
current->next = block->next;
}
break;
}
current_offset = current->next;
if (current_offset == 0) {
break;
}
}
}
In this context, moving memory coalescing to the language runtime means shifting the
responsibility for optimizing memory layout (e.g., merging free blocks to reduce fragmentation)
from user-defined functions like my_free to the runtime system of the programming language.
683
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Memory coalescing would be part of the runtime's strategy for managing memory
efficiently without programmer intervention.
3. Abstraction Layer:
○ The runtime provides an abstraction over the underlying system, making
programs portable and less dependent on platform-specific details.
● The runtime would periodically scan and optimize the memory heap (e.g., by combining
adjacent free blocks).
● This could be done:
○ During garbage collection, where unused memory is reclaimed.
○ When memory allocation fails due to fragmentation.
○ At specific intervals or idle periods, as decided by the runtime's internal
heuristics.
In Our Case
● Your my_free function would no longer need to worry about coalescing memory
immediately.
● The runtime (a simulated one for teaching purposes) would handle coalescing at its
discretion, through a function like your coalesce_memory.
This approach ties into the concept of managed runtimes like the JVM, which students will
likely encounter in high-level languages, while still giving them hands-on experience with
low-level memory concepts in C.
684
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
So, some languages have a heavy runtime, like Java, and others have a
light runtime, like C?
Exactly! Programming languages differ in the complexity and functionality of their runtime
environments, which can be classified as heavy or light based on the responsibilities they take
on. Let's break this down:
These languages come with a runtime environment that manages many aspects of program
execution for the developer. This leads to easier programming but at the cost of performance
overhead and complexity.
● Java: The JVM performs garbage collection, JIT compilation, and memory optimization.
● Python: The CPython runtime interprets code, handles garbage collection, and
manages dynamic typing.
● C#: The CLR provides garbage collection, JIT compilation, and advanced runtime
features like LINQ.
These languages have minimal runtime environments, providing low-level control and
requiring the programmer to manage most aspects of program execution.
685
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● C: The runtime is little more than libc and basic startup code.
● C++: Adds slightly more runtime overhead than C (e.g., dynamic_cast, exceptions,
virtual function tables), but still lightweight compared to Java or Python.
Comparison
Aspect Heavy Runtime (e.g., Java) Light Runtime (e.g., C)
Complexity Larger and more complex runtime Minimal runtime with low-level
systems. control.
686
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
By learning about memory coalescing and runtime management in C, we are gaining insight into
how lightweight runtimes work and laying the groundwork for understanding how heavy
runtimes automate these processes.
Implementing a coalesce_memory function that traverses the list of blocks and performs
coalescing when possible is a great way to simulate the behavior of a managed runtime. This
"deferred" approach mirrors how garbage collectors in managed runtimes (e.g., Java Virtual
Machine or .NET) often perform compaction or defragmentation in the background.
Updated Code
void coalesce_memory() {
// Start from the beginning of the heap.
BlockHeader* block = (BlockHeader*)heap;
687
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
}
}
}
Example Usage
int main() {
init_heap();
void* a = my_malloc(20);
void* b = my_malloc(30);
return 0;
}
688
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Pros:
○ Simplifies my_free.
○ Coalescing happens only when necessary, reducing overhead during frequent
malloc/free operations.
● Cons:
○ Increased fragmentation until coalesce_memory is called.
○ The runtime requires additional logic to decide when to invoke coalescing.
689
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Lesson:
You can find the source code for the malloc and free functions in the GNU C Library (glibc).
This library is a critical component of most Linux systems, providing essential system functions
like memory allocation.
690
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While the exact implementation details can be complex, the core idea behind malloc and free
is to manage a heap, a region of memory that can be dynamically allocated and deallocated.
1. malloc:
○ When you call malloc(size), it searches the heap for a free block of memory
that is large enough to accommodate the requested size.
○ If a suitable block is found, it is allocated to the caller, and the remaining free
space (if any) is marked as available for future allocations.
○ If no suitable block is found, the system may request more memory from the
operating system.
2. free:
○ When you call free(ptr), it marks the memory block pointed to by ptr as free.
○ This freed block may be merged with adjacent free blocks to form a larger free
block, improving memory utilization.
A Note on Complexity:
The actual implementation of malloc and free in glibc is significantly more intricate, involving
various memory management techniques like:
● Memory Pooling: Dividing the heap into smaller pools to optimize allocation and
deallocation.
● Coalescing: Merging adjacent free blocks to reduce fragmentation.
● Memory Alignment: Ensuring that allocated blocks are aligned to specific boundaries to
optimize performance.
Memory Leaks:
1. A memory leak happens when allocated memory is not properly freed after it is no
longer needed.
void problem() {
int* i = (int*) malloc(sizeof(int));
*i = 3;
691
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
printf("%d\n", *i);
}
2. The malloc allocates memory, but there’s no free call to release it. Each time
problem() is called, more memory is allocated and never returned, eventually
exhausting the available memory.
Dangling Pointers:
3. A dangling pointer occurs when memory is freed but the pointer still refers to that
memory location, leading to unpredictable behavior if it is accessed again.
void dangling() {
int* i = (int*) malloc(sizeof(int));
int* j;
*i = 3;
free(i);
j = (int*) malloc(sizeof(int));
*j = 8;
printf("%d\n", *i);
}
4. The free(i) releases the memory allocated for i, but the pointer i still points to that
memory. When j is allocated, it reuses the same memory location. Writing to j
overwrites i, causing undefined behavior.
Managed runtimes (e.g., Java Virtual Machine, Python interpreter) handle memory allocation
and deallocation automatically. This approach offers several advantages:
692
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
Managed memory systems simplify programming and eliminate many common bugs, like
memory leaks and dangling pointers. However, they come at the cost of performance overhead,
reduced control, and increased runtime complexity. The choice between managed and
unmanaged systems depends on the application's requirements. For example:
● A real-time system with strict performance needs might favor unmanaged memory.
● A web application, where developer productivity and stability are crucial, might benefit
more from a managed runtime.
693
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
management (like malloc and free in C) from the programmer, reducing the risk of errors
such as double frees, dangling pointers, or forgetting to free memory.
1. Allocation: Memory is allocated from the heap for objects or data structures when
needed by the program.
2. Tracing or Counting: A mechanism tracks whether memory is still in use. This can
involve:
○ Reference Counting: Keeping a count of how many references point to an
object, as we discussed earlier.
○ Tracing: Using algorithms like mark-and-sweep or generational GC to traverse
memory and identify "live" (in use) versus "dead" (unreachable) objects.
3. Reclamation: When memory is determined to be unused, it is reclaimed and returned to
the heap for future allocations.
● Reference Counting: Simple but can’t handle cyclic references (e.g., two objects
referencing each other but not used elsewhere).
● Mark-and-Sweep: A tracing algorithm where reachable objects are "marked," and
unmarked objects are swept up as garbage.
● Generational GC: Divides memory into regions (young, old) to optimize for the fact that
most objects die young.
● Copying GC: Divides the heap into two spaces and copies live objects to the other
space, compacting memory in the process.
● Stop-the-World GC: Pauses the program while garbage collection is running.
● Incremental or Concurrent GC: Runs alongside the program to minimize pauses.
However, GC can also introduce runtime overhead, potentially causing pauses in program
execution. Managed runtimes like the JVM (.NET, Python, etc.) strive to minimize this impact
through efficient GC designs.
694
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
count tracks how many references (or pointers) currently point to the object. When the reference
count drops to zero, the object is no longer accessible and can be safely reclaimed.
1. Increment the Count: When a new reference to an object is created (e.g., assigning a
pointer or variable), the reference count of the object is incremented.
2. Decrement the Count: When a reference is removed (e.g., when a pointer goes out of
scope or is reassigned), the reference count is decremented.
3. Reclaim Memory: If the reference count of an object reaches zero, the object is
considered unreachable and is immediately freed.
1. Simplicity: The implementation is straightforward, and memory can often be reclaimed
immediately when objects are no longer in use.
2. Deterministic: Since objects are deallocated as soon as their reference count reaches
zero, memory cleanup is predictable, which is particularly useful in systems requiring
low-latency operations.
3. Incremental: Unlike some other GC techniques (e.g., mark-and-sweep), reference
counting does not require pausing the program to traverse the entire memory space.
Example:
struct Object {
struct Object* ref;
};
// Once we leave the scope of local variables `a` and `b`, the
// chunks of memory they were referencing are no longer accessible.
// However, neither is freed because they reference each other!
○ Even though a and b are no longer used, their reference counts never drop to
zero, resulting in a memory leak.
2. Runtime Overhead:
695
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Despite its limitations, reference counting is still used in scenarios where its simplicity and
deterministic behavior are advantageous. Examples include:
To overcome its weaknesses, reference counting can be combined with other techniques:
1. Cycle Detection: Periodically run an additional garbage collection algorithm (e.g.,
mark-and-sweep) to identify and clean up cyclic references.
2. Hybrid Approaches: Combine reference counting for deterministic cleanup with tracing
garbage collection for cycles and memory compaction.
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
696
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
// malloc implementation.
void* my_malloc(uint8_t size) {
uint8_t current_offset = free_list;
while (current_offset < HEAP_SIZE) {
BlockHeader* current = (BlockHeader*)&heap[current_offset];
if (current->is_free && current->size >= size) {
// Found a suitable block.
uint8_t remaining = current->size - size;
697
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
current->ref_count = 1;
return &heap[current_offset + sizeof(BlockHeader)];
}
current_offset = current->next;
if (current_offset == 0) {
break; // Reached the end of the list.
}
}
// No suitable block found.
return NULL;
}
698
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
// reaches 0.
void decrement_ref(void* ptr) {
if (!ptr)
return;
uint8_t block_offset = (uint8_t*)ptr - heap -
sizeof(BlockHeader);
BlockHeader* block = (BlockHeader*)&heap[block_offset];
if (block->ref_count > 0) {
block->ref_count--;
if (block->ref_count == 0) {
// Free the block when the reference count reaches 0.
my_free(ptr);
}
}
}
This implementation incorporates reference counting for memory management while retaining
the basic structure of your heap allocator. It includes functions to increment and decrement
reference counters and automatically frees memory when the reference count reaches zero.
Adding reference counting to a heap involves several key modifications to track and manage the
number of references to each allocated memory block. Here's an explanation of the changes
needed:
699
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Ensure that all functions handling pointers (e.g., copying or assigning pointers)
appropriately call increment_ref and decrement_ref as needed to maintain
the correct reference counts.
6. Coalescing Free Blocks
○ The logic for coalescing free blocks (merging adjacent free blocks) can remain
mostly unchanged but must be carefully applied to ensure it doesn't disrupt the
reference-counting system.
7. Handle Null or Invalid Pointers
○ Both increment_ref and decrement_ref should include checks to handle
NULL pointers or invalid memory regions gracefully, avoiding potential errors.
8. Debugging and Monitoring
○ Add debugging tools, such as a print_heap function, to display the reference
counts of blocks alongside their other metadata. This can help identify issues like
memory leaks (blocks with a reference count greater than zero but no valid
references).
These changes allow the heap to manage memory more effectively, ensuring blocks are freed
only when no references remain. However, this approach has limitations, such as failing to
handle cyclic references, which would require a more advanced garbage collection technique to
resolve.
The increment_ref function is called when a new reference to an object is created. This
typically happens in the following scenarios:
● Assignment:
When a pointer or reference variable is assigned to point to an existing object, the
reference count of that object is incremented. For example, if b is assigned to point to a,
increment_ref is called for the object pointed to by a.
700
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
incremented.
The decrement_ref function is called when a reference to an object is removed or goes out of
scope. This happens in scenarios such as:
● Reassigning a Reference:
When a variable holding a reference to an object is reassigned to point to a different
object, the reference count of the old object is decremented. For example, if b is
reassigned to point to c, decrement_ref is called for the object previously pointed to
by b, and increment_ref is called for the object pointed to by c.
● Object Destruction:
When a data structure (e.g., an array or a hash table) holding references to objects is
destroyed, decrement_ref is called for all the objects it references.
● When you assign a value or pass it as an argument, the compiler inserts a call to
increment_ref.
● When a variable is reassigned or goes out of scope, the compiler inserts a call to
decrement_ref.
When decrement_ref reduces the reference count of an object to zero, the garbage collector
immediately deallocates the memory for the object. This involves:
701
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Cyclic References: If objects form a reference cycle, their reference counts never reach
zero, leading to memory leaks. Special techniques (e.g., cycle detection or weak
references) are needed to address this.
2. Overhead: Incrementing and decrementing counters on every reference manipulation
adds computational overhead, which can slow down execution compared to manual
memory management.
In summary, the increment_ref and decrement_ref functions are invoked implicitly by the
runtime system during common operations like assignments, function calls, and variable scope
changes. This automation makes reference counting easy to use but comes with trade-offs in
terms of performance and handling edge cases like cycles.
702
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Overview
The memory is managed using reference counts. Each object in memory has an associated
reference count (indicated in the red box). This count tracks how many active references point
to the object. When the reference count drops to zero, the garbage collector deallocates the
object since it is no longer in use.
703
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
You can refer to these as "invisible instructions" because they are not explicitly visible to the
programmer in the source code. Their presence is inferred by the compiler based on how the
program manipulates references. This abstraction allows programmers to focus on writing code
without worrying about manually managing memory in many cases. However, these "invisible
instructions" are crucial for ensuring that the runtime system keeps track of reference counts
correctly and deallocates memory when it's no longer needed.
Can I say that the more "invisible instructions" a language has, the more
"dynamic" it is?
Yes, you're absolutely on the right track! The proportion of "invisible instructions" to "visible
instructions" could indeed be seen as a proxy for the dynamicity or runtime complexity of a
programming language. These "invisible instructions" represent the work that the runtime
system or compiler generates to manage the execution environment, ensuring correctness,
safety, and flexibility. Let's delve into this idea:
Languages that require more runtime checks, management, and guarantees tend to rely
heavily on these invisible instructions. They handle tasks that are either not specified explicitly
by the programmer or are abstracted away to improve usability and safety. A language's
704
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
dynamicity increases with the complexity and number of such operations, as it reflects a higher
degree of runtime intervention.
Here are examples of how these "invisible instructions" manifest in different contexts:
If we could measure the ratio of invisible to visible instructions in a program, it would give an
idea of how much the language/runtime abstracts from the programmer:
● Low Ratio (Static and Predictable): Languages like C or Assembly have minimal
runtime support. Most instructions are directly mapped to what the programmer writes.
Dynamicity is low.
● Moderate Ratio (Hybrid Models): Languages like C++, Rust, or Go offer safety
features such as bounds checking or concurrency primitives but limit runtime
intervention.
705
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● High Ratio (Dynamic and Managed): Languages like Python, Java, or JavaScript rely
heavily on invisible instructions for runtime type checking, memory management, safety,
and flexibility.
Potential as a Metric
1. Dynamicity Isn't Just Quantity: Some languages insert a few highly complex invisible
instructions (e.g., JVM's bytecode verification) rather than numerous simpler ones.
2. Optimization Reduces Dynamism: Many "invisible instructions" can be optimized away
(e.g., just-in-time compilation may remove bounds checks for loops with known ranges).
3. Context Matters: The ratio may vary significantly depending on the program's domain
(e.g., numerical computing versus web applications).
Conclusion
Yes, you can use the proportion of invisible to visible instructions as a measure of a language's
runtime dynamicity, but it should be seen as one facet of the broader runtime behavior. This
perspective helps highlight the trade-offs between performance, safety, and abstraction in
programming languages.
Mark-and-sweep is one of the simplest and most foundational garbage collection algorithms. It
works in two phases: marking and sweeping. Here's how it operates:
706
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ After identifying all live objects, the garbage collector scans through the heap
memory and deallocates any object that was not marked during the marking
phase.
○ This frees up memory space that can later be reused by the program.
Imagine you have a program managing a heap with objects A, B, C, and D. Here's an example
of how the algorithm would work:
Advantages of Mark-and-Sweep
1. Handles Cycles: Unlike reference counting, mark-and-sweep can handle cyclic
references (e.g., two objects referencing each other but unreachable from the roots).
2. Simple to Implement: Its conceptual model is straightforward and easy to implement.
3. No Reference Counting Overhead: Unlike reference counting, there’s no need to
increment or decrement counters during normal execution.
Disadvantages of Mark-and-Sweep
707
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Garbage Collector: Generational Garbage Collector (e.g., G1 GC, Parallel GC, ZGC,
Shenandoah).
● How It Works:
○ Memory is divided into generations: Young Generation (for short-lived objects)
and Old Generation (for long-lived objects).
○ Short-lived objects are collected frequently in the Young Generation, which
avoids scanning the entire heap.
○ Advanced garbage collectors like G1 GC and ZGC focus on minimizing pause
times for large-scale applications.
● Why?
○ Java programs often create many short-lived objects (e.g., temporary objects in
loops), so generational collection is efficient.
○ JVM targets enterprise applications, where responsiveness and scalability are
crucial.
2. Python (CPython)
708
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
4. Go (Golang)
6. Rust
709
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● How It Works:
○ Rust uses a concept called ownership with borrow checking at compile time to
manage memory.
○ The compiler enforces strict rules about when memory can be accessed or freed,
ensuring safety without a runtime garbage collector.
● Why?
○ Rust is designed for systems programming, where deterministic performance and
minimal runtime overhead are essential.
9. Swift
710
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
10. Erlang
Summary
● Generational GC (Java, C#) is great for managing short-lived and long-lived objects
efficiently.
● Reference Counting (Python, Swift) provides immediate deallocation but struggles with
cycles.
● Concurrent GC (Go) minimizes pause times for real-time systems.
● No GC (Rust) gives ultimate control and predictability at the cost of manual memory
management.
Each strategy reflects trade-offs between performance, responsiveness, simplicity, and safety!
711
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Broader Contributions
McCarthy's work on garbage collection was transformative because it laid the foundation for
decades of advancements in memory management. Nearly all modern garbage collection
techniques (e.g., generational, concurrent, incremental) build on ideas rooted in McCarthy's
original invention.
Without McCarthy's contribution, programming languages might have evolved differently. His
invention of garbage collection popularized higher-level programming languages by abstracting
away low-level details like memory allocation and deallocation. For these reasons, John
McCarthy is a central figure in the history of garbage collectors and programming languages!
712
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
David Ungar is a highly influential figure in computer science, particularly in the fields of
programming languages, memory management, and software performance. Here are some of
his major contributions:
● Adaptive Optimization: Ungar's research into optimizing dynamic languages like Self
led to advancements in adaptive compilation, a technique that analyzes program
behavior at runtime and dynamically optimizes performance-critical paths.
○ This research inspired many modern JIT systems, such as the ones used in
Java, Python, and JavaScript engines like V8 and SpiderMonkey.
● Innovative User Interfaces: Ungar's work on Self extended beyond the language itself
to include the design of highly interactive and intuitive programming environments.
713
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● In more recent years, Ungar has focused on research related to leveraging multicore
processors for high-performance software.
● His work explores how language runtime systems and programming paradigms can be
adapted to take better advantage of parallel computing architectures.
● Ungar has been recognized for his contributions with various honors, including the ACM
Programming Languages Achievement Award in 2009.
● His work remains a cornerstone of modern programming language research and
development.
Legacy
David Ungar’s influence spans both theoretical and practical domains of computer science. His
work on memory management, programming language design, and performance optimization
has made dynamic languages more powerful and accessible, impacting millions of developers
worldwide.
714
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Loop Optimizations
● Loop Unrolling: Duplicates the loop body multiple times to reduce the overhead of loop
control and increase instruction-level parallelism.
● Loop Fusion: Combines adjacent loops that iterate over the same range to reduce loop
overhead and improve cache performance.
● Loop Invariant Code Motion: Moves computations that are constant across iterations
outside the loop.
● Loop Tiling: Breaks large loops into smaller blocks to improve data locality and cache
usage.
2. Inline Expansion
● Replaces a function call with the function's body. This transformation has two benefits.
First, it eliminates the overhead of the call, something important particularly for small
functions. But, second, and most importantly, inlining makes other compiler optimizations
context-sensitive, meaning that they can benefit from information that is available at the
point the function is called.
3. Constant Folding
● Removes code that does not affect the program's output, such as variables or
statements that are never used.
715
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Identifies and eliminates repeated computations of the same expression. For example:
a = b * c + d * e + b * c;
temp = b * c;
a = temp + d * e + temp;
6. Strength Reduction
Replaces expensive operations with cheaper equivalents. For example, replacing multiplication
with addition in a loop:
int y = 0;
for (int i = 0; i < n; i++) {
y += 5;
}
7. Register Allocation
8. Peephole Optimization
Focuses on small sequences of instructions, replacing inefficient patterns with more efficient
ones. For example, replacing:
mov a, b
mov b, a
xchg a, b
● Converts recursive tail calls into loops. This optimization has two main benefits. First, it
saves stack space: whereas we would need multiple activation records for the function,
now we need only one! Second, it enables loop optimizations to kick in! Most compilers
716
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Optimizations that analyze and modify code across multiple functions or modules.
Examples include:
○ Inline Function Expansion
○ Constant Propagation Across Function Boundaries
12. Vectorization
● Transforms scalar operations into vector operations that can take advantage of SIMD
(Single Instruction, Multiple Data) hardware capabilities.
● Reorders instructions to avoid pipeline stalls in the CPU while preserving the program's
semantics.
Levels of Optimization
These optimizations are applied automatically by modern compilers, but developers can also
fine-tune or disable specific optimizations as needed for debugging or performance-critical
applications.
717
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Optimizing a program is an investment: the more time we spend on preparation, the better the
program can perform during execution. However, this investment only pays off if:
● The program runs frequently (e.g., system utilities like ls, cp).
● The program has a long execution lifespan (e.g., server applications, simulations).
● Low preparation time: Shell scripts (e.g., Bash) are designed for rapid iteration and
minimal preparation. They delegate the heavy lifting to other programs (e.g., GNU Core
Utils), making extensive optimization unnecessary.
● High preparation time: Compiled languages like C or Rust prioritize execution
efficiency, making them ideal for programs executed millions of times or in
performance-critical environments.
● For short-lived or rarely executed programs, the preparation cost outweighs potential
execution savings, so minimal optimization is warranted.
● For programs executed at scale or over long durations, the execution savings far exceed
the preparation cost, justifying significant optimization efforts.
This tension between preparation time and execution time is well-reflected into the design and
purpose of programming languages:
718
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
It visually contrasts the trade-off between program preparation time (time to compile or
otherwise prepare the code for execution) and program execution time (time taken to actually
run the program). The x-axis represents the preparation time, while the y-axis represents the
execution time. It compares approaches for running C programs (compiled using clang or gcc)
and JavaScript programs (executed using various compilation modes of the V8 engine). The
figure below shows some actual data, obtained from this website. The figure compares the
compilation time and the running time of three different programs (P3_SL, P1_A and P4_SL are
randomly generated programs) compiled with gcc at its many optimization levels on a typical
Intel I7:
719
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Key Observations:
720
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● C Compilation illustrates a static optimization strategy: you decide upfront how much
preparation (compilation) time you're willing to invest, based on expectations of how the
program will run.
● JavaScript (V8) demonstrates a dynamic optimization approach: it begins with minimal
preparation and adjusts on the fly as the program runs longer or more often, balancing
the tension between preparation and execution over time.
Practical Implications:
In contrast, C operates under a static optimization strategy, where you choose the level of
optimization before execution. The decision is influenced by the program's lifecycle:
721
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Development: Frequent recompilation makes fast compile times and debuggable builds
(-O0) more valuable than runtime performance.
● Production: Programs that will run many times or handle heavy workloads benefit from
higher optimization (-O2 or -O3), as the compile-time cost is outweighed by the runtime
gains.
The multiple optimization levels in gcc and clang exist because real-world scenarios often
require nuanced trade-offs between compilation time, binary size, and execution speed. Here's
why we have four levels:
While it may seem simpler to have just -O0 (no optimization) and -O3 (maximum optimization),
the real world often demands intermediate trade-offs for the following reasons:
722
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
The intermediate levels (-O1 and -O2) allow developers to choose optimizations tailored to
specific scenarios. This granularity reflects the diversity of software development needs, ranging
from quick debugging (-O0), iterative testing with moderate optimizations (-O1),
performance-critical production builds (-O2), and highly optimized compute-intensive
applications (-O3). This flexibility ensures that developers can strike the right balance between
preparation and execution times, even in statically optimized languages like C.
Notice, however, that the difference between the optimization levels depends heavily on the
compiler. In GCC, it's easier to notice a performance difference between the codes produced at
the -O2 and -O3 optimization levels, as the figure below, taken from this paper suggests:
723
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
However, similar experiments, performed on clang/LLVM seem to yield very similar performance
results between these two optimization levels, as seen in this article by Lucas Victor, when
compiling large programs generated randomly:
So, an optimization in, say, clang, is carried out by an LLVM pass. What is
this? Which kinds of passes exist?
You're exactly right! In LLVM (Low-Level Virtual Machine), an optimization pass is a modular
unit of work that analyzes or transforms the Intermediate Representation (IR) of a program.
Passes in LLVM are the building blocks for both program analysis and transformation. Let’s
dive into the details.
Passes can:
1. Analyze the program to gather information (e.g., dependency graphs, alias information).
2. Transform the program (e.g., eliminate unused code, inline functions).
3. Be combined to form pass pipelines, where the results of one pass feed into others,
forming a sequence of optimizations and analyses.
724
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
LLVM passes are typically categorized based on their purpose and how they operate:
1. Analysis Passes
● These gather information about the program but do not modify it.
● Other optimization passes rely on the results of these analysis passes to make informed
decisions.
● Alias Analysis (AA): Determines if two pointers might refer to the same memory.
● Loop Analysis: Analyzes properties of loops, such as induction variables and bounds.
● Call Graph Analysis: Builds a graph of function calls in the program.
● Dominance Analysis: Computes dominance relationships between basic blocks in a
control flow graph.
These analyses provide key insights about the structure, dependencies, and behavior of the
code.
● These modify the IR to improve performance, reduce code size, or enhance other
program properties.
● Transform passes can use information provided by analysis passes to guide their
changes.
● Dead Code Elimination (DCE): Removes instructions or blocks of code that have no
effect on the program’s output.
● Loop Unrolling: Expands loops to reduce loop overhead and enable further
optimizations.
● Function Inlining: Replaces a function call with the actual body of the function to reduce
call overhead.
● Instruction Combining: Merges instructions (e.g., combining x = a + b; y = x *
2 into y = (a + b) * 2) for efficiency.
3. Utility Passes
Examples:
725
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
LLVM has a Pass Manager that orchestrates the execution of passes. Here’s how passes
interact:
● Constant Propagation:
○ Analysis: Find all variables that are constants throughout their lifetime.
○ Transformation: Replace those variables with their constant values.
● Dead Store Elimination:
○ Analysis: Use alias analysis to find memory writes that are never read.
○ Transformation: Remove those writes.
Summary
1. Analysis passes gather information about the program but don’t change it.
2. Transform passes use that information to optimize the program.
726
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This separation allows LLVM to build efficient and flexible optimization pipelines tailored to
different goals, whether it’s reducing binary size, improving runtime performance, or preparing
code for specific hardware platforms.
1. Front-End Optimizations
The front-end is responsible for parsing the source code, performing lexical and syntactic
analysis, and converting it into an Intermediate Representation (IR). Optimizations here are
usually language-specific and aim to clean up or simplify the input program.
Advantages:
However, because the front-end works at a higher level of abstraction (close to the original
source code), it’s limited in its ability to perform deep, target-independent optimizations.
2. Middle-End Optimizations
The middle-end is where the bulk of optimizations happen. Here, the program is represented in
a mostly target-independent intermediate representation (IR), such as LLVM IR. Since this
stage is meant to be independent of both the input language and the target hardware,
optimizations are focused on improving program performance and efficiency in a generalized
way.
727
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Loop Optimizations:
○ Loop Unrolling: Expands the loop body to reduce loop overhead.
○ Loop-Invariant Code Motion (LICM): Moves calculations out of loops if their
results don’t change.
● Data Flow Optimizations:
○ Dead Code Elimination (DCE): Removes unused variables or instructions.
○ Constant Propagation: Replaces variables with known constant values.
● Control Flow Optimizations:
○ Branch Prediction: Reorganizes branches to improve runtime performance.
○ Simplification of Control Flow Graphs (CFGs).
● Memory Optimizations:
○ Alias Analysis: Determines whether pointers can refer to the same memory.
○ Scalar Replacement: Replaces memory accesses with registers when possible.
Advantages:
3. Back-End Optimizations
The back-end generates target-specific code (e.g., x86, ARM) from the IR. Optimizations here
take the target architecture’s characteristics into account, focusing on register allocation,
instruction scheduling, and code generation.
● Instruction Scheduling:
○ Rearranges instructions to minimize pipeline stalls in the CPU.
● Register Allocation:
○ Maps variables to a limited number of CPU registers.
● Peephole Optimization:
○ Looks for small sequences of instructions that can be replaced with more efficient
ones (e.g., combining two instructions into one).
● Vectorization:
○ Converts scalar operations to vector operations, taking advantage of SIMD
(Single Instruction, Multiple Data) instructions.
● Target-Specific Optimizations:
○ Exploiting hardware-specific features, like specialized instructions or caches.
Advantages:
● Tailors the code to the specific hardware, squeezing out the last bit of performance.
728
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Ensures the generated binary adheres to the target platform's constraints (e.g., calling
conventions, instruction set).
● The front-end knows the most about high-level language semantics, so it can leverage
this information for optimizations like constant folding or inline expansion.
● The middle-end focuses on platform-independent improvements, which allows the
compiler to reuse these optimizations across many target architectures.
● The back-end is where the hardware constraints are considered, so it’s the only stage
that can fully exploit platform-specific features.
Summary
While most optimizations occur in the middle-end, each stage of the compiler contributes to the
overall process:
Together, these stages ensure the program is both efficient and correct for the target platform.
729
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
opt vs clang
● clang:
○ A compiler driver that handles the entire compilation process, from front-end
parsing of source code to back-end code generation and optimization.
○ clang invokes the LLVM middle-end optimizations as part of its pipeline.
● opt:
730
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Specifically used for the middle-end stage. It is most useful for developers
working with LLVM IR directly, allowing them to apply and debug optimization
passes.
● Transform Passes:
○ mem2reg: Promote memory variables to SSA registers.
○ loop-unroll: Unroll loops.
○ gvn: Perform Global Value Numbering.
○ inline: Inline function calls.
○ simplifycfg: Simplify control flow graphs.
● Analysis Passes:
○ instcount: Count instructions in the IR.
○ loop-info: Print loop information.
○ callgraph: Generate the program's call graph.
Example Workflow
The figure below illustrates a typical usage of opt, showing the transformation pipeline:
731
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion
The opt tool is an integral part of LLVM's middle-end, enabling you to experiment with and
debug optimization passes directly on IR. It complements tools like clang, which abstract away
the middle-end optimizations for the typical user.
So, how can I see all the passes that opt runs on a program, at a given
optimization level?
Starting with LLVM 13, you can use the -print-pipeline-passes option to directly list all
passes in the optimization pipeline.
● Explanation:
732
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Notes:
It seems that each level higher than O0 uses lots of different passes. Who
determines these sequences of passes?
Determining the best sequence of optimization passes in a compiler is an incredibly challenging
problem due to several factors:
1. Number of Passes: Modern compilers like Clang offer a large number of optimization
passes, each targeting different aspects of performance or code size.
2. Order Sensitivity: The effectiveness of these passes often depends on their ordering,
as the output of one pass serves as the input for subsequent ones.
3. Iterative Applications: Some passes can be applied multiple times, interleaved with
other passes, to maximize their effectiveness.
4. Program-Specific Variability: The optimal sequence for one program may differ
significantly from another, depending on their structure and behavior.
In Clang, the optimization levels (-O1, -O2, -O3, etc.) are carefully crafted to balance these
challenges and provide reasonable defaults for general-purpose programming. These
sequences are designed through extensive experimentation and are continuously refined by the
LLVM community.
Certain LLVM passes naturally complement each other and are often grouped together. For
example:
733
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Community-Driven Refinements
Finding effective pass sequences requires a blend of expertise and empirical studies.
Researchers and practitioners collaborate to refine these sequences, as exemplified in works
like the paper New Optimization Sequences for Code-Size Reduction for the LLVM Compilation
Infrastructure.
LLVM-based compilers like Swift and Rust configure their optimization pipelines to align with
their language-specific requirements:
Custom Pipelines
Not all languages adopt LLVM's default pipelines. For instance, the Burst compiler for C#
abandoned the default LLVM pipeline in favor of a custom one tailored to its requirements (See
the bits around "we long since abandoned the default LLVM pass pipeline for a custom one").
This demonstrates the importance of understanding the trade-offs and benefits of different
pipeline configurations for each use case.
To understand the LLVM pass pipelines better you can take a look into LLVM default pass
sequences, and similar resources:
Conclusion
The design of optimization pipelines is a blend of art and science. While Clang’s default
pipelines are rooted in years of experimentation, ongoing research and language-specific
adaptations continue to drive innovation in this field.
734
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Description:
○ -O0 focuses on fast compilation with no optimizations applied.
○ Only essential passes for correctness are included:
■ verify: Ensures the IR is valid.
■ always-inline: Forces inlining of functions marked as
always_inline.
■ Minimal other passes (e.g., CoroConditionalWrapper for coroutine-related
transformations).
● Use Case:
○ Ideal for debugging and development, as the output closely mirrors the source
code.
○ Simplifies debugging by preserving variable names and function calls.
● Passes:
○ Includes a wide range of optimizations for:
■ Control flow simplification: simplifycfg
■ Memory optimization: mem2reg, sroa (promotes memory to registers).
■ Dead code elimination: deadargelim.
■ Function inlining: Limited to small or mandatory cases.
■ Loop transformations: Basic loop rotation, unswitching.
■ Call graph-based optimizations: cgscc, inline for necessary cases.
● Description:
○ Performs lightweight optimizations to reduce execution time without significantly
increasing compilation time or binary size.
○ Intermediate transformations include dead code elimination, early constant
propagation, and some inlining.
● Use Case:
○ Suitable for early testing when faster execution is desired without incurring
significant compile-time overhead.
735
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Passes:
○ Builds on -O1 and includes:
■ Global Value Numbering (gvn): Removes redundant computations.
■ Aggressive inlining and loop transformations: Additional loop unrolling
(loop-unroll), vectorization (loop-vectorize), and speculative
execution.
■ More advanced control flow optimizations: Jump threading, correlated
propagation.
■ Scalar Replacement of Aggregates (sroa): Breaks down memory
objects into scalars.
■ Dead store elimination (dse): Removes unnecessary memory writes.
● Description:
○ Focuses on a balance between compile time, runtime performance, and
binary size.
○ Includes almost all major optimizations except the most aggressive ones from
-O3.
● Use Case:
○ Default choice for production builds, providing a good balance of performance
and compilation time.
● In contrast to gcc, there is not much difference between clang -O2, and clang -O3.
In fact, Emery Berger et al.'s work have reported that there's basically no statistical
difference between these two optimization levels. See, for instance, this video (at
23:58).
Key Differences
Feature -O0 -O1 -O2 -O3
736
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● -O0: For debugging, where fast compilation and a faithful source-to-binary mapping are
more important than runtime performance.
● -O1: For development and testing phases when you want some runtime improvement
without much additional compile-time cost.
● -O2: The go-to optimization level for production, striking a balance between compilation
time and runtime performance.
● -O3: For highly compute-intensive applications where maximum runtime performance is
required, even at the cost of compilation time and binary size.
Would you like me to elaborate on specific passes or discuss more about trade-offs in real-world
scenarios?
Can you show me an example that illustrates the difference between -O2
and -O3?
Absolutely! Let's use gcc nstead of clang in this case, as the difference is much more
noticeable:
737
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When you compile your dot function with -O3 instead of -O2, the generated code grows
significantly due to vectorization, loop unrolling, and the generation of multiple execution
paths to maximize runtime performance. Let’s analyze what happens and why the code grows
so much at -O3.
738
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For example:
movdqu (%rdi,%rax), %xmm1 ; Load multiple elements into a SIMD register.
pcmpeqd %xmm2, %xmm0 ; Compare SIMD vectors for equality.
pandn %xmm1, %xmm0 ; Perform vectorized operations.
● SIMD instructions process multiple elements at once, but preparing data for SIMD and
gathering results afterward require additional instructions.
● For example:
○ Loading data into SIMD registers (movdqu).
○ Masking operations (pcmpeqd, pandn).
○ Shuffling or reordering operations to align data (pshufd).
● The vectorized loop (.L4) is more complex than the scalar loop in -O2 because it uses
instructions optimized for parallel processing.
● Vectorized instructions often require aligned memory accesses. The compiler generates
code to:
○ Check if the arrays are aligned.
○ Handle unaligned memory accesses gracefully.
For example:
cmpq $30, %r8 ; Check if alignment and size allow vectorization.
je .L3 ; If not, fallback to scalar code.
4. Branch Handling
739
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The scalar loop directly operates on one element at a time, making the code compact.
At -O3: The loop is transformed into a SIMD loop (processes multiple elements):
740
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
This trade-off is worth it for performance-critical workloads on large datasets, but it might be
excessive for smaller inputs or memory-constrained environments. Would you like further details
on specific instructions or optimizations?
Are there ways for the program to "talk" to the compiler about how to
optimize the code? At least in C or similar system languages?
Certainly! Improving the output of a C compiler like GCC through better source code practices is
a fascinating area of discussion. Here are additional strategies to help the compiler generate
better code:
Instead of macros, prefer inline functions for small, frequently used operations. Inline
functions:
Example:
741
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
hot/cold functions: Mark functions likely to be called frequently (hot) or rarely (cold).
void critical_function() __attribute__((hot));
void error_handler() __attribute__((cold));
noinline: Prevent inlining for debugging or when inlining could negatively affect cache locality.
Compilers optimize for locality, but writing code that naturally adheres to memory access
patterns helps. Strategies include:
● Structure of Arrays (SoA) over Array of Structures (AoS) for SIMD-friendly code.
● Group frequently accessed variables together to minimize cache line misses.
You're already familiar with restrict, but ensure that its use is backed by rigorous analysis.
Compilers are cautious with optimizations around pointers due to aliasing concerns.
Example:
void add_arrays(
int *restrict a,
int *restrict b,
int *restrict c, size_t n
) {
for (size_t i = 0; i < n; i++) {
c[i] = a[i] + b[i];
}
}
742
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Write loops in a way that enables the compiler to unroll them and apply vectorization:
Example:
Compilers like GCC (-O2 or -O3) will attempt vectorization automatically if the loop structure is
simple enough.
7. Reduce Indirection
Avoid excessive levels of pointer dereferencing or linked data structures if arrays or flat data
structures can suffice. Flat structures are often easier for compilers to optimize.
Example:
struct FlatArray {
int data[100];
};
Inline calculations or use compiler-inlinable helper functions for small computations. Minimize
calls to external functions, as they break optimization opportunities like inlining and instruction
scheduling.
743
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Avoid marking variables volatile unless necessary (e.g., memory-mapped I/O, signal
handlers). volatile prevents optimization, including register allocation and instruction
reordering.
Code that invokes undefined behavior (UB) often prevents the compiler from applying
aggressive optimizations. Examples of UB include:
Run the program with typical workloads and use GCC's profiling tools:
Write conditional statements to favor predictable execution paths, especially in critical sections.
Example:
744
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Function pointers can degrade performance as the compiler often cannot predict which function
will be called, impacting branch prediction and instruction prefetching.
Example:
These techniques require understanding the workload and profiling output to maximize benefits.
Combining them with careful benchmarking and testing can significantly improve the efficiency
of compiled code.
Let's call the function without the restrict keyword dot0 and let's call the function with that
keyword dot1. See below the two CFGs that we might get with the following compilation lines:
745
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
You will notice that we have two kinds of loops in each CFG. Each loop has a different purpose:
● Loops 1 and 3 are vectorized loops for dot0 and dot1, respectively, triggered when
iterations exceed vector lane requirements and memory safety checks (if any) pass.
● Loops 2 and 4 are fallback loops handling small iteration counts, aliasing (for dot0),
and leftover iterations.
● The removal of runtime checks in dot1 relies on the programmer’s guarantee, which
must be upheld to ensure correctness.
746
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When the restrict keyword is not present, the compiler assumes that the arrays may alias
(i.e., overlap in memory). This uncertainty requires the inclusion of runtime memory checks to
ensure correctness under all scenarios.
747
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Before the vectorized loop (Loop 1) runs, a vector.memcheck block verifies the following:
If these checks fail, execution falls back to the scalar loop (Loop 2), ensuring correctness under
all circumstances.
The restrict keyword guarantees to the compiler that the arrays do not alias. This assurance
allows the compiler to skip runtime memory checks and optimize the code more aggressively.
Key Differences
● No Runtime Memory Checks: Unlike dot0, dot1 does not require the
vector.memcheck block because the restrict keyword guarantees non-overlapping
memory.
● Simplified Control Flow: The removal of memory checks reduces the overhead,
resulting in a leaner and more efficient CFG.
748
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
While dot0 guarantees correctness even with overlapping arrays (via runtime checks), dot1
relies on the programmer to uphold the restrict contract. If the programmer violates this
contract (e.g., by passing overlapping arrays to dot1), the behavior of the program is undefined
and may produce incorrect results.
Conclusion
The use of the restrict keyword in dot1 enables the compiler to generate more efficient
code by eliminating runtime memory checks and simplifying the control flow. However, this
optimization comes with the responsibility of ensuring that the contract of non-aliasing is upheld.
In contrast, dot0 provides robustness at the cost of additional runtime checks and a more
complex control flow.
By carefully analyzing the control flow graphs, we can see how the presence or absence of
restrict shapes the compiler’s optimization decisions and impacts performance and
correctness guarantees.
But, even without restrict, clang can still vectorize the function! How's
that possible?
Clang is able to do it with a runtime check! The vector.memcheck block in LLVM IR checks
for potential memory aliasing conflicts between the input arrays (a, b, and r) to determine
whether the vectorized loop can safely run. If a conflict is detected, it redirects the control flow to
749
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Explanation:
This logic ensures that vectorization occurs only when the input arrays are guaranteed not to
overlap, preserving correctness.
750
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Static Analysis
751
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Dynamic Analysis
● Static Analysis Strengths: It's fast and avoids runtime overhead but limited by
conservative assumptions (e.g., it may not know pointer relationships or sizes at compile
time).
● Dynamic Analysis Strengths: It's precise, as it evaluates actual memory regions at
runtime, but introduces some overhead.
1. Safety: The program behaves correctly even when memory aliasing occurs.
2. Performance: It uses vectorized code paths whenever possible, maximizing speed.
This dual approach is central to modern compiler optimizations and is well-captured in the
techniques described in the Runtime Pointer Disambiguation paper. It's one of the reasons
LLVM-based compilers like Clang are so effective in balancing safety and performance!
So, code optimization is the art of moving into compilation time as much of
the computations that we could do at execution time?
Well, in a way! Compilers are very good at solving at compilation time computations that,
otherwise, would be solved at running time. Consider the following example, taken from
Stefanos Baziotis' webpage:
752
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The puzzling behavior the figure is showing is due to an optimization that the LLVM community
likes to call Scalar Evolution. But even without it, a compiler can still achieve a similar result via
a combination of constant propagation, loop unrolling, and dead-code elimination. Let me
explain what's happening in detail:
● Constant Propagation and Folding: When the collatz function is invoked with a
constant argument (like 6 or 250), the compiler tries to evaluate the result at compile
time. This is part of constant propagation and folding, where the compiler replaces
computations with their results to generate more efficient code.
● Loop Unrolling and Iteration Limits: For certain inputs, the compiler attempts to
evaluate loops at compile time. However, this has practical limits:
○ If the loop is simple and the number of iterations is small, the compiler fully
evaluates it.
753
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ If the loop becomes too complex (e.g., too many iterations), the compiler gives up
and emits the original loop code.
For collatz(6), scalar evolution simulates the function at compile time; hence, fully
computing the number of iterations (8) because the Collatz sequence for 6 is relatively short:
6 → 3 → 10 → 5 → 16 → 8 → 4 → 2 → 1. Thus, clang is able to replaces the call to
collatz(6) with a single instruction, e.g:
mov eax, 8
ret
For collatz(250), the sequence is longer: 250 → 125 → 376 → 188 → 94 → 47 → 142 → ...
→ 1. Here, the compiler likely hits its iteration threshold for how many iterations scalar
evolution can perform. Instead of fully evaluating the function, it emits optimized loop code, but
does not compute the result at compile time.
For collatz(249), the sequence is much shorter (47 iterations), so the compiler successfully
evaluates it at compile time and replaces the loop with a constant:
mov eax, 47
ret
754
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Clang 17.0.1: This version doesn't aggressively optimize the collatz loop at -O1. It
emits runtime loop code for all cases.
● Clang 18.1.0: This version likely has improvements in constant propagation and partial
evaluation, enabling it to evaluate short Collatz sequences at compile time.
● Optimization Level: Higher levels (-O2, -O3) often lead to more aggressive
optimizations.
● Hardcoded Limits: There are usually fixed thresholds for the number of iterations or
computation steps the compiler will attempt to evaluate.
For example:
● LLVM has a threshold for loop unrolling and constant folding controlled by internal
parameters like UnrollThreshold or ConstExprLimit.
● If these thresholds are exceeded, the compiler stops partial evaluation and emits the
loop code.
Key Takeaway
What you're observing is a perfect example of the compiler solving problems at compile time
when it’s computationally feasible and beneficial for performance. However, practical limits exist
to prevent excessive compile times, which is why the behavior changes for larger or more
complex inputs!
But a compiler can't solve everything at compilation time, right? What are
the limits of optimizing compilers?
Great question! Optimizing compilers are powerful tools, but they have inherent limits stemming
from both theoretical and practical considerations. Let’s break this down:
755
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
756
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For loops with bounds and computations that are statically known (e.g., summing integers from
1 to 10), a compiler can replace the loop with a constant. However:
● If the loop depends on runtime values or inputs, the compiler generally can’t remove it
outright, because doing so risks changing the program’s behavior.
● If the loop’s termination condition can’t be proven at compile-time, the compiler may
conservatively leave it unchanged.
Bottom Line
Optimizing compilers can perform impressive transformations but are fundamentally limited by:
These limits ensure that optimizations are effective yet safe, and compilers terminate within
reasonable bounds.
757
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Creation of LLVM
2. Development of Clang
758
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Promoted open collaboration by making LLVM, Clang, and Swift widely accessible as
open-source projects.
● Inspired the development of other tools and languages that build on or extend LLVM,
such as Julia, Kotlin Native, and the Rust compiler.
● Played a leadership role at companies like Google (TPU development) and Tesla
(Autopilot software) to integrate machine learning into high-performance systems.
● Focused on AI infrastructure and compiler technologies that optimize machine learning
workflows.
● Influenced modern computer science education through LLVM and its use in teaching
compiler design and optimization.
● Contributed to industry adoption of LLVM in production environments, from browsers
(e.g., Chrome) to high-performance computing systems.
● Advocated for modular design principles in compiler infrastructure, which have been
widely adopted in subsequent tools and frameworks.
● Demonstrated how such modularity accelerates innovation in language design and
software tooling.
759
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Chris Lattner’s work on LLVM and its ecosystem revolutionized compiler design, while his
creation of Swift brought modern, performant programming to the masses. His contributions
continue to influence both academia and industry.
760
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Data-Flow Analyses
Code: https://github.com/pronesto/DCC053_Public/tree/main/23_DataFlow
1. Static Analysis
● Control Flow Analysis (CFA): Builds a control flow graph (CFG) to determine the flow
of execution between basic blocks. This is crucial for optimizations like dead code
elimination and loop unrolling.
● Data Flow Analysis: Tracks how data values propagate through the program. Examples
include reaching definitions, live variable analysis, and available expressions.
● Alias Analysis: Determines whether two pointers or references can point to the same
memory location. Precise alias analysis is critical for optimizations like loop invariant
code motion and memory access reordering.
● Type Analysis: Identifies the types of variables and expressions, enabling type-specific
optimizations like vectorization.
● Loop Dependency Analysis: Determines dependencies between iterations of loops,
enabling parallelization or vectorization while avoiding data races.
● Memory Dependency Analysis: Ensures that optimizations like instruction reordering
or prefetching do not violate memory consistency.
● Compilers translate source code into an IR (e.g., LLVM IR, HIR/MIR in Rust) that is
simpler and more uniform. This facilitates analysis and enables transformations without
worrying about high-level syntactic details or low-level machine specifics.
● The compiler uses symbol tables to track variables, functions, and their properties (e.g.,
type, scope, storage duration). This information ensures correct variable access and
assists in optimizations like constant propagation.
● Profiling data can guide optimizations by identifying frequently executed code paths ("hot
spots"). Techniques like feedback-directed optimization (FDO) and profile-guided
761
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
optimization (PGO) use runtime data to improve decisions such as inlining and loop
unrolling.
● Language features like const, restrict, or final provide explicit guarantees about
immutability or aliasing. These hints allow compilers to apply more aggressive
optimizations safely.
● Pragmas or compiler-specific annotations (e.g., #pragma in C/C++) let developers
provide additional guidance.
6. Speculative Analysis
● Some optimizations are speculative and depend on assumptions that can be checked at
runtime. For example, speculative execution or value speculation optimizations are
guarded by runtime checks to ensure safety.
7. Compiler Heuristics
● Compilers often use heuristics based on common patterns and trade-offs, such as when
deciding whether to inline a function or allocate variables in registers.
8. Safety Guarantees
● Optimizations are always validated against the program's semantic rules to ensure
correctness. For example, constant folding or strength reduction is only applied when the
transformations are mathematically sound under the target language's rules.
For instance, consider the control-flow graph below. How does the compiler
know that two variables can occupy the same register?
762
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The compiler determines whether two variables can occupy the same register using liveness
analysis, a key part of register allocation. Here's how it works in the context of the control-flow
graph (CFG) you provided:
1. Liveness Analysis
● Definition: A variable is live at a point in the program if its value is used later without
being overwritten. The compiler calculates this information for each variable at every
program point.
● Steps:
1. Compute live-out information for each basic block in the CFG. This includes
variables that are live immediately after the block.
2. Propagate the live-out information backward through the block to determine
which variables are live at each point.
2. Interference Graph
3. Register Allocation
763
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Graph Coloring Algorithm: The compiler assigns registers to variables using a graph
coloring algorithm on the interference graph. Two interfering variables get different
"colors" (registers), while non-interfering variables can share the same "color."
● Variable x:
○ It is live throughout the loop, as it is read in multiple conditions and updated
repeatedly. It likely interferes with many other variables.
● Variable y:
○ It is used within the loop but redefined on each iteration. If it is not live outside its
use within the loop, it may not interfere with other variables, like z.
● Variable z:
○ Similar to y, z has a limited scope of use within the loop. It may be live at
different times than y.
In this way, the compiler can determine that y and z may share the same register if their
lifetimes do not overlap.
Optimization Opportunities
● By analyzing the CFG, the compiler might also perform optimizations like:
○ Reusing registers for temporary values like y and z.
○ Reducing memory access by keeping frequently used values, like x, in a register
across iterations.
Definition of Liveness
1. It holds a value that might be used later, i.e., the value is "alive" and necessary for the
correctness of the program.
2. Its value has not been overwritten by another assignment before that use.
In other words, a variable is live if there is a path in the control-flow graph (CFG) from the
current point to a point where the variable is used, and the variable is not redefined along that
path.
764
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Formalization
Example
a = 5;
b = 10;
c = a + b; // Use of a, b
d = c * 2; // Use of c
Liveness analysis is a data-flow analysis performed on the control-flow graph (CFG) of the
program. Here's the high-level process:
○ Define:
■ use: The set of variables used before any redefinition in the block.
■ def: The set of variables defined (assigned a value) in the block.
○ Compute:
■ live-in[b] = use[b] ∪ (live-out[b] - def[b])
■ live-out[b] = ∪ (live-in[s]) for each successor s of the block
b.
2. Iterate these equations until the sets stabilize (a fixed point is reached).
765
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● A variable like x is live at every program point where its value might influence the output
x. This includes:
○ The test conditions x > 1, y > 3, and z > 0.
○ Any assignment or computation where x is used, like y = x / 2.
● Register Allocation: Variables live at the same time (overlapping lifetimes) cannot
share a register.
● Dead Code Elimination: Variables that are not live at any program point can be safely
removed, as their computation has no effect on the program's outcome.
How can the compiler find out the program points where a variable is alive?
Liveness analysis works in three steps: The figure below illustrates the first and the last of them:
● Goal: Derive equations for each program point based on the data-flow relationships.
● For each program point, you compute the IN and OUT sets:
○ IN[p]: The set of variables that are live before the program point p.
○ OUT[p]: The set of variables that are live after the program point p.
● These sets are computed using two relationships:
766
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● The equations are solved iteratively (iterative data-flow analysis) until a fixed point is
reached (i.e., the IN and OUT sets no longer change).
● This process propagates the liveness information throughout the control-flow graph:
○ Backward in the case of liveness, as liveness is a backward analysis (it
depends on later program points).
● Once the equations stabilize, the computed IN and OUT sets are associated with each
program point, as shown in the figure's second part.
● This information is now ready for practical use, such as register allocation or dead code
elimination.
● Variables in the IN set at any point are alive before that program point.
● Variables in the OUT set at any point are alive after that program point.
This approach makes the concept of liveness analysis systematic and easy to implement for
compilers. Would you like a worked-out example with equations?
767
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The figure presents an elegant way to solve liveness analysis using Prolog, where the
equations for liveness are translated directly into logical predicates. This approach leverages
Prolog's natural ability to express relationships and recursive computation, closely mirroring the
structure of data-flow equations.
Translation to Prolog:
2. Prolog recursively evaluates these predicates, computing IN and OUT sets for all
program points.
1. A query like the following computes the IN and OUT sets for all program points:
?- member(P, [1, 2, 3, 4]), in(P, IN), out(P, OUT).
2. The solution provides the IN and OUT sets for each program point:
P = 1, IN = [], OUT = [y];
768
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Variable Use/Definition: The diff operation removes killed variables, while union
adds used variables.
2. Control Flow: The recursive relationships between OUT and IN mimic the propagation
of live variables through the control-flow graph.
3. Data Flow: The computed IN and OUT sets align with the possible executions of the
program, reflecting which variables are live at each point.
Prolog is cool, but can you show me how to solve these equations in a
more "imperative" setting?
There are several ways to solve these equations in a more imperative setting. The simplest
approach is to use something called "chaotic iterations". Essentially, go over all the equations in
your list of equations and evaluate all of them. If there is any change on any IN or OUT set,
repeat this process; otherwise, stop it. It's like the code below:
def abstract_interp(equations):
env = {eq.name(): set() for eq in equations}
changed = True
while changed:
changed = False
for eq in equations:
if eq.eval(env)
changed = True
return env
1. Initialization:
○ Each equation (representing a program point) starts with its associated set (IN or
OUT) initialized to an empty set or some default value, stored in the env
dictionary.
2. Iterative Updates:
○ The equations are evaluated in some order (usually corresponding to the order of
program points).
769
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Each equation modifies the environment env (i.e., updates the IN or OUT sets)
based on the constraints of the data-flow analysis (e.g., IN[p] = (OUT[p] \
DEF[p]) ∪ USE[p] for liveness).
○ If any set changes (e.g., a new variable is added to an IN or OUT set), the
changed flag is set to True, ensuring another iteration.
3. Termination:
○ The iterations continue until no IN or OUT set changes in a complete pass over
the equations, at which point the solution has reached a fixed point.
● The term "chaotic iterations" refers to the fact that the equations can be processed in
any order, and the order of evaluation doesn't affect correctness (though it may affect
performance).
● In practice, heuristics like evaluating equations in reverse post-order (for backward
analyses like liveness) can speed up convergence.
770
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Each equation in the list corresponds to the constraints for a program point.
● env tracks the state of the IN and OUT sets.
● eq.eval(env) updates the sets for a program point based on the current state of the
environment, and returns True if it caused a change.
● The sets we are working with (IN and OUT) are drawn from a finite domain of values. For
example:
○ Variables in the program are finite.
○ IN[p] and OUT[p] are subsets of all variables.
● This means there are only a finite number of possible states for IN and OUT.
● Each equation defines how IN and OUT are updated, often using a transfer function.
These transfer functions must be monotonic, meaning that if the input to the function
grows (in terms of set inclusion), the output will also grow or remain the same.
● Since the transfer functions are monotonic and the lattice of values is finite, the process
will converge to a fixed point (where no further changes occur).
771
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. At every iteration, the IN and OUT sets can only grow (for backward problems like
liveness) or shrink (for forward problems like reaching definitions).
2. Since the sets are subsets of a finite set (the program's variables), there is a maximum
size they can reach.
3. Once all IN and OUT sets stop changing (no new variables are added or removed), the
process halts.
For example:
● If a program has 3 variables (x, y, z), the possible values of IN[p] and OUT[p] for any
program point are subsets of {x, y, z}, i.e., there are 23=82^3 = 8 possible states for
each set.
● At most, after a finite number of steps, the process will exhaustively explore these states,
reaching a fixed point.
○ Transfer functions guarantee that the values in the lattice either grow (for
backward problems) or shrink (for forward problems).
○ Since the lattice has finite height, the process eventually reaches a point where
further updates do not change the IN or OUT sets.
5. Practical Example
772
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
6. Summary
The process of solving data-flow equations (like in liveness analysis) terminates because:
If you apply these principles correctly, you can be sure the iterative process will halt and give the
correct result. Let me know if you'd like further clarifications or examples!
1. What Is a Lattice?
A lattice is a partially ordered set (poset) that satisfies the following conditions:
● Partial Order: Elements in the set are ordered by some relation ≤\leq (e.g., "subset of"
for sets).
● Join (Least Upper Bound): For any two elements aa and bb, there is a unique smallest
element cc such that a≤ca \leq c and b≤cb \leq c. This is called the join (a∨ba \vee b).
● Meet (Greatest Lower Bound): For any two elements aa and bb, there is a unique
largest element dd such that d≤ad \leq a and d≤bd \leq b. This is called the meet (a∧ba
\wedge b).
Example: The set of subsets of a finite set (e.g., {x,y,z}) forms a lattice:
773
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In data-flow analysis, the lattice represents the domain of values that we are computing. Each
program point maps to an element in this lattice (e.g., the set of live variables at that point for
liveness analysis). The structure of the lattice ensures:
1. Monotonicity: The equations used in the analysis respect the lattice order. For example,
if OUT grows, IN will also grow or stay the same.
2. Fixed Points: The iterative process used in chaotic iterations converges to a fixed point
because the lattice is finite and the operations respect the partial order.
Different data-flow problems (e.g., liveness analysis, reaching definitions, constant propagation)
can be expressed in terms of lattices:
For example:
b) Termination
A lattice with a finite height guarantees that the iterative process converges because:
1. The process can only move upwards (or downwards, for some analyses) in the lattice.
2. There is a maximum height, so the process terminates.
774
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
c) Correctness
The fixed point computed in the lattice corresponds to the solution to the data-flow equations,
ensuring that the analysis gives the correct result.
1. Domain of Values: The lattice represents the possible states for each program point
(e.g., sets of variables).
2. Transfer Functions: Define how states evolve, respecting monotonicity.
3. Convergence Guarantee: A finite-height lattice ensures that chaotic iterations terminate
at a fixed point.
4. General Applicability: Different analyses (liveness, reaching definitions, etc.) use
different lattices but follow the same principles.
Can you show how a lattice models the values computed by our liveness
analysis?
Let's work through this example below, step by step:
775
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Data-Flow Equations
The monotone data-flow equations describe how the IN and OUT sets evolve:
● IN[n] and OUT[n] represent the live variables before and after a program point nn,
respectively.
● IN[n] = (OUT[n] ∖ DEF[n]) ∪ USE[n]
● OUT[n] = ⋃IN[s], where s ∈ succ[n]
We solve the equations iteratively, starting with OUT[n]={} for all instructions and updating until a
fixed point is reached.
Initialization
Start with:
Iteration 1
1. d3:print(a)
○ USE={a},DEF={}
○ IN[d3]=(OUT[d3]∖DEF)∪USE={}∪{a}={a}
2. d4:print(b)
○ USE={b},DEF={}
○ IN[d4]=(OUT[d4]∖DEF)∪USE={}∪{b}={b}
3. d2:a>b?
○ Successors: d3 and d4.
○ OUT[d2]=IN[d3]∪IN[d4]={a}∪{b}={a,b}
○ IN[d2]=(OUT[d2]∖DEF)∪USE={a,b} (no definitions or uses in d2).
4. d1:b=read()
○ Successor: d2d2.
○ OUT[d1]=IN[d2]={a,b}
○ IN[d1]=(OUT[d1]∖DEF)∪USE=({a,b}∖{b})∪{}={a}.
5. d0:a=read()d0: a = \text{read()}
776
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Successor: d1.
○ OUT[d0]=IN[d1]={a}.
○ IN[d0]=(OUT[d0]∖DEF)∪USE=({a}∖{a})∪{}={}.
Iteration 2
We repeat the process, but now there are no changes to any of the IN or OUT sets. The fixed
point is reached, so we stop.
4. Final Results
● IN[d0] = {}
● OUT[d0] = {a}
● IN[d1] = {a}
● OUT[d1] = {a,b}
● IN[d2] = {a,b}
● OUT[d2] = {a,b}
● IN[d3] = {a}
● OUT[d3] = {}
● IN[d4] = {b}
● OUT[d4] = {}
777
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
We have been talking about liveness analysis, but there are many other
data-flow analyses, right?
Yes, there are many more! Liveness analysis is just one example of many data-flow analyses.
These analyses are designed to gather information about program behavior by solving systems
of equations over program control flow graphs. Here's an overview of a few common data-flow
analyses, along with their key equations:
Purpose: Determines which definitions of variables may reach each program point.
Equations:
Purpose: Determines which expressions are guaranteed to have already been computed and
not invalidated by redefinition.
● Direction: Forward.
● Key Sets:
○ GEN[n]: Expressions computed at n.
○ KILL[n]: Expressions invalidated at n.
○ IN[n]: Available expressions at the start of n.
○ OUT[n]: Available expressions at the end of n.
Equations:
778
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Purpose: Determines expressions that will definitely be computed in the future, along all paths
from a program point.
Equations:
Purpose: Determines which variables are "live" (potentially used in the future) at each program
point.
● Direction: Backward.
● Key Sets:
○ USE[n]: Variables used at node n.
○ DEF[n]: Variables defined at node n.
○ IN[n]: Live variables at the start of n.
○ OUT[n]: Live variables at the end of n.
Equations:
779
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Example: Power set of variables (for liveness), constant values (for constant
propagation), etc.
Let's see one more data-flow analysis. First, can you explain which kind of
information you need to optimize the program below?
Variable z in the program above is "dead": it's not used anywhere, except to redefine itself. To
demonstrate that variable z is dead, we can use reaching-definition analysis.
Reaching-definition analysis tells us where a variable is defined and whether that definition
"reaches" a certain point in the program. For variable z, definitions occur at the following points:
1. d5: z = x - 4
2. d8: z = z - 1
Step 2: Liveness of z
A variable is live at a program point if its current value is used later in the program along any
path. If the value of z is not used anywhere after its definition, the variable is considered dead.
1. The last use of z in the program should be identified to determine if any of its definitions
reach that use.
2. Looking at the program and CFG:
780
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Use Case:
● It is useful for detecting dead code, and doing constant propagation, for instance.
● Example: If a definition of a variable does not reach any usage, it may indicate that this
definition is redundant.
1. Data-Flow Equations
781
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For example:
1. Initialize IN and OUT for all program points to empty sets.
2. Iterate over the CFG until all IN and OUT sets stabilize (i.e., no further changes occur).
Example
Program:
1: x = 5
2: y = x
3: x = 10
4: print(y)
CFG:
1→2→3→4
Analysis:
1. Definitions:
782
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ d1: x = 5
○ d2: y = x
○ d3: x = 10
2. GEN and KILL:
○ GEN[d1] = {d1}; KILL[d1] = {d3} (other x definitions)
○ GEN[d2] = {}; KILL[d2] = {} (no definitions killed)
○ GEN[d3] = {d3}; KILL[d3] = {d1} (kills previous x definitions)
3. Equations:
○ OUT[1] = GEN[1] ∪ (IN[1] \ KILL[1]) = {d1}
○ IN[2] = OUT[1] = {d1}
○ OUT[2] = GEN[2] ∪ (IN[2] \ KILL[2]) = {d1}
○ IN[3] = OUT[2] = {d1}
○ OUT[3] = GEN[3] ∪ (IN[3] \ KILL[3]) = {d3}
○ IN[4] = OUT[3] = {d3}
4. Final Results:
○ IN[1] = {}; OUT[1] = {d1}
○ IN[2] = {d1}; OUT[2] = {d1}
○ IN[3] = {d1}; OUT[3] = {d3}
○ IN[4] = {d3}; OUT[4] = {}
Key Observations
783
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The definition of z at d8 is dead, because it does not reach any instruction that uses variable z.
The only use of z is d8 itself. However, the definition of z at d8 is killed by d5. Because d8 is
dead, so is d5. Once we remove d8 from the CFG, there will be no other use of variable z, and
we can safely eliminate d5, where variable z is defined.
784
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The semantics of the five instructions in this intermediate representation (IR) are as follows:
S0[src0] = v0
S0[src1] = v1
v = v0 + v1
S1 = S0 ∪ {dst: v}
S1 ⊢ interp(next) ⇒ S2
○ This means the interpretation of add updates the state by adding a new binding
for dst and proceeds to interpret the next instruction.
2. mul(dst, src0, src1, next):
○ Semantics: Compute the product of the values in src0 and src1, store the
result in dst, and continue with the instruction labeled next.
785
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
S0[src0] = v0
S0[src1] = v1
v = v0 × v1
S1 = S0 ∪ {dst: v}
S1 ⊢ interp(next) ⇒ S2
S0[src0] = v0
S0[src1] = v1
v0 < v1
S1 = S0 ∪ {dst: true}
S1 ⊢ interp(next) ⇒ S2
S0[src0] = v0
S0[src1] = v1
v0 ≥ v1
S1 = S0 ∪ {dst: true}
S1 ⊢ interp(next) ⇒ S2
S0[cond] = true
S1 ⊢ interp(next_true) ⇒ S1
S0[cond] = false
786
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
S1 ⊢ interp(next_false) ⇒ S1
In this representation:
787
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example Walkthrough
Setup
Execution
● Start at p.
○ eval: p = (2 < 3) = true.
● Branch to n_min since p is true.
○ eval: answer = n + zero = 2.
● Since n_min has no successor, the program terminates.
788
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Result
The final environment is {"m": 3, "n": 2, "zero": 0, "p": true, "answer": 2}.
The fib function receives an input n, and creates the necessary instructions to compute the
n-th Fibonacci number!
789
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
We can start implementing the IN and OUT sets. These sets contain pairs like (V, L0), meaning
that variable V is defined at L0. We can implement a general "environment" that associates
each program point (the ID of an instruction), with an IN and and OUT set, e.g.:
In this case, we use strings to represent the keys of IN and OUT sets. Thus, the IN set at
program point 0 will be accessed by the key "IN_0". Given these considerations, and equation is
an object that we can "evaluate" to produce an updated IN (or OUT) set, e.g.:
class DataFlowEq(ABC):
"""
A class that implements a data-flow equation. The key trait of a data-flow
equation is an `eval` method, which evaluates that equation. The evaluation
of an equation might change the environment that associates data-flow facts
with identifiers.
"""
def __init__(self, instruction):
"""
Every data-flow equation is produced out of a program instruction. The
initialization of the data-flow equation verifies if, indeed, the input
object is an instruction.
"""
assert(isinstance(instruction, Inst))
self.inst = instruction
790
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
@abstractclassmethod
def name(self) -> str:
"""
The name of a data-flow equation is used to retrieve the data-flow
facts associated with that equation in the environment. For instance,
imagine that we have an equation like this one below:
@abstractclassmethod
def eval_aux(self, data_flow_env) -> set:
"""
This method determines how each concrete equation evaluates itself.
In a way, this design implements the 'template method' pattern. In other
words, the DataFlowEq class implements a concrete method eval, which
calls the abstract method eval_aux. It is the concrete implementation of
eval_aux that determines how the environment is affected by the
evaluation of a given equation.
"""
raise NotImplementedError
Equations have two concrete representations: we have equations that compute IN sets, and
equations that compute OUT sets. The former follows below:
def name_in(ID):
"""
The name of an IN set is always ID + _IN. Eg.:
>>> Inst.next_index = 0
>>> add = Add('x', 'a', 'b')
>>> name_in(add.ID)
'IN_0'
"""
return f"IN_{ID}"
class IN_Eq(DataFlowEq):
"""
This abstract class represents all the equations that affect the IN set
791
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class ReachingDefs_IN_Eq(IN_Eq):
"""
This concrete class implements the meet operation for reaching-definition
analysis. The meet operation produces the IN set of a program point. This
IN set is the union of the OUT set of the predecessors of this point.
"""
def eval_aux(self, data_flow_env):
"""
The evaluation of the meet operation over reaching definitions is the
union of the OUT sets of the predecessors of the instruction.
Example:
>>> Inst.next_index = 0
>>> i0 = Add('x', 'a', 'b')
>>> i1 = Add('x', 'c', 'd')
>>> i2 = Add('y', 'x', 'x')
>>> i0.add_next(i2)
>>> i1.add_next(i2)
>>> df = ReachingDefs_IN_Eq(i2)
>>> sorted(df.eval_aux({'OUT_0': {('x', 0)}, 'OUT_1': {('x', 1)}}))
[('x', 0), ('x', 1)]
"""
solution = set()
for inst in self.inst.preds:
solution = solution.union(data_flow_env[name_out(inst.ID)])
return solution
def __str__(self):
"""
The name of an IN set is always ID + _IN.
Example:
>>> Inst.next_index = 0
>>> i0 = Add('x', 'a', 'b')
>>> i1 = Add('x', 'c', 'd')
>>> i2 = Add('y', 'x', 'x')
>>> i0.add_next(i2)
>>> i1.add_next(i2)
>>> df = ReachingDefs_IN_Eq(i2)
>>> str(df)
'IN_2: Union( OUT_0, OUT_1 )'
"""
succs = ', '.join([name_out(pred.ID) for pred in self.inst.preds])
return f"{self.name()}: Union( {succs} )"
And the equations that compute OUT sets are implemented as follows:
def name_out(ID):
"""
792
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
class OUT_Eq(DataFlowEq):
"""
This abstract class represents all the equations that affect the OUT set
related to some program point.
"""
def name(self):
return name_out(self.inst.ID)
class ReachingDefs_Bin_OUT_Eq(OUT_Eq):
"""
This concrete class implements the equations that affect OUT facts of the
reaching-definitions analysis for binary instructions. These instructions
have three fields: dst, src0 and src1; however, only the former is of
interest for these equations.
"""
def eval_aux(self, data_flow_env):
"""
Evaluates this equation, where:
OUT[p] = (v, p) + (IN[p] - (v, _))
Example:
>>> Inst.next_index = 0
>>> i0 = Add('x', 'a', 'b')
>>> df = ReachingDefs_Bin_OUT_Eq(i0)
>>> sorted(df.eval_aux({'IN_0': {('x', 1), ('y', 2)}}))
[('x', 0), ('y', 2)]
"""
in_set = data_flow_env[name_in(self.inst.ID)]
new_set = {(v, p) for (v, p) in in_set if v != self.inst.dst}
return new_set.union([(self.inst.dst, self.inst.ID)])
def __str__(self):
"""
A string representation of a reaching-defs equation representing
a binary instruction. Eg.:
>>> Inst.next_index = 0
>>> add = Add('x', 'a', 'b')
>>> df = ReachingDefs_Bin_OUT_Eq(add)
>>> str(df)
'OUT_0: (x, 0) + (IN_0 - (x, _))'
"""
kill_set = f" + ({name_in(self.inst.ID)} - ({self.inst.dst}, _))"
gen_set = f"({self.inst.dst}, {self.inst.ID})"
return f"{self.name()}: {gen_set}{kill_set}"
class ReachingDefs_Bt_OUT_Eq(OUT_Eq):
"""
793
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This concrete class implements the equations that affect OUT facts of the
reaching-definitions analysis for branch instructions. These instructions
do not affect reaching definitions at all. Therefore, their equations are
mostly treated as identity functions.
"""
def eval_aux(self, data_flow_env):
"""
Evaluates this equation. Notice that the reaching definition equation
for a branch instruction is simply the identity function.
OUT[p] = IN[p]
Example:
>>> Inst.next_index = 0
>>> i0 = Bt('x')
>>> df = ReachingDefs_Bt_OUT_Eq(i0)
>>> sorted(df.eval_aux({'IN_0': {('x', 1), ('y', 2)}}))
[('x', 1), ('y', 2)]
"""
return data_flow_env[name_in(self.inst.ID)]
def __str__(self):
"""
A string representation of a reaching-defs equation representing a
branch. Eg.:
>>> Inst.next_index = 0
>>> i = Bt('x')
>>> df = ReachingDefs_Bt_OUT_Eq(i)
>>> str(df)
'OUT_0: IN_0'
"""
kill_set = f"{name_in(self.inst.ID)}"
gen_set = f""
return f"{self.name()}: {gen_set}{kill_set}"
def reaching_defs_constraint_gen(insts):
"""
Builds a list of equations to solve Reaching-Definition Analysis for the
given set of instructions.
Example:
>>> Inst.next_index = 0
>>> i0 = Add('c', 'a', 'b')
>>> i1 = Mul('d', 'c', 'a')
>>> i2 = Lth('e', 'c', 'd')
>>> i0.add_next(i2)
>>> i1.add_next(i2)
794
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
795
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example
The figure below shows the equations that are produced for a simple program. The program
computes Fibonacci numbers, but its exact semantics will bear no consequence on this
presentation. All that we need are the equations.
796
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The OUT sets are shown in red, and the IN sets are shown in blue. Once we have all these
equations, we can solve them using chaotic iterations, or any faster algorithm. The figure below
shows a solution to reaching definitions:
1. IN sets: Represent definitions that reach the entry of a program point. These are
definitions that may influence the computation at that point.
2. OUT sets: Represent definitions that reach the exit of a program point. These are
definitions that may propagate to subsequent program points.
The solution provides information about how variable definitions flow through the program.
Specifically:
1. Definition Reachability: The IN set shows all the definitions that might affect the
current instruction. If a variable is used at a program point, its definition must appear in
the IN set.
797
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Propagation: The OUT set determines which definitions leave the program point and
may affect downstream instructions.
3. Control Flow Influence: The IN and OUT sets reflect the possible paths of program
execution in the control-flow graph (CFG).
For example:
● At program point 3, the IN set {(count, 0), (fib, 2), (pred, 1)} means that:
○ The definition of count at 0.
○ The definition of fib at 2.
○ The definition of pred at 1. might influence the computation of aux = zero +
fib.
798
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
1. Meaning: It shows which variable definitions are available at each point in the program,
based on the CFG.
2. Utility: It supports optimizations, debugging, and program analysis by exposing how
definitions influence computations.
3. Execution Relation: It accounts for all possible paths in the CFG, ensuring correctness
even for complex control flows.
What do you mean when you say that the Reaching-Definition Analysis is
conservative?
When I say that reaching-definition analysis is conservative, I mean that it is designed to
over-approximate the possible definitions that might reach each program point. This ensures
that the analysis is safe and correct for all possible program executions. In essence,
conservative analyses guarantee that no potential behaviors of the program are missed, even at
the cost of including some that are not possible in practice.
"May" Analysis
Example
In reaching-definition analysis:
799
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● If a variable x has definitions at points P1 and P2, and the analysis concludes that both
definitions reach a program point P3, this means:
○ At least one of these definitions may influence P3 depending on the path
taken.
○ This does not imply that both definitions simultaneously influence P3 in a single
execution.
"Must" Analysis
● A "must" analysis computes facts that hold on all possible executions of the program.
● In the context of definitions, a "must-reach" analysis would determine which definitions
of a variable are guaranteed to reach a program point on all possible paths.
● This is stricter and less inclusive than a "may" analysis.
Key Differences
Purpose Includes everything that might Includes only what always happens.
happen.
Example All definitions that may reach a Definitions that must reach a point.
Property point.
Example
1: x = 5
2: if (cond) { x = 10 }
3: y = x
800
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"May" Analyses
"Must" Analyses
● Dual Nature: Often, "may" and "must" analyses are duals of each other:
○ A "must" analysis can sometimes be derived from the complement of a "may"
analysis and vice versa.
● Example:
○ "May reach" definitions describe what might reach a program point.
801
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
Can you state out some properties that are true about the results of a
Reaching-Definition Analysis?
Yes, there are many such properties. Find a few below:
Theorem 1 (Affectation):
Proof Sketch:
1. By definition, the IN set at P includes all definitions that reach the entry of P.
2. Reaching definition analysis is conservative, meaning if a definition does not reach P, it
is guaranteed not to affect the program at P.
3. Therefore, only the definitions of b and c in the IN set at P can contribute to the
computation of a.
Utility:
This theorem helps in understanding data provenance and supports optimizations like constant
propagation. If b or c have a single reaching definition, their values can be substituted directly.
Proof Sketch:
1. The OUT set at Q represents all definitions that propagate from Q.
2. If no successor of Q has (x,Q) in its IN set, the definition at Q is unused downstream.
3. Such definitions can be eliminated without changing the program's behavior.
802
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Utility:
If a variable x has exactly one definition (x, Q) that reaches program point P,
then x's value at P is determined solely by the computation at Q.
Proof Sketch:
1. Reaching definition analysis tracks all definitions that can reach P.
2. If x has exactly one reaching definition (x,Q), no other definitions can interfere or
override x before P.
3. Therefore, x's value at P must be the result of the computation at Q.
Utility:
This property supports constant propagation and partial redundancy elimination. For
example, the computation of x at Q can be propagated to P to reduce redundant evaluations.
Proof Sketch:
1. Definitions not in the IN set at P do not propagate to P due to intervening redefinitions or
control-flow barriers.
2. As a result, (x,Q) has no influence on computations at P.
Utility:
This theorem is essential for understanding the scope of definitions and avoiding incorrect
assumptions about variable dependencies.
If program point P has multiple predecessors, its IN set for variable x is the
union of the OUT sets of x's definitions from all predecessors.
Proof Sketch:
1. The IN set at P is defined as the union of the OUT sets of all predecessor points.
2. This ensures that all definitions flowing into P from distinct paths are captured.
803
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Utility:
This theorem explains the control flow-sensitive nature of reaching definition analysis and
helps in reasoning about programs with branches or loops.
For a variable x defined inside a loop, the definition may reach the loop
header if it is not overwritten by subsequent iterations.
Proof Sketch:
1. In a loop, the OUT set of the loop body propagates to the loop header.
2. If x's definition is not killed within the loop, it will appear in the IN set of the loop header,
potentially affecting subsequent iterations.
Utility:
This theorem supports loop optimizations like loop invariant code motion and partial
redundancy elimination.
At program point P, two definitions (x, Q_1) and (x, Q_2) can coexist in the IN
set if and only if Q1 and Q2 belong to different control-flow paths leading to P.
Proof Sketch:
Utility:
These theorems show how reaching definition analysis provides powerful insights into program
behavior and supports a variety of optimizations and correctness guarantees. Would you like me
to expand on any specific theorem or its practical applications?
804
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
There are so many analyses, but they are also very similar! Is there a way
to implement them all by just varying small parts of them?
Yes! All these analyses are instances of the Data-Flow Framework! Analyses like reaching
definitions, liveness analysis, and available expressions are examples of analyses that fit
within this data-flow framework. Let me explain what a data-flow framework is, its structure,
and why it’s useful in compiler design.
The data-flow framework is a general abstraction used to perform static analyses of programs.
It defines a systematic way to collect information about program behaviors by solving data-flow
equations over a control-flow graph (CFG).
Key Components
805
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Input:
○ CFG: The control-flow graph of the program.
○ Analysis Specification: Defines:
■ Direction (forward or backward).
■ Transfer functions for nodes.
■ Meet operator for combining paths.
2. Equations:
○ For each node in the CFG, compute:
■ IN[n] and OUT[n] for forward analysis.
■ OUT[n] and IN[n] for backward analysis.
1. Generality:
○ The framework supports a wide range of analyses, as long as the problem can
be expressed in terms of data flow through the CFG.
2. Safety:
○ The framework guarantees a safe approximation of program behaviors,
typically through over- or under-approximation.
3. Convergence:
○ By using a finite lattice for data-flow values and monotonic transfer functions, the
framework ensures convergence to a fixed point.
● Direction: Forward.
● Transfer Function: OUT[n] = GEN[n] ∪ (IN[n] \ KILL[n])
● Meet Operator: Union ("may" analysis).
806
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Direction: Backward.
● Transfer Function: IN[n] = USE[n] ∪ (OUT[n] \ DEF[n])
● Meet Operator: Union ("may" analysis).
● Direction: Forward.
● Transfer Function: OUT[n] = GEN[n] ∪ (IN[n] ∩ ~KILL[n])
● Meet Operator: Intersection ("must" analysis).
1. Optimizations:
○ Dead code elimination (e.g., from liveness analysis).
○ Common subexpression elimination (e.g., from available expressions).
○ Register allocation (e.g., from liveness analysis).
2. Program Verification:
○ Ensuring variables are initialized before use (e.g., definite assignment analysis).
3. Code Refactoring:
○ Identifying redundant computations or unused variables.
● The data-flow framework can express both may and must analyses:
○ May Analysis: Typically uses the union operator in the meet (e.g., reaching
definitions, liveness).
○ Must Analysis: Typically uses the intersection operator in the meet (e.g.,
available expressions).
Summary
The data-flow framework provides a unifying structure for many program analyses. By defining
the control flow, data-flow values, transfer functions, and meet operators, it systematically
computes information about program behavior. This framework is crucial for enabling safe and
effective compiler optimizations.
807
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. John Cocke
2. Gary Kildall
● In 1973, Kildall introduced the seminal formalization of data-flow analysis in his paper
A Unified Approach to Global Program Optimization.
● Kildall's algorithm provided a systematic way to compute data-flow information using the
concepts of lattices, transfer functions, and fixed-point iteration. This became the
foundation for modern data-flow frameworks.
3. Frances E. Allen
4. Kenneth Zadeck
● In the 1970s and 1980s, Zadeck and others refined the practical application of data-flow
frameworks, focusing on efficiency and scalability in compilers.
5. Others
● Contributions from researchers like Patrick Cousot and Radhia Cousot (abstract
interpretation) and Aho, Sethi, and Ullman (Dragon Book) helped formalize and
popularize data-flow frameworks in the broader programming languages and compiler
community.
Kildall's 1973 paper is often considered the formal birth of the data-flow framework because:
1. It provided the theoretical foundation for solving data-flow equations iteratively using a
fixed-point computation.
2. It introduced the use of lattices to model data-flow facts and guarantee convergence.
3. It unified multiple data-flow analyses (e.g., reaching definitions, liveness, available
expressions) under a common theoretical framework.
Why Is It Important?
808
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Global Optimizations: Efficiently propagating information across the entire program, not
just within basic blocks.
2. Generalization: A single mathematical structure for many analyses, making compilers
modular and extensible.
3. Theoretical Guarantees: Convergence and correctness via well-defined mathematical
properties like monotonicity and finite lattices.
Legacy
The framework is now a core part of compiler theory and has influenced other areas of program
analysis, including abstract interpretation and model checking. Researchers like Allen and
Kildall have left a lasting mark on how compilers analyze and optimize code.
Is this the same Kildall from the TV show? What else did he do for
computer science?
Yes, Gary Kildall, the same visionary computer scientist who contributed to the development of
the data-flow framework, is also famously known for his pioneering work in personal
computing and for founding the company Digital Research Inc. (DRI). He was indeed one of
the hosts of the PBF Series Computer Chronicles.
Beyond his foundational work in compiler theory, Kildall made significant contributions across
various domains in computer science and was an advocate for science outreach. Here's a
summary of his contributions:
809
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Kildall's 1973 paper, A Unified Approach to Global Program Optimization, laid the
groundwork for data-flow frameworks, as discussed earlier.
● His work provided theoretical tools still used in modern compiler design.
● Kildall is best known for creating CP/M (Control Program for Microcomputers) in
1974.
○ CP/M was one of the first operating systems for personal computers and became
a de facto standard in the early days of microcomputing.
○ It introduced the concept of a BIOS (Basic Input/Output System), which
allowed CP/M to run on various hardware platforms, paving the way for
cross-platform software development.
● CP/M heavily influenced MS-DOS, which Microsoft developed after a controversial
decision by IBM not to adopt Kildall's system.
● Kildall foresaw the potential of graphical interfaces for personal computing and
developed GEM (Graphical Environment Manager).
○ GEM was an early GUI for PCs and inspired subsequent developments in
graphical operating systems.
● Kildall developed one of the first implementations of network file sharing with MP/M (an
extension of CP/M for multitasking).
● He contributed to the development of CD-ROM technology, demonstrating its potential
for data distribution and multimedia applications.
● In the 1980s, Kildall hosted the PBS series The Computer Chronicles:
○ This popular TV show explained computer technology to a broad audience,
making concepts accessible to both experts and laypeople.
810
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Kildall's approachable style and passion for teaching helped demystify personal
computing during its infancy.
● He was a strong advocate for democratizing access to computing and believed in the
power of computers to transform society.
Despite his achievements, Kildall's work is often overshadowed by more publicized figures like
Bill Gates and Steve Jobs. Nevertheless, his contributions to computer science remain
profound:
● Standardization: His work on CP/M and BIOS established conventions that shaped the
PC industry.
● Vision: Kildall's foresight in areas like cross-platform software, GUIs, and networking
influenced countless innovations.
● Science Outreach: His efforts on The Computer Chronicles showcased his dedication
to educating the public about technology.
Conclusion
Gary Kildall was a polymath in computer science, contributing to fields as diverse as compiler
theory, operating systems, networking, and science communication. His passion for both
innovation and outreach left an indelible mark on the history of computing. Would you like to
explore any specific aspect of his work further?
811
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Example
int x = 10;
int y = x + 5;
int z = y * 2;
812
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
int x = 10;
int y = 15; // Replaced x + 5 with 15
int z = 30; // Replaced y * 2 with 30
● Control Flow: In programs with loops or conditionals, constant propagation can become
more complex.
● Aliasing and Pointers: When pointers or aliasing are involved, it can be hard to
determine if a variable truly holds a constant value.
How does the compiler know that a variable is a constant? Is it via some
data-flow analysis?
Yes, the compiler uses data-flow analysis to determine whether a variable is constant at
specific points in the program. The specific type of data-flow analysis used for this purpose is
called constant propagation analysis. Here's how it works in detail:
The compiler tracks the possible values of variables as they "flow" through the control flow
graph (CFG) of the program. For constant propagation, the goal is to identify variables that are
provably constant at each point in the program.
813
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Could you give me an example showing how this data-flow analysis works?
Yes, the figure below represents the process of constant propagation analysis:
814
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Initialization:
○ Variables are initially assigned UNDEF until they are assigned specific values in
the program.
2. Data-Flow Analysis:
○ The figure tracks how constants propagate through the program. For example:
■ a = 1 sets a to the constant 1.
■ c = a + 10 propagates the constant value 11 to c.
■ When control flow merges (e.g., at the loop), the state reflects the most
conservative approximation using a join operation:
■ If the variable is constant on one path but not on the other, the
result is NAC.
3. Iterative Refinement:
○ The process continues iteratively, updating states until reaching a fixed point
where no further changes occur.
○ In loops, variables like a and b may become NAC because their values cannot be
determined to be constant across iterations.
815
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Constant propagation analysis does not execute the code but symbolically evaluates it using the
CFG and abstract states. It uses transfer functions and the lattice of values (as described
earlier) to approximate the values of variables.
the dense approach requires maintaining O(|V| * |P|) space to store all this information.
816
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ If you have 1,000 variables and 10,000 program points, the dense
implementation would require space proportional to 10,000,000 entries, even
though only a fraction of these might be relevant!
4. Memory Usage:
○ Modern programs are large and may have thousands of variables, making this
approach infeasible due to high memory consumption.
○ The space used to store abstract states could otherwise be used for more critical
tasks, like storing actual program data or performing other optimizations.
5. Inefficiency for Sparse Data:
○ In many cases, the number of variables that actually change at any given
program point is small. This makes the dense representation overkill because
we're maintaining and updating information that isn't necessary.
Consider an Analogy:
Imagine you’re tracking the position of all the students in a university across a campus map at
every second of the day. For most students, their position won’t change very often, but you’d still
waste time and storage keeping track of everyone constantly, instead of just focusing on the few
who are actually moving.
In summary, the dense approach is expensive because it doesn't take advantage of the
sparsity of actual changes in variable states, leading to significant overhead in both space and
computation.
The inefficiency of the dense approach comes from unnecessarily tracking the state of every
variable at every program point. But here's an important observation: whether a variable is
constant or not depends only on the point where it is defined—not on every subsequent
program point.
817
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● If every jar has a unique label based on its contents (e.g., “sugar,” “flour”), you always
know what’s inside by looking at the label.
● There’s no need to check every cabinet or recipe step to figure out what’s in each
jar—it’s consistent because of the unique label.
Similarly, if every variable in a program is uniquely tied to its definition, its property (constant or
not) is immediately clear and doesn’t need to be re-evaluated everywhere.
But, how can we ensure that every variable has a single name? Renaming
variables in straight-line code is easy. But what about code with branches?
Yes, renaming variables within straight-line code (code without branches) is fairly easy. See how
that could be done in the figure below:
818
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Let's break down the figure and explain why renaming variables in straight-line code is simple
but becomes challenging when control flow introduces branches.
Straight-line Code
L0: a = read()
L1: b = read()
L2: a = a + b
L3: ret a
We see a simple sequence of instructions without any branching. Renaming variables here is
straightforward: just rename each definition (a = ..., b = ...) with a unique counter, that
you increment every time it is added to a definition. The sequence then becomes:
L0: a0 = read()
L1: b0 = read()
L2: a1 = a0 + b0
L3: ret a1
Each variable has a unique name (a0, a1, b0), making it clear where values are defined and
used. There's no ambiguity or overlap.
L0: a = read()
L1: b = read()
L2: brq a b L4
L3: b = b + 1
L4: ret b
819
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The control flow diverges at L2 with a conditional branch (brq a b L4). Depending on the
condition:
At L4, it's unclear whether to use b0 (from before the branch) or b1 (from L3). We cannot
assign a unique name to b here because it depends on the program's execution path. This
ambiguity demonstrates why branches complicate variable renaming.
So, we could rename variables within straight-line code with counters. How
would you implement that?
Here's the Python implementation of functions that rename variables by adding counters to their
names:
def initialize(variables):
"""
Initializes the Count and Stack data structures for renaming variables.
Args:
variables: A list of all variables in the program.
Returns:
Count: A dictionary mapping each variable to its count (initially 0).
Stack: A dictionary mapping each variable to a stack containing [0].
"""
Count = {var: 0 for var in variables}
Stack = {var: [0] for var in variables}
return Count, Stack
Args:
block: A list of instructions in the basic block.
Count: A dictionary mapping each variable to its current count.
820
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Returns:
A renamed version of the block.
"""
for instruction in block:
# Replace all uses of variables in the instruction
for x in get_used_variables(instruction):
i = top(Stack[x])
replace_variable_use(instruction, x, f"{x}{i}")
Explanation
● initialize:
○ Sets up the Count and Stack data structures. Count tracks how many times a
variable has been defined so far, and Stack keeps the history of indices for each
variable.
● rename_basic_block:
○ Processes a list of instructions in a basic block. For each instruction:
■ It retrieves the variables that are used and defined
(get_used_variables and get_defined_variables).
■ It replaces each use and definition with a unique name based on the
current index from Stack and updates Count and Stack accordingly.
821
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
But the problem with control flow still remains. How can we deal with
programs that have branches?
When programs have branches, we encounter a problem: a variable may have multiple
definitions, depending on the execution path. As an example, consider the figure below:
This "multiplexer" doesn’t exist as a physical construct in the program; instead, it exists in the
abstract representation of the program.
822
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Combines multiple possible values of a variable into a single new variable, depending on
the control flow.
● Is placed at points where control flow merges (e.g., at L4).
b2 = phi(b1, b0)
Semantics of Phi-Functions
823
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Reading Variables: When control flow reaches a join point with phi-functions, we
determine the incoming path that led to this point. Each phi-function then reads the
variable from the corresponding input column based on this path. For example, if the
flow comes from the second branch in the figure, we select values like a1a1, b1b1, and
c1c1.
2. Assigning Variables: After reading the required values, the phi-functions assign these
values in parallel to their respective target variables. This ensures the semantics remains
consistent even when variables are permuted, solving the so-called "swap problem."
Implementation in Python
The provided Python code below captures this behavior using two classes: Phi and PhiBlock:
824
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Phi Class: This models an individual phi-function, mapping a destination variable to a selection
of source variables depending on the program path. Example:
a = Phi("a", ["a0", "a1", "a2"])
● Here, the variable a will be assigned the value of a0, a1, or a2, depending on the
incoming path.
825
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
PhiBlock Class: This groups multiple phi-functions into a single entity, ensuring all
phi-functions at a join point are evaluated in parallel. For example, in the figure, a,b,ca, b, c are
updated together based on the active incoming path. Example:
aa = PhiBlock([a0, a1], [10, 31])
● In this case, the phi-block ensures that values from a selected column are copied
simultaneously to their target variables.
● Dynamic Path Resolution: The program's path dictates which set of source variables is
selected.
Practical Example
The example from the code illustrates how PhiBlock resolves phi-functions dynamically:
Given an environment:
e.set("a0", 1)
e.set("a1", 3)
a0 = a0
a1 = a1
a0 = a1
a1 = a0
This structured implementation ensures correctness and flexibility in handling control flow joins
in a representation that we call Static Single-Assignment form!
826
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Nodes in the CFG represent basic blocks (sequences of instructions with no branching
except at the end).
● Edges represent possible paths of control flow between blocks.
In SSA form:
Conclusion
● Every variable has exactly one definition site, making data flow clear and unambiguous.
827
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Every definition dominates its uses, ensuring that variables are always defined before
they are used.
This representation is not only theoretically elegant but also highly practical, as it simplifies
many compiler optimizations and analyses.
1. Unique Naming: In SSA form, every variable is given a unique name at each point of
assignment, even if the same logical variable (like x) is reassigned multiple times in
different scopes or branches. Instead of overwriting a variable, SSA introduces a new
version of it.
2. No Shadowing: Because each assignment produces a new, uniquely named variable,
there's no risk of shadowing. In your language, even if the same variable is used in
nested scopes, each instance will have a different name in SSA form.
3. Global vs. Local Identification: In SSA form, since all variables have distinct names, it
becomes straightforward to differentiate between global and local variables. Global
variables will remain consistent across function definitions and scopes, while local
variables will have unique, scope-specific names.
828
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Now, the outer x1 and the inner x2 have unique names, making it easy for your visitor to
classify them correctly. The outer x1 can be classified as a global variable, and x2 as local.
● Global variables would be the ones that are introduced outside of functions (in your
case, outside of anonymous Fn expressions or Let bindings inside functions).
● Local variables would be the ones declared inside a function or inner scope and would
have distinct names in SSA form.
You wouldn't need to worry about shadowing or scope tracking anymore, as each variable will
already have a unique name.
Once you've transformed your program into SSA form, you could modify your visitor to easily
classify global variables. Since SSA guarantees unique names, it can directly look at where
variables are defined without worrying about shadowing.
Conclusion
Yes, converting the program to SSA form is a very elegant solution for dealing with the
problem of variable shadowing. It simplifies the task of distinguishing global and local variables
because each variable has a unique name, eliminating the ambiguity caused by shadowing.
829
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This makes your visitor's job much easier, allowing it to correctly classify variables without
needing to track scopes explicitly.
830
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When the program is compiled without optimizations, variables are often mapped to memory
locations (e.g., stack slots), and phi-functions are not really necessary (for memory is not in SSA
form! Only virtual registers!). So, in our figure, when the program is compiled without the
-mem2reg pass:
When the mem2reg pass is applied, LLVM starts mapping variables onto virtual registers:
831
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
LLVM inserts phi-functions at the start of the loop to select the correct value for these variables.
The SSA representation after mem2reg might look like:
● Before mem2reg: Variables are in memory. LLVM does not represent memory locations
in SSA form: only virtual registers.
● After mem2reg: Variables are promoted to SSA form with phi-functions resolving
control-flow merges.
The transformation enables LLVM to leverage the full power of SSA for optimization while
maintaining the ability to represent code without SSA when necessary.
Conclusion
Variables in the LLVM IR are inherently SSA-based. After applying transformations like
mem2reg we have more variables represented as virtual registers. The phi-functions introduced
during this process handle control-flow merges, ensuring the correctness of data flow in SSA
form. This design is essential for LLVM's optimizations and efficient code generation.
In SSA form:
832
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This property allows us to map the constant status of a variable directly to its name in SSA form,
rather than to a pair of the variable and a specific program point.
● In SSA form, each variable is uniquely defined, and its constant status is tied to that
definition.
● Instead of tracking the constant status for every variable at every program point, we only
need to track it once per variable, at its definition.
● This reduces the storage complexity to O(∣V∣), as the number of definitions is equal to
the number of unique variables.
The figure below illustrates the benefit of the SSA representation, showing the differences
between the dense and the sparse implementations of constant propagation:
833
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
834
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Improved Performance:
○ Sparse analysis reduces the computational overhead of traversing the
control-flow graph to update constant statuses for each program point.
Conclusion
But we are still left with the problem of how to convert a program to SSA
Form if the program has branches. What would be the steps to do that?
Converting a program into Static Single Assignment (SSA) form involves a series of
transformations to ensure that every variable in the program has a single definition and that
every definition of a variable dominates all its uses. This transformation requires addressing
the complexities introduced by branching control flow, where a variable may have multiple
potential definitions depending on the execution path.
835
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ After renaming, each variable in the program has a single definition site and a
unique name.
3. Preservation of Semantics:
Conclusion
The process of converting a program to SSA form is grounded in ensuring the dominance of
variable definitions over their uses. This requires careful handling of phi-functions, which use
variables based on control-flow edges rather than within blocks. These transformations enable
powerful compiler optimizations by creating a representation where data flow is explicit and
unambiguous.
This core property: definition sites dominate its uses, seems to break with
the edges of phi-functions, doesn't it?
Yes, that's a bit weird. The parameters of a phi-function are treated differently from standard
uses of variables:
● A variable used in a phi-function is associated with a specific incoming edge to the block,
not the block as a whole.
● This edge-based usage ensures that the definition of the variable dominates its "use" in
the control-flow graph, preserving the SSA property.
The figure below illustrates the special case of uses of variables within phi-functions:
836
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
837
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
These conditions ensure that a phi-function is placed at z only when z is the first point in the
CFG where control flow from x and y converges and both definitions of b are still live. This
placement guarantees the SSA property by resolving ambiguity about which definition of b to
use at z.
● The cubic complexity O(∣V∣3) becomes impractical for large programs with many blocks
and variables.
● While a single DFS is manageable, performing ∣V∣ DFS traversals quickly adds up in
terms of both time and memory usage.
838
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Implications
Even though this method avoids explicitly enumerating all paths Pxz and Pyz, the need to
perform multiple graph traversals for each variable still makes it computationally expensive. This
inefficiency highlights the need for a more targeted approach to determine where phi-functions
should be placed.
The core insight of the algorithm is to leverage dominance relationships in the control-flow
graph (CFG) to efficiently determine where phi-functions are required, without exploring all
possible paths explicitly. The intuition is based on the concept of the dominance frontier.
The dominance frontier (DF) of a node x in a CFG is the set of nodes z such that:
1. x dominates a predecessor of z (i.e., there’s at least one control-flow edge entering z
that comes from a block dominated by x).
2. x does not dominate z itself.
In other words:
● The dominance frontier identifies where control flow merges, and the dominance of a
definition no longer extends unambiguously to all subsequent blocks.
● These are precisely the locations where multiple definitions of a variable could reach and
where phi-functions are needed to resolve ambiguity.
● If a variable b is defined in block x, and control flow can reach block z through multiple
paths (e.g., one path where b was defined in x, and another where it was defined
elsewhere), then z lies in the dominance frontier of x.
● By computing the dominance frontiers of all blocks, the algorithm efficiently identifies all
points where phi-functions are necessary for every variable.
839
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This iterative process ensures that all necessary phi-functions are placed, even in complex
CFGs with nested and interconnected control flow.
Conclusion
The algorithm introduced by Cytron et al. is grounded in the observation that the dominance
frontier of a block concisely captures where ambiguity arises in variable definitions. By iterating
over these frontiers, the algorithm efficiently determines the placement of phi-functions, paving
the way for SSA form construction. This elegant use of dominance relationships has made the
algorithm a foundational tool in modern compiler design.
● The root node always dominates itself, as there is no path to any other node that
bypasses it.
● Intuitively, if a node d dominates a node n, then d must be encountered in the control
flow before n on any possible execution path.
Example
840
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To determine the set of dominators for each node in the CFG, we can use the following iterative
algorithm based on equations:
Iterative Computation
● Initialization:
○ For the root node s0: D[s0] = {s0}.
○ For all other nodes n: D[n] is initialized to all nodes in the CFG.
D[n] = {all nodes in the CFG} for n ≠ s0
● Iteration:
○ For each node n, repeatedly update D[n] using the equation: D[n] = {n} ∪
(intersection of D[p] for each p in pred[n])
○ The intersection operator ensures that the dominator set for n becomes
progressively smaller with each iteration.
● Termination:
○ The algorithm stops when no dominator set changes during an iteration.
1. Convergence:
○ Each update to D[n] only makes the set smaller, eventually stabilizing when no
further updates occur.
2. Correctness:
841
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ By starting with the largest possible sets and refining them iteratively, the
algorithm ensures that all nodes satisfying the dominance property are included
in D[n] while eliminating any nodes that do not dominate n.
Example
The figure below shows, for each node of a rooted CFG, its set of dominators. Thus, {a}, being
the root, is only dominated by itself. Node b, in turn, is dominated by nodes a and b itself, and
so on and so forth:
Key Insights
Practical Importance
842
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● They are used to build dominator trees, which represent the hierarchical dominance
relationships in a CFG.
● They help identify natural loops, critical for loop optimizations.
● They are the basis for computing dominance frontiers, which are key to placing
phi-functions in SSA form.
Recursive Computation
843
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
DF[x] = DF_local[x].union(
w for c in children[x] for w in DF_up[c]
)
Example Walkthrough
For a node e with children f and g in the dominator tree, see its dominance frontier in the figure
below:
The dominance frontier of a node in the control-flow graph (CFG) identifies the points where
control flow merges and ambiguity about variable definitions can arise. Let’s use the figure to
understand why the dominance frontier is the place to insert phi-functions, assuming a
well-formed program where every variable is defined before it is used. Let's use the figure below
to provide some intuition on why the dominance frontier is the place to insert phi-functions:
844
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Ambiguity at Node k
845
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Summary
Phi-functions are inserted on the dominance frontier of a node because these are the first points
where control flow from different paths merges, and ambiguity about variable definitions can
arise. In the case of node e:
This placement ensures that every use of v in the program can be unambiguously resolved to
the correct definition.
Could you provide some Python code to compute the dominance frontier of
a node?
Here's the Python code, assuming a class-based structure for handling the dominance tree and
related computations:
class DominanceTree:
def __init__(self, cfg, idom_map, dom_tree):
"""
Initialize the dominance tree representation.
Args:
cfg: The control-flow graph.
idom_map: A dictionary mapping each node to its immediate
dominator.
dom_tree: A dictionary mapping each node to its children in the
dominator tree.
"""
846
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
self.cfg = cfg
self.idom_map = idom_map
self.dom_tree = dom_tree
self.dominance_frontier = {node: set() for node in cfg.nodes}
Args:
node: The node for which to find the immediate dominator.
Returns:
The immediate dominator of the node.
"""
return self.idom_map.get(node, None)
Args:
dom_node: The potential dominator node.
target_node: The target node.
Returns:
True if `dom_node` dominates `target_node`, False otherwise.
"""
# This should be implemented based on the dominator tree structure
pass
Args:
node: The node for which to compute the dominance frontier.
"""
S = set()
847
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
self.dominance_frontier[node] = S
Explanation
848
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ This new definition might need to be "joined" with other definitions downstream in
the control-flow graph (CFG).
2. Chain Reaction of Phi-Function Placement:
○ The new definition from the phi-function could make other nodes meet the criteria
for needing a phi-function.
○ This process continues until no further phi-functions are required, forming a
"chain reaction."
3. Example:
Suppose a variable v is defined at node e, and its dominance frontier includes node k. A
phi-function is placed at k: v = phi(v:e, v:j, v:i)
○ This phi-function introduces a new definition v:k.
○ If another node (e.g., l) lies in the dominance frontier of k, and there is a use of
v in l, then we must insert another phi-function at l to resolve the ambiguity.
To handle this chain reaction systematically, we compute the iterated dominance frontier. This
concept ensures that all necessary phi-functions are placed by iterating the dominance frontier
computation until no new definitions require further phi-functions.
Definition: The iterated dominance frontier (IDF) for a variable v is the transitive
closure of the dominance frontier:
849
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
The figure below shows the iterated dominance frontier of four different nodes:
Conclusion
The iterated dominance frontier is a critical concept for constructing SSA form. By iteratively
placing phi-functions and recomputing dominance frontiers, it handles the cascading effect of
new definitions, ensuring that every use of a variable is unambiguously tied to a unique
dominating definition. This systematic approach is what makes SSA construction efficient and
robust for complex programs.
850
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
When a variable x is defined at a specific node (in this case, node f), phi-functions must be
inserted at nodes within the iterated dominance frontier (IDF) of f. This is necessary to ensure
that every use of x is properly dominated by its unique definition, as required by the SSA form.
1. Node f:
○ The initial definition of x is placed here.
2. First Pass (Dominance Frontier):
○ The dominance frontier of f is calculated. Nodes d and h are identified, as they
are control-flow merge points where definitions from f reach multiple paths.
3. Second Pass (Iterated Dominance Frontier):
○ Adding the phi-functions at d and h creates new definitions of x, which propagate
further.
○ The dominance frontier of these newly added definitions expands the scope to
nodes like e, l, and k, requiring additional phi-functions.
4. Iterative Process:
○ This process continues until no new nodes meet the criteria for needing
phi-functions. In the figure, nodes d, h, e, l, and k collectively form the iterated
dominance frontier of f.
851
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Guarantees SSA Correctness: Without the IDF, some uses of x would not be
dominated by their definitions, breaking the core SSA Property.
● Handles Complex Control Flow: CFGs with loops and branches (like in this example)
require iterative placement of phi-functions to ensure all potential paths are covered.
● Minimizes Redundancy: The IDF places phi-functions only where absolutely necessary,
avoiding unnecessary overhead.
class CFG:
def __init__(self):
# A dictionary mapping each node to its set of original variable
# definitions
self.Aorig = {}
# A dictionary mapping each node to its dominance frontier
self.DF = {}
# A dictionary to store inserted phi-functions for each node
self.Aphi = {}
# A dictionary mapping variables to their definition sites
self.defsites = {}
# A graph representation where each node has a list of predecessors
self.predecessors = {}
def place_phi_functions(self):
# Step 1: Initialize defsites
for node, variables in self.Aorig.items():
for var in variables:
if var not in self.defsites:
self.defsites[var] = set()
self.defsites[var].add(node)
852
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
# defined there
if var not in self.Aorig.get(y, set()):
W.append(y)
Explanation:
1. Initialization:
○ Aorig: Stores the original variables defined at each node.
○ defsites: Maps each variable to the nodes where it is defined.
○ DF: Precomputed dominance frontier for each node.
○ Aphi: Keeps track of nodes where phi-functions are inserted.
2. Worklist Algorithm:
○ For each variable, a worklist W is initialized with its definition sites.
○ The dominance frontier of each node in the worklist is examined, and
phi-functions are inserted as needed.
3. Insert Phi-Functions:
○ The insert_phi method simulates the insertion of phi-functions by generating
a string with the appropriate number of arguments based on the node's
predecessors.
4. Mock CFG:
○ An example CFG structure is provided with nodes, dominance frontiers, and
predecessor relationships.
The algorithm iterates over nodes in the control flow graph (CFG). Since the CFG has a finite
number of nodes, the possible iterations are inherently bounded. Each node can only be added
to the worklist WW a limited number of times.
853
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
3. Worklist Is Finite
● The worklist W is initialized with the definition sites of a variable a, which is finite because
there are a finite number of nodes in the CFG and each node can define only a finite
number of variables.
● Nodes are only added back to the worklist if:
○ A phi-function is inserted at a node y, and
○ The variable a is not already defined in Aorig[y].
● Thus, the additions to W are limited by the size of the CFG and the number of variables.
● For a given node n, the dominance frontier DF[n] contains a finite number of nodes
because it is derived from the CFG, which has a finite size. Therefore, the inner loop:
for y in DF[n]:iterates over a finite set of nodes.
● In each iteration, at least one node nn is removed from the worklist W. This ensures that
the algorithm progresses and does not get stuck in an infinite loop.
High-Level Reasoning
Imagine a CFG with 10 nodes, where each node can define up to 3 variables:
854
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Args:
node: The current node in the control flow graph (CFG) being
processed.
CFG: The control flow graph represented as a dictionary where keys
are nodes and values are lists of successors.
Count: A dictionary mapping each variable to its current count.
Stack: A dictionary mapping each variable to a stack of assigned
indices.
"""
# Rename variables in the current basic block
rename_basic_block(node['instructions'], Count, Stack)
# Pop variable stacks for all variables defined in the current block
for instruction in node['instructions']:
for variable in get_defined_variables(instruction):
Stack[variable].pop()
Args:
855
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Returns:
The top element of the stack.
"""
return stack[-1]
Key Points:
1. CFG Representation: The control flow graph (CFG) is represented as a dictionary where
each node contains its instructions, successors, predecessors, and phi-functions.
2. Phi-Function Update: For each successor, the phi_functions field is updated with
the correct renamed variables.
3. Recursive Traversal: The function traverses the CFG recursively, processing each node
and its children.
4. Stack Cleanup: After processing a node, the stacks of variables defined in that node are
popped to maintain the correct state.
856
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
An important part of this transformation is inserting phi-functions at the appropriate points in the
control flow graph (CFG). The algorithm uses the concept of the dominance frontier to
determine where these phi-functions should be placed. By iterating over the dominance frontiers
with a worklist, the algorithm ensures that phi-functions are inserted in all necessary locations to
accurately represent the variable's flow in SSA form. You can find the interface of many of the
key data structures in the LLVM code base, e.g:
This approach aligns with the standard method of placing phi-functions in SSA construction,
ensuring that variables have a single assignment point while correctly modeling the program's
control flow. However, the code has changed over time. It's much easier to recognize the
algorithm we just described in the old implementation of PromoteMemoryToRegister.cpp.
Minimal SSA form refers to the placement of phi-functions such that the program satisfies the
SSA property—each variable has a single, unique definition—even if some of the phi-functions
inserted are unnecessary. The algorithm ensures that phi-functions are inserted at every point
required by the dominance frontier of a definition. This placement is considered "minimal"
according to the definition of SSA because it guarantees the fewest possible phi-functions that
uphold the correctness of SSA form. However, this approach does not consider whether the
variable is actually live at the program point where the phi-function is inserted.
As a result, minimal SSA form can include phi-functions for variables that are not live (i.e.,
no longer used after their definition). These dead variables may create unnecessary
overhead during optimization and code generation. For example, in the control flow graph
857
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
below, a phi-function for variable i would exist at L1 simply because it is in the dominance
frontier of L2, and there exists a definition of i there, even if i is not used later.
Pruned SSA form improves upon minimal SSA form by adding a liveness check during the
phi-function placement phase. In this version, phi-functions are only inserted at points where the
variable is both required by the dominance frontier and alive. A variable is alive at a program
point if there exists a path from that point to a later use of the variable without an intervening
redefinition.
By eliminating unnecessary phi-functions for dead variables, pruned SSA form produces a more
compact and efficient SSA representation. This optimization is particularly important in
large-scale programs, as it can reduce the number of phi-functions and simplify subsequent
compiler optimizations.
Practical Implications
While minimal SSA form is simpler to compute, pruned SSA form is often preferred in practical
compilers because it avoids redundant operations and improves efficiency. The distinction
between these forms is important for compiler developers to understand when implementing
SSA construction algorithms.
Summary
858
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Minimal SSA Form ensures correctness with the smallest number of phi-functions
required by the dominance frontier, regardless of liveness.
● Pruned SSA Form refines this by checking variable liveness, reducing the number of
phi-functions and avoiding those for dead variables.
The distinction between these two forms underscores the importance of combining correctness
with efficiency in compiler design.
859
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
An SSA program is said to be in Conventional SSA Form (CSSA) if all phi-related variables
have non-overlapping live ranges. When this condition is met, SSA-elimination typically
proceeds without issues. However, problems can arise when the CSSA form is broken. Let’s
explore the issue using the figure:
Example Walkthrough
860
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
861
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
is lost.
Example Walkthrough
862
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
863
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
the program are broken. Specifically, x3 is overwritten before its value is copied to y3,
leading to incorrect results.
5. (e): Breaking Critical Edges Does Not Solve the Problem
Unlike the lost-copy problem, breaking critical edges does not resolve the swap problem.
Even after splitting edges, the issue of overwriting x3 before copying it to y3 persists.
○ t1 = x3
○ x3 = y3
○ y3 = t1
Another approach is to restore the program to CSSA form. This involves inserting additional
temporary locations before and after each phi-function, effectively isolating the live ranges of
phi-related variables. However, this method can be computationally expensive and may
introduce significant overhead.
Eliminating phi-functions inherently requires implementing parallel copies. Parallel copies can
be realized using a combination of swap and copy instructions. Architectures may provide
specialized instructions to swap the contents of two registers, or we can use techniques like the
3-xor trick for efficient swapping.
Conclusion
The swap problem demonstrates the importance of preserving CSSA properties during
SSA-elimination. While breaking critical edges can address simpler issues like the lost-copy
problem, the swap problem necessitates careful handling through temporary variables or
parallel copy implementations. By addressing these challenges, compilers ensure the
correctness and efficiency of the transformed code.
864
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
variable has a single name, as we can see in the paper "Code Motion of Control Structures in
High-Level Languages".
This work was presented at the 15th ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages (POPL). The authors developed SSA form as a way to simplify
program analysis and optimization by transforming the program so that each variable is
assigned exactly once. This transformation made data flow analysis more efficient and enabled
powerful optimizations like constant propagation, dead code elimination, and common
subexpression elimination.
Their algorithm was further refined by Cytron et al. in 1991 in their paper "Efficiently Computing
Static Single Assignment Form and the Control Dependence Graph". Cytron’s work introduced
the notion of dominance frontiers as a mechanism to compute the placement of phi-functions
efficiently, which is a critical step in converting programs to SSA form. This refinement
significantly reduced the complexity of SSA construction, making it practical for real-world
compilers.
In short:
Zadeck seems to have done quite a lot in this domain! Are you familiar with
his work?
Yes, F. Kenneth Zadeck is a highly influential figure in computer science, particularly in the
areas of program analysis, compiler optimization, and formal methods. His work has had a
lasting impact on both theoretical and practical aspects of compiler design, software
engineering, and optimization.
865
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Static Single Assignment (SSA) Form (with Barry K. Rosen and Keith H.
Wegman)
● SSA form is one of Zadeck's most famous contributions. As mentioned earlier, SSA
simplifies the process of program optimization by ensuring that each variable is assigned
only once. This form enables efficient data flow analysis and various optimizations, such
as constant propagation, dead code elimination, and more. Zadeck, along with his
collaborators, introduced the original idea and demonstrated its power in simplifying
compiler optimization tasks.
866
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
used in conjunction with the dominance tree to perform optimizations like loop invariant
code motion and predicate splitting. His work in this area enhanced how compilers
analyze control flow and improve the execution of programs.
● Zadeck’s research also focused on data flow analysis, which involves tracking the flow
of data through a program to optimize it. This includes techniques like reaching
definitions, live variable analysis, and available expressions. His contributions in this
area have made it easier to reason about how data moves through programs and to
apply optimizations based on that analysis.
● Zadeck's contributions extend beyond compiler optimization into the field of formal
methods, where he has worked on rigorous techniques for specifying, analyzing, and
verifying software systems. His work on program correctness and formal verification has
influenced how we think about program reliability and security.
● Zadeck’s work has had a significant influence on compiler construction and optimization
tools used in the industry. Many modern compilers, such as GCC, LLVM, and others,
incorporate ideas from his research. His contributions have helped shape the
development of efficient and reliable compilers used in a wide range of applications, from
general-purpose programming to embedded systems and high-performance computing.
Key Papers:
867
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Recognition:
● Zadeck’s work has been recognized by the computer science community through
numerous awards, citations, and honors. His research has had a profound impact on
how compilers optimize programs and how software reliability is ensured.
868
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This task is more than a simple mapping; it combines several complex decision problems with
interesting theoretical foundations:
869
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
that every variable remains in one or more registers throughout its lifetime. This involves
ensuring that the allocation does not exceed the available number of registers at any
point in the program.
These problems are interrelated and, in many cases, computationally challenging. Solving them
efficiently is critical to generating high-performance code. Modern compilers use a mix of
heuristics and algorithmic approaches—often based on graph coloring, linear programming, and
other optimization techniques—to tackle these challenges and deliver code that makes the most
efficient use of the limited registers available.
870
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
takes a single clock cycle, while accessing memory can take tens to hundreds of cycles,
depending on cache performance. By ensuring that frequently accessed variables are
kept in registers, register allocation significantly reduces the time spent on memory
operations, boosting execution speed.
In summary, register allocation is a keystone of compiler design. Its importance stems not only
from the performance gains it directly enables but also from its role in unlocking the potential of
other optimizations. As Hennessy and Patterson argue, register allocation is foundational to the
pursuit of efficient, optimized code—a goal at the heart of every modern compiler.
871
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This method is particularly well-suited for just-in-time (JIT) compilers and other scenarios where
compilation speed is critical. Linear scan treats register allocation as a coloring problem on
interval graphs, where the intervals represent the live ranges of variables. The algorithm assigns
registers by scanning live ranges linearly, reusing registers whenever possible. While not as
optimal as more complex methods, linear scan is fast and straightforward, making it a popular
choice in practical compilers like LLVM and Java's HotSpot VM.
2. Graph-Coloring-Based Allocation
Graph coloring is one of the most well-known approaches to register allocation. Variables' live
ranges are represented as vertices in an interference graph, with edges connecting variables
whose live ranges overlap. Allocating registers then becomes a graph-coloring problem, where
the goal is to assign colors (registers) to vertices without adjacent vertices sharing the same
color. This approach, pioneered by Chaitin, remains a cornerstone of register allocation,
although it can be computationally expensive due to the NP-hard nature of graph coloring.
ILP-based register allocation formulates the problem as an integer optimization task, where
constraints model the relationships between variables, registers, and memory. The objective is
to minimize cost functions, such as the number of spills or memory accesses. Although ILP
provides optimal solutions, its high computational complexity makes it suitable primarily for
scenarios where compilation time is less critical, such as ahead-of-time (AOT) compilation for
embedded systems.
This approach conceptualizes register allocation as a puzzle. The live ranges of variables are
treated as pieces that must be fit onto a board representing the bank of registers. This method
visualizes and resolves allocation challenges as a series of placement decisions, making it
particularly appealing for solving complex allocation problems that involve registers with multiple
sizes, as in the x86 architecture, where registers such as AH and AL can be combined to form
AX, e.g.:
872
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
6. Constraint Solving
The diversity of approaches to register allocation reflects its complexity and central importance
in compiler design. From theoretical models like graph coloring and ILP to practical methods like
linear scan and heuristic-based algorithms, the choice of approach depends on the specific
requirements of the compilation process, such as speed, code quality, and target architecture
constraints. As research continues, new methods and hybrid solutions will likely emerge, further
enriching the toolbox available to compiler developers.
873
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
874
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ SSA form inherently shortens variable live ranges, reducing the overlap between
live ranges in the interference graph.
○ Shorter live ranges increase the allocator’s flexibility, allowing it to map different
parts of the same variable more effectively.
3. Decoupled Allocation and Spilling:
○ Pre-SSA allocation enables a two-phase strategy:
■ First, spill variables until the program reaches a state where register
assignment is feasible.
■ Then, perform register assignment separately, focusing on mapping
variables to physical registers without additional spill concerns.
The decision between pre-SSA and post-SSA allocation depends on the compiler's design
priorities:
● Post-SSA is more common due to its simplicity and the extensive body of knowledge
available.
● Pre-SSA is advantageous in scenarios where the interference graph's chordal nature
can be exploited for efficiency or where shorter live ranges are critical.
Both approaches demonstrate the flexibility and adaptability of modern compiler design in
addressing the challenges of register allocation.
875
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ An induced cycle is a cycle in the graph such that no edges exist between
non-consecutive vertices in the cycle.
876
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
877
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
This efficient solution for SSA-form programs makes it a valuable property in register allocation,
especially when leveraging the simplicity and structure of SSA. Yet, notice that we are talking
only about register assignment! Problems like the minimization of spilling or the maximization of
coalescing remain NP-complete.
But what do you do with the phi-functions? The phi-related variables are
mapped to registers, but we still need to implement these instructions with
actual assembly instructions.
Handling φ-functions during SSA-based register allocation is indeed one of the trickiest parts of
the whole thing. After register allocation, φ-functions need to be translated into actual assembly
instructions, which typically involves implementing parallel copies. Here’s how this is managed:
878
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
instructions that move values to their assigned registers. If the parameters of the
phi-functions are associated with registers, then we must take these locations into
consideration when choosing the right instructions to replace phi-functions:
879
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In practice, handling φ-functions efficiently after register allocation combines clever algorithmic
design with low-level architectural features to balance correctness and performance:
880
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
881
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
882
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
5. Since the interference graph is the intersection graph of live range subtrees on the
dominance tree, and the intersection graph of subtrees of a tree is chordal, the
interference graph of an SSA-form program is chordal.
But how can we use all that to build a register allocator? You mention that
there are several problems to be solved...
Let's use the register allocation algorithm described in the paper "Register Allocation via the
Coloring of Chordal Graphs". The figure below highlights the key steps of the algorithm.
High-Level View:
This algorithm has been designed to work on chordal interference graphs. It can be applied onto
a non-SSA form program, as long as its interference graph is chordal. The algorithm consists of
two main phases:
1. Mapping to memory: the algorithm decides which variables, if any, will be mapped to
memory. This is done by gauging the size of the maximum clique in the interference
883
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
graph. Whenever a variable is mapped to memory, it's removed from the program, and
accesses to it are replaced with loads and stores.
2. Mapping to registers: at this point, the program is guaranteed to be K-colorable, and
variables are mapped to "colors" (registers) via a greedy coloring algorithm. Because the
maximum clique in the graph has no more than K registers in this phase, spilling is no
longer necessary.
1. Build: The Build stage is where the interference graph is created. Imagine each
variable as a node in a graph. If two variables' live ranges overlap (meaning they are
"alive" at the same time and thus cannot occupy the same register), an edge is drawn
between their corresponding nodes. This graph is the core structure for determining
which variables can potentially share registers and which cannot.
2. MaxClique: Next, the MaxClique stage does a clever analysis to understand the
maximum register pressure. It does this by:
○ Using an algorithm called Maximum Cardinality Search to create a special
ordering of the nodes.
○ Then, using that order to identify the size of the largest clique (a fully connected
sub-graph where every node is connected to every other node) in the graph. This
size represents the maximum number of variables simultaneously "live" and thus,
potentially competing for registers.
3. Spill: If the size of the maximum clique found is greater than the number of available
registers (K), then the graph is un-colorable in the number of available registers. The
Spill step kicks in. This is where the algorithm removes nodes from the graph that
prevent the graph from being K-colorable. To "remove" a variable we insert loads/stores,
to map it to memory. This introduces the concept of spilling. The algorithm identifies the
best nodes to spill, and introduces code to save these variables to memory, freeing up a
register. The algorithm continues spilling nodes and re-building the interference graph in
a cycle, until a graph is found where the maximum clique is no greater than K.
4. Color: Once the maximum clique size is at or below K, the Color stage begins. Here, the
algorithm greedily assigns registers to variables in the order determined in the
MaxClique step. Since the graph is now chordal and the maximal clique is of size K or
less, the greedy coloring algorithm guarantees a coloring using a maximum of K
registers.
5. Coalesce: The Coalesce stage is an optimization step. It tries to merge the nodes of
variables related by copy instructions (like v := u). If v and u are never live at the
same time, they can use the same register. This reduces the number of required
registers.
6. SSA Elimination: The SSA Elimination step happens towards the end and transforms
the phi-functions into parallel copy instructions, which can be directly executed in the
machine.
884
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Chordal Graph Advantage: The algorithm relies heavily on the structure of chordal
graphs because they have the nice property that optimal coloring can be achieved in
polynomial time.
● Iterative Approach: The algorithm is iterative due to the spill process. If the initial
interference graph is too dense (more live ranges than registers), the algorithm needs to
remove variables (through spilling), rebuild, and try again.
1. Spilling: Initially, we can aggressively spill variables to reduce the number of live
variables at any given point (MaxLive) until it aligns with the desired register
budget (K). This effectively reduces the size and complexity of the interference
graph.
This decoupled approach simplifies the register allocation process by separating the potentially
complex and iterative spilling decisions from the register assignment itself. The main advantage
of this approach is simplicity: we can do register assignment without worrying about mapping
the excess of variables to memory as Florent Bouchez explains in his paper, "A Study of Spilling
and Coalescing in Register Allocation as Two Separate Phases".
885
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
1. Liveness Analysis: The initial step is to perform liveness analysis on the SSA-form
program. Liveness analysis determines which variables are "live" at each program point.
A variable is considered live at a point if its value might be used in the future, as we had
seen in our class about data-flow analyses.
In essence, the build phase translates the liveness information into a graphical representation of
register allocation conflicts:
Notice that we can still obtain chordal interference graphs, even if the original program is not in
SSA form. The figure above shows an example of such a situation. Indeed, many real-world
programs have chordal interference graphs, even after SSA elimination.
886
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Every use of a variable is dominated by its unique definition. This means that if a
variable is used at a certain point, its definition must have occurred earlier in the
program's execution path.
Algorithm:
We can leverage this property to perform a backward traversal of the Control Flow Graph (CFG)
of the SSA-form program.
def live_analysis(program):
"""
Computes liveness information for each statement in the
SSA-form program.
Args:
program: The SSA-form program represented as a data structure
(e.g., a list of statements, where each statement has
'uses' and 'defs' attributes).
Returns:
A dictionary where keys are statements and values are sets of
live variables at the entry of each statement.
"""
live_vars = {}
for statement in program:
live_vars[statement] = set()
# Backward traversal
for statement in reversed(program):
for use_var in statement.uses:
live_vars[statement].add(use_var)
for pred_stmt in statement.predecessors:
# If not defined in predecessor
if use_var not in pred_stmt.defs:
live_vars[pred_stmt].add(use_var)
return live_vars
Explanation:
1. Initialization:
○ Create an empty dictionary live_vars to store the set of live variables at the
entry of each statement.
2. Backward Traversal:
887
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ Iterate through the program statements in reverse order (from the end to the
beginning).
3. Mark Uses as Live:
○ For each variable used in the current statement, add it to the live_vars set for
that statement.
4. Propagate Liveness:
○ For each predecessor statement of the current statement:
■ If the used variable is not defined in the predecessor statement:
■ Add the used variable to the live_vars set of the predecessor
statement. This ensures that the variable remains live along the
execution path leading to the current statement.
And how do we get the size of the maximum clique in the chordal graph?
In chordal graphs, the size of the maximum clique can be efficiently determined by leveraging
the concept of simplicial vertices and simplicial elimination orderings.
Simplicial Vertex:
888
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
889
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
def maximum_cardinality_search(graph):
"""
Computes a simplicial elimination ordering using
Maximum Cardinality Search.
Args:
graph: The input graph represented as an adjacency
list or matrix.
Returns:
A list representing the simplicial elimination ordering.
"""
n = len(graph)
labels = [0] * n
ordering = []
for _ in range(n):
max_label_vertex = max(range(n), key=lambda v: labels[v])
ordering.append(max_label_vertex)
for neighbor in graph[max_label_vertex]:
labels[neighbor] += 1
del graph[max_label_vertex]
return ordering
890
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ The maximum clique size in the chordal graph is equal to this maximum
label plus 1!
Rationale:
● The MCS algorithm assigns labels to vertices based on the number of already-ordered
neighbors.
● In a chordal graph, the maximum label assigned to a vertex during MCS represents the
size of the maximum clique that includes that vertex.
The sequence of figures below show how maximum cardinality search will find a simplicial
elimination ordering for the interference graph of our running example:
891
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
In this example, the simplicial elimination ordering will be T7, R1, R2, T1, R5, R4, T8, R6 and
T9, as we had seen before. If we remove the nodes in the reverse simplicial ordering, then
whenever we remove a node, all its neighbors (in the remaining graph) form a clique
892
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Ok, I can find out the size of the largest clique of a chordal interference
graph in polynomial time. But how does that help me with spilling?
Sebastian Hack has a cute theorem in his PhD Dissertation that helps with that!
893
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
By using the size of the largest clique as a guide, we can strategically reduce register pressure,
enabling efficient register allocation while balancing computational constraints.
The following formula, adapted from Andrew Appel's book Modern Compiler Implementation in
Java, provides a structured way to compute spill costs:
894
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Variables with lower spill costs are prioritized for spilling, as they are expected to cause less
runtime overhead. The figure below shows how this formula could be applied to our running
example:
895
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Iterative Process
By systematically choosing spill candidates with minimal impact, we ensure efficient register
allocation while minimizing the runtime cost of spilling.
896
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
To assign registers to variables, we can use greedy coloring! Here is a mock Python
implementation of the greedy coloring algorithm:
for v in S:
# Find colors used by neighbors of v
neighbor_colors = {m[neighbor] for neighbor in V \
if (v, neighbor) in E or (neighbor, v) in E}
# Assign the lowest available color
m[v] = next(c for c in range(len(V)) \
if c not in neighbor_colors)
return m
897
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For chordal graphs, if we follow a Simplicial Elimination Ordering (SEO), greedy coloring
yields an optimal coloring.
After spilling, the interference graph is guaranteed to be K-colorable, where K is the number of
available registers. Using greedy coloring:
This process ensures a valid and efficient register assignment for the program without additional
spilling. Thus, notice that it's vital to reduce the size of the largest clique to at most K, the
number of registers in the target architecture. For instance, if we get back to the interference
graph of our original example, you will see that we will need at least four different colors:
898
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
However, if we try the same greedy coloring algorithm on the graph that is produced after the
spilling phase (assuming three registers), then we are guaranteed to find a coloring with at most
three colors, e.g.:
899
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Post-Assignment Coalescing
The algorithm modifies the interference graph GG and attempts to merge nodes corresponding
to variables involved in copy instructions while maintaining colorability.
900
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
"""
V, E = G
# Start with a copy of G
G_prime = (V.copy(), E.copy())
for copy in L:
# Extract variables from "x = y"
x, y = parse_copy_instruction(copy)
return G_prime
1. Input: The list of copy instructions L, the interference graph G, and the number of
registers K.
901
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
2. Finding Unused Colors: For each copy instruction x=y, compute the colors used by the
neighbors of x and y in the graph. If there exists a color c that is unused in these
neighborhoods, proceed with coalescing.
3. Coalescing:
○ Create a new node xy to represent the merged variables.
○ Assign xy a color cc and update the graph to reflect this merge.
○ Replace occurrences of x and y in the copy instructions with xy.
○ Remove x and y from the graph.
4. Repeat: Process all copy instructions in L.
The figure below shows the effect of eliminating copies in the program after register assignment:
902
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Conclusion:
903
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Yes, avoiding spilling is considered more important than removing copy instructions in many
different academic works, such as Bouchez' "On the Complexity of Register Coalescing". This
approach ensures stable and predictable performance by minimizing expensive memory
accesses while still trying to reduce copy instructions when feasible. It reflects a practical
compromise between aggressive optimization (copy removal) and maintaining efficient register
allocation.
But, can't spilling happen during the SSA elimination phase? What if the
arguments of phi-functions end up mapped to different stack slots?
It depends on the program's representation! This issue is discussed in detail in the paper "SSA
Elimination after Register Allocation". Here's the breakdown:
904
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
905
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
splitting the live range of a variable (e.g., inserting a copy like a3 = a2) ensures
that φ-related variables are handled cleanly.
○ The result is that all variables related to a φ-function can be naturally assigned
distinct slots, and no extra memory transfer instructions are needed during SSA
elimination.
906
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
● Arguments of φ-functions are naturally assigned the same stack slot if they are ever
spilled.
● Memory transfers can thus be entirely avoided, reducing the need for further spilling.
In the end, by converting a program to CSSA form before spilling, the complexity and overhead
associated with SSA elimination can be greatly reduced.
What are these stack slots you mentioned? I mean: where are spilled
variables stored?
Spilled variables are mapped to stack memory. Spilling occurs when there are not enough
registers to hold all the live variables at a given point in the program. The compiler generates
additional load and store instructions to move the spilled variable’s value between the stack and
registers as needed.
Both mappings are possible! This is determined by the optimization level and by the number of
available registers:
907
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Global variables are not spilled to the stack. Instead, they are stored in static memory, which is
allocated in the program’s data segment. Here’s why:
● Global variables have a lifetime that spans the entire execution of the program.
● They are accessible from anywhere in the program, so they are placed in a fixed
memory location (e.g., .data or .bss segments).
Finding a spill location for a global variable is unnecessary because the memory location of
global variables is already predetermined. We can still operate with them in register, e.g., bring
them to registers and keep them in registers for as long as possible. But ultimately, they need to
be stored back into their original memory location. In other words, any affectation on the global
variable must change the state of the static memory that holds it.
● These variables are explicitly allocated and freed by the programmer (e.g., using
malloc/free in C or new/delete in C++).
● Their values reside in the heap, and only pointers to them are usually manipulated in
registers.
It doesn’t make sense to talk about spilling a heap-allocated variable itself because spilling
refers to moving a register-held value to memory. However, the pointer to a heap-allocated
variable could be spilled if it is stored in a register and register pressure is too high.
Summary
908
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
If we assume that we are using a compiler like gcc or clang, than that's what happens:
Spilling is a concept closely tied to register allocation and applies to values temporarily held in
registers, not to variables that are permanently in memory (like globals or heap-allocated
variables).
The Go compiler handles φ-functions similarly to the approach discussed in these lecture notes:
● It ensures that all operands of a φ-function can be mapped to the same physical location,
whether in a register or memory.
● If the program is not in Conventional Static Single Assignment (CSSA) form, the Go
compiler splits the live ranges of φ-related variables to resolve potential conflicts.
909
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
For spilled variables, the Go compiler uses the stackalloc.go file to assign stack slots. A few
notable aspects include:
● Shared Stack Slots: The Go compiler optimizes stack usage by allowing multiple SSA
values to share the same stack slot, provided their lifetimes do not overlap.
● Interference Graph for Optimization: To minimize stack allocation costs, the Go
compiler leverages the interference graph of SSA variables, ensuring that stack slots are
reused efficiently.
Further Reading
For a detailed explanation of the Go compiler's register allocation process, Vladimir Makarov’s
article offers an excellent overview. Makarov highlights the unique aspects of Go’s SSA-based
allocation strategy and its practical implications.
In 1981, Gregory Chaitin introduced the idea of modeling register allocation as a graph
coloring problem in his seminal paper, Register Allocation via Coloring.
910
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
○ However, at the time, they did not yet show that SSA-form programs inherently
produce chordal interference graphs.
911
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Chaitin's proof of NP-completeness for register allocation only works for general
programs: it would not work for SSA-form programs.
Gregory Chaitin is a pioneering figure in both theoretical computer science and mathematics.
Born in 1947 in the United States, Chaitin demonstrated exceptional talent in mathematics and
computer science from a young age. He is widely recognized for his groundbreaking work in
algorithmic information theory (AIT), a field that blends computation with information theory
to explore the limits of what can be computed and described. Over the years, Chaitin has made
contributions to both theoretical and practical aspects of computer science.
912
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
Chaitin spent much of his career at IBM Research, where his ideas influenced numerous areas
of computer science. In addition to his technical achievements, he is a prolific author and
engaging speaker, bringing abstract concepts to broader audiences through books and lectures.
Contributions to Science
913
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
as Meta Math! The Quest for Omega, make these ideas accessible to broader
audiences.
Legacy
Gregory Chaitin’s work has left an indelible mark on both theoretical computer science and
practical software engineering. His contributions to register allocation have shaped compiler
technology, making programs run faster and more efficiently. Simultaneously, his works in AIT
have deepened our understanding of computation, randomness, and mathematical truth,
inspiring many researchers.
914
This material is freely available and must not be sold or used for commercial purposes.
For typos or suggestions, please contact Fernando Pereira at [email protected].
915