Compiler Design Notes
Compiler Design Notes
1. What is a Compiler?
A compiler is a software tool that translates code written in a high-level programming language (like
C++ or Java) into machine code (binary code) understandable by a computer's CPU.
Unlike an interpreter, which translates code line-by-line, a compiler converts the entire code at once,
creating an executable file that can run independently of the original code.
2. Importance of Compilers
Compilers enable code portability, allowing programs written in high-level languages to run on different
hardware architectures.
They optimize code for faster execution and efficient use of resources, which is crucial in performance-
sensitive applications.
3. Phases of a Compiler
Compilers work in multiple stages or phases. Each phase processes the code in steps, transforming it
progressively closer to machine code.
Lexical Analysis:
Syntax Analysis:
• This phase, also known as "parsing," checks if tokens form valid expressions based on the
language's grammar rules.
• The syntax analyzer builds a tree structure called a *parse tree* or *syntax tree* to represent
the hierarchical structure of the source code.
• If the syntax analyzer encounters code that doesn't follow the grammar rules, it generates syntax
errors.
Semantic Analysis:
• In this phase, the compiler checks the code for semantic consistency and meaning.
• It verifies variable declarations, type checking, and ensures operations are performed on
compatible data types.
• Example: Checking if a variable is declared before use or if we’re adding two compatible types,
like integers, rather than incompatible ones, like an integer and a string.
• The compiler translates the syntax tree into an intermediate code, which is simpler than the
source code but not yet machine code.
• This intermediate code is platform-independent, meaning it can be translated further into
machine code for different hardware architectures.
Code Optimization:
• In this phase, the compiler improves the intermediate code for better performance and efficient
resource use.
• Techniques include eliminating redundant instructions, minimizing memory usage, and reducing
execution time.
• Optimization aims to make the code run faster without changing its functionality.
Code Generation:
• The compiler converts the optimized intermediate code into machine code specific to the target
CPU.
• The generated machine code can then be executed directly by the computer’s hardware.
• Throughout these phases, the compiler maintains a *symbol table* to keep track of variable
names, their data types, scope, and memory locations.
• Additionally, the compiler performs *error handling* to detect and report errors in the source
code. It may also attempt error recovery to continue processing the code.
4. Types of Compilers
-Single-pass Compiler: Processes the code in a single pass through all phases. This is faster but may
have limited error detection capabilities.
Multi-pass Compiler: Analyzes the code in multiple passes, allowing for more in-depth error checking
and optimization.
Cross-compiler: Generates code for a different platform than the one it runs on.
Just-In-Time (JIT) Compiler: Compiles code at runtime, commonly used in environments like Java Virtual
Machine (JVM).
Compiler: Translates entire code at once, producing an executable file that runs independently.
Interpreter: Translates and executes code line by line, which is slower but more flexible, making it ideal
for scripting and testing.
• Lexical analysis is the initial phase in the compilation process. It reads the source code character
by character and groups these characters into meaningful sequences, known as tokens.
• The primary goal is to simplify the parsing process by reducing the complexity of analyzing raw
text.
• Tokens are the smallest units in a source code that have a meaningful role. Examples include
keywords (`int`, `return`), identifiers (variable names like `x` or `total`), literals (`5`, `"hello"`),
operators (`+`, `-`, `*`), and punctuation (`;`, `(`, `)`).
• Each token represents a distinct element of the programming language and is associated with a
specific class (e.g., identifier, keyword, operator).
Lexeme: A lexeme is the actual string of characters in the source code that matches a specific token.
For instance, in the code `int num = 5;`, the lexeme `int` corresponds to the token type "keyword."
Pattern: A pattern is the rule that specifies how a particular lexeme should look. These patterns are
defined using regular expressions or grammar rules. For example, the pattern for an identifier might be
"a sequence of letters and numbers that starts with a letter."
Tokenizing: Breaking down the source code into tokens based on language rules.
Classifying Tokens: Categorizing tokens based on their type, such as keywords, identifiers, literals, or
operators.
Skipping White Space and Comments: Lexical analyzers ignore white space, tabs, and comments as
they don’t contribute to the executable code.
Error Detection: During lexical analysis, certain errors, such as illegal characters or invalid identifiers,
can be identified early. Error handling at this phase allows the compiler to highlight mistakes in the code,
like unrecognized symbols.
Regular expressions are used to define patterns for various tokens in the source code. Each regular
expression specifies the allowed structure for tokens like identifiers, numbers, or keywords.
Example:
• Identifiers: `[a-zA-Z_][a-zA-Z0-9_]*` (A sequence that starts with a letter or underscore and can
be followed by letters, numbers, or underscores).
• Numbers: `[0-9]+(\.[0-9]+)?` (An integer or a decimal number).
Lexical analyzers often use finite automata (FA) to recognize tokens. Finite automata are state
machines that process input symbols (characters) to determine whether a sequence belongs to a specific
token class.
Deterministic Finite Automaton (DFA): Has only one possible state for each input, making it simpler
but less flexible.
Nondeterministic Finite Automaton (NFA): Allows multiple transitions for a single input, offering
more flexibility but is generally converted into a DFA for efficiency.
DFAs are typically generated from regular expressions and used by the lexical analyzer to match
lexemes against token patterns.
7. Lexical Errors
Lexical analyzers usually skip or report these errors, allowing the compiler to alert the programmer to
issues in the code format.
Lexical analyzers can be generated using tools like Lex (Linux) or Flex (Fast Lexical Analyzer). These tools
accept regular expressions and generate the code needed for token recognition.
Advantages of using Lex/Flex include automatic handling of token generation, efficiency, and standard
error handling routines.
Syntax analysis, also known as parsing, is the second phase in the compilation process. Its goal is to
determine whether the sequence of tokens (generated in the lexical analysis phase) forms a valid
sentence in the programming language.
This phase verifies the syntactic structure of the code according to the language's grammar, ensuring it
adheres to rules like the placement of keywords, operators, and symbols.
The parser processes tokens to build a parse tree, which represents the hierarchical structure of
statements in the source code.
Parse Tree: A tree structure that visually represents the grammatical structure of a source code
according to the rules of the language's grammar.
Example: In the expression `a = b + c;`, the parse tree would show `=` as the root with branches for `a`
and the expression `b + c`.
Syntax analyzers use context-free grammar (CFG) to define the language rules. CFG consists of rules
specifying how tokens can be combined to form valid code statements.
Components of CFG:
• Terminals: Basic symbols or tokens from the source code (e.g., `int`, `+`, `;`).
• Non-terminals: Variables representing patterns or groups of terminals and other non-terminals
(e.g., `expression`, `statement`).
• Productions: Rules that define how non-terminals can be expanded into terminals and other
non-terminals.
• Start Symbol: The non-terminal symbol from which parsing begins, representing the entire code
or program.
4. Types of Parsers
Top-Down Parsers:
Begin parsing from the start symbol and attempt to derive the input string by expanding the
grammar rules.
Bottom-Up Parsers:
• Start parsing from the input tokens and attempt to build the parse tree upwards by reducing
tokens to non-terminals.
• LR parsers (like SLR, LALR, and Canonical LR parsers) are common types of bottom-up parsers.
They parse input from left to right and construct a rightmost derivation in reverse.
• A straightforward method of top-down parsing that uses a set of recursive functions to process
input according to the grammar rules.
• Easy to implement but limited to simpler grammars (non-left-recursive).
Predictive Parsing:
Shift-Reduce Parsing:
A type of bottom-up parsing where tokens are shifted onto a stack and then reduced based on the
grammar rules.
-Example parsers using shift-reduce techniques include LR parsers, SLR (Simple LR) parsers, and LALR
(Look-Ahead LR) parsers.
During syntax analysis, the parser detects errors when the sequence of tokens does not match any
valid pattern in the grammar.
Types of syntax errors include missing operators, unbalanced parentheses, and incorrect keyword
usage.
Panic Mode Recovery: Discards tokens until it finds a known synchronization point (like a semicolon).
Phrase-Level Recovery: Replaces or inserts tokens to correct errors and continue parsing.
Error Productions: Special grammar rules added to handle common syntax errors.
Global Correction: Finds the minimum set of changes needed to make the input syntactically correct,
though this is complex and rarely used.
7. Ambiguity in Grammar
A grammar is ambiguous if it allows multiple parse trees (or derivations) for the same string.
Ambiguity can cause issues in parsing since the compiler may not know which parse tree to follow,
leading to inconsistent interpretations of code.
Removing Ambiguity:
Rewriting the grammar or applying constraints can help eliminate ambiguity, ensuring a single parse
tree for each valid input.
• Semantic analysis is the third phase in the compilation process, following lexical and syntax
analysis.
• Its goal is to ensure the code makes logical sense by checking that expressions, statements, and
program constructs follow semantic rules.
• This phase checks for errors that syntax analysis cannot detect, like variable type mismatches,
undeclared variables, and improper use of operators.
Semantic errors occur when code violates the language’s logical rules.
Examples include:
Semantic analysis helps in identifying these errors before the code moves to further phases, which is
critical to preventing logical flaws in the compiled code.
3. Type Checking
Type checking is a fundamental part of semantic analysis. It ensures variables and expressions are used
in ways consistent with their data types.
Types can include `int`, `float`, `boolean`, etc., and each operation must conform to type constraints.
For instance:
Adding two integers is valid, but adding an integer to a string may not be.
Performed at runtime, useful in languages that allow flexible typing, like Python.
The compiler maintains a symbol table as it processes the code, tracking identifiers (like variable
names, function names) and their attributes, such as type, scope, and memory location.
Scope: Refers to the context within which a variable or identifier is accessible. Semantic analysis
enforces scoping rules to prevent variables from being used outside their allowed areas.
Local Scope: A variable declared within a function or block is limited to that scope.
Global Scope: Variables declared outside all functions are accessible globally within the program.
Binding: The association of variables with specific memory locations or types. Semantic analysis
manages binding to avoid conflicts in memory allocation.
The compiler automatically converts data from one type to another where necessary, like converting
an integer to a float during a mathematical operation.
The programmer manually changes a data type using casting syntax, such as converting a float to an
integer explicitly.
Semantic analysis ensures that type conversions and casts are valid and consistent with language rules.
Some compilers begin generating intermediate code during semantic analysis. This code is an
abstraction between source code and machine code, often in a form like three-address code.
The intermediate code is designed to be platform-independent, ensuring it can later be translated into
different machine codes.
Semantic analysis tools can automatically detect common semantic errors, reducing the manual effort
required by the programmer.
Methods like attribute grammars can attach additional rules or properties to grammar productions,
further enhancing semantic checks.
Part 5: Intermediate Code Generation
Intermediate code generation translates the source code into an intermediate, platform-independent
representation, which simplifies the final conversion to machine code.
This intermediate code acts as a bridge, allowing optimization before generating machine-specific
instructions.
Advantages include portability across different hardware architectures and easier optimization.
Easy to Translate: Should easily convert to machine code or assembly language for the target machine.
Optimizable: Should facilitate optimizations like reducing redundant computations and improving
memory usage.
Platform-Independent: Helps in writing a single compiler front end that can target multiple platforms
with different back ends.
TAC breaks down expressions into simple instructions that each contain at most three addresses (two
operands and one result).
t1 = c * d
t2 = b + t1
a = t2
TAC simplifies analysis and optimization because each instruction is relatively atomic and easy to
manipulate.
Quadruples:
Each instruction in this form has four fields: operation, argument 1, argument 2, and result.
(+, b, c, a)
Quadruples provide a convenient way to represent TAC operations with clear, separate fields for
operation and arguments.
Triples:
Instead, results are referenced by the instruction’s position, reducing the need for temporary
variables.
( *, c, d)
( +, b, #0)
Triples simplify memory management by avoiding the need for extra variable storage but make it
harder to track results in larger expressions.
Conditional Statements:
In TAC, conditional statements like `if` and `while` translate into sequences that jump to specific
labels based on conditions.
if a < b goto L1
goto L2
L1: c = d
L2:
Loop Statements:
Loops like `while` and `for` are translated into labels with conditional jumps that facilitate repeating
instructions until a condition is met.
Functions are represented with instructions that handle parameter passing, function entry, and exit,
allowing arguments to be passed in registers or memory based on the calling conventions of the
language.
Parameter Passing: Instructions for loading argument values into designated locations.
Optimization: Intermediate code simplifies transformations like code motion and common
subexpression elimination, leading to optimized final code.
Retargetability: A single front end (lexical, syntax, and semantic analysis) can be combined with multiple
back ends (code generators for various platforms), making compiler development efficient.
Code optimization is a compilation phase aimed at enhancing the performance and efficiency of the
intermediate code without altering its functionality.
It improves the runtime speed, reduces memory usage, and makes better use of system resources.
These optimizations are applied to the intermediate code and are independent of the target machine.
Examples include common subexpression elimination, loop optimization, and dead code elimination.
Machine-Dependent Optimization:
These optimizations are specific to the target machine’s architecture and instruction set.
Constant Folding:
This technique precomputes constant expressions at compile time, replacing them with their
computed values.
If an expression appears multiple times and yields the same result, it is computed once, and
subsequent uses refer to the precomputed result.
This removes code that does not affect the program's outcome, such as variables or expressions
whose results are never used.
Loop Optimization:
Loop Unrolling: Expands loops by replicating the loop body, reducing the number of iterations and
branch instructions.
Loop Invariant Code Motion: Moves calculations that yield the same result in every iteration
outside the loop.
Induction Variable Elimination: Simplifies loop variables that change in predictable ways to reduce
computations.
Peephole Optimization:
A local optimization technique that examines and optimizes small sets of instructions (or
“peepholes”) in the code.
Data flow analysis is used to gather information about how data is used within a program, aiding in
optimizing code.
Reaching Definitions: Identifies statements where a variable’s value is defined and where that value
could be used.
Live Variable Analysis: Determines which variables hold values that might be needed later, helping to
eliminate unnecessary computations and dead code.
Allocates program variables to the limited number of CPU registers, reducing memory access time.
Graph Coloring: A method for register allocation where variables are assigned registers without conflicts,
minimizing the number of required registers.
6. Optimization Constraints
While optimization enhances code, it must not alter the intended output or introduce errors.
Optimization techniques are selected based on trade-offs in compile time, code size, and execution
speed, as excessive optimization might increase compile time without substantial performance gain.
Part 7: Code Generation
Code generation is the final phase of the compiler, where optimized intermediate code is converted into
machine code, specific to the architecture of the target machine.
This machine code is directly executable by the CPU and can run without the need for the original source
code.
The generated code should be efficient, minimizing resource use and optimizing execution time.
It should be correct and match the functionality specified by the source code, without introducing
unintended behavior.
Code generation should ideally use as few instructions and memory as possible, particularly in
resource-constrained environments.
Code generation must account for specific features of the target CPU, such as:
Instruction Set: The set of instructions that the CPU can execute, such as arithmetic operations,
load/store operations, and branching.
Registers: Limited storage within the CPU for holding variables and intermediate results.
Memory Hierarchy: Efficiently accessing memory, particularly for frequently accessed variables or large
data structures.
Instruction Selection: Determines the specific machine instructions that correspond to each
intermediate code instruction.
This step chooses the best instructions to minimize the number of operations and optimize execution
time.
Registers provide fast access to data compared to memory, so assigning frequently used variables to
registers can speed up execution.
Graph coloring algorithms are commonly used to manage register allocation efficiently.
Instruction Ordering:
Arranges instructions in an optimal sequence to avoid pipeline stalls, unnecessary delays, or memory
access conflicts.
Direct Translation:
Converts intermediate code instructions directly into target machine instructions without complex
transformations. This approach is simple but may produce less optimized code.
Uses templates for different patterns of intermediate code instructions, matching each template
with corresponding machine code.
Example: An addition operation in intermediate code, such as `t1 = a + b`, could map to a machine-
specific template like `ADD R1, R2`.
Applied during or after code generation, this technique improves sequences of machine instructions
in small windows (or "peepholes").
Removing redundant instructions: Eliminating instructions that don’t impact the final outcome.
Strength Reduction: Replacing complex operations with simpler ones, such as shifting for multiplication
by powers of two.
Jump Optimization: Simplifying or removing unnecessary jump instructions, which improves control
flow.
Intermediate Code:
t1 = c * d
t2 = b + t1
a = t2
Generated Machine Code (assuming a hypothetical architecture with registers R1, R2, and R3):
LOAD R1, c
MUL R1, d
LOAD R2, b
ADD R2, R1
STORE a, R2
Efficient Register Usage: Deciding which variables to store in registers versus memory, as registers are
limited.
Handling Complex Expressions: Breaking down expressions efficiently to minimize instructions and
register/memory usage.
Control Flow Management: Translating high-level control statements (loops, conditionals) into machine-
level jumps, branches, and calls.
Generated code must be thoroughly tested to ensure that it matches the intended functionality of the
source code.
- Techniques include validating outputs, checking edge cases, and comparing performance with
expected benchmarks.
Part 8: Symbol Table and Error Handling
The symbol table is a crucial data structure maintained by the compiler throughout the compilation
process. It stores information about identifiers (variables, functions, classes, etc.) found in the source
code.
The symbol table helps the compiler efficiently look up attributes of each identifier, such as type,
scope, and memory location.
The symbol table is organized to allow quick insertion, deletion, and lookup of symbols.
Scope Levels that indicate where an identifier is accessible (e.g., local, global).
Usage:
In semantic analysis, the symbol table checks if variables are declared before use.
In code generation, it provides memory locations for variables and tracks function parameters.
Scopes help the compiler determine which instances of identifiers to use, especially when variables
share names in different parts of the code.
Block Scope: Variables declared within a block are only accessible in that block.
Global Scope: Variables declared globally are accessible throughout the program.
The compiler may use a *stack of symbol tables* to handle nested scopes, pushing a new table when
entering a new scope and popping it when leaving.
4. Error Handling in Compilation
Error handling ensures that errors in the code are detected and reported clearly, allowing programmers
to correct issues without the compiler halting abruptly.
Lexical Errors: Invalid characters, incorrect identifiers, etc., detected in the lexical analysis phase.
Syntax Errors: Violations of language grammar (e.g., missing semicolons, unbalanced parentheses),
detected in syntax analysis.
Semantic Errors: Logical inconsistencies like type mismatches, undeclared variables, or incompatible
operations, detected in semantic analysis.
The compiler discards input symbols until it finds a known synchronization point (such as a
semicolon), allowing parsing to continue from a stable state.
Phrase-Level Recovery:
Replaces or inserts tokens to recover from an error and continue parsing, useful for correcting small
syntax errors.
Error Productions:
Special grammar rules added to handle common errors, allowing the compiler to recognize and
correct predictable mistakes.
Global Correction:
Attempts to find the minimum number of changes needed to make the program valid, though it is
computationally expensive and rarely used.
6. Error Reporting
The compiler should provide informative error messages to help programmers understand and fix
issues.
Cause and Suggestion: A brief description of the error and potential fixes.
T-Diagrams and Bootstrapping in Compiler Design
T-diagrams are visual tools used to represent compilers and their components. They illustrate the
relationships between languages, source code, and compilers.
• The top represents the source language (the language in which the source code is written).
• The middle represents the compiler.
• The bottom represents the target language (the language into which the source code is
translated).
T-diagrams help show the process of compiling code from one language to another, especially across
different machines or languages.
Source Language
| Compiler |
Target Language
For example, a C compiler that translates C code to assembly language could be depicted as:
C Source Code
| C Compiler |
Assembly Language
3. Using T-Diagrams for Bootstrapping
Bootstrapping is a method for creating a compiler for a programming language by using a compiler that
is partially or fully written in the language it is intended to compile.
It is commonly used when developing a new language or when creating a self-hosting compiler (a
compiler that can compile its own source code).
- Create a cross-compiler, a compiler written in a known language (like C) that generates code for a
new language (e.g., Language X).
- This compiler could run on an existing machine and output machine code that runs on the target
machine.
- T-diagram representation:
| Cross-Compiler (written in C) |
2. Self-Hosting Stage:
- Once the language X compiler can compile its own source code, it becomes self-hosting.
3. Full Bootstrap:
- The language can now compile and execute its own compiler, enabling further development and
optimization.
- This final self-hosting compiler is often optimized and used as the primary compiler for the
language.
4. Why Use Bootstrapping?
Bootstrapping is essential for developing new languages and improving existing compilers.
It also ensures that a compiler can generate its own updates and enhancements, enabling independent
growth of the language ecosystem.
In compiler design, an input program that appears syntactically correct may still contain semantic errors,
which a compiler must detect to prevent runtime failures. This type of error-checking requires deeper
analysis than just syntax; the compiler needs a broader understanding of the program’s behavior, data,
and structures.
Context-sensitive analysis: Unlike parsing (which checks the structure), this analysis goes further,
checking if data types are used correctly and ensuring that each statement adheres to the language's
semantic rules.
Semantic elaboration: Compilers use this concept to gather information about the program’s types,
variables, scope, and control flow, which are all essential to producing an executable output.
2. Ad hoc syntax-directed translation: Informal, flexible methods where specific code snippets manage
checks directly during parsing.
Type systems in programming languages assign a "type" to each data value. The type, as an abstract
category, defines a set of characteristics common to all values of that type (e.g., integers, characters,
booleans). Understanding types is critical for ensuring a program's safety, expressiveness, and efficiency.
Enhance efficiency: Allow the compiler to optimize code based on data types.
1. Runtime Safety: A robust type system helps in catching errors before execution, reducing runtime
issues such as illegal memory access or data misinterpretation.
2. Expressiveness: Types allow designers to include complex behaviors, such as overloading, where an
operator’s meaning changes based on the operands’ types.
3. **Optimizing Code**: Knowledge of types enables more efficient code generation, as certain checks
can be bypassed or simplified.
Type construction rules: Defines how new types (like arrays or structures) are built from base types.
Type equivalence: Determines when two types are considered the same.
Type inference rules: Allows the compiler to deduce types for expressions based on context.
A complete type system involves several components, each defining a unique part of how types operate
and interact within a programming language. Here are the main components:
1. Base Types:
These types are atomic and serve as building blocks for more complex types.
2. Type Constructors:
Examples include array types (`int[]`), pointer types (`int*`), and function types (e.g., `void(int, float)`).
3. Type Equivalence:
Name Equivalence: Two types are equivalent if they share the same name.
Structural Equivalence: Two types are equivalent if they have the same structure, even if they have
different names.
4. Type Inference:
Deduction of the types of expressions and variables based on context, often without explicit type
declarations.
Type inference uses a set of rules to determine the types of expressions based on the operations
performed.
5. Type Checking:
- For instance, addition may only be allowed between compatible numeric types.
A well-defined type system enhances a compiler's ability to detect errors early, optimize code, and
ensure program correctness.
4.3.1 Evaluation Methods
In attribute grammars, evaluation methods are approaches to computing the values of attributes. There
are three main evaluation methods:
1. Dynamic Methods:
- These methods adjust the evaluation order based on the structure of the attributed parse tree.
- Dynamic methods are like dataflow computers, where each rule "fires" once all input values are
available.
2. Oblivious Methods:
- Evaluation order is fixed and does not depend on the grammar's structure or specific parse tree.
- Techniques include left-to-right passes, right-to-left passes, and alternating passes until all attributes
are evaluated.
- This method is simple to implement but may not be as efficient as methods that take the specific tree
structure into account.
3. Rule-Based Methods:
- These methods rely on static analysis to determine an efficient evaluation order based on the
grammar structure.
- Rules are applied in an order determined by the parse tree's structure, allowing a more optimized
evaluation.
- Tools may perform static analysis to create a fast evaluator for attribute grammars.
4.3.2 Circularity
Circular dependencies in attribute grammars occur when an attribute depends on itself, either directly or
indirectly. This results in a **cyclic dependency graph**, causing issues in evaluation since the required
values cannot be computed.
Avoidance of Circularity:
- Some grammars, like those limited to synthesized attributes, inherently prevent circularity.
- Restricting the grammar to specific types of attributes or configurations can avoid cycles.
Handling Circularity:
- If a circular grammar is unavoidable, some methods attempt to assign values to all attributes, even if
they are involved in cycles. This may involve iterative evaluation methods.
The authors provide extended examples to demonstrate how attribute grammars handle complex cases,
including:
- For typed languages, a compiler must infer types for expressions by examining the types of operands
and operators.
- Each node in an expression tree represents an operation or value, and type inference functions (e.g.,
`F+`, `F-`, `F*`, `F/`) determine the result type for each operation.
- Attribute grammars can estimate execution costs by associating each operation with a "cost"
attribute.
- This cost attribute accumulates through the parse tree, allowing a cumulative estimation of resource
use for code segments.
- For example, loading a variable, performing arithmetic, and storing a result each adds a cost,
simulating the runtime behavior of the compiled code.
4.3.4 Problems with the Attribute-Grammar Approach
While attribute grammars provide a systematic way to perform context-sensitive analysis, they face
several practical limitations:
- Attribute grammars work best for localized information, where dependencies flow from parent to
child nodes or vice versa.
- For problems that require sharing information across distant nodes in the parse tree, attribute
grammars must rely on "copy rules," which increase complexity and reduce efficiency.
2.Storage Management:
- Attribute evaluation often creates numerous temporary values and instances of attributes.
- Efficient management of these values is necessary to avoid excessive memory use, but in many cases,
attribute grammars do not handle storage management optimally.
- Results from attribute evaluation are stored throughout the parse tree.
- If later phases need these results, the compiler must perform tree traversals to retrieve them, which
can slow down the compilation.
- Some solutions, like using global tables to store attributes, violate the functional paradigm of attribute
grammars.
- Although practical, these ad hoc methods introduce dependencies that complicate the attribute
grammar framework.
Here are the detailed notes on the "easy" topics from Chapter 5 of *Engineering a Compiler* by Keith
Cooper and Linda Torczon.
5.1 Introduction to Intermediate Representations
An Intermediate Representation (IR) is a structure that compilers use to model the source code for
transformations and optimizations. IRs serve as a middle layer between source code and target machine
code. IRs allow compilers to standardize processes such as analysis, optimization, and code generation
without referring back to the original source code.
Characteristics of IRs:
Expressiveness: IRs must encode all program information required by various compiler passes.
Efficiency: Operations on the IR should be fast, as the compiler repeatedly accesses it.
Human Readability: IRs should be understandable so developers can analyze and debug compilation
processes.
IRs can be categorized based on their structural organization, level of abstraction, and naming
conventions:
1. Structural Organization:
Graphical IRs: Represent code as graphs, making complex relationships easy to analyze.
Linear IR: Resemble assembly language, following a sequence of operations with an explicit order.
Hybrid IRs: Combine linear IRs for simple instruction sequences and graphical IRs for control flow
representation.
2. Level of Abstraction:
- IRs vary from high-level, close to the source language, to low-level, closer to machine code.
- Example: Array access could appear as a high-level operation or broken down into individual machine
instructions in a low-level IR.
3. Naming Discipline:
Explicit Naming: Uses variable names and labels directly in the IR.
Implicit Naming: Uses positions, such as stack positions in stack-machine IRs, instead of explicit variable
names.
Graphical IRs represent program components and their relationships as nodes and edges in graphs. They
make control flow, data dependencies, and other relationships more intuitive to visualize and optimize.
Parse Trees:
- Parse trees display the hierarchical structure of the source code according to grammar rules. Every
grammar symbol has a corresponding node in the tree.
- For example, an expression like `a * 2 + a * 2 * b` would create a large parse tree with nodes for every
grammar rule applied.
- ASTs are simplified parse trees that abstract away unnecessary nodes, focusing on the essential
structural elements of the code.
- ASTs are smaller and more efficient, retaining only nodes relevant to program structure (e.g.,
operators, variables) without redundant intermediate steps.
- DAGs further optimize ASTs by merging identical subexpressions. For instance, `a * 2` used in multiple
parts of an expression would be stored once, with references to it instead of duplicating the expression.
Linear IRs represent instructions in a sequential list, resembling assembly code for an abstract machine.
This structure mirrors both source code and machine code, making it ideal for low-level optimizations.
- Example:
push a
push 2
mul
push b
add
Here, the operands are pushed onto the stack, and operations like `mul` and `add` consume values from
the stack’s top, simplifying instruction formats and memory usage.
Three-address code is a popular linear IR where each instruction has at most three components (two
operands and one result). It simplifies translating high-level constructs into sequences of low-level
operations.
- Example:
t1 = a * 2
t2 = b + t1
TAC uses temporary variables to hold intermediate values, making the code easy to optimize and
translate into assembly code.
Symbol tables track variables, functions, and other identifiers throughout compilation. They are essential
for name resolution, type checking, and managing variable scopes.
5.5.1 Hash Tables for Symbol Tables
Hash Tables: Used to implement symbol tables due to their fast, average-case lookup times.
- In a hashed symbol table, each identifier is mapped to an index, enabling constant-time access for
retrieving and storing symbol information.
- Hash tables handle collisions through techniques such as chaining or open addressing to ensure
efficiency even with many entries.
Insertion: Adds a new symbol (variable, function) to the table with details such as type and scope.
Lookup: Retrieves symbol information based on name, helping the compiler check for declarations and
type consistency.
Deletion: Removes symbols that go out of scope, managing memory and ensuring that outdated
symbols are no longer accessible.
Nested Scopes:
- Symbol tables must handle multiple layers of scope, such as functions within functions.
- Common approaches include linked lists of symbol tables, where each new scope pushes a new table
onto the stack, and each table points back to its parent scope.
Here’s the next detailed section covering the advanced topics from Chapter 5, based on the key points
from your document:
5.2.2 Graphs
Graphs provide a versatile and abstract way of representing program behavior, especially useful when
representing dependencies and control flow beyond syntax. There are different types of graphs used in
compiler design, and they play specific roles in program analysis and optimization.
1. Control-Flow Graph (CFG)
Definition: A control-flow graph (CFG) is a directed graph where nodes represent basic blocks
(sequences of instructions with a single entry and exit point) and edges represent possible control
transfers between them.
Purpose: CFGs show how control moves from one part of a program to another, which is essential for
analyzing the program’s structure and optimizing it by identifying redundant or unreachable code.
Structure:
- CFGs are particularly helpful in constructing efficient code sequences by enabling loop optimization,
inlining, and other control-flow modifications.
Definition: A data-dependence graph (DDG) captures data dependencies between operations. Nodes
represent operations or instructions, and edges represent dependencies where one instruction requires
the result of another.
Purpose: Useful for optimizations that improve data flow efficiency, such as instruction scheduling and
parallelization.
Example: If one instruction computes a value that another instruction needs, an edge is drawn to show
that dependency. This is critical in managing the order of operations and reducing execution stalls.
Definition: A DAG is an abstract syntax tree (AST) with sharing. Identical subtrees are instantiated once
and shared among multiple parents.
Purpose: DAGs help in identifying common subexpressions and optimizing repeated computations by
ensuring that identical computations are not duplicated in memory or processing.
Example: In the expression `a * 2 + a * 2 * b`, DAGs represent `a * 2` only once, allowing the result to
be reused across the expression without re-evaluation.
4. Call Graph
Definition: A call graph represents the calling relationships among functions or procedures in a program.
Each node corresponds to a procedure, and each edge corresponds to a call site where one procedure
calls another.
Purpose: Helps in interprocedural analysis and optimization by providing insights into how functions
interact and potentially affect each other.
Challenges:
Ambiguous Call Sites: Some calls, such as those involving procedure-valued parameters or dynamic
dispatch in object-oriented languages, are hard to resolve statically and may require runtime
information.
Linear codes, unlike graphs, are sequences of instructions that mirror the linear execution order of
assembly or machine code. Linear representations are simple but need additional mechanisms for
encoding control flow.
- TAC is a common linear representation that breaks down complex expressions into three-address
statements, where each statement has at most three operands (e.g., `t1 = a + b`).
- TAC is helpful for representing complex expressions in a way that aligns with machine instruction
formats and allows for easy optimization and transformation.
Quadruples: These store each instruction as a record with four fields: operation, two arguments, and
a result location.
Triples: Similar to quadruples but without an explicit field for the result, making storage more
efficient but harder to manage in complex expressions.
Control-Flow in Linear IRs
- In linear representations, control flow (branches, loops) is typically represented by jumps and branch
instructions.
- These instructions demarcate basic blocks in the code, making it easier for the compiler to identify
and manipulate sections of code.
Efficient naming of temporary values generated during compilation is essential for optimizing memory
and register usage.
Naming Schemes: Compiler-generated names (like `t1`, `t2`) are often assigned to intermediate results.
These schemes influence the compiler’s ability to optimize code.
Register Allocation: Proper naming allows temporary values to be efficiently mapped to CPU registers,
reducing the need for memory access.
Example: If two intermediate expressions calculate the same result, they can share a single temporary
name, reducing redundant calculations.
SSA form is a powerful intermediate representation that simplifies data flow analysis by ensuring that
each variable is assigned exactly once.
Structure:
- Each variable has a unique version for every assignment (e.g., `a1`, `a2` for two assignments to `a`).
- The SSA form introduces **Φ (phi) functions** to merge variable values in different branches of
code, preserving consistency across control flow paths.
Advantages:
- **Simplifies Optimization**: SSA makes it easier to identify and eliminate redundant computations,
as each variable’s value is clear and traceable.
Improves Register Allocation: By knowing the exact life of each variable, SSA helps the compiler
allocate registers more efficiently, reducing memory usage.
Symbol tables are crucial for managing variable and function names during the compilation process.
They store data such as scope, data type, and memory location for each identifier.
Hierarchical Organizatio: Symbol tables may be organized hierarchically to represent nested scopes,
allowing quick lookup and scope management.
Insertion: Adds new identifiers (e.g., variables, functions) to the table when they are declared.
Deletion: Removes identifiers from the table when they go out of scope, freeing memory.