0% found this document useful (0 votes)
748 views193 pages

Computer Architecture and Organization: Lecture Notes

Uploaded by

nilanjan.n.03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
748 views193 pages

Computer Architecture and Organization: Lecture Notes

Uploaded by

nilanjan.n.03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 193

Computer Science 382

Lecture Notes

Computer Architecture
and Organization

Shudong Hao

September 4, 2024
Computer Science 382

Disclaimer
The lecture notes have not been subjected to the usual scrutiny reserved for formal publications. They may not
be distributed outside this class without the permission of the Instructor.

Colophon
This document was typeset with the help of KOMA ‐ Script and LATEX using the kaobook class.
The source code of this book is available at:
https://github.com/fmarotta/kaobook

Edition History
1st edition: August 2022.
2nd edition: August 2023.
Contents
Contents iii

1 Fundamentals 1
1.1 Number Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Binaries and Decimals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3.1 Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3.2 Signed Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Binaries and Hexadecimals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Binary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5.1 Fixed Width Binary Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5.2 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.5.3 Bit-wise Logical Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Basic Components of Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Central Processing Unit (CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1.1 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1.2 Arithmetic Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Random Access Memory (RAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Peripheral Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Computer Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 A Peek Into Memory with C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1.2 Binary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1.3 Formatted I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1.4 goto Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2.1 Reference and Dereference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2.2 Pointers and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.2.3 Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2.4 Arrays and Pointer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2.5 Null-Terminated Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Quick Check Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2 Instruction Set Architecture 25


2.1 Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 Accessing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.1 Load . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.2 Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Moving Constants & Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Data Processing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.1 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4.2 Logic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5 Flow Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.5.1 Program Counter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.5.2 Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2.1 Unconditional Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.5.2.2 Conditional Branching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5.2.3 Comparison with Zero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.2.4 Condition Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.2.5 More Conditional Branch Instructions . . . . . . . . . . . . . . . . . . . . . . . 40
2.5.3 Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.5.4 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.4.1 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.4.2 Larger Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.6 Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6.1 Runtime Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.6.2 Procedure Call Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.6.2.1 Return Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.2.2 Passing Arguments and Return Values . . . . . . . . . . . . . . . . . . . . . . . 49
2.6.2.3 Creating Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6.2.4 Leaf and Non-Leaf Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.6.2.5 Resolving Register Usage Conflicts . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.6.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6.3 Recursive Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.6.4 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
2.7 Quick Check Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3 Microprocessor Design 67
3.1 Fundamental of Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.1 Logic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.2 Combinational Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.2.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.2.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.2.3 Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.3 Sequential Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.3.1 SR Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.3.2 D Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.3.3 Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.4 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.1.5 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 From Assembly to Machine Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.1 Arithmetic/Logic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.1.1 With Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.1.2 With Immediates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.2 Memory Accessing Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.3 Branching Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 A Single-Cycle Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.2 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.3 Stages of an Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.3.1 Stage 1: Instruction Fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.3.2 Stage 2: Instruction Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3.3 Stage 3: Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.3.4 Stage 4: Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.3.5 Stage 5: Writing Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.4 A Pipelined Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4.1 Operating a Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4.1.1 Pipeline Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.4.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.2 From Single-Cycle to Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4.3 Adding Pipeline Registers to Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.5 Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.1 Data Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.1.1 Stalling the Pipeline Manually . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.5.1.2 Stalling the Pipeline Automatically . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.5.1.3 Forwarding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5.2 Control Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.5.2.1 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.2.2 Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.2.3 Dynamic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.7 Quick Check Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

4 Memory System 119


4.1 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
4.1.1 Locality of Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.1.2 Caching and Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.2 Cache Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
4.2.1 Memory Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.2.2 Adding Cache to the Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.2.3 Cache Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.2.3.1 Direct-Mapped Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
4.2.3.2 A Real-World Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.2.3.3 Associative Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.2.3.4 Write Transactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.2.4 Multi-Level Caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
4.2.5 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.2.6 Writing Cache-Friendly Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.3 Virtual Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
4.3.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
4.3.3 Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.3.3.1 Translation with Page Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.3.3.2 Accelerating Translation with TLB . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.3.4 From Virtual Address to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.4 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.5 Quick Check Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

Appendix 153
A C Language in Action 155

B ARMv8 Assembly in Action 157


B.1 Program Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
B.1.1 Our First Assembly Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
B.1.2 Segments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
B.1.3 Declaring Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
B.1.3.1 Loading Labeled Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.1.3.2 Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.1.3.3 Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
B.1.3.4 Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
B.1.3.5 Repetitive Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B.1.3.6 Reserving Empty Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
B.2 Linking and Executing Assembly Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.2.1 General Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.2.2 Listing Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
B.2.3 Using External Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.2.3.1 Examples of Using printf() . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
B.2.3.2 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
B.2.3.3 Linking C Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.3 Debugging using gdb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.3.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
B.3.2 Start Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.3.3 Debugging Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.3.3.1 Breakpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
B.3.3.2 Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.3.3.3 Panel Focus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.3.4 Printing Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
B.3.5 Inspecting Condition Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
B.3.6 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.3.6.1 Nothing Showed Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.3.6.2 Breakpoint Not Set Exactly at a Label . . . . . . . . . . . . . . . . . . . . . . . . 174
B.3.6.3 No Such File or Directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.3.6.4 Cannot Link -lc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.4 Floating Point Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.4.1 Basic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.4.1.1 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.4.1.2 Moving Real Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.4.1.3 Converting Precisions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
B.4.2 Printing Using printf() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
B.4.3 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

Alphabetical Index 181

ARM Assembly Directives & Instructions 183


List of Figures

1.1 How a discrete value can be extracted from continuous signals. In this example, we can get a binary
number 010 based on the current change on a wire. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Zero extension. In each example, we extend an unsigned binary number of four bits to one byte. . 2
1.3 Signed extension, which copies MSB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 For non-negative numbers, signed and unsigned have no difference. When consider MSB as sign
bit, the same binary pattern will be mapped to a positive number in the unsigned range. . . . . . . 4
1.5 Truncating MSB of a 𝑑 + 1-bit binary will result in positive or negative overflow. There’s no effect
on any number between [−2𝑑−1 , 2𝑑−1 − 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 We apply three bits of logical shift left operation to binary number 10001100b (top). If the size of
the binary is limited (bottom), the MSBs are discarded. . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 We apply three bits of logical shift right operation to binary number 10001100b (top). If the size of
the binary is limited (bottom), the LSBs are discarded. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 In arithmetic shift right, we pad MSBs with copies of original numbers MSB. We only show the
versions where the binary size is limited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 A typical von Neumann model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 Visualization of RAM. Notice each byte (8 bits) has a unique address. The addresses are ranged from
0x00...0 to 0xFF...F. What’s the size of this RAM? . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 Abstractions of computer systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Register file of ARM architecture. X-registers can store 64 bits. The lowest half 32 bits for each X-
register can also be used independently as W-registers. However, the upper half 32 bits cannot be
used independently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 LDR will grab a few bytes (determined by the size of the destination register) from the starting ad-
dress, calculated by base+simm9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 STR is the opposite direction of LDR. Notice how the same value of X9 is stored differently on little
and big endian machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Using W and X registers for addition. If W registers are used, the highest 32 bits will be cleared
out. The gray boxes are W registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Program flow of the example. Instruction B modifies PC based on its target, which makes the pro-
gram skip some instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 CBNZ will check the register value; if it’s not zero, it’ll branch to the instruction tagged as L1. . . . . 35
2.7 Without unconditional branch B, the program flow would be wrong when variable a is zero. . . . . 37
2.8 A loop is simply a backward branching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.9 LDRB loads one byte into the lowest byte in a W -register. . . . . . . . . . . . . . . . . . . . . . . . 44
2.10 Each long integer takes eight bytes, so the starting address of each element is 8*i where i is the
index of the element. In this figure, LDR will copy eight bytes starting from address pointed by
X9+X13 to X12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.11 Visualization of a virtual memory space for a program. Our assembly code and global variables will
be loaded to this space straight out of the executable file. During run time, the heap and stack are
growing towards each other as our program calls a procedure, or allocates space dynamically. For
stack, the bottom is actually at the high address, while the top is at the low address, so you can take
it as a upside-down stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.12 If we treat function/procedure calls as stacking some “blocks”, the process of procedure calls looks
really like pushing blocks to the stack. At point (2), fun2() returned to fun1(), and therefore its
block is removed from the stack top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 a and b are the two inputs of the and gate, and a&b is the output. The and gate constantly and almost
immediately reflects the change of the input, with small amount of delay which is neglectable. . . 68
3.2 On the left, signal a has been branched into two; on the right, a and b are separate signals without
relations, indicated by a line hop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 The combinational logic for comparing two bits. When the two bits are equal, the output is 1; oth-
erwise it’s 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 The parallel sets of wires on the left are called buses, where each wire transfers one bit of data, and
all the wires transfer data at the same time. To make the graph clearer, we made lines from input b
blue. Each pair of input bits uses the bit equality logic in Figure 3.3 to compare. . . . . . . . . . . . 69
3.5 The combinational logic for selecting one of the inputs as the output. Input s acts as a “switch”,
or “control”. When s == 1 (asserted), input b passes through the multiplexer; when s == 0
(deasserted), input a passes through. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 The inputs a and b, as well as the output, are all 64-bit double words. The same control signal controls
all the bits of an input, which makes the logic choose one of the inputs for every bit, and thus choose
one double word to pass through. We highlighted the wires for input b and its corresponding control
signal wires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7 Bit adder, where input y is marked as red, x as blue, the carry-in signal cin as black. . . . . . . . . 71
3.8 Full 64-bit adder using the bit adder design from Figure 3.7. From bit 0 to bit 62, each adder’s carry-
out flag cout[i] will be used as carry-in flag for bit i+1’s adder. . . . . . . . . . . . . . . . . . . . 72
3.9 The left shows a combinational logic, whereas the right shows a sequential logic. In the combina-
tional logic, both outputs p and q respond to the change of input in almost instantly, and thus we are
not able to “store” the output. In the sequential logic, however, one temporary change in the input
in will trigger the permanent change in the outputs, making them stay, and thus to be “stored”. . . 73
3.10 A simple SR latch and an example of timing diagram. Time (1) is the state of setting, (2) for resetting,
and (3) for latched. The temporary change in either R or S will make the change in the output stay,
and thus to be stored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.11 D latch has a clock C to control when the data D is allowed to pass through and to cause change in
the output Q+. C is 1, the status is called “latching”, where output Q+ responds to the change of the
input data D. When C is 0, the status is “storing”, and Q+ stays/stores the value regardless of changes
of input D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.12 An edge-triggered latch, or a flip-flop, where when C rises, the trigger T will temporarily rise to high
voltage, allowing Q+ store the value of input data D at that moment. Afterwards, T drops back down,
and no matter how input C changes Q+ stays stable. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.13 A register implemented using edge-triggered latches. One latch can store one bit of data, and all the
64 bits will be updated all together when the clock rises. . . . . . . . . . . . . . . . . . . . . . . . . 76
3.14 A little more detailed register file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.15 Control bus and address bus are unidirectional, while data bus is bidirectional. . . . . . . . . . . . 77
3.16 Encodings of arithmetic and logic instructions with register operands. . . . . . . . . . . . . . . . . 79
3.17 Encodings of arithmetic and logic instructions with register operands and immediates. . . . . . . . 80
3.18 Encodings of memory accessing instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.19 Encodings of branching instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.21 A clock controls the sequential logic, so each clock cycle makes one pass of the data. . . . . . . . . 83
3.20 A single-cycle implementation of datapath. Black lines are data signals, while blue lines are control
signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.22 Stage 1: instruction fetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.23 Stage 2: decoding. Fields in an instruction is sent to different parts of the register file. The opcode
is sent to a control unit that can generate control signals. . . . . . . . . . . . . . . . . . . . . . . . . 85
3.24 Stage 3 — execution — consists of two tasks: updating PC for branching instructions, and computing
through ALU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.25 Stage 4: memory access. Memory has two control signals MemWrite and MemRead . . . . . . . . 89
3.26 Stage 5: writing back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.27 A summary of five stages each instruction goes through in the datapath, with description language
on the side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.28 An example of combinational logic, where the input is three-bit, and the output D is one bit. When
the clock rises, the output D is written to the register. . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.29 A sequence of three inputs: 110 , 010 , 110 with an unpipelined version. . . . . . . . . . . . . . 93
3.30 A three-way pipeline structure where each stage runs one input. . . . . . . . . . . . . . . . . . . . . 93
3.31 We added two registers between the three stages as “barriers”, to make sure the signals in each stage
will not be interrupted or overwritten. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.32 When we separate the combinational logic into three stages, the clock cycle can be shorten to only
execute one stage. At the peak of the system, between time 2 and 3, all logic gates are working on
different instructions, which greatly improves the throughput and reduces resource waste. . . . . . 94
3.33 It is very challenging to separate combinational circuits into stages with equal latency. Thus, the
clock cycle needs to be long enough to cover the slowest stage, which limits the throughput of the
entire system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.34 The horizontal axis is a time line. At the top we run one instruction through all five stages at a time, so
it takes much longer to complete all three instructions. At the bottom, we run multiple instructions
at the same time, which resembles a pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.35 A detailed datapath with pipeline registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.36 A pipeline diagram showing the progression of instruction executions. . . . . . . . . . . . . . . . . 98
3.37 A sequence that can lead to data hazard, due to dependencies between instructions. Register X2 in
the first instruction is the destination, but also one of the source operands in the third instruction. At
cycle 4, instruction 3 has already read X2 ’s old value, but it hasn’t been updated from instruction 1
yet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.38 Because there’s no dependency on X2 in instruction STR , we swap it with ADD to align its ID
stage with SUB ’s WB stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.39 Inserting NOP instruction allows us to delay instructions that depend on the completion of earlier
instructions. In this example, to make SUB ’s WB and ADD ’s ID stages align, we only need to add
one NOP . Certainly in some cases more NOP s may be needed. . . . . . . . . . . . . . . . . . . . . 101
3.40 As long as there’s a data hazard, we’d postpone the instruction and its following instructions by
stalling them in place, and let NOP s move along the pipeline. Once all data hazards have been
resolved, stalled instructions can restart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.41 Hazard detection unit receives signals from multiple stages (representing multiple instructions), and
stall instructions currently at ID and IF stages, and insert bubble to EX stage. . . . . . . . . . . . . . 104
3.42 The forwarding unit will take signals from ME and WB stages, and overwrite ALU operands. . . . 107
3.43 At cycle 3, the value of X1 has been determined. Assume it is zero, then we need to branch to .L1 .
The two instructions AND and ORR that are already in the pipeline will not continue executing;
instead, we flush them and let NOP s move along the pipeline. . . . . . . . . . . . . . . . . . . . . . 108
3.44 Two-bit predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.1 The time gap between accessing DRAM/SRAM (the memory) and CPU cycle time is getting larger
through the years. Figure borrowed from Computer Systems: A Programmer’s Perspective. . . . . . . 119
4.2 Row-major languages store all elements in each row together. . . . . . . . . . . . . . . . . . . . . . 120
4.3 column-major languages store all elements in each column together. . . . . . . . . . . . . . . . . . . 121
4.4 Memory hierarchy. Figure borrowed from Computer Systems: A Programmer’s Perspective. . . . . . . 122
4.5 Traditional bus structure between CPU chip and main memory. . . . . . . . . . . . . . . . . . . . . 123
4.6 A CPU chip with one level of cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7 An illustration of how cache works in a toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.8 A cache with 𝑆 sets, 𝐸 lines per set, and stores 𝐵 bytes for line. Note all the lines in the cache have
the same structure as shown in line 0; due to illustration purposes we only show structures of line 0. 126
4.9 For convenience, two boxes are used for storing two cards each. Box x only stores cards starting
with digit x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.10 After ♥00 was requested. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.11 After ♦01 was requested. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.12 Three possible cache organizations with a total capacity of eight bytes, and with lines that can store
two bytes of data each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.13 Simplified illustration of ARM processor chip with multi-level caches. . . . . . . . . . . . . . . . . 134
4.14 Matrix multiplication on Core i7. Large missing rate leads to more cycles needed to run each inner
loop iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.15 Two programs need 60 bytes of memory space, but the actual memory has only 48 bytes. . . . . . . 139
4.16 Only loading parts of the programs allows them to fit in to small physical memory at the same time. 139
4.17 When other parts of the programs are needed, we just swap them in. . . . . . . . . . . . . . . . . . 140
4.18 To fit multiple processes’ virtual memory space into a small physical memory, the system needs to
split virtual memory spaces into pages, and only load the pages that contain needed data and code
into physical memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.19 Adding memory management unit (MMU) to CPU chip. . . . . . . . . . . . . . . . . . . . . . . . . 142
4.20 TLB as a cache for page table entries on MMU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.21 A translation lookaside buffer (TLB) is simply a SRAM cache that caches page table entries. . . . . 146
4.22 Physical memory retrieves and stores pages from the hard drive, and sends lines to the cache. The
cache then sends words to the registers, where the data are requested by the ALU. . . . . . . . . . 149

B.1 Using .balign exp will align the next data at the address of multiple of exp , with zeros as
padding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B.2 Interface of running gdb for assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.3 An example of different memory examining format. . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.4 Converting between double precision real numbers and integers. . . . . . . . . . . . . . . . . . . . 176

List of Tables

1.1 Data size (64-bit machines). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2


1.2 Binary operators in C language. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3 Commonly used format specifiers in C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4 Array access methods. The two methods on each row are equivalent. . . . . . . . . . . . . . . . . . 19

2.1 Conditional branch instructions B.cond and condition code checking. . . . . . . . . . . . . . . . . . 40

3.1 ALU operations for each instruction in the instruction set architecture. . . . . . . . . . . . . . . . . 88

4.1 A typical memory hierarchy in detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122


4.2 Notations in cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.3 Contents of TLB, page table (first 32 pages), and the cache in the example. All numbers are hexadec-
imal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

B.1 Directives for different C data types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158


Fundamentals 1
1.1 Number Representations . 1
1.1 Number Representations
1.2 Basic Components of
Microprocessors . . . . . . . 8
First thing we need to understand is everything on our computers that we
1.3 Computer Abstractions . . 10
are so used to — graphics, documents, windows, etc. — comes down to
1.4 A Peek Into Memory with
something physical, i.e., digital signals. Interesting though, digital signals
C . . . . . . . . . . . . . . . . 11
are so simple: they can only transfer currents. How can we interpret those
1.5 Quick Check Solutions . . 20
signals for our use then?

One idea is to take the voltage as numbers. If at a certain moment the


voltage is very high, we consider that it’s expressing a number 1; if later it
drops to a very low number, we interpret it as number 0.

One problem though — the voltage is not stable due to materials of the
wire, temperature, and so on. It can stay roughly at a certain level with
small up-and-downs, so we can use a threshold instead. Figure 1.1 shows
this idea. Voltage

With this in mind, let’s look at number representations.

Time

1.1.1 Notations Figure 1.1: How a discrete value can be


extracted from continuous signals. In this
example, we can get a binary number 010
The integer number system we’re most familiar with is decimal. In our based on the current change on a wire.
course, we will also focus on other systems as well, mostly binary and
hexadecimal.

When writing a decimal number, we usually add a letter D at the end, e.g.,
100D , -382D . The letter B will be used as suffix or prefix for binary
numbers: 100B , 100010B , 0b1010 , etc. For hexadecimal, we can add
H at the end such as 1AB5H , but we can also add 0x as prefix: 0x1AB5 .
In the future, without specifying which number we’re using, you should
be able to recognize them correctly based on suffix or prefix.

1.1.2 Binary Numbers

Binary numbers are also sometimes referred as “binary patterns” or “bi-


nary sequences”. Binaries simply consist of digit 0s and 1s. Each digit is
also called a bit. When we write a binary number such as 100101001000 ,
the left-most bit is called most significant bit (MSB), while the right-most 1: Later we’ll talk about most/least sig-
least significant bit (LSB). 1 nificant bytes, whose acronyms are also
MSB and LSB. Without specifying, either
Because machines usually store multiple bits together, there are also some B is for bit or byte should be clear based
binary-related sizes or units we need to know about. on context.

A byte is always 8-bits, and this is the unit we’re going to use most fre- 2: Fun fact, 4-bits is called a nibble. We
quently. 2 When we have more bytes, the unit definition could vary among rarely use that though.
2 1 Fundamentals

different machines. On a 64-bit machine (which is the most popular these


days), a word has 32 bits (or 4 bytes), while a half word has 16 bits (or
2 bytes). 64-bit is a double word, or quadword (for historical reasons).
See Table 1.1 for a summary.
Table 1.1: Data size (64-bit machines).
We can go larger for sure, but when there are too many bytes, we use an-
Data # bits other system, the IEC prefixing system, which is similar to scientific nota-
nibble 4 tion but with powers of 2: 3
byte 8
half word 16 Kilobyte (KB) = 210 bytes Megabyte (MB) = 220 bytes
word 32 Gigabyte (GB) = 230 bytes Terabyte (TB) = 240 bytes
double word 64 Petabyte (PB) = 250 bytes Exabyte (EB) = 260 bytes
quadword 64
Zettabyte (ZB) = 270 bytes Yottabyte (YB) = 280 bytes

3: The definition of large data size has


been varied based on different systems
and scenarios. Strictly speaking, one kilo- 1.1.3 Binaries and Decimals
byte literally means “a thousand bytes”,
which is 103 = 1,000 bytes. To accom-
modate binary conventions, IEC (Interna- Recall that all numbers are represented as binary numbers, which can be
tional Electrotechnical Commission) cre- directly mapped to the change of currents on a wire. We as humans, how-
ates a similar standard, making 1,024 ever, rarely use binary numbers, and we’re more familiar with decimal
bytes one Kibibyte – a shorthand for
“Kilo-binary-byte”. These different nota- numbers. Therefore, the conversion between binaries and decimals are
tions have been used non-strictly, though, inevitable.
and for representing the capacity of ran-
dom access memory, it is typical to use Before we start, think how we represent decimal numbers. 4 We can have
“Kilobyte” to refer to actually “Kibibyte”. a positive number such as +382, but also a negative one like −220. In
To avoid confusion, we will follow the machines, how can we represent the plus and minus sign?
table in the textbook. See more informa-
tion here: https://www.wikiwand.com/ Let’s start with the unsigned conversion, which is the simplest case, and
en/Orders_of_magnitude_(data).
you’re probably already familiar with it.
4: We only talk about integers here, not
floating points.

1.1.3.1 Unsigned Integers

 Conversion to Decimals.
Given a binary number of 𝑑 bits 𝑟𝑑−1 𝑟𝑑−2 ⋅ ⋅ ⋅ 𝑟1 𝑟0 , where each 𝑟𝑖 is a
single bit 0 or 1, the decimal number is calculated as follows:

𝑈 = 𝑟𝑑−1 ⋅ 2𝑑−1 + 𝑟𝑑−2 ⋅ 2𝑑−2 + ⋅ ⋅ ⋅ + 𝑟1 ⋅ 21 + 𝑟0 ⋅ 20 (1.1)


𝑑−1
= ∑ 𝑟𝑖 ⋅ 2𝑖 . (1.2)
𝑖=0

You can see that all digits of the binary number are used to convert
into the decimal number.

 Extension.
Later in this course we’ll usually need to extend a binary number
to a larger size, e.g., extend a 8-bit number to a 16-bit number while
1001 0111 keeping its value the same. For unsigned numbers, we can just do
zero extension, because no matter how many zeros we add to the
0000 1001 0000 0111
front, in Equation (1.1) they don’t carry any weight eventually. See
Figure 1.2: Zero extension. In each exam- Figure 1.2 for an example.
ple, we extend an unsigned binary num-
ber of four bits to one byte.
1.1 Number Representations 3

 Data Range.
When it comes to binary, another thing we usually talk about is the
𝑑 𝑑
range of the number. Let 𝑈𝑚𝑎𝑥 and 𝑈𝑚𝑖𝑛 denote the maximal and
minimal number an unsigned integer of 𝑑 bits can represent, then
we have:

𝑑
𝑈𝑚𝑎𝑥 = 2𝑑 − 1, (1.3)
𝑑
𝑈𝑚𝑖𝑛 = 0. (1.4)

1.1.3.2 Signed Integers

Signed numbers are not very complicated either. The only difference is we
use the MSB as the sign: 1 for negative, and 0 for positive.

 Conversion to Decimals.
When converting a signed binary into a decimal, we simply make
the weight of the MSB negative:

𝑆 = − 𝑟𝑑−1 ⋅ 2𝑑−1 + 𝑟𝑑−2 ⋅ 2𝑑−2 + ⋅ ⋅ ⋅ + 𝑟1 ⋅ 21 + 𝑟0 ⋅ 20 (1.5)


𝑑−2
= − 𝑟𝑑−1 ⋅ 2𝑑−1 + ∑ 𝑟𝑖 ⋅ 2𝑖 . (1.6)
𝑖=0

The representation of signed number is called two’s complement. 5 5: There’s also one’s complement if
So... does 1000b represent −0?? Let’s keep reading. you’re interested.

ψ Caution!
 Signed Extension.
A common mistake is simply take
Now that we’re dealing with signed numbers whose MSB is a sign,
MSB as a sign and apply Equa-
when we extend the numbers we cannot simply add zeros to the tion (1.1) to the rest of the bits. For ex-
front. For example, if we want to extend 1010b to a byte with zero ample, 1010b is −6 in decimal based
extension, we’ll get 00001010b, which is +10, apparently a different on our equation above. However, if
you just take the leading 1 as neg-
number.
ative sign, and convert 010b using
Equation (1.1), you’ll get −2, which
But what if we just do zero extension and make the highest bit the is obviously wrong.
sign, as in 10001010b? Do the conversion using Equation (1.5) and
you’ll see that’s not right either.

We’re getting so close though. 10001010b is −118, and our target


number is −6. If we can add +112 to 10001010b , we’ll get the cor-
rect number.

With a bit guess, we notice that in the number 10001010b, the three
contiguous zeros take exactly a weight of +112, then why don’t we
just make them all 1s, as in 11111010b ? This is exactly the signed
extension!

Formally, for a signed extension, if the original number’s MSB is 1,


1001 0111
we just pad all 1s to the front; otherwise all 0s. This way, we com-
pensate the negative increase of the weight due to the extension. See 1111 1001 0000 0111
Figure 1.3 for an example.
Figure 1.3: Signed extension, which
copies MSB.
4 1 Fundamentals

 Data Range.
𝑑 𝑑
Let 𝑆𝑚𝑎𝑥 and 𝑆𝑚𝑖𝑛 the maximal and minimal numbers a signed 𝑑-bit
binary can represent, respectively:

𝑑 𝑑−1
𝑆𝑚𝑎𝑥 = 𝑈𝑚𝑎𝑥 = 2𝑑−1 − 1, (1.7)
𝑑
𝑆𝑚𝑖𝑛 = − 2𝑑−1 . (1.8)

d
Figure 1.4 shows a mapping from signed number to unsigned num-
<latexit sha1_base64="Fh+sW8gkafvCAqqaYNAcCfM0QP4=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIp6rLgxmUF0xbSWCaTSTt0HmFmIpaQz3DjQhG3fo07/8Zpm4W2HrhwOOde7r0nShnVxnW/ncra+sbmVnW7trO7t39QPzzqapkpTHwsmVT9CGnCqCC+oYaRfqoI4hEjvWhyM/N7j0RpKsW9maYk5GgkaEIxMlYK/Ic8LoY5R0/FsN5wm+4ccJV4JWmAEp1h/WsQS5xxIgxmSOvAc1MT5kgZihkpaoNMkxThCRqRwFKBONFhPj+5gGdWiWEilS1h4Fz9PZEjrvWUR7aTIzPWy95M/M8LMpNchzkVaWaIwItFScagkXD2P4ypItiwqSUIK2pvhXiMFMLGplSzIXjLL6+S7kXTu2y27lqNdquMowpOwCk4Bx64Am1wCzrABxhI8AxewZtjnBfn3flYtFaccuYY/IHz+QPL5ZGP</latexit>

Umax 111...11
d
<latexit sha1_base64="Gu7LPOxptVGN91mA3uky5ahxk4Y=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4sSRS1GPBi8cKpi20sWw2m3bpbhJ3N8US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8w8P+FMadv+tlZW19Y3Nktb5e2d3b39ysFhS8WpJNQlMY9lx8eKchZRVzPNaSeRFAuf07Y/upn67TGVisXRvZ4k1BN4ELGQEayN5LkPWZD3M4Gf8nOnX6naNXsGtEycglShQLNf+eoFMUkFjTThWKmuYyfay7DUjHCal3upogkmIzygXUMjLKjystnROTo1SoDCWJqKNJqpvycyLJSaCN90CqyHatGbiv953VSH117GoiTVNCLzRWHKkY7RNAEUMEmJ5hNDMJHM3IrIEEtMtMmpbEJwFl9eJq2LmnNZq9/Vq416EUcJjuEEzsCBK2jALTTBBQKP8Ayv8GaNrRfr3fqYt65YxcwR/IH1+QOt45IB</latexit>

1 111...10
Umax
ber.

In fact, there’s a small trick to convert decimals to two’s complements


d 1
quickly. Say we want to know the 4-bit two’s complement of decimal −5.
<latexit sha1_base64="3kS4GEVp9LdLEByzOQ1M63oTUb8=">AAAB+HicbVDLSsNAFJ3UV62PRl26GSyCIJZEirosuHFZwbSFNobJZNIOnZmEmYlYQ77EjQtF3Pop7vwbp4+Fth64cDjnXu69J0wZVdpxvq3Syura+kZ5s7K1vbNbtff22yrJJCYeTlgiuyFShFFBPE01I91UEsRDRjrh6Hridx6IVDQRd3qcEp+jgaAxxUgbKbCr3n0enblFkHP0WJy6gV1z6s4UcJm4c1IDc7QC+6sfJTjjRGjMkFI910m1nyOpKWakqPQzRVKER2hAeoYKxIny8+nhBTw2SgTjRJoSGk7V3xM54kqNeWg6OdJDtehNxP+8XqbjKz+nIs00EXi2KM4Y1AmcpAAjKgnWbGwIwpKaWyEeIomwNllVTAju4svLpH1edy/qjdtGrdmYx1EGh+AInAAXXIImuAEt4AEMMvAMXsGb9WS9WO/Wx6y1ZM1nDsAfWJ8/Ab2Sog==</latexit>

Umax + 1 100...00
d 1 d
First, we get the unsigned binary representation for its absolute value 5,
<latexit sha1_base64="GQ1n7/GjhRcUCWiiKqgd/6lwiiM=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4sSRS1GPBi8cKpi20sWw2m3bpbhJ3N8US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8w8P+FMadv+tlZW19Y3Nktb5e2d3b39ysFhS8WpJNQlMY9lx8eKchZRVzPNaSeRFAuf07Y/upn67TGVisXRvZ4k1BN4ELGQEayN5LkPWXDu5P1M4Ke8X6naNXsGtEycglShQLNf+eoFMUkFjTThWKmuYyfay7DUjHCal3upogkmIzygXUMjLKjystnROTo1SoDCWJqKNJqpvycyLJSaCN90CqyHatGbiv953VSH117GoiTVNCLzRWHKkY7RNAEUMEmJ5hNDMJHM3IrIEEtMtMmpbEJwFl9eJq2LmnNZq9/Vq416EUcJjuEEzsCBK2jALTTBBQKP8Ayv8GaNrRfr3fqYt65YxcwR/IH1+QOqQZIB</latexit>

<latexit sha1_base64="AdSQwDN3hZIvPJbrQ4U1+Pu5GpU=">AAAB8nicbVBNS8NAEN34WetX1aOXxSJ4KokU9Vjw4rGi/YA0ls1m2y7d3YTdiVhCfoYXD4p49dd489+4bXPQ1gcDj/dmmJkXJoIbcN1vZ2V1bX1js7RV3t7Z3duvHBy2TZxqylo0FrHuhsQwwRVrAQfBuolmRIaCdcLx9dTvPDJteKzuYZKwQJKh4gNOCVjJv3vIoryfSfKU9ytVt+bOgJeJV5AqKtDsV756UUxTyRRQQYzxPTeBICMaOBUsL/dSwxJCx2TIfEsVkcwE2ezkHJ9aJcKDWNtSgGfq74mMSGMmMrSdksDILHpT8T/PT2FwFWRcJSkwReeLBqnAEOPp/zjimlEQE0sI1dzeiumIaELBplS2IXiLLy+T9nnNu6jVb+vVRr2Io4SO0Qk6Qx66RA10g5qohSiK0TN6RW8OOC/Ou/Mxb11xipkj9AfO5w/IyZGN</latexit>

Umax 011...11 011...11 Smax

which is 0b0101 . Then flip every bit to 0b1010 and add 1 to it, which
gives us 0b1011 , and that’s exactly the two’s complement of −5. If it’s a
positive number, then the two’s complement is identical to the unsigned
000...00 000...00 0
<latexit sha1_base64="Ysbopyf6ykGDA1PlXgQLwFYRP70=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1W962qtWavUa3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7A+fwBd9GMrw==</latexit>

Unsigned range 111...11


<latexit sha1_base64="BovQVm0ni53vvSdXuUDMc3aLmuE=">AAAB6XicbVBNS8NAEJ3Ur1q/qh69LBbBiyWRoh4LXjxWsR/QhrLZTtqlm03Y3Qgl9B948aCIV/+RN/+N2zYHbX0w8Hhvhpl5QSK4Nq777RTW1jc2t4rbpZ3dvf2D8uFRS8epYthksYhVJ6AaBZfYNNwI7CQKaRQIbAfj25nffkKleSwfzSRBP6JDyUPOqLHSw4XXL1fcqjsHWSVeTiqQo9Evf/UGMUsjlIYJqnXXcxPjZ1QZzgROS71UY0LZmA6xa6mkEWo/m186JWdWGZAwVrakIXP190RGI60nUWA7I2pGetmbif953dSEN37GZZIalGyxKEwFMTGZvU0GXCEzYmIJZYrbWwkbUUWZseGUbAje8surpHVZ9a6qtftapV7L4yjCCZzCOXhwDXW4gwY0gUEIz/AKb87YeXHenY9Fa8HJZ47hD5zPH+KajOc=</latexit>

1
binary. 6
111...10 2
<latexit sha1_base64="PNmnlg/VseG4xBbW5b8Avx8LvfI=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4sSSlqMeCF49V7Ae0oWy2m3bpZhN2J0IJ/QdePCji1X/kzX/jts1BWx8MPN6bYWZekEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2d++4lrI2L1iJOE+xEdKhEKRtFKD5fVfqnsVtw5yCrxclKGHI1+6as3iFkacYVMUmO6npugn1GNgkk+LfZSwxPKxnTIu5YqGnHjZ/NLp+TcKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZUEmKXLHFojCVBGMye5sMhOYM5cQSyrSwtxI2opoytOEUbQje8surpFWteFeV2n2tXK/lcRTgFM7gAjy4hjrcQQOawCCEZ3iFN2fsvDjvzseidc3JZ07gD5zPH+QejOg=</latexit>

1.1.4 Binaries and Hexadecimals


d
<latexit sha1_base64="53ElV2QGyW7YA0YC4gaj+2iLQ8g=">AAAB8XicbVBNSwMxEJ2tX7V+VT16CRbBU9mVUj0WvHisaD+wXUs2m21Dk+ySZIWy9F948aCIV/+NN/+NabsHbX0w8Hhvhpl5QcKZNq777RTW1jc2t4rbpZ3dvf2D8uFRW8epIrRFYh6rboA15UzSlmGG026iKBYBp51gfD3zO09UaRbLezNJqC/wULKIEWys9IDuHsNBJpicDsoVt+rOgVaJl5MK5GgOyl/9MCapoNIQjrXueW5i/Awrwwin01I/1TTBZIyHtGepxIJqP5tfPEVnVglRFCtb0qC5+nsiw0LriQhsp8BmpJe9mfif10tNdOVnTCapoZIsFkUpRyZGs/dRyBQlhk8swUQxeysiI6wwMTakkg3BW355lbQvql69WrutVRq1PI4inMApnIMHl9CAG2hCCwhIeIZXeHO08+K8Ox+L1oKTzxzDHzifP1H2kKk=</latexit>

100...00 Smin
Signed range
One nice thing about conversion between hexadecimals and binaries is
Figure 1.4: For non-negative numbers, they can be mapped directly: every four-bit binary is corresponding to a
signed and unsigned have no difference.
When consider MSB as sign bit, the same hexadecimal digit:
binary pattern will be mapped to a posi-
tive number in the unsigned range. 1010 → A 1011 → B 1100 → C
1101 → D 1110 → E 1111 → F
6: Can you think of a justification why
this trick works?
Thus, if you want to convert a binary 000101101010011111010100 to hex-
adecimal, you just need to chop them into four-bit chunks:

0001 0110 1010 0111 1101 0100

and convert every four-bit into a digit mapped in hexadecimal:

0001 0110 1010 0111 1101 0100


↓ ↓ ↓ ↓ ↓ ↓
1 6 A 7 D 4

Then we get the hexadecimal number 0x16A7D4.


Given a signed hexadecimal number, if the leading digit is greater than 7,
we say the number has to be negative. Can you explain why?

˛ Quick Check 1.1


1. For each of the following statements, state if it’s true or false,
and explain why:
a) Depending on the context, the same sequence of bits
may represent different things;
b) If you interpret a 𝑁 bit two’s complement number as an
1.1 Number Representations 5

unsigned number, negative numbers would be smaller


than positive numbers.
2. For the following questions, assume an 8-bit integer and an-
swer each one for the case of an unsigned number and two’s
complement number. Indicate if it cannot be answered with
a specific representation:
a) What is the largest integer? What is the result of adding
one to that number?
b) How would you represent the numbers 0, 1, and -1?
c) How would you represent 17 and -17?
3. What is the least number of bits needed to represent the fol-
lowing ranges using any number representation scheme:
a) 0 to 256;
b) -7 to 56;
c) 64 to 127 and -64 to -127;
d) Address every byte of a 12 TB chunk of memory.

B See solution on page 20.

1.1.5 Binary Operations

Calculations between binary numbers are bit-wise operations, and it’s not
that different than decimal calculations. In computer systems, however,
each binary has a fixed width, e.g., 64 bits, so the result might be different
or not make sense mathematically.

1.1.5.1 Fixed Width Binary Arithmetic

Binary arithmetic such as addition, subtraction, multiplication, and divi-


sion follow the same rules as decimal numbers. We only look at addition
here, but the situations we discussed here apply for the other three calcu-
lations as well.

Let’s look at an example first, where we restrict both operands and the re-
sult to be only one byte. Assume operand a is 0b1111 1111 , while b
is 0b0000 0001 . When we perform a + b by hand, we know we need
to carry one bit of 1 to the front, so we have a + b = 0b1 0000 0000 .
However, remember we want to restrict the result to only one byte, we can
only save the lowest eight bits, which makes is a + b = 0b0000 0000 :

1111 1111
+ 0000 0001
1
 0000 0000

In this case, when the result has a carry (or a borrow in subtraction), we
say there’s a carry happened. When we interpret the operands as unsigned
integers, we see this results in 255 + 1 == 0 .
6 1 Fundamentals

Now, what if we interpret the binary numbers as signed integers? Let’s


look at the following example:

0111 1111
+ 0111 1111
0
 1111 1110

This time, we add two 0b0111 1111 together, and get 0b1111 1110 . If
we interpret them as signed numbers, this is basically 127+127 == -2 .
Now this doesn’t make sense mathematically — how come adding two
positive numbers results in a negative number? In this case, we say there’s
an overflow.
positive
0111...1 overflow
Assume the lengths of the operands are 𝑑 bits, and when we perform addi-
0100...0 tion on them we result in a 𝑑 + 1-bit number, so we have to get rid of MSB.
0011...1 011...1
As shown in Figure 1.5, if the result is between −2𝑑−1 and 2𝑑−1 − 1 (inclu-
sive), truncating MSB doesn’t affect our result. However, if it’s between
0000...0 000...0
111...1 [−2𝑑−1 − 1, −2𝑑 ], truncating MSB will take us to positive range. This is
called negative overflow. Similarly, truncating a number in the range of
1100...0 100...0
1011...1
<latexit sha1_base64="H0fjCowplGClm9ablGEOeZeQw2c=">AAACBXicbVDLSsNAFJ34rPUVdamLYBFclUTqY1l047KCfUATwmRy2w6dTMLMRCghGzf+ihsXirj1H9z5N07aLLT1wDCHc+7l3nuChFGpbPvbWFpeWV1br2xUN7e2d3bNvf2OjFNBoE1iFotegCUwyqGtqGLQSwTgKGDQDcY3hd99ACFpzO/VJAEvwkNOB5RgpSXfPHJTHoIIBCaQuSOZFP85RHnuZ7lv1uy6PYW1SJyS1FCJlm9+uWFM0gi4IgxL2XfsRHkZFooSBnnVTSXoCWM8hL6mHEcgvWx6RW6daCW0BrHQjytrqv7uyHAk5SQKdGWE1UjOe4X4n9dP1eDKyyhPUgWczAYNUmap2CoisUIqgCg20QQTQfWuFhlhnYjSwVV1CM78yYukc1Z3LuqNu0ateV3GUUGH6BidIgddoia6RS3URgQ9omf0it6MJ+PFeDc+ZqVLRtlzgP7A+PwBvLmZYQ==</latexit>

|
d bits
<latexit sha1_base64="OlQqY3YZLIxt2bgUY/MluCmov9E=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJu3azSbsboRS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCorZNMMWyxRCSqE1CNgktsGW4EdlKFNA4EPgSj25n/8IRK80Tem3GKfkwHkkecUWOlZtgvV9yqOwdZJV5OKpCj0S9/9cKEZTFKwwTVuuu5qfEnVBnOBE5LvUxjStmIDrBrqaQxan8yP3RKzqwSkihRtqQhc/X3xITGWo/jwHbG1Az1sjcT//O6mYmu/QmXaWZQssWiKBPEJGT2NQm5QmbE2BLKFLe3EjakijJjsynZELzll1dJ+6LqXVZrzVqlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fyteM8Q==</latexit>
{z }
[2𝑑−1 , 2𝑑 − 1] we end up having a positive overflow. Note that overflow
is meaningless when we interpret the numbers as unsigned.
negative
1000...0 overflow
| {z }
˛ Quick Check 1.2
<latexit sha1_base64="H0fjCowplGClm9ablGEOeZeQw2c=">AAACBXicbVDLSsNAFJ34rPUVdamLYBFclUTqY1l047KCfUATwmRy2w6dTMLMRCghGzf+ihsXirj1H9z5N07aLLT1wDCHc+7l3nuChFGpbPvbWFpeWV1br2xUN7e2d3bNvf2OjFNBoE1iFotegCUwyqGtqGLQSwTgKGDQDcY3hd99ACFpzO/VJAEvwkNOB5RgpSXfPHJTHoIIBCaQuSOZFP85RHnuZ7lv1uy6PYW1SJyS1FCJlm9+uWFM0gi4IgxL2XfsRHkZFooSBnnVTSXoCWM8hL6mHEcgvWx6RW6daCW0BrHQjytrqv7uyHAk5SQKdGWE1UjOe4X4n9dP1eDKyyhPUgWczAYNUmap2CoisUIqgCg20QQTQfWuFhlhnYjSwVV1CM78yYukc1Z3LuqNu0ateV3GUUGH6BidIgddoia6RS3URgQ9omf0it6MJ+PFeDc+ZqVLRtlzgP7A+PwBvLmZYQ==</latexit>

d + 1 bits
<latexit sha1_base64="qBuevcQwSNJLBUUIDx4jzYBqJ5o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSIIQkmkqMeiF48V7Qe0oWw2m3bpZhN2J0Ip/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEph0HW/nZXVtfWNzcJWcXtnd2+/dHDYNEmmGW+wRCa6HVDDpVC8gQIlb6ea0ziQvBUMb6d+64lrIxL1iKOU+zHtKxEJRtFKD+G51yuV3Yo7A1kmXk7KkKPeK311w4RlMVfIJDWm47kp+mOqUTDJJ8VuZnhK2ZD2ecdSRWNu/PHs1Ak5tUpIokTbUkhm6u+JMY2NGcWB7YwpDsyiNxX/8zoZRtf+WKg0Q67YfFGUSYIJmf5NQqE5QzmyhDIt7K2EDaimDG06RRuCt/jyMmleVLzLSvW+Wq7d5HEU4BhO4Aw8uIIa3EEdGsCgD8/wCm+OdF6cd+dj3rri5DNH8AfO5w+hE41h</latexit>

Figure 1.5: Truncating MSB of a 𝑑 + 1-bit Is the following statement true or false, and why? It is possible to
binary will result in positive or negative get an overflow error when adding two signed numbers of oppo-
overflow. There’s no effect on any num-
ber between [−2𝑑−1 , 2𝑑−1 − 1].
site signs.
B See solution on page 21.

1.1.5.2 Shifting

As the name suggests, shifting is simply to shift every bit several digits
over. There are two directions: left and right, meaning we can shift the
number to the left or the right.

 Logical shift left:


[7][6][5][4][3][2][1][0] In logical shift left, we move every bit of the original number to the
1 0 0 0 1 1 0 0
left, and patch zeros at the end. Figure 1.6 (top) shows an exam-
ple, where we shift left three bits. One interesting fact about shift-
1 0 0 0 1 1 0 0 0 0 0
[10][9][8][7][6][5][4][3][2][1][0]
ing left is, if we shift left n bits to a binary number b , the result is
equivalent to b ⋅ 2n . You can verify this: shift left 2 bits on binary
[7][6][5][4][3][2][1][0] 100b is 10000b , which is 100b ⋅ 22 . This is an important property
1 0 0 0 1 1 0 0
actually. Multiplication is a complicated operation, hardware-wise,
and is very slow. However, shifting is fast and simple. Later in this
1 0 0 0 1 1 0 0 0 0 0
[7][6][5][4][3][2][1][0] course you’ll see lots of places where we use shifting instead of mul-
tiplication.
Figure 1.6: We apply three bits of log-
ical shift left operation to binary num-
ber 10001100b (top). If the size of the bi- When we restrict the length of the binary result, as in Figure 1.6 bot-
nary is limited (bottom), the MSBs are dis-
carded.
tom, we will discard the most significant bits that are out of the bi-
nary size.
1.1 Number Representations 7

 Logical shift right:


Shifting right needs to be careful, since MSB can be a sign bit. If we
apply the same rule as in logical shift left, we have logical shift right,
where shift every bit to the right, discard least significant bits, and
pad zeros to the beginning. It’s simply the opposite of logical shift
left. A similar example is shown in Figure 1.7.

[7][6][5][4][3][2][1][0]
Even though the bit-wise operation of logical shift left and right are 1 0 0 0 1 1 0 0
the opposite, shifting right cannot be regarded as dividing by two.
This can be easily verified from the examples shown in Figure 1.7. 0 0 0 1 0 0 0 1 1 0 0
The most obvious thing to notice is that, if the number is a signed in- [10][9][8][7][6][5][4][3][2][1][0]

teger, shifting right might even change the sign of the number. This [7][6][5][4][3][2][1][0]
is true when the MSB of the original binary number is 1. We call 1 0 0 0 1 1 0 0

this “logical” because it’s simply operated based on how we move


the bits around without thinking about the values of the numbers. 0 0 0 1 0 0 0 1 1 0 0
[7][6][5][4][3][2][1][0]
To make the shifting right similar to division, we’d need arithmetic
shift right operation. Figure 1.7: We apply three bits of logi-
cal shift right operation to binary num-
ber 10001100b (top). If the size of the bi-
 Arithmetic shift right: nary is limited (bottom), the LSBs are dis-
Arithmetic shifting is similar to logical, except that we pad MSBs carded.

with the original number’s MSB. An example is shown in Figure 1.8.


If original number’s MSB is 1, we pad 1s to the front, to preserve the
negative sign of the number. If it’s 0, then it’s same to logical shift.
[7][6][5][4][3][2][1][0]
1 0 0 0 1 1 0 0
˛ Quick Check 1.3
Assume we have a binary number 1100 0101 . How can we ap- 1 1 1 1 0 0 0 1 1 0 0
[7][6][5][4][3][2][1][0]
ply shifting operations on this number so we can get 0000 0101 ?
That is, only preserve the lowest four bits and zero out the rest. [7][6][5][4][3][2][1][0]
0 0 0 1 1 1 0 0
B See solution on page 21.

0 0 0 0 0 0 1 1 1 0 0
[7][6][5][4][3][2][1][0]

1.1.5.3 Bit-wise Logical Operation Figure 1.8: In arithmetic shift right, we


pad MSBs with copies of original num-
bers MSB. We only show the versions
where the binary size is limited.
Another common type of binary operation is logical operation, such as
and , or , and xor (exclusive-or). For this kind of operations, we just
need to align the two binary operands and do bit-wise operation on ev-
ery bit. Same as what you learned in Discrete Structures, 1 and 1 is 1 ;
1 and 0 is 0 , etc.

One interesting application of logical operations that we’d like to mention


is masking. In some situations we’d like to only extract several bits of a
number. For example, we have a binary number 11001010 , and we want
to extract the 6th bit and zero out all other bits, so that we have 01000000 .
We can certainly use masking to achieve this.

Notice both 0 and 0 and 0 and 1 result in 0 , so we can create a bi-


nary number where only the bit we want to extract is 1 and all the others
are zero. Then we perform bit-wise and operation between the original
number and this new number we created.
8 1 Fundamentals

˛ Quick Check 1.4


Assume we have a binary number 1100 0101 . How can we ap-
ply only bit-wise logical operations on this number so we can get
0000 0101 ? That is, only preserve the lowest four bits and zero
out the rest.
B See solution on page 21.

1.2 Basic Components of Microprocessors

Most of our modern computers are built on the simple model from John
von Neumann, called the von Neumann model or Princeton architecture.
Figure 1.9 shows us a visualization of this model. For our course, we fo-
cus on Central Processing Unit (Chapter 3) and Random Access Memory
(Chapter 4) most of the time. I/O devices will be mentioned, but not dis-
cussed in depth.

1.2.1 Central Processing Unit (CPU)

As its name suggests, CPU is the core of all computers, since it’s where
the actual calculation happens. Everything we see eventually needs to go
through CPU to calculate results. Roughly speaking, the two most impor-
tant components of CPU are registers and arithmetic logic unit (ALU).

1.2.1.1 Registers

Consider a simple expression 2 + 3 : the numbers 2 and 3 are the operands,


while the plus sign is the operator. In CPU, the operands need to be stored
and ready in registers, so that they can be sent to ALU for calculation.

Most of the machines have a set of registers for us to use, and this set is
usually called a register file. You can think registers are just small storage
to store one binary sequence.

In our course, we’ll be using ARMv8 architecture, where we have 32 gen-


eral purpose registers, and each of them can store 64 bits. We call it general
purpose, because they can be used by us to store any binary sequence we

Data Bus

Register Random
ALU Clock
File Access I/O Device I/O Device
Memory #1 #2
Central Processor Unit Control
(RAM)
(CPU) Unit

Control Bus
Figure 1.9: A typical von Neumann model. Address Bus
1.2 Basic Components of Microprocessors 9

want. There are also special registers that we cannot modify the numbers
stored there, and we’ll introduce them later.

1.2.1.2 Arithmetic Logic Unit (ALU)

Once the operands are ready in the registers, they’ll be sent to ALU for the
real calculation. ALU receives two operands, and does simple calculations
as specified. In fact, ALU is very simple: the only calculations it can do
are arithmetic (addition and subtraction) and bit-wise logic ( and , or ,
etc). Later in this course, we will understand how these simple elementary-
school-math in ALU leads to something so magical and powerful.

1.2.2 Random Access Memory (RAM)

Most of our computers are stored-program computers, meaning our pro-


grams are stored as data inside the computer. Typically, a random access
memory, or RAM, is connected to CPU to store them.

 Data Storage.
You can think the RAM as a large array or list of bytes. It’s byte-
addressed, meaning each byte in the memory has its own address.
The addresses typically range from 0 to a very large number, depend-
ing on the size of the RAM. For example, with a 4GB RAM, we can
store 4 ⋅ 230 bytes, and therefore the address range will be [0, 232 − 1].
Apparently, this 4GB memory is never large enough for a modern
computer, or in fact much much smaller than what we actually need.
Later in this course, we will learn a very genius idea about virtual
memory and physical memory. For now, their difference doesn’t
really matter.

Figure 1.10 shows us a visualization of memory storage, where you


can see every eight bits, i.e., every byte, has a unique address. As you
can see, we usually use hexadecimals to represent addresses. Now
with this address, we can get any number we like, or store a number
to any address we like.

 Data Transfer.
Note here RAM is not inside CPU; thus, if we want to do some cal-
culations using CPU on data from RAM, we need to move the data
from RAM to registers first. With Figure 1.9, you probably have
guessed that CPU sends an address to RAM through address bus,
and then RAM will send the data stored at that address back to CPU
through data bus.

Buses are just a collection of wires used to transfer data between


CPU and other devices, where each wire transfers a single bit of zero
Figure 1.10: Visualization of RAM. No-
or one. Remember in Figure 1.1 we say we can extract a discrete
tice each byte (8 bits) has a unique ad-
value from continuous change of signals on a wire. That is true of dress. The addresses are ranged from
course, but what if we want to transfer one byte? It’s a waste of time 0x00...0 to 0xFF...F. What’s the size
of this RAM?
10 1 Fundamentals

to use just one wire, because we’d have to wait for 8 seconds (as-
sume each second transfers one bit) to get the number. What makes
sense is to group 8 wires together, and transfer the 8 bits on them all
at once, one bit per wire. This group of 8 wires is called a bus.

The width of the buses can vary based on their needs. For 32-bit
machines, an address bus has a width 32, and thus it can address
at most 4GB of memory. On 64-bit machines, the address bus has a
width of 64, which makes the addressable space 264 bytes, or 16EB.
The width of control bus depends on how many signals the control
unit needs to send to different parts of the machine. We’ll learn in
this Chapter 3.

1.2.3 Peripheral Devices

CPU and RAM are the main focus of our course, but for a real computer
that’s far from complete obviously. First of all, remember the storage of
memory is very limited, and therefore we can’t store all our data/program
into it. In fact, only when we start executing a program will it be loaded
into RAM. Also, we need to connect monitors, keyboard, etc. We call these
peripheral devices, and from Figure 1.9, we see that they are mostly I/O
devices, including hard drives.

˛ Quick Check 1.5


Now let’s do a simple quick check to make sure you understand
everything so far.
True or False?

1. All data stored in CPU has an unique address;


2. Random access memory stores all the files on one’s com-
puter;
3. Random access memory is byte-addressed;
4. If we want to do a calculation such as 24−13, we store the two
operands into RAM, and transfer them to ALU to perform the
calculation. Once ALU computes the result, it’ll be directly
transferred back to RAM;
5. In a program written in Java, for example, we declare inte-
gers, characters, floating points, etc., for variables. RAM also
stores them in the forms of integers, characters, and floating
points.

B See solution on page 21.

1.3 Computer Abstractions

With the most underlying organization introduced, let’s go a bit higher.


One of the greatest ideas in computer science is abstraction. It’s not easy to
go from circuits to the fancy stuff we see on our computers, and therefore,
1.4 A Peek Into Memory with C 11

from the bottom (hardware), the computer scientists built one layer on top
at a time, which eventually transforms digital signals to everything we’re
familiar today.

In computer science, abstraction means the underlying details are removed


or ignored, so that it provides a convenient interface for upper level appli-
cations. For example, when you write a Java program and want to print
something out, you’d use function System.out.println() . Now, you
think that’s straightforward, right? But remember, when you want to print
something to screen, which is physical, you need to manage how the ac-
tual hardware manages to bring this string to the screen. However, as
a programmer, did you really care about those? No, the only thing you
did is to call the println() function, and this is because this function is
exactly an abstraction of the job you want to do.
High-Level Language Not covered
Figure 1.11 shows us the layers between the actual hardware (digital logic)
and our level (high-level language). You can see this abstraction greatly re- Chapter 2 &
duces our work and expedites the development of computer science. Now Assembly Language Appendix
if we run a C or Java program, we really don’t have to think about how
different wires or buses transfers 0s or 1s; all we need to do is to focus on Operating System Not covered
the task itself.
Instruction Set
Chapter 2
Our course is on the level of Microarchitecture, so not too low not too Architecture
high. Each layer in the abstraction is, however, closely coupled, so it’s im-
possible to entirely look at one layer. Therefore, we’ll have to look at its Microarchitecture Chapter 3,4,5
neighbors as well. In Chapter 2, we start with instruction set architecture
and focus on assembly language to get a hang of how CPUs actually exe- Digital Logic Chapter 3
cute programs. Then in Chapter 3, we’ll first briefly talk about digital logic,
and then start building our microprocessor with different components. Figure 1.11: Abstractions of computer
systems.

1.4 A Peek Into Memory with C

To prepare you for the madness later in this course, let’s start from the
layer you’re most familiar with, which is the high-level language. Not all
high-level languages are created equally, though. Our purpose is not to
learn a new language; instead we want to use a language that’s high-level
enough for understanding but also has close connection to low-level hard-
ware design as an entry point. C apparently wins in this perspective.

1.4.1 Quick Start

C programs start with main() function, which is the entry point of a C


program. The simplest C program looks like this:

1 int main() {
2 return 0;
3 }
12 1 Fundamentals

Variable and function declarations are very similar to Java, so we will skip
that part. One difference is that C is not an object-oriented language, so the
functions and variables don’t have attributes such as public , private ,
etc.

1.4.1.1 Data Types

In C, we have several basic data types: char , short int , int , long int ,
float , and double . The difference among the three types of ints is
they have different numbers of bits to represent an integer, so they will
have different ranges of representation. Same for float and double ,
though the result is they have different precision.

In fact, char types are also integers, but can only store eight bits (see
Section 1.4.2.2). The integers they store represent a character’s ASCII code,
so the following two declarations are equivalent:

1 char c = 'A';
2 char c = 65;

because character 'A' has an ASCII code of 65.

Notice here we don’t have string types. Since a string is basically an array
of characters, we can declare it using three different methods:

1 char str[] = "Hello!";


2 char* str2 = "Hello!";
3 char str3[] = {'H','e','l','l','o','!',0};

The difference between the first two is the place where the string is stored,
and we can just ignore it now. Notice for the last method, we add a 0 at
the end. This is a null terminator, marking the end of the string. If we use
the first two methods, a null terminator will be automatically attached, so
we do not need to add it manually. The null terminator can be declared
using integer value of 0, or a character '\0' .

Integer data types (including char ) by default are signed. You can add
modifier unsigned to make them unsigned. For example,

1 int var1 = -253;


2 unsigned int var2 = -253;
3 printf("var1 = %d, var2 = %u\n", var1, var2);

We’ll introduce printf() in the next section, but based on the output,
you can see they have different values, even if we declare them to be the
same. Actually you can verify that, if an integer takes four bytes, −253 is
indeed 4294967043 when interpreted as an unsigned integer.
1.4 A Peek Into Memory with C 13

1.4.1.2 Binary Operations

C language can deal with binary numbers as well. We show the operators
and their descriptions in Table 1.2.
Table 1.2: Binary operators in C lan-
Notice that what’s different than C++ is << and >> are not used as guage.
stream operators in C, so we cannot print something to the terminal using
Operator Description
them. They are simply used for numbers and manipulate bits. Also, &
<< Logical shift left
and | are different than logical operators && and || . Please be careful >> Arithmetic shift right
when you use them. & Bit-wise and
| Bit-wise or
You can certainly apply them to any numbers you like. See the following ^ Bit-wise xor
example: ~ Bit-wise negation

1 char a = 10; /* a = 0000 1010b */


2 char b = 20; /* b = 0001 0100b */
3 char c = a & b; /* c = 0000 0000b */
4 char d = a << 3; /* d = 0101 0000b */

where left operand of << is the original number, and right operand is the
number of bits we want to shift.

˛ Quick Check 1.6


Assume a = 0b1000 1011, b = 0b0011 0101, and c = 0b1111
0000. Calculate the results of the following C statements:

1. a & b;
2. a | c;
3. a | 0;
4. a | (b >> 5);
5. a & (b >> 5);
6. ~((b | c) & a);

B See solution on page 22.

1.4.1.3 Formatted I/O

A very common thing to do is to print something to the terminal. To use


I/O utilities in your C code, you should include the header file as the first
line:

1 #include <stdio.h>

If it’s just a string, such as those declared in the code listing above, we can
use puts() function:

1 puts(str);
14 1 Fundamentals

If we want to print, say, a variable, we’d need to use formatted output,


printf() . The first argument of printf() is a string that represent
what the output should look like. For every variable, we’d need to put
a “placeholder” there, called format specifier. Then the following argu-
ments would be the variables we want to print. For example:

1 printf("Variable a = %d\n", a);

where we assume there’s a variable a whose value is 10. Then the output
will be like: Variable a = 10 , with a new line. You can see that %d
is replaced by the value of a when printed out. Different types require
different format specifiers. The common ones are shown in Table 1.3
Table 1.3: Commonly used format speci-
fiers in C. In C, we cannot print an entire array out with one single printf() ; in-
%d Decimal signed integers stead, we need to loop into the array, and print each element individu-
%u Decimal unsigned integers ally:
%f Float numbers
%lf Doubles 1 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};
%c Characters 2 for (int i = 0; i < 5; i ++) {
%s Null-terminated strings 3 printf("The %d-th element is %lf\n", i, arr[i]);
%p Pointers 4 }

However, the only exception is strings. If we have declared a string, we


can just print it out using one printf() with %s:

1 char str[] = "Hello!";


2 printf("The string is %s\n", str);

1.4.1.4 goto Statement

We’re fairly familiar with some typical control structures such as loops and
if-else , so we’ll skip them here, but bring a unique and “infamous”
keyword in C, called goto . We say it’s infamous, because it has been
criticized for so long.

What it does is very simple: you mark a line of your C code with a label,
and you can use goto to change your program to execute the line with
the label. For example:

1 #include <stdio.h>
2 int main() {
3 int a = 10;
4 a = 20;
5 goto L1;
6 a = 40;
7 L1: printf("a = %d\n", a);
8 return 0;
9 }
1.4 A Peek Into Memory with C 15

In this example, we mark the line with printf() using a label L1 . After
we assigned 20 to a , we used a goto statement to jump to L1 . As
you see, in the print out string, value of a is 20, instead of 40, because
the statements between goto and the destination label L1 were simply
skipped.

Except that you cannot label the statements that are declaring a variable,
there’s no limit where you can put a label. You can certainly put a label
before the goto statement like this:

1 int a = 10;
2 L1: a = 20;
3 goto L1;
4 a = 40;
5 printf("a = %d\n", a);
6 return 0;

But what will happen? You’re stuck in an infinite loop! You can even do
this:

1 L1: goto L1;

But as you see goto statement will cause problems: the program logic
and algorithm becomes very unclear, the control is not structured, and it
makes the program difficult to debug and understand. That’s why it’s
been criticized so hard widely, and almost everyone advises against using
it. Please do not use goto in any of your assignment/lab/exam in this
course unless specifically asked.

So why do we introduce it here? Later when we learn assembly language,


which is unstructured, this type of jumping between statements using la-
bels is the only way we can control the program flows. In other words,
everything in assembly is “ goto ”, so it’d be better to know this from a
language we’re familiar with.

1.4.2 Pointers

Every variable, even our code, stores in memory when executing the pro-
gram. If it’s in memory, apparently they’ll be stored as binary numbers,
and every byte will have an address (see Figure 1.10). For C language, we
can explicitly use those addresses, which is a major difference than the
languages you have learned so far. The addresses in C are called point-
ers. Pointers are “pointing” to a variable in our program, and have a type
closely related to the type of the variable they point to.

1.4.2.1 Reference and Dereference

Look at the following program:


16 1 Fundamentals

1 int main() {
2 int var = 10;
3 int* ptr_var = &var;
4 return 0;
5 }

In this example, we declared an integer var , and it’ll be stored some-


where in the memory when we’re running this program. What if we want
to know where it’s stored? To get a variable’s address, we use & , the
address-of operator, in front of it. To store this address into a variable, we
first need to consider the type of var . Since it’s an integer, we’ll declare
a pointer of integer, int* , to store the address of var .

Given a pointer, if we want to get the value stored at the address, we can
dereference the pointer by using * operator:

1 int main() {
2 int var = 10;
3 int* ptr_var = &var;
4 int deref = *ptr_var;
5 return 0;
6 }

When running this program, on line 3, we’ll have ptr_var to store the
address of var , say 0xffff1000 . Then on line 4, we use * to derefer-
ence the address, which gives us the value stored there, and assign it back
to variable deref . Thus, the value of deref is 10.

Everything in our program will be stored in the memory when executing,


meaning pointers themselves will also be stored just like any other vari-
ables in the memory. If they are also in the memory, that means they also
have their own addresses. So how do we get the addresses of pointers, or
addresses of addresses?

Following the example above, and based on our current discussion, it’s
not difficult to derive the double pointer:

1 int** pp_var = &ptr_var;

1.4.2.2 Pointers and Memory


7: Be careful: both int and long
types have 4 bytes, but long int type
You probably already know it, but different variable types take different
has 8!
numbers of bytes in memory: 7

char 1 byte short int 2 bytes int 4 bytes


float 4 bytes long int 8 bytes double 8 bytes
Any type of pointers: 8 bytes
1.4 A Peek Into Memory with C 17

When we store a variable, its bytes will be stored in continuous bytes in


memory. We know that each byte has its own address, if an integer takes
4 bytes, it should have 4 addresses. Why does a pointer have only one
address for one variable? Here’s the thing: when a variable takes multiple
bytes, the address (pointer) of this variable is the lowest address.

For example, assume var is an integer, stored in addresses 0x1000 ,


0x1001 , 0x1002 , and 0x1003 . In this case, the address (pointer) of
var , i.e., &var is the lowest address which is 0x1000 .

Which leads to another question then. We know different data types can
occupy different numbers of bytes. When you dereference a pointer, how
does the system know you want a, say integer, or a double? Remember
pointers also have types, e.g., int* and double* . If you’re dereferenc-
ing a int* , the system will group four contiguous bytes starting from
the address indicated by int* and interpret it as an integer. Thus, the
type of the pointer determines how many bytes it needs to retrieve from
the address.

1.4.2.3 Endianness

Almost all the memories are byte-addressed, meaning each byte has a
unique address. We also know some data types occupy multiple bytes.
The question is, how do you store these different bytes in the memory, i.e.,
in what order?

Say we have an integer int var = 366154 , whose hexadecimal is 0x5964A .


Since each integer takes four bytes and each byte has two hexadecimal dig-
its, we can separate the bytes as follows:

0x00 0x05 0x96 0x4A

Now these four bytes will be stored in the memory when the program is
executing, and each byte has its own address. Let’s assume the integer’s
address &var is 0x1000 . Question is, which byte has what address?

There are two ways to manage this. First, we can let the lowest byte occupy
the lowest address, called little endian:

Address 0x1000 0x1001 0x1002 0x1003


Data 0x4A 0x96 0x05 0x00

Of course we can also let the lowest byte occupy the highest address, which
is called big endian:

Address 0x1000 0x1001 0x1002 0x1003


Data 0x00 0x05 0x96 0x4A
18 1 Fundamentals

The endianness is specific to machines, not something we can control. Most


of the machines we’re using are little endian machines, and so in this
course without specifying we assume little endianness. You can check the
8: In the code, we first print out the endianness of your machine by running the following program: 8
address and the value of var using
hexadecimal format (%X). To view each
byte, we convert the pointer of var 1 #include <stdio.h>
to char* , since each char takes one 2 int main() {
byte. Operator sizeof(int) returns
3 int var = 366154;
4 printf("%p: 0x%X\n", &var, var);
the number of bytes that an int takes,
5 char* p = (char*)&var;
so we iterate from the lowest address to
the highest, and print one byte at a time. 6 for (int i = 0; i < sizeof(int); i ++)
7 printf("%p: 0x%X\n", p+i, p[i]);
8 return 0;
9 }

Also notice that regardless of endianness, the address of a variable stays


the same. For the above example, both big and little endianness, the ad-
dress of var is 0x1000 .
Ÿ Caution!

Note that endianness only happens


within multi-byte variables. For an 1.4.2.4 Arrays and Pointer Arithmetic
array, all the elements are still placed
from low address to high address; it’s
within each element that there might When we want to print out the address of a variable, we use & operator, e.g.,
be endianness differences among dif-
int* p = &var . However, to show the address of an array or a string,
ferent machines.
In other words, arrays always grow we do not need to use address-of operator, because the variable name of
from low address to high address; an array or a string is already the starting address of it:
while individual elements grow dif-
ferently based on endianness.
1 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};
2 printf("%p\n", arr);

In fact, the name of an array is also the address of its first element, i.e.,
arr == &arr[0] .

Another important operation is called pointer arithmetic. Given a pointer,


we can add or subtract a number from it (called offset) to obtain a new
address. Following the code segment above, since arr is the starting
address of the array, we can add the index of an element to arr and get
the address of that element:

1 for (int i = 0; i < 5; i ++) {


2 printf("%p\n", arr + i);
3 }

When running this program, you’ll notice that the addresses on two con-
secutive lines have eight bytes differences. This is because arr is a double*
and each double takes eight bytes, so adding 1 to the pointer will pro-
duce an offset of eight bytes. More generally, for a pointer p of type x
and an integer i , when we do p + i , the new address will be actually
i * sizeof(x) bytes higher than p .
1.4 A Peek Into Memory with C 19

We have also learned dereferencing a pointer to get the value stored at that
address, so here’s another way to print all elements out:

1 for (int i = 0; i < 5; i ++) {


2 printf("%lf\n", *(arr + i));
3 }

Table 1.4: Array access methods. The two


What we are doing here is first calculate the address of the i-th element
methods on each row are equivalent.
by arr + i . Then we dereference the address (pointer) by *(arr + i) .
Table 1.4 summarizes the equivalency between two methods so far related arr[0] *arr
to arrays. &arr[0] arr
arr[i] *(arr + i)
˛ Quick Check 1.7 &arr[i] arr + i
Can you write a program to show all the bytes in the following
array individually, in the order they are stored?

1 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};

B See solution on page 22.

1.4.2.5 Null-Terminated Strings

In Section 1.4.1.1 we mentioned that strings need to have a null-terminator


at the end. A null terminator is simply a char whose ASCII value is 0. The
reason why we need a null terminator is mainly because of how C handles
strings and output.

We know that to print out an array we need to use for loops, where we’ll
specify the number of iterations we want to repeat. One exception, how-
ever, is strings, where we can just use %s to print all the characters in the
char array:

1 char str[] = "Hello!";


2 printf("%s", str);

Note that printf() receives str , which is the starting address of the
char array, and so it’ll start and continue printing one byte after another.
When will it stop, then? That’s where the null terminator comes into play:
printf() will start printing given the starting address of the string, keep
printing, until there’s a null terminator.

What if we don’t have the null terminator? Sometimes it’s fine, because
there’s already no data stored at the end of the string, and therefore its
value is just 0, which happens to be the null terminator. But try the follow-
ing program:

1 #include <stdio.h>
2 int main() {
20 1 Fundamentals

3 char a[8] = {'H','e','l','l','o','!','!','!'};


4 char b[8] = {'W','o','r','l','d','!','!','!'};
5 printf("%s", a);
6 return 0;
7 }

And you’ll see the output is Hello!!!World!!! , which is because a is


not null terminated. It stops at b simply because in memory there’s no
data stored after the last “ ! ”, and so by default its value is 0 .

1.5 Quick Check Solutions

Quick Check 1.1

1. For each of the following statements, state if it’s true or false, and
explain why it’s true or false:
a) Depending on the context, the same sequence of bits may rep-
resent different things;
 True. The same bits can be interpreted in many different
ways with the exact same bits! It can be from an unsigned
number to a signed number or even a program. It is all
dependent on your interpretation.
b) If you interpret a 𝑁 bit two’s complement number as an un-
signed number, negative numbers would be smaller than pos-
itive numbers.
 False. In Two’s Complement, the MSB is always 1 for a neg-
ative number. This means ALL Two’s Complement nega-
tive numbers will be larger than the positive numbers.
2. For the following questions, assume an 8-bit integer and answer
each one for the case of an unsigned number and two’s complement
number. Indicate if it cannot be answered with a specific represen-
tation:
a) What is the largest integer? What is the result of adding one to
that number?
 Unsigned: 255, 0;
 Two’s Complement: 127, -128.
b) How would you represent the numbers 0, 1, and -1?
 Unsigned: 0b0000 0000 , 0b0000 0001 , not possible;
 Two’s Complement: 0b0000 0000 , 0b0000 0001 , 0b1111 1111 .
c) How would you represent 17 and -17?
 Unsigned: 0b0001 0001 , not possible;
 Two’s Complement: 0b0001 0001 , 0b1110 1111 .
3. What is the least number of bits needed to represent the following
ranges using any number representation scheme:
a) 0 to 256;
1.5 Quick Check Solutions 21

 In general 𝑛 bits can be used to represent at most 2𝑛 distinct


things. As such 8 bits can represent 28 = 256 numbers.
However, this range actually contains 257 numbers so we
need 9 bits.
b) -7 to 56;
 Range of 64 numbers which can be represented through 6
bits as 26 = 64.
c) 64 to 127 and -64 to -127;
 We are representing 128 numbers in total which requires 7
bits.
d) Address every byte of a 12 TB chunk of memory.
 Since a TB is 240 and the factor of 12 needs 4 bits, in total
we can represent using 44 bits as 243 bytes < 12 TB < 244
bytes.

Quick Check 1.2

Is the following statement true or false, and why? It is possible to get an


overflow error when adding two signed numbers of opposite signs.

 False. Overflow errors only occur when the correct result of the ad-
dition falls outside the range of [−2𝑑−1 , 2𝑑−1 − 1]. Adding numbers
of opposite signs will not result in numbers outside of this range.

Quick Check 1.3

Assume we have a binary number 1100 0101 . How can we apply shift-
ing operations on this number so we can get 0000 0101 ? That is, only
preserve the lowest four bits and zero out the rest.

 First shift left four bits, and shift right four bits.

Quick Check 1.4

Assume we have a binary number 1100 0101 . How can we apply only
bit-wise logical operations on this number so we can get 0000 0101 ?
That is, only preserve the lowest four bits and zero out the rest.

 Let the number AND with mask 0b0000 1111 .

Quick Check 1.5

Now let’s do a simple quick check to make sure you understand every-
thing so far.

True or False?

1. All data stored in CPU has an unique address;


22 1 Fundamentals

 False. Data are stored in registers in CPU, and registers use


names to refer to, not address.
2. Random access memory stores all the files on one’s computer;
 False. Files are stored on secondary storage.
3. Random access memory is byte-addressed;
 True. Each byte has a unique address.
4. If we want to do a calculation such as 24 − 13, we store the two
operands into RAM, and transfer them to ALU to perform the cal-
culation. Once ALU computes the result, it’ll be directly transferred
back to RAM;
 False. We first need to transfer them to registers, and ALU will
retrieve them from the registers. Once the result is calculated,
it’ll have to be written back to a register first, and then store
that data back to RAM.
5. In a program written in Java, for example, we declare integers, char-
acters, floating points, etc., for variables. RAM also stores them in
the forms of integers, characters, and floating points.
 False. All data stored in RAM – either it’s a real data or a ma-
chine instruction (code) – are all just binary numbers.

Quick Check 1.6

Assume a = 0b1000 1011 , b = 0b0011 0101 , and c = 0b1111 0000 .


Calculate the results of the following C statements:
1. a & b;
 0b0000 0001

2. a | c;
 0b1111 1011

3. a | 0;
 0b1000 1011

4. a | (b >> 5);
 0b1000 1011

5. a & (b >> 5);


 0b0000 0001

6. ~((b | c) & a);


 0b0111 1110

Quick Check 1.7

Can you write a program to show all the bytes in the following array indi-
vidually, in the order they are stored?
1.5 Quick Check Solutions 23

1 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};

 Solution:

1 #include <stdio.h>
2 int main() {
3 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};
4 char* p = (char*)arr;
5 for (size_t i = 0; i < sizeof(double)*5; i ++) {
6 printf("0x%x\n", *(p+i));
7 // p[i] is also correct!
8 }
9 return 0;
10 }
Instruction Set Architecture 2
In this chapter we will discuss ARMv8 instruction set architecture. From 2.1 Instruction Format . . . . . 25
now on, I want you forget all the magical things you learned about pro- 2.2 Accessing Memory . . . . . 26
gramming languages.
2.3 Moving Constants & Regis-
ters . . . . . . . . . . . . . . . 29
2.4 Data Processing Operations30
2.1 Instruction Format 2.5 Flow Control . . . . . . . . . 32
2.6 Procedures . . . . . . . . . . 47
Instructions consist of four parts and have the following format: 2.7 Quick Check Solutions . . 59

1 Label: Mnemonic Operand_1, Operand_2, ... // Comments

where label marks the address of the instruction in memory, which is op-
tional; mnemonic is the text representation of specific operation we want
the instruction to carry out (think about the plus sign in 22 + 33), which is
required for any instruction; operands are separated by commas, and the
amount of operands needed is determined by the actual mnemonic for
that instruction (it can vary from 0 to 3). The operands can be numbers
(called immediate) and/or registers. Lastly, line comments are marked
by double slash, and block comments can be used with a pair of /* ...
*/.

Register Names

In ARMv8, we have 32 general purpose registers, each of which can store


64 bits, and one of them is a zero register whose value is always 0 and
cannot be changed. Figure 2.1 shows the complete register file in ARM
architecture.
Those register names can be used as operands. If W-registers are used, only
the lowest half 32 bits will be operated on; if X-registers are used, all 64 bits
will be operated on.

64 bits 32 bits
<latexit sha1_base64="LfdtqX7Zgdrdmqa3pXaPSHzWw5w=">AAACE3icbVC7SgNBFJ2Nrxhfq5Y2i0EQi7ArIWoXtLGMYB6QjWF2cpMMmX0wc1cSlvUbbPwVGwtFbG3s/Bsnj0ITDwxzOOde7r3HiwRXaNvfRmZpeWV1Lbue29jc2t4xd/dqKowlgyoLRSgbHlUgeABV5CigEUmgvieg7g2uxn79HqTiYXCLowhaPu0FvMsZRS21zRM31LYnKYPE7ato/F/Y0TBN75JS0UUYYvLgcVRp2jbzdsGewFokzozkyQyVtvnldkIW+xAgE1SppmNH2EqoRM4EpDk3VqAHDmgPmpoG1AfVSiY3pdaRVjpWN5T6BWhN1N8dCfWVGvmervQp9tW8Nxb/85oxds9bCQ+iGCFg00HdWFgYWuOArA6XwFCMNKFMcr2rxfpUB4Q6xpwOwZk/eZHUTgtOqVC8KebLl7M4suSAHJJj4pAzUibXpEKqhJFH8kxeyZvxZLwY78bHtDRjzHr2yR8Ynz8FaZ+H</latexit> <latexit sha1_base64="NvTksVe9IQ1b2iP8oWC5v2j1zog=">AAACE3icbVC7TgJBFJ31ifhCLW02EhNjQXYRHyXRxhITeSQsktnhAhNmH5m5ayCb9Rts/BUbC42xtbHzb5wFCgVPMpmTc+7Nvfe4oeAKLevbWFhcWl5Zzaxl1zc2t7ZzO7s1FUSSQZUFIpANlyoQ3IcqchTQCCVQzxVQdwdXqV+/B6l44N/iKISWR3s+73JGUUvt3LETaNuVlEHs9FWY/qXTcJgkd/FJ0UEYYvzgclRJ0s7lrYI1hjlP7CnJkykq7dyX0wlY5IGPTFClmrYVYiumEjkTkGSdSIEeOKA9aGrqUw9UKx7flJiHWumY3UDq56M5Vn93xNRTauS5utKj2FezXir+5zUj7F60Yu6HEYLPJoO6kTAxMNOAzA6XwFCMNKFMcr2ryfpUB4Q6xqwOwZ49eZ7UigX7rFC6KeXLl9M4MmSfHJAjYpNzUibXpEKqhJFH8kxeyZvxZLwY78bHpHTBmPbskT8wPn8A/X2fgg==</latexit>

z }| { z }| {
X0 W0 X9 W9 X16 W16 X28 W28
···

···

···
<latexit sha1_base64="JXzBjyZgmsI8ags2tosNzBng8sY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALFFjzg=</latexit>

<latexit sha1_base64="JXzBjyZgmsI8ags2tosNzBng8sY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALFFjzg=</latexit>

<latexit sha1_base64="JXzBjyZgmsI8ags2tosNzBng8sY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALFFjzg=</latexit>

X29 W29

X7 W7 X14 W14 X26 W26 X30 W30

X8 W8 X15 W15 X27 W27 XZR WZR

Figure 2.1: Register file of ARM architecture. X-registers can store 64 bits. The lowest half 32 bits for each X-register can also be used
independently as W-registers. However, the upper half 32 bits cannot be used independently.
26 2 Instruction Set Architecture

2.2 Accessing Memory

Recall in Section 1.2.1 in Chapter 1, we made it clear that if we want to


let ALU to perform a calculation, we have to move the data from RAM
to registers first. So in this section, let’s see how we can achieve that, and
those will be our first assembly instructions! (Exciting!)

2.2.1 Load

To move data from memory to registers, we use load instructions: LDR .


There are many versions of LDR, but let’s start with two most common
ones:

 LDR Wt, [base,simm9] :


loads a word from memory addressed by base+simm9 to Wt;
 LDR Xt, [base,simm9] :
loads a doubleword from memory addressed by base+simm9 to Xt.

Xt represents the destination registers, whereas base is a register that stores


the base address in memory. simm9 stands for signed immediate 9-bit in-
teger, which is used as an offset from the base address. It indicates how
many bytes from the base address we want to start grabbing data.

ą Example 2.1
An integer is currently stored at memory address of
0xFFFFFFFFEB101010 , and this address has already been
stored in register X9 . Please load this integer into the 11-th
register.

Solution: First, notice the data we want to load is an integer, which


takes four bytes, the size of a word. Therefore, the destination
register should be a W -register, e.g., W10 .

Next, remember the address of a variable is its lowest ad-


dress. Thus, the integer is stored at 0xF..FEB101010 to
0xF..FEB101013 . The starting address 0xF..FEB101010 has
already been stored in X9 , so X9 is our base.

When we use W -registers as destination, LDR will grab four bytes


from base+simm9 . In this example, if we leave simm9 zero (no
offset from base), it’ll take the integer back to W10 . In summary,
the instruction should be:

1 LDR W10, [X9]


2 // or LDR W10, [X9, 0]
3 // or LDR W10, [X9, #0]

The idea of this example is shown in Figure 2.2, where we included


another example LDR W10,[X9,2] .
2.2 Accessing Memory 27

LDR W10,[X9,0] X9
W9
LDR W10,[X9,2] X9
W9
Starting address: X9+0 Starting address: X9+2
FF FF FF FF EB 10 10 10 FF FF FF FF EB 10 10 10

0xFFFFFFFF EB101017 DD X10 0xFFFFFFFF EB101017 DD X10


W10 W10
0xFFFFFFFF EB101016 D9 0xFFFFFFFF EB101016 D9
0xFFFFFFFF EB101015 01 00 00 00 00 A4 00 AB 12 0xFFFFFFFF EB101015 01 00 00 00 00 01 23 A4 00
0xFFFFFFFF EB101014 23 0xFFFFFFFF EB101014 23
0xFFFFFFFF EB101013 A4 0xFFFFFFFF EB101013 A4
0xFFFFFFFF EB101012 00 0xFFFFFFFF EB101012 00
0xFFFFFFFF EB101011 AB 0xFFFFFFFF EB101011 AB
0xFFFFFFFF EB101010 12 0xFFFFFFFF EB101010 12

Address Data Address Data

Figure 2.2: LDR will grab a few bytes (determined by the size of the destination register) from the starting address, calculated by base+simm9.

Pay attention to how we move individual bytes into the register:


the byte at lowest address is stored in the lowest byte of the register.
This is little endian. If the machine is using big endianness, the
order is reversed.

We can also load a byte or a half word (2 bytes) into a register, but with
caution. Remember the register operands are either X-registers (8 bytes)
or W-registers (4 bytes), so if we want to grab only one or two bytes from
memory, we need to also extend the data to match the size of the desti-
nation. The instructions that can do this for us have the same format for
operands as in LDR, but the mnemonic can have variations:

LDR[S]{H|B} Wt, [base, simm9]

The [S] in the mnemonic is optional; if used, it’ll sign extend the data to
match the destination register; otherwise it’ll zero extend the data. The
{H|B}, required, stands for either move half-word (H) or a byte (B). For ex-
ample, LDRSH W10, [X9,-24] will grab two bytes starting from address
X9-24, and sign extend this two byte data to four bytes, and store it into
W10.

Additionally, the third operand can also be a register:

 LDR Wt, [Xn,Xm] :


loads a word from memory addressed by Xn+Xm to Wt;
 LDR Xt, [Xn,Xm] :
loads a doubleword from memory addressed by Xn+Xm to Xt.

The suffix mentioned above can also be applied to these two forms of the
instruction.

2.2.2 Store

Moving data from memory to registers is to load data, whereas moving


data from a register to a location in memory is to store data. The basic
28 2 Instruction Set Architecture

mnemonic for storing data is STR .

Now that the destination is memory which is byte-addressed, we don’t


need to think about extensions such as byte to double word.

 STR Wt, [base,simm9] :


stores a word from Wt to memory addressed by base+simm9;
 STR Xt, [base,simm9] :
stores a double word from Xt to memory addressed by base+simm9;
 STRB Wt, [base,simm9] :
stores a byte from Wt to memory addressed by base+simm9;
 STRH Wt, [base,simm9] :
stores a halfword from Wt to memory addressed by base+simm9.

Remember, assembly is a very simple language; it does exactly what you


told it to do. If you want to use STR instruction to store a double word start-
ing at address of base+simm9, all the 8 bytes on memory from base+simm9
to base+simm9+7 will be overwritten. They won’t be saved somewhere.

ą Example 2.2
Currently X9 stores a quad number 0x12345678ABCDEF00 ,
and we want to store this number into the memory address
0xF..FEB101010. Currently X10 is storing this memory address.
Write the instruction to achieve this, and draw the memory layout.
Assume we’re using a little-endian machine.

Solution: The instruction is very straightforward:

1 STR X9, [X10]


2 // or STR X9, [X10, 0]
3 // or STR X9, [X10, #0]

When drawing memory layout, notice we assume this is a little-


endian machine. Therefore, the lowest byte in the register will be
stored at the lowest address, which is the base address. We show
both little and big endian in Figure 2.3.

Similarly, the third operand can also be a register:

 STR Wt, [Xn,Xm] :


stores a word from Wt to memory addressed by Xn+Xm;
 STR Xt, [Xn,Xm] :
stores a doubleword from Xt to memory addressed by Xn+Xm.

The suffix mentioned above can also be applied to these two forms of the
instruction.

˛ Quick Check 2.1


1. True or False and why: assume X0 points to the start of an
2.3 Moving Constants & Registers 29

STR X9,[X10,0] (Little Endian) X9 STR X9,[X10,0] (Big Endian) X9


W9 W9

12 34 56 78 AB CD EF 00 12 34 56 78 AB CD EF 00
Address Data Address Data
0xFFFFFFFF EB101017 12 0xFFFFFFFF EB101017 00
0xFFFFFFFF EB101016 34 0xFFFFFFFF EB101016 EF
0xFFFFFFFF EB101015 56 0xFFFFFFFF EB101015 CD
0xFFFFFFFF EB101014 78 0xFFFFFFFF EB101014 AB
0xFFFFFFFF EB101013 AB 0xFFFFFFFF EB101013 78
0xFFFFFFFF EB101012 CD 0xFFFFFFFF EB101012 56
0xFFFFFFFF EB101011 EF 0xFFFFFFFF EB101011 34
0xFFFFFFFF EB101010 00 0xFFFFFFFF EB101010 12
X10 X10

FF FF FF FF EB 10 10 10 FF FF FF FF EB 10 10 10

Figure 2.3: STR is the opposite direction of LDR. Notice how the same value of X9 is stored differently on little and big endian machines.

array called arr. LDR X1,[X0,3] will load arr[3] to X1 ;


2. Assume we have an array of long integers
long int arr[3] , and the value of arr is stored in
register X9 . Write a sequence of ARMv8 instructions to
move all elements in the array from memory to register X10
and move it back, one at a time.

B See solution on page 59.

2.3 Moving Constants & Registers

Later you’ll find out that moving data from memory to registers is not al-
ways the best choice. Sometimes we just want to set some integer number
to a register, something as simple as giving a variable a value. The instruc-
tion we can use is MOV .

 MOV Xn, simm64 : moves 64-bit signed immediate to register Xn ;


 MOV Wn, simm32 : moves 32-bit signed immediate to register Wn ,
while zero out the highest 32 bits of Wn .

We can also “move” data from one register to another:

 MOV Xdst, Xsrc : copies data from register Xsrc to Xdst ;


 MOV Wdst, Wsrc : copies data from register Xsrc to Xdst .

Notice two things here: first, the destination and source registers should
match the size; second, it’s called “moving”, but really it copies data from
source to destination. This means after the instruction, the data will have
two copies: one in the destination, and the other in the source.

For example, MOV X9,X10 will copy the data from X10 to X9 . If be-
fore X10 stores an integer value of 382, after the instruction, both X9
and X10 will have value of 382. The old value in X9 will be completely
overwritten.
30 2 Instruction Set Architecture

˛ Quick Check 2.2


1. What’s the value of X9 after each of the following instruc-
tion in hexadecimal?

1 MOV X9, -9
2 MOV W9, 10

2. Assume X9 has a value of 0xFFFF101029AB00FE, while


X10 has a value of 0x33F59077F2FFABCD. What are the
values of X9 and X10 after the following instruction:
MOV W10, W9 ?
3. Which of the following instructions are invalid and why?
a) MOV W10, WZR
b) MOV X20, [X22,10]
c) MOV 382, X20
d) MOV XZR, 382
e) MOV 382, 392

B See solution on page 60.

2.4 Data Processing Operations

Once the data have been moved to registers, either from an immediate
number or memory, we are ready to perform real computations.

2.4.1 Arithmetic Operations

22 22 22 22 22 22 22 22 X0 Let’s start with the simplest arithmetic: addition and subtraction. The
mnemonics for these two operations are very straightforward: ADD and
33 33 33 33 33 33 33 33 X1
SUB .

ADD X2,X0,X1  ADD Wd,Wn,Wm : Wd = Wn + Wm ;


55 55 55 55 55 55 55 55 X2  ADD Xd,Xn,Xm : Xd = Xn + Xm ;
 ADD Wd,Wn,simm12 : Wd = Wn + simm12;
ADD W2,W0,W1
 ADD Xd,Xn,simm12 : Xd = Xn + simm12.
00 00 00 00 55 55 55 55 X2

Figure 2.4: Using W and X registers The subtraction instructions have the same format except the mnemonic
for addition. If W registers are used, the is SUB, so we’ll skip them here.
highest 32 bits will be cleared out. The
gray boxes are W registers. When using W registers, however, we have to be careful. For example, as-
sume we have X0 = 0x2222222222222222 and X1 = 0x3333333333333333 ,
both full 64-bit numbers. When we use X registers to do addition, e.g.,
1: This is very different than other ma-
chines such as x86, where operating on ADD X2,X1,X0 , X22 will be 0x5555555555555555 without any prob-
low bits of a register does not change the lem. If we use ADD W2,W1,W0 , however, the highest 32 bits will be cleared
high bits at all. Therefore, reading docu- (set to zero), and only the lowest 32 bits will be used. Therefore, in this case,
mentation carefully is important; some-
times it’s also good to just play around
X2 will be 0x0000000055555555 . See Figure 2.4 for an illustration. 1
and find out small details like this.
2.4 Data Processing Operations 31

A typical workflow is, we load the operands from memory to registers


first, and use ADD or SUB to perform the calculation. ALU will send the
result back to the destination register, so we’ll have to store the result back
from the register to memory.

ą Example 2.3
Assume a long integer var is stored at address ptr_var in mem-
ory. The value of ptr_var has been loaded into register X9 .
Write a sequence of ARMv8 assembly to add 1 to var . This is
similar to translating the following C code into assembly:

1 long int var;


2 long int* ptr_var = &var;
3 (*ptr_var) ++;

Solution: To perform any type of calculation, the operands have


to be inside the registers. Since we want to perform addition on
var which is currently inside the memory, we need to first load
it into a register, say X10 : LDR X10,[X9] . For the addition, one
operand now is already X10 , but the other operand is number 1,
which is an immediate, so we can use ADD with an immediate
operand.

Now that two of the operands of the addition are ready, we


also need to decide which register to store the result of addition.
Assume we want to save it to X12 : ADD X12,X10,1 .

After this instruction, we are not done yet, because the result
currently resides in a register, but what we want is to update the
variable which resides in memory. Therefore, we need to move the
result back by using STR instruction. Notice that at this moment
X9 still stores the address of var , so if we write back there, it’ll
overwrite the old value of var : STR X12,[X9] .

In sum, the sequence looks like this:

1 LDR X10, [X9]


2 ADD X12, X10, 1
3 STR X12, [X9]

Remarks: You can certainly set the destination register for ADD to
X10 , which will overwrite the old value of X10 , basically like
x10 = x10 + 1 . However, that X10 was updated with new
value does not mean the value inside the memory var will also
be updated. Therefore, you have to STR the result back manually.

We don’t usually use multiplication and division in our course, but we


will show the instructions here for reference:

 MUL Xd,Xn,Xm : Xd = Xn * Xm;


32 2 Instruction Set Architecture

 SDIV Xd,Xn,Xm : signed integer division, Xd = Xn / Xm;


 UDIV Xd,Xn,Xm : unsigned integer division, Xd = Xn / Xm.

2.4.2 Logic Operations

Another type of calculation that the ALU can perform is logic operations,
which is basically bit-wise operations. The instructions we can use are:

 AND Xd,Xn,Xm : bit-wise and on registers Xn and Xm and store


the result in Xd ;
 ORR Xd,Xn,Xm : bit-wise or on registers Xn and Xm and store the
result in Xd ;
 EOR Xd,Xn,Xm : bit-wise exclusive-or on registers Xn and Xm
and store the result in Xd .

All these instructions can also be used for W -registers, but all three operands
need to match the size.

In addition to those mentioned, we also sometimes want to shift the bits in


a register, left or right. Recall that we have two types of shifting—logical
and arithmetic.

Correspondingly, our instructions have these versions as well:

 ASR Xd, Xn, Xm : arithmetic shift right Xd = Xn >> Xm ;


 LSL Xd, Xn, Xm : logical shift left Xd = Xn << Xm ;
 LSR Xd, Xn, Xm : logical shift right Xd = Xn >> Xm .

As discussed in Section 1.1.5.2, if we want to multiply a number by 2𝑛 ,


the fastest way is to shift left using LSL . Also, register Xm in the three
instructions above can also be replaced by immediate numbers.

˛ Quick Check 2.3


Congratulations! You have learned enough assembly instructions
to write many programs! See if you can write sequences of
assembly instructions for the following problem.

(Fall 21 quiz) Write the corresponding ARMv8 assembly code for


the following C statement. Assume that the variable f is in register
X20 , and the base address of array A is in register X21 . A is an
array of integers. A[0] = f + A[5];
B See solution on page 60.

2.5 Flow Control

One huge difference between the high-level languages you have learned
and assembly is that the former is highly structured, whereas the latter is
just a sequence of instructions. Therefore, repeat after me → “I will forget
2.5 Flow Control 33

about everything I learned about program structures, such as for -loop,


if-else statements, and so on.”

The assembly program simply executes one instruction after another se-
quentially, unless we explicitly instruct the program to jump to a specific
instruction. Assembly programs don’t have “memory” either: you cannot
expect the program will magically jump back to somewhere without your
instruction.

Once we’re clear on this, let’s talk about program counter first.

2.5.1 Program Counter

The name of program counter (PC) might be a little bit confusing; its job
is not exactly to “count” the program. PC is a 64-bit special register inside
CPU, and no we cannot modify the data inside it.
Ÿ Caution!
You probably already know this: when we’re executing our program, the
assembly instructions are also stored in RAM. Because everything stored Ę Note: PC is not one of the 32 general
purpose registers we introduced in
in RAM has an address, each instruction in our assembly programs also
Section 2.1, therefore it cannot be di-
has its own address. rectly referenced. For example, some-
thing such as ADD PC,PC,4 and
For us humans, when we read/write an assembly program, we know after
LDR X0,[PC] are wrong.
one instruction we just go to the next one, but how does machine know?

Here’s what happens: the CPU will store the address of its next instruc-
tion to be executed into PC. After the current instruction has finished, the
CPU will simply go to RAM and retrieve the instruction pointed by PC. 2 2: Disclaimer: this is an extremely simpli-
fied example. Later in Chapter 3, you’ll
see it’s more complicated. However, to
ą Example 2.4
not get confused, it’s entirely ok to think
Consider the following assembly sequence, about it this way now.

1 ADD X9, X10, X11 /* 0xFFFF0040 */


2 SUB X11, X12, X9 /* 0xFFFF0044 */
3 STR X9, [X11, 0] /* 0xFFFF0048 */
4 LDR X9, [X12, 8] /* 0xFFFF004C */

When executing this program, all instructions will be stored in to


memory contiguously. Because each instruction takes four bytes,
the addresses of these instructions are multiples of four.

Before executing the first instruction ADD , the PC has a value


of 0xFFFF0040 . When the ADD instruction starts running,
PC is automatically incremented by four bytes, which makes it
0xFFFF0044 . When ADD finishes, CPU looks at the address in-
side PC which is 0xFFFF0044 , and goes to that memory location
to fetch the next instruction, which is SUB .

When SUB starts running, PC is again incremented by four bytes,


making it 0xFFFF0048 that points to the next instruction. As you
see, the only thing that the CPU looks at is PC; it’ll simply fetch
34 2 Instruction Set Architecture

whatever pointed by PC.

We cannot manually give PC a value like moving some value to a register,


because PC determines the program flow, but sometimes we just want to
skip some instructions, which surely needs us to “modify” PC. This can
be achieved by using special instructions.

2.5.2 Branching

Assembly programs are sequentially executed unless specified. To branch


to somewhere else in the program, we have to resort to a special set of
instructions that can modify PC.

2.5.2.1 Unconditional Branching

The simplest way is to use B instruction (stands for branching). The syn-
tax is B target where target is the target instruction’s memory ad-
dress. The problem is we don’t really know the memory address of each
instructions when they’re running. Therefore, we can use a label (if you
forgot about labels, go back to Section 2.1).

ą Example 2.5
Consider the following assembly sequence,

1 ADD X9, X10, X11 /* 0xFFFF0040 */


2 B L1 /* 0xFFFF0044 */
3 SUB X11, X12, X9 /* 0xFFFF0048 */
4 L1: STR X9, [X11, 0] /* 0xFFFF004C */
5 LDR X9, [X12, 8] /* 0xFFFF0050 */

where we added a label L1 for the STR instruction. On line 2,


B L1 means we want to execute the instruction pointed by L1
next instead of its following instruction SUB .
Unconditional
Sequential
Branching
When executing instruction B L1 , because the instruction says
ADD ADD
“now you need to go to L1 ”, PC will be changed to the address of
L1 which is 0xFFFF004C , instead of the address of SUB which
B L1
is 0xFFFF0048 .
SUB SUB
Therefore, after B instruction is done, the CPU fetches the target
STR L1: STR instruction at L1 (the STR ), and start running from there. See
Figure 2.5 for an illustration.
LDR LDR

Figure 2.5: Program flow of the example.


Instruction B modifies PC based on its tar- In addition to B , if the address has already been stored in a register, say
get, which makes the program skip some X20 , we can use another instruction BR (branch from register): BR X20 .
instructions.
2.5 Flow Control 35

C code Flow chart Assembly


long int a; LDR X9, [X0]
long int a;
if (a == 0) {
a = a + 1; a == 0? CBNZ X9, L1
}
a = a - 1; Nope Yup
a = a + 1; ADD X9, X9, 1

a = a - 1; L1: SUB X9, X9, 1


Figure 2.6: CBNZ will check the register
value; if it’s not zero, it’ll branch to the
STR X9, [X0] instruction tagged as L1.

2.5.2.2 Conditional Branching

The unconditional branch could be useful sometimes, but if we want to


write something like if-else , we’ll have to use conditional branch.

Consider the following C code snippet:

1 long int a;
2 if (a == 0) a ++;
3 a --;

There are two possible paths for the program: one goes through if block,
while the other one doesn’t. If we draw a flowchart, it’ll be like the middle
diagram in Figure 2.6.

Given this flowchart, we just need to translate each part into their cor-
responding assembly instructions. Assume the address of variable a is
stored in X0 , and we want to load this variable into X9 . The if block
is simple: we just need to add 1 to register X9 . The instruction SUB is
the destination of branching (for the case when a!=0 ), so we need to add
a label in front of it, say L1 .

Now we only have one problem left: how to translate the condition, i.e.,
the if statement, and this is where we’ll introduce the first set of condi-
tional branching instructions. Currently, variable a is in X9 , so we just
need to check if X9 is zero or not; if it is zero, we keep moving on to the
next instruction (ADD); otherwise we “jump” or branch to the instruction
labeled as L1. The instruction to do this is CBNZ (conditionally branch if
not zero):

1 CBNZ Xt, Label

This instruction will check the value inside Xt . If it’s not zero, it’ll move
the address of the instruction tagged with Label to PC, and thus branches
to that instruction. If it is zero, it’ll just move to the very next instruction.

Another related instruction is CBZ , which you probably figured that it’ll
jump to the destination if the register it’s checking is zero.
36 2 Instruction Set Architecture

Now that we’ve introduced the branching instructions (at least a small
portion of it), there are some common beginner mistakes that we need to
be careful of.

Let’s re-write the C code in Figure 2.6 using CBZ instruction. Following
our steps before, the instruction ADD is the target of CBZ , so it’s natu-
ral to give it a label, say L2 . So someone came up with the following
sequence:

1 LDR X9, [X0] /* long int a; */


2 CBZ X9, L2 /* if (a == 0) goto L2; */
3 L2: ADD X9, X9, 1 /* a = a + 1; */
4 L1: SUB X9, X9, 1 /* a = a - 1; */

Let’s walk through this sequence. If X9 is indeed zero, the flow of in-
struction would be LDR → CBZ → ADD → SUB, which is correct. If X9 is
not zero, the flow is also LDR → CBZ → ADD → SUB, which is not what we
want obviously. The problem is we didn’t skip the ADD instruction, so a
quick and simple fix: after CBZ, add an unconditional branch B to jump to
L1 :

1 LDR X9, [X0] /* long a; */


2 CBZ X9, L2 /* if (a == 0) goto L2; */
3 B L1 /* else goto L1; */
4 L2: ADD X9, X9, 1 /* a = a + 1; */
5 L1: SUB X9, X9, 1 /* a = a - 1; */

Now the flow when a != 0 is correct: LDR → CBZ (failed) → B → SUB.

ą Example 2.6
Translate the following C code into assembly:

1 long int a;
2 if (a == 0) a = a + 2;
3 else a = a - 2;
4 a = a * 4;

As usual, we first draw the flowchart of the C code, as in Figure


2.7. Notice this time because the statement a = a - 2 is in the
else block, after we finish the if block, we’ll have to skip the
statements in the else block.

With the flowchart, we can clearly see there are two branches, and
thus two destinations: one is a = a - 2 for the else case, and the
other is a = a * 2 when if case has finished. Therefore, when
translating to assembly, we will add two labels for each of the
destinations, L1 and L2.

We use CBNZ to test the if statement like what we did before.


After ADD instruction, we reached the end of if block, and need
2.5 Flow Control 37

C code Flow chart Assembly (Correct) Assembly (Wrong)


long int a; LDR X9, [X0] LDR X9, [X0]
long int a;
if (a == 0) {
a = a + 2; a == 0? CBNZ X9, L1 CBNZ X9, L1
}
Nope Yup
else {
a = a - 2; a = a + 2; ADD X9, X9, 2 ADD X9, X9, 2
}
a = a * 4; B L2

a = a - 2; L1: SUB X9, X9, 2 L1: SUB X9, X9, 2

a = a * 2; L2: LSL X9, X9, 2 LSL X9, X9, 2

STR X9, [X0] STR X9, [X0]

Figure 2.7: Without unconditional branch B, the program flow would be wrong when variable a is zero.

to skip else block and jump to L2. This time, this branching is
unconditional — it doesn’t depend on a specific value or register,
and therefore we use unconditional branch B L2 to force the
program dodge else case.

Common mistakes: as shown in the rightmost diagram in Figure


2.7, a common mistake is that we forgot to add unconditional
branch at the end of the if block. Remember, assembly is very
innocent — if you don’t specify it to branch/jump to somewhere,
it’ll just execute the very next instruction. Without B, you can see
the red flow passes through all instructions, and thus the wrong
result. No, assembly does not recognize if-else, and does not
remember which instruction is the end of what block. All the
control is in your hand, not the machine’s.

Take-away: unlike high-level languages which usually have {} to


determine the flow of the program, assembly will always execute
the instructions by order unless we explicitly specify the flow.

2.5.2.3 Comparison with Zero

In the previous example, we compare variable a with zero, but what if we


want to compare if two variables are equal, say a == b ? A very straight-
forward way is to simple store the difference between a and b in a regis-
ter, and use CBZ or CBNZ .

Consider following code:

1 if (a == b) goto L1;
2 else goto L2;

It can be easily translated into the following assembly segment, assuming


a is in X9 and b in X10 :
38 2 Instruction Set Architecture

1 SUB X0, X9, X10 // X0 = a - b;


2 CBZ X0, L1 // if (a - b == 0) goto L1;
3 B L2 // else goto L2;

This version of the assembly translation has its limitations: (1) we have to
store subtraction result in a register which can be wasteful, and (2) CBZ
or CBNZ can only branch if the register operand is zero or not zero, but
can’t branch if it’s positive or negative. Thus, we introduce another more
extendable way to branching.

The steps of conditional branching can be simplified to just two steps: (1)
compare the two numbers, and (2) branch to destination based on some
criteria. We have an instruction for step (1), which is CMP :

 CMP Wn, Wm : subtract Wm from Wn ;


 CMP Xn, Xm : subtract Xm from Xn .

Next, we use a new condition branch instruction B.EQ (branch if equal),


or B.NE (branch if not equal). Thus, the above example can be translated
to:

1 CMP X9, X10 // Compare a and b


2 B.EQ L1 // if (a == b) goto L1;
3 B L2 // else goto L2;

The meaning of the sequence above seems more human friendly, doesn’t
it?

What’s more convenient with CMP instruction is that it can also directly
compare with an immediate number:

 CMP Wn, imm12 : subtract imm12 from Wn ;


 CMP Xn, imm12 : subtract imm12 from Xn .

Thus, if we want to compare, say X9 and 100 , we can simply do CMP X9,100 ,
instead of moving 100 to a register and subtracting two registers and
then using CBZ / CBNZ .

At this point, you probably would be wondering: it looks like CMP in-
struction is also doing subtraction, but where does it save the result? And
since B.EQ is a separate instruction, how does it know which two regis-
ters we are comparing? The magic behind these is condition codes.

2.5.2.4 Condition Codes

Inside CPU, there’s a special register called CPSR, or current program sta-
tus register, which stores 32 bits. Different than the registers we have seen,
these 32 bits are used individually, and typically directly set by the proces-
sor (not us!). We only care about the highest four bits, which correspond
to the four condition codes respectively:
2.5 Flow Control 39

31 30 29 28 0 – 27
N Z C V Other flags
Negative Zero Carry Overflow \

When a condition code has a value of 1, we say it’s set; otherwise it’s
clear.

The four condition codes indicate the following situations:

 N : negative (for signed number), a copy of the MSB of the calcu-


lated result;
 Z : set if the result is zero; otherwise clear;
 C : set if there’s a carry for unsigned numbers;
 V : set if there’s an overflow for signed numbers.

In sum, condition codes reflect the status change of the instruction just
being executed, or some attributes of the result of the operation. However,
not all instructions are designed to change condition codes. So far the only
instruction we learned that changes the codes is CMP . Arithmetic/logic
instructions such as ADD and ORR do not modify condition codes. If we
want them set condition codes, we need to use the following instructions
3: None of the logic instructions includ-
which explicitly set condition codes: ADDS , SUBS , and ANDS . 3 ing shifting modifies condition codes, ex-
cept ANDS .
Now let’s go back to this sequence:

1 CMP X9, X10 // Compare a and b


2 B.EQ L1 // if (a == b) goto L1;
3 B L2 // else goto L2;

During CMP , CPU subtracts X10 from X9 , but does not store the result
anywhere. If the result is zero (meaning X9 is equal to X10 ), the Z flag
in CPSR will be set; otherwise it’ll be cleared. Then we move on to the
next instruction B.EQ . This instruction only inspects the current value of
Z flag: if it’s set, it’ll branch to L1 ; otherwise it’ll execute the following
instruction, which in this case is B L2 .

˛ Quick Check 2.4


For the following assembly program, write the values of condition
codes N , Z , C , and V after the execution of every instruction.
Assume at beginning all four codes are cleared.

1 MOV X9, 9
2 MOV X10, 10
3 CMP X9, X10
4 ADD X9, X9, 10
5 SUB X10, X10, 3
6 CMP X9, X10
7 CMP X10, X9

B See solution on page 61.


40 2 Instruction Set Architecture

Table 2.1: Conditional branch instructions B.cond and condition code checking.

Signed numbers Unsigned numbers


Comparison Instruction CC Test Instruction CC Test
= B.EQ Z == 1 B.EQ Z == 1
≠ B.NE Z == 0 B.NE Z == 0
< B.LT N != V B.LO C == 0
≤ B.LE ~(Z == 0 && N == V) B.LS ~(Z == 0 && C == 1)
> B.GT (Z == 0 && N == V) B.HI (Z == 0 && C == 1)
≥ B.GE N == V B.HS C == 1

2.5.2.5 More Conditional Branch Instructions

So now the magic of CMP has been revealed: CMP instruction simply does
the subtraction and changed the condition codes correspondingly. This
Ÿ Caution!
provides us great ability to extend comparison and branching to more com-
CMP instruction changes all condi- plicated cases such as if (a < b) and so on.
tion codes based on the result, not
just the Z flag. Consider the following C code:

1 if (a < b) goto L1;


2 else goto L2;

Assume variable a and b are in registers X9 and X10 . The first step
is to CMP X9,X10 . Next, we use a condition branch instruction B.LT
(branch if less than) . Thus, the assembly sequence looks like this:

1 CMP X9, X10 // Check X9 - X10 (a - b)


2 B.LT L1 // if (a - b < 0) goto L1;
3 B L2 // else goto L2;

Remember, CMP instruction simply compares the two operands; it doesn’t


make any decisions; it is B.LT instruction that changes the flow of the
program. We also have other instructions for “greater than”, “greater than
of equal”, etc. See Table 2.1.

One possible question is, for example, when comparing ≤, we have signed
numbers and unsigned numbers, and so how does the computer know if
a data is signed or unsigned? The answer is, it doesn’t know, or it doesn’t
need to know. Remember we emphasized this from very beginning: ev-
erything inside memory, either it’s code or data, or no matter what kind
of data, is just binary sequence. Surely when we write a C program, we can
specify something like: int a = -1; unsigned int b = 4294967296;
which makes variable a a signed number and b an unsigned number.
However, if you take a look at how they are stored in the machine, both of
them are exactly the same: 0xFFFFFFFF . The machine has no idea which
one is what.

Then who decide which number is signed or unsigned? Of course it’s us.
If we treat the numbers to compare as unsigned, then just use for example
B.LS ; if we want to take them as signed, then use B.LE . From Table 2.1,
2.5 Flow Control 41

you see both B.LE and B.LS are comparing ≤, but the only difference
is that these two instructions check different condition codes. Instructions
such as CMP will modify the condition codes regardless, then it’s up to us
how we want to treat our numbers and thus which instruction to choose.

ą Example 2.7
Note for this example, we just pretend X -registers are 8-bit wide
for convenience; they are of course 64 bits in fact. Assume currently
in register X9 we have data 1111 1011 whereas X10 we have
0000 0101 . Consider the following assembly code:

1 SUBS X11, X9, X10


2 B.LT L1
3 B L2
4 ...
5 L1: ...
6 ...
7 L2: ...

After running instruction SUBS X11,X9,X10 , we have the follow-


ing values in the registers:

Register Data Signed Unsigned


X9 1111 1011 −5 251
X10 0000 0101 5 5
X11 1111 0110 −10 246
N = 1, Z = 0, C = 1, V = 0

Because X11 is not zero and there’s no overflow, both Z and V


are clear. The MSB of X11 is 1, and it’ll be copied to N , meaning
if it was treated as a signed number, it’ll be negative. Lastly, C is
also set because a carry happened for the 3rd bit.

Now, on line 2, we used B.LT instruction, meaning we want to


treat X9 and X10 as signed numbers. B.LT then will examine
condition codes N and V. In this example, N != V , so the
program will successfully jump to label L1 .

If we, however, use B.LO on line 2 instead, the instruction ignores


N and V, and only checks if C is equal to zero. In this example,
C is 1, so the instruction will not jump to L1 , and will keep
executing line 3 instead, which brings us to label L2 .

From this example, you can see when running instructions that set
condition codes, the machine has no idea if the operands are signed
or unsigned at all. It’ll do its job and set the condition codes based
merely on bit operations. It is our job to decide to treat them as
42 2 Instruction Set Architecture

C code Flow chart Assembly Assembly (another ver.)


long i = 0; MOV X9, 0 MOV X9, 0

MOV X10, 10 MOV X10, 10


long int i = 0;
Begin: if (i - 10 < 0) {
goto Inc; i - 10 < 0? Begin: CMP X9, X10 Begin: CMP X9, X10
}
else { Yup Nope
Yup
goto Out; goto Inc; B.LT Inc B.GE Out
} Nope
Inc: i += 1; goto Out; B Out ADD X9,X9,1
goto Begin;
Out: Inc: i += 1 Inc: ADD X9,X9,1 B Begin

goto Begin; B Begin Out:

Out: Out:

Figure 2.8: A loop is simply a backward branching.

signed or unsigned, and use instructions accordingly.

˛ Quick Check 2.5


(Fall 21 Quiz) Write an ARMv8 assembly code that sets x to y +
10, if y is equal to 1, otherwise sets x to y/8. Assume that x and
y are in X19 and X20 , and are treated as signed integers. Note:
you don’t need multiplication or division instructions to finish this
question at all.
B See solution on page 62.

2.5.3 Loops

You probably already figured out that loops are basically just branching,
but backward instead of forward. Let’s look at the following C code:

1 long int i = 0;
2 while (i < 10) i = i + 1;

Using a C keyword goto , we can rewrite the code as follows:

1 long int i = 0;
2 Begin: if (i - 10 < 0) goto Inc;
3 else goto Out;
4 Inc: i = i + 1;
5 goto Begin;
6 Out:

This way, we convert the loop into a conditional branch we’re familiar
with, and therefore the translation to assembly is straightforward. See Fig-
ure 2.8 for the assembly code.

However, there are some minor stuff in Figure 2.8 that’s worth mention-
ing. This time, we didn’t use LDR in the beginning. This is fine, and just
2.5 Flow Control 43

assume we’re going to use X9 for variable i . Before the comparison, we


also set X10 to 10, but you can certainly just do CMP X9,10 , or use SUBS
instead, e.g., SUBS X0,X9,X10 . Lastly, we use B.LT because when we
declare long int type in C, it’s signed number by default.

2.5.4 Arrays

Dealing arrays almost always needs loops. Let’s start with the simplest
case, i.e., character arrays (or strings), since each character takes only one
byte.

2.5.4.1 Strings

Say we have five characters, "qwert" , stored in RAM. We want to add 2


to each of the characters, so that it becomes "sygtv" . The C code for this
task is not difficult:

1 char str[] = "qwert";


2 int i = 0;
3 while (*(str+i) != 0) {
4 *(str + i) = *(str + i) + 2;
5 i ++;
6 }

where we used the fact that all strings should end with a null-terminator
0. Now let’s write the assembly for this task!

1. Analysis:
From our C knowledge, we already know str is a pointer, and its
value is the address of the first byte of the array. Therefore, str
is the base address of the array. Because each character takes one
Register Data
byte, the index i of each character is also the offset of that char-
X9 base address of str
acter from the base address. Thus, (str+i) is the address of the
X11 index i
i -th character.
W12 i -th element *(str+i)

2. Panning:
Now, let’s assume the address of str is already loaded into X9 . In
summary, we have register usage shown in the table on the side.

The pseudocode or assembly algorithm should look like this:


44 2 Instruction Set Architecture

X12 X12
W12 W12
0xFFFFFFFF EB101017 00 0xFFFFFFFF EB101017 00
00 00 00 00 00 00 00 77 0xFFFFFFFF EB101016 00 00 00 00 00 00 00 00 65 0xFFFFFFFF EB101016 00
0xFFFFFFFF EB101015 00 0xFFFFFFFF EB101015 00
X11 (index) X11 (index)
0xFFFFFFFF EB101014 't' 0xFFFFFFFF EB101014 't'
00 00 00 00 00 00 00 02 0xFFFFFFFF EB101013 'r' LDRB W12,[X9,X11] 00 00 00 00 00 00 00 02 0xFFFFFFFF EB101013 'r'
0xFFFFFFFF EB101012 'e' 0xFFFFFFFF EB101012 'e'
X9 (base address) X9 (base address)
0xFFFFFFFF EB101011 'w' 0xFFFFFFFF EB101011 'w'
FF FF FF FF EB 10 10 10 0xFFFFFFFF EB101010 'q' FF FF FF FF EB 10 10 10 0xFFFFFFFF EB101010 'q'

Address Data Address Data

Figure 2.9: LDRB loads one byte into the lowest byte in a W -register.

Algorithm 1: Pseudocode for iterating strings


1 Set X11 to zero for the counter i ⇒ MOV X11,0 ;
2 repeat
3 Load *(str+i) into W12 ⇒ LDRB W12,[X9,X11] ;
4 if W12 is zero then
5 Break the loop;
6 Add 2 to *(str+i) ⇒ ADD W12,W12,2 ;
7 Store W12 back to memory ⇒ STRB W12,[X9,X11] ;
8 Update X11 ( i++ ) ⇒ ADD X11,X11,1 ;
9 until Loop broken;

3. Assembly:
All the direct translations of assembly have been added to the pseu-
docode for clarity. Now the only thing left is to complete the loop:

1 MOV X11, 0 // index i = 0;


2

3 Loo: LDRB W12, [X9, X11] // Load one byte *(str+i);


4 // LDRSB is fine too;
5 CMP W12, 0 // compare *(str+i) and 0;
6 B.EQ End // if(*(str+i)==0) goto End
7 ADD W12, W12, 2 // else W12 = W12 + 2;
8 STRB W12, [X9, X11] // Update *(str+i);
9 ADD X11, X11, 1 // i ++;
10 B Loo // Branch back
11 End:

We also use Figure 2.9 to illustrate the step of LDRB in the assembly code.

Note integers are usually stored in W -registers, because each integer takes
four bytes. To simplify, we use X register for index i , and it won’t cause
a problem in this example. However, to move individual bytes, we have
to use X -registers with LDRB and STRB instructions.

2.5.4.2 Larger Types

Strings are relatively easy, because each element of the array takes only
one byte, and the index of the array can thus be conveniently used as offset
2.5 Flow Control 45

from the base address. When a data type takes more than one byte, it needs
to be planned more carefully.
Now we’ll modify the example in the previous section with a long int
type array, where each element takes eight bytes. The task is still to add 2
to every element in the array:

1 long int arr[] = {30, -23, 100, 99, 10};


2 for (int i = 0; i < 5; i ++) {
3 *(arr + i) = *(arr + i) + 2;
4 }

Let’s follow the three steps again — analysis, planning, and assembly —
to translate this C code into assembly.

1. Analysis:
From previous example, we’ve been fairly familiar with arrays, so
no doubt arr is the base address of the array. What’s different here
is, the index i cannot be used as offset anymore. When it’s a string,
moving the index by one also means moving/offsetting to the next
one byte, and therefore the index can be conveniently used as offset
as well.

In this example, however, each element takes eight bytes, meaning


when we move on to the next element (say arr[i+1] ), we should
skip eight bytes of the current element ( arr[i] ) so that we can land
on the first byte of the next element. If we still use the i as offset,
we are actually only moving to the next byte, instead of next element.

2. Panning:
Based on the analysis above, we are clear that before using LDR or
STR , we have to calculate the offset correctly. Given an index i , if
each element takes eight bytes, the correct offset should be i * 8 .
Thus, we have the planning for registers as shown on the side.

The pseudocode is very similar to the previous example, except that Register Data
it needs an additional step to calculate the offsets correctly for each X9 base address of arr
element. Also notice this time because we’re using full eight bytes, X11 index i
we use LDR and STR instructions. X12 i-th element *(arr+i)
X13 offset i*8
46 2 Instruction Set Architecture

X9 (base address) X13 (offset)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
12 34 56 78 9A BC 00 00 FF CB 43 21 A9 03 30 66
Figure 2.10: Each long integer takes eight
arr[0] arr[1]
bytes, so the starting address of each el-
ement is 8*i where i is the index of
LDR X12,[X9,X13]
the element. In this figure, LDR will copy
eight bytes starting from address pointed
X12 66 30 03 A9 21 43 CB FF
by X9+X13 to X12 .

Algorithm 2: Pseudocode for iterating long integers


1 Set X10 to the array length 5 → MOV X10,5 ;
2 Set X11 to zero for the counter i → MOV X11,0 ;
3 while X11 < X10 do
4 Calculate offset → LSL X13,X11,3 ;
5 Load *(arr+i) into register X12 →
LDR X12,[X9,X13] ;
6 Add 2 to *(arr+i) → ADD X12,X12,2 ;
7 Store X12 back to memory → STR X12,[X9,X13] ;
8 Update X11 (i++) → ADD X11,X11,1 ;

3. Assembly:
With the analysis and planning, translating to assembly is very straight-
forward:

1 MOV X11, 0 // index i = 0;


2 Begin: CMP X11, 5 // compare i and 5;
3 B.GE End // if (i >= 5) goto End;
4 LSL X13, X11, 3 // i = i * 8;
5 LDR X12, [X9, X13] // Load one double word;
6 ADD X12, X12, 2 // Add 2;
7 STR X12, [X9, X13] // Update *(arr+i);
8 ADD X11, X11, 1 // i ++;
9 B Begin // Branch back
10 End:

Also see Figure 2.10 for an illustration.

˛ Quick Check 2.6


(Fall 21 Final) Assume there’s an array:
long int arr[] = {12, 34, 56, 78}
Write an assembly to store the numbers in the reverse order in an-
other array arr2 . Suppose arr is already in X9 , and arr2 in
X10 . You have to use a loop to complete the task.
B See solution on page 62.
2.6 Procedures 47

2.6 Procedures

In this section we’re going to learn how to write procedures in assembly.

Before talking about procedures, let’s review the memory address space
first. Whenever we start a program, we assume that the program takes the
entire virtual memory space from address 0x00..0 to 0xFF..F in the
memory. At the bottom of this space, we have .text segment that stores
our assembly code and read only data (e.g., string literals). Going up, we
have .data segment that stores global variables. After .data segment
we also have .bss used for uninitialized data. These parts are basically
pre-determined during compilation time, meaning when generating an
object file, the layout and data in these segments are already arranged in
the object file.

The rest of the space will be used during run time, meaning they’re used
for storing and managing data that only occurs when the program is actu-
ally executing, such as local variables. The start of the area is called heap,
which is used for dynamic allocations during run time. The end of the
area is called stack, used for procedure calls. Since they occupy the two
ends of the empty space of the memory, they grow towards each other, or
towards the center. Figure 2.11 shows a bit more detailed visualization.
0xFFF...F
Stack bottom
Since we’re going to write procedures, we’ll naturally focus on the stack Stack
area, and see how calling or returning from a procedure manages the stack. SP
(stack pointer)

2.6.1 Runtime Stack Empty space

Through Data Structures you have learned that stack is a First-In-Last-Out Heap

(FILO) structure, and the main operations for stacks are push and pop. Read/write segment
Push is to add something to the top of the stack, whereas pop is to take (.data, .bss)
Loaded from
the top element of the stack out. executable file Read-only segment
(.text, .rodata)
Stack is also the structure for procedure calls. Intuitively, there are a lot of Unused
0x000...0
similarities of behaviors between stacks and procedures. Let’s look at the
Figure 2.11: Visualization of a virtual
following example. memory space for a program. Our as-
sembly code and global variables will be
Assume we have a C code with three functions called funx() (x = 1, 2, 3)). loaded to this space straight out of the ex-
ecutable file. During run time, the heap
main() calls fun1() , and fun1() calls fun2() and fun3() :
and stack are growing towards each other
as our program calls a procedure, or al-
locates space dynamically. For stack, the
bottom is actually at the high address,
1 int main() { 1 void fun1() {
while the top is at the low address, so you
2 fun1(); 2 fun2(); can take it as a upside-down stack.
3 return 0; 3 fun3();
4 } 4 }

Formally, fun1() is the caller of fun2() and fun3() , and these two
functions are the callee of fun1() . Assume every time we call a proce-
dure, we put a “block” that represents the callee at the top of a stack, and
every time we return from a procedure, we take that block out. Figure 2.12
48 2 Instruction Set Architecture

❶ ❷ ❸ ❹ ❺ ❻
Stack Time
bottom main() main() main() main() main() main() stamps

call return

fun1() fun1() fun1() fun1() fun1()

call return call return

fun2() fun2() fun3() fun3()


Stack
growth Stack top

Figure 2.12: If we treat function/procedure calls as stacking some “blocks”, the process of procedure calls looks really like pushing blocks
to the stack. At point (2), fun2() returned to fun1(), and therefore its block is removed from the stack top.

shows us a time line from when fun2() was called to fun1() returned
to main() .

Now think the “blocks” as boxes where each procedure stores its own local
variables. These boxes, called procedure frames, or simply frames, are ex-
actly how the system manages stack space for procedure calls as in Figure
2.11.

One question is then, are those frames created automatically when we call
a procedure? The answer is no. Frames are nothing but a designated area
on stack for procedure management; it is basically in our head. When
we call a procedure or return from a procedure, we have to manually
allocate/de-allocate the frame area on stack. We don’t need to do this in
high-level languages but in assembly we have to manage these areas.

2.6.2 Procedure Call Conventions

Simply put, procedures in assembly are just branches with return. Let’s
look at the following example.

1 Proc: MOV X0, 0


2 MOV X1, 1
3 B Ret_Pt
4 _start: B Proc
5 Ret_Pt: MOV X1, 1
6 ...

In the code listing above, line 1 to line 3 can be treated as a procedure called
Proc . You can see it’s nothing special but a branch.

We start running our program from label _start , and branch to Proc .
This looks like we’re “calling” the procedure Proc . Then at line 3, we un-
conditionally branch to Ret_Pt , which is the next instruction of B Proc ,
and move on from there. This is like we are returning from the procedure
Proc .
2.6 Procedures 49

Question: what if we forgot to write line 3, and didn’t branch back to


Ret_Pt ? What will happen?

This simple code is a very simple procedure call (though not complete), but
it shows us the first essential element of procedure call: return address.

2.6.2.1 Return Address

What’s special about procedure is, after the callee finishes, the program
will return back to where it was called in the caller, and keep executing
from there. Since in assembly every instruction has an address, we need
to know where we should go back to from the procedure. In the example
above, instruction B Proc is the calling point of Proc , and so when
returning from the callee, its following instruction on line 5 is the return
point, which is labeled as Ret_Pt .

Using a label to mark the return address is an ok option, but not the best.
If you have a billion of functions, you’d have to create a billion of labels
for their return addresses. Too much labels only make programs hard to
follow. Plus there are also portability issues.

A much better option is to use BL instruction (branch and link). As the


name suggests, it does two things: branch to a label, and link the return
address. Using BL instruction, we modify the example code:

1 Proc: MOV X0, 0


2 MOV X1, 1
3 RET
4 _start: BL Proc
5 MOV X1, 1
6 ...

What BL Proc instruction does is first push the address of its following
instruction (i.e., the return address) into register X30 , and then copy the
address labeled by Proc to PC, so that we can branch to the procedure
Proc . Notice by convention, X30 (can also be referred as LR ) is called
link register, whose purpose is exactly to store return address. Therefore,
do not use X30 for other purposes.

The second change we did is on line 3 where we now use RET instruction,
which obviously stands for “return”. Because X30 is used by default for
return address, RET instruction will simply copy X30 to PC, so that next Ÿ Caution!
instruction executed is the one at the return address. Note PC (program counter) and
X30 / LR (link register) are two dif-
ferent registers. PC is automatically
set by the machine or instructions
2.6.2.2 Passing Arguments and Return Values such as RET , while X30 can be
modified by us manually, or auto-
matically by using instructions such
The second essential element of a procedure is passing arguments/return
as BL . Ę Remember: BL changes
values. In the code example above, we didn’t pass any arguments, but in
X30 and PC, while RET only
practice it is quite common. changes PC.
50 2 Instruction Set Architecture

The quickest way to pass arguments is through registers, because it doesn’t


involve reading/writing memory. Again, by convention, registers X0 to
X7 are used to pass arguments. Of course there’s no hard requirement
that you have to pass arguments through those eight registers, but porta-
bility/compatibility becomes the issue again, so we’ll just ask you only
use X0 to X7 to pass procedure arguments.

For return values, since usually one result is returned, we simply use X0
to store return value.

ą Example 2.8
(Fall 21 Homework) Write an assembly program that translates the
following C code:

1 long int a = 10;


2 long int b = 20;
3 long int c;
4 long addition(long int a, long int b) { return a +
↪ b; }
5 int main() {
6 c = add(a, b);
7 }

Assume addresses of variables a , b , and c are stored in


registers X9 , X10 , and X11 , respectively.

Solution: First step is still to write LDR instructions for variables


a and b . We don’t need to LDR variable c because its value is
not stored in the memory yet. We’ll use _start as the entrance
of the main procedure:

1 _start: LDR X0, [X9] // Load a to X0


2 LDR X1, [X10] // Load b to X1

Before we branch to the procedure, we need to move arguments


we want to pass to registers X0 – X7 . In this example, luckily,
when we load the two variables they are already in X0 and X1 ,
so we don’t need to move them anymore. The only thing we need
is to branch to the procedure addition , so the next instruction
would be BL addition .

The instruction following this BL is where we return from the


caller; that’s also where we grab the return value from the callee.
We’ll store this into variable c in memory, so a natural step is to
use STR : STR X0,[X11] . In summary, the main procedure looks
like this:
2.6 Procedures 51

1 _start: LDR X0, [X9] // Load a to X0


2 LDR X1, [X10] // Load b to X1
3 BL addition // Call addition() function
4 STR X0, [X11] // store return value to c

The procedure addition should also be easy. Since we followed


the calling convention by passing two arguments through X0 and
X1 , the procedure can just use them directly. Also because the
return value is stored in X0 , we can just let the destination of ADD
instruction be X0 . Thus, we have:

1 addition: ADD X0, X0, X1 // a = a + b;


2 RET // Return to the caller

Note that it doesn’t matter if you write main procedure first or


addition first; their placement in your source code doesn’t mat-
ter. In the above code we did not add appropriate terminating
statements in the main procedure, so it could cause a problem if
you put addition below STR . Please refer to Section B.1.1 in
the appendix for more details on how to terminate a program.

2.6.2.3 Creating Frames

In the previous examples we didn’t create any frames for procedure calls,
because the registers are enough to perform what we want. More than
often, though, registers are never enough to store all the data needed for
procedures. Thus, the purpose of creating frames is to store procedure
arguments and local variables.

Remember, however, that frames are merely a way of managing procedure


calls; the computer system has no idea what a frame is, or where the frame
boundaries are at. Which means it is our job to know where the frame is.
So how?

Frames are just an area on stack in the memory, and again, everything in
the memory has an address. If, for a procedure call, we know where the
first and last bytes of its frame are at, we are able to draw a boundary and
claim something like, the area between xyz and abc is the frame area for
this procedure call.

Back to Figure 2.11, we see that there’s a stack pointer that points to the
top of the stack, and this is exactly the “boundary” we need. Stack pointer
stores the address of the current lowest byte of the stack, and is inside
register SP (which is not one of the 32 general purpose registers). Note
that this register can certainly be used for other purposes and the system
doesn’t prevent you from using it, but as a good practice and the sane of
your code (and your mental health), do not use SP other than storing
stack top address.
52 2 Instruction Set Architecture

The other side of the “boundary” for procedure frames is frame pointer,
stored in X29 or referred as FP . Thus, the bytes between FP and SP
form the area we visualize as procedure frame.

Are SP and FP automatically changed when we BL to a procedure? Sorry


nope! We have to manually change FP and SP on our own. In fact we
can only need to change SP since it points to the stack top and thus more
important; changing FP is optional, as long as you remember the size (i.e.,
how many bytes) of the procedure frames.

ą Example 2.9
Let’s use this example to show how procedure frames work in de-
tail. We’ll start from a C code again:

1 void proc(long int rando) {


2 long int x = rando;
3 return;
4 }
5

6 long int fun(long int a, long int b) {


7 return a + b;
8 }
9

10 int main() {
11 proc(20);
12 long int y = fun(2,3);
13 }

The main procedure and fun() should be easy to produce at this


point:

1 fun: ADD X0, X0, X1 // a + b


2 RET
3

4 _start: MOV X0, 20 // Pass 20 as parameter


5 BL proc // Call proc(20)
6 MOV X0, 2 // Parameter 1 <- 2
7 MOV X1, 3 // Parameter 2 <- 3
8 BL fun // Call fun(2,3)
9 STR X0, [X9] // Store long int y,
10 // assume &x is in X9

Note again, on line 9, we followed the calling convention that re-


turn value will be stored in X0 .
In proc() , we assign the argument rando passed to the proce-
dure to the local variable x . Because x is local, we have to store
it on stack, or more specifically, the procedure frame of proc()
on stack. So let’s create a frame for it now!

First thing: how many bytes we need for the frame? Local vari-
2.6 Procedures 53

able a is of long int type, so we only need 8 bytes. Currently


SP is pointing to the stack top. To create a frame for proc() on
stack, we need to subtract SP because remember the stack grow
towards low address. At the end, before we return, we need to “de-
allocate” the frame, so we add the corresponding number of bytes
back. Thus, we have the following framework:

1 proc: SUB SP, SP, 8 // Allocating frame:


2 // SP = SP - 8
3

4 /* --- Main procedure body --- */


5

6 ADD SP, SP, 8 // De-allocating frame:


7 // SP = SP + 8
8 RET

The procedure body is very simple, which is just to copy rando to


x . Because we followed the calling convention (very important!),
we know rando is already in X0 before calling proc() . To put
it together, we have the following for proc :

1 proc: SUB SP, SP, 8 // Allocating frame:


2 // SP = SP - 8
3

4 STR X0, [SP]


5

6 ADD SP, SP, 8 // De-allocating frame:


7 // SP = SP + 8
8 RET

Here we use SP as the base address and store X0 at SP .

˛ Quick Check 2.7


Translate the following C code into assembly, with correct calling
convention and procedure frame creation.
54 2 Instruction Set Architecture

1 void init_array(long int length, long int val) {


2 long int arr[length] = {val};
3 return;
4 }
5

6 void init_num(int a, int b) {


7 int num = a + b;
8 return;
9 }
10

11 int main() {
12 init_array(20, -1);
13 init_num(10,20);
14 exit(0);
15 }

B See solution on page 62.

2.6.2.4 Leaf and Non-Leaf Procedures

In the example above, both proc() and fun() are called leaf proce-
dures, because they didn’t call any other procedures. The name could be
more intuitive if you think procedure callings as a tree structure, where a
callee is the caller’s child. If a procedure calls other procedures, the caller
will be referred as non-leaf procedures, and this is where we need to be
careful.

Suppose we have changed the C code to the following:

1 void proc() {
2 fun();
3 return;
4 }
5

6 void fun() {
7 return;
8 }
9

10 int main() {
11 proc();
12 }

In this case, proc() becomes a non-leaf procedure. What’s special about


non-leaf procedures?

The following assembly code is our first try:

1 proc: BL fun // call fun(); 0x1000


2 RET // return; 0x1004
3
2.6 Procedures 55

4 fun: RET // return; 0x1008


5

6 _start: BL proc // call proc(); 0x100C


7 MOV X0, 0 // 0x1010

This is a very simple and intuitive because that’s what we translate line-
by-line from the C code. What could go wrong then? Let’s start walking
through the instructions from _start . We also add each instruction’s
address in memory in the comments.

On line 6 where we executed BL proc , PC will be modified to the address


of the first instruction in target proc , which is 0x1000 . Meanwhile, the
return address is automatically stored into X30 , the link register.

We then proceed to proc() on line 1. The instruction BL fun will mod-


ify PC to the address of the first instruction of fun() , which is 0x1008 ,
and save the return address 0x1004 to X30 . Here’s the problem! You
can see at this point, X30 is overwritten. Surely we can correctly return
from fun() — RET on line 4 will copy X30 (currently 0x1004 ) to PC
so that we can go back to the caller proc() , but what about returning
to main() ? We execute RET on line 2, which was supposed to take us
back to line 7, but currently X30 does not point to 0x1010 — it’s still
0x1004 . We’re stuck in here.

The lesson learned here is, the machine doesn’t remember whatever value
you had in a register. If you overwrite X30 , you overwrite it, and execut-
ing RET doesn’t bring the old value back.

To solve this, we have to avoid overwriting X30 . Since X30 is automat-


ically set by BL instruction, the only way for us is to store X30 in the
frame of the caller’s procedure on stack before using BL , and load it back
to X30 before RET .

The following code is the correct version:

1 proc: SUB SP, SP, 8 // Create a frame to store X30


2 STR X30, [SP] // Store X30 in the frame
3

4 BL fun // Call fun()


5

6 LDR X30, [SP] // Restore X30 back


7 ADD SP, SP, 8 // De-allocate frame
8 RET // Return
9

10 fun: RET // Return


11

12 _start: BL proc // Call proc()


13 MOV X0, 0 // Random instruction

We only stored X30 in the frame of proc() , because it’s a non-leaf pro-
cedure whose return address is at risk of being lost. For fun() , since it’s
56 2 Instruction Set Architecture

a leaf procedure, we didn’t store its X30 .

˛ Quick Check 2.8


(Fall 22 Midterm 1) Translate the following C procedure into
assembly with correct calling convention and procedure
frame creation. Note you can assume long int dog() and
long int bunny() have already been implemented properly
with correct calling convention. No comments needed.

1 long int cat(long int x, long int y) {


2 if (x < y) return dog();
3 else return bunny();
4 }

Note: long int cat() itself is also a procedure – it needs to fol-


low calling convention as well.
B See solution on page 64.

2.6.2.5 Resolving Register Usage Conflicts

It is very common to have multiple nested procedure calls and even to call
existing libraries. Because assembly is so simple it doesn’t do any memory
management or computing resource management for us, we’d have to be
careful ourselves.

One very important problem is register conflicts. We can call as many pro-
cedures as we like, but there’s only one set of registers. When the caller
and callee wanted to use the same registers, we have to resolve the con-
flicts. From previous sections, we see that X30 is a great example, where
both procedures need it for return address. We resolved it by storing it
on stack, and loading it back. This is the strategy we’re going to use very
often.

To avoid conflicts, both the caller and the callee are responsible for saving
registers.

 Callee: the callee needs to store callee-saved registers and SP on


stack, including X19 to X30 . Because the caller expect these regis-
ters to be unchanged after procedure call, the callee needs to restore
(meaning, LDR back from memory to registers) these values before
return;
 Caller: the caller needs to save caller-saved registers on stack, be-
cause there’s no guarantee that the callee will not change the values
of these registers. All the registers that are not callee-saved registers
are caller-saved registers.

The following table shows us the work both the caller and the callee need
to do around the point of procedure call and return.

Of course we don’t have to save all the 19 caller-saved registers every time
we branch; we just need to save those that we want to use later.
2.6 Procedures 57

Caller Callee
Before branching Save X0 ... X18 ×××
After entering callee ××× Save X19 ... X29 , X30
Before return ××× Restore X19 ... X29 , X30
After returning to the caller Restore X0 ... X18 ×××

2.6.2.6 Summary

Writing procedures needs careful planning, so we’ll make a summary here


to show general steps we can take when writing a procedure.

1. Calculate the procedure frame size ( frame_size ), and subtract it


from SP : SUB SP,SP,frame_size ;
2. Store registers on stack:
a) For non-leaf procedures, store X30 : STR X30,[SP,offset] ;
b) If the current procedure needs to change any values in X19 ... X29 ,
store them in the frame as well;
3. Do the procedure thing;
4. Put return value back to X0 ;
5. Restore stored registers in step 2: LDR X30,[SP,offset] , etc;
6. Add frame_size back to SP : ADD SP,SP,frame_size ;
7. RET back to the caller.

2.6.3 Recursive Procedures

The best way to illustrate the importance of following the calling conven-
tion is to examine recursive procedures. Let’s start with a very simple
task:

1 long int facto(long int num) {


2 if (num == 1) return 1;
3 else return num * facto(num - 1);
4 }
5

6 int main() {
7 long int x = facto(3);
8 exit(0);
9 }

where we use facto() to calculate a factorial recursively.

First thing to notice is facto() can be both leaf and non-leaf procedure:
when num == 1 it’s a leaf procedure, otherwise it’s a non-leaf. To make
things consistent, we treat this procedure as non-leaf. To create a frame,
we need to calculate frame size. There’s no local variables in the procedure,
so the only thing we need to save on stack is X30 , and the frame size is
thus eight bytes.
58 2 Instruction Set Architecture

We write the following framework for facto() including the base case,
which is the simplest. Due to calling convention, we know that num is
stored in X0 , so we just need to compare it with constant 1. If it’s equal,
we just need to return 1. Because we need to store return value to X0 and
X0 is already 1, we don’t need to do anything but restoring X30 and
return.

1 facto: SUB SP, SP, 8


2 STR X30, [SP]
3

4 /* Base case */
5 CMP X0, 1
6 B.EQ _end
7

8 /* Recursive case */
9

10 _end: LDR X30, [SP]


11 ADD SP, SP, 8
12 RET

Now let’s fill in the recursive case. We need to call facto() and pass
num-1 . Currently num is in X0 , so naturally we just do SUB X0,X0,1
and BL facto . The problem is, when we return back from the callee and
need to do multiplication, X0 has been changed by the callee. Therefore,
we need to save X0 somewhere. Can we use registers? No, because all
the calls to facto() share the same set of registers, and it’ll inevitably be
overwritten. Therefore, the answer is stack, because every call has its own
stack frame.

In fact, you can see because num is used both for multiplication and for
passing new argument to recursive calls, it can be viewed as a local vari-
able as well. Remember local variables are also needed to be stored on
stack.

The following is the correct code with comments:

1 facto: SUB SP, SP, 16 // 8 bytes for X30, and


2 // 8 bytes for num
3 STR X30, [SP] // Store X30
4

5 /* Base case */
6 CMP X0, 1 // if (num == 1)
7 B.EQ _end // goto _end;
8

9 /* Recursive case */
10 STR X0, [SP, 8] // Store num to [SP+8]
11 SUB X0, X0, 1 // X0 = num - 1
12 BL facto // call facto(num - 1)
13 LDR X1, [SP, 8] // Restore num to X1
14 MUL X0, X0, X1 // X0 = X0 * X1
15 // X0 = facto(num-1) * num
16
2.7 Quick Check Solutions 59

17 _end: LDR X30, [SP] // Restore X30


18 ADD SP, SP, 16 // De-allocate frame
19 RET // return;

˛ Quick Check 2.9


Briefly answer the following questions on calling conventions:

1. How do we pass arguments into procedures?


2. How are values returned by procedures?
3. What is SP and how should it be used in the context of pro-
cedures in ARM assembly?
4. Which values need to saved by the caller, before jumping to
a procedure using BL ?
5. Which values need to be restored by the callee, before return-
ing from a procedure?
6. In a bug-free program, which registers are guaranteed to be
the same after a function call? Which registers aren’t guaran-
teed to be the same?

B See solution on page 65.

2.6.4 Reference

For detailed procedure call standards for ARMv8 architectures, there’s no


reference better than the official documentation released by ARM. Please
refer to file Procedure Call Standard for the ARM⃝
R
64-bit Architecture (AArch64)
here: https://github.com/ARM-software/abi-aa/releases.

2.7 Quick Check Solutions

Quick Check 2.1

1. True or False and why: assume X0 points to the start of an array


called arr . LDR X1,[X0,3] will load arr[3] to X1 ;
 False. We list the correct instructions for all types of arrays:
• char : LDRSB W1, [X0, 3] ;
• unsigned char : LDRB W1, [X0, 3] ;
• short int : LDRSH W1, [X0, 6] ;
• unsigned short int : LDRH W1, [X0, 6] ;
• int , or long (both signed and unsigned): LDR W1, [X0, 12] ;
• long int (both signed and unsigned): LDR X1, [X0, 24] .

2. Assume we have an array of long integers long int arr[3] , and


the value of arr is stored in register X9 . Write a sequence of
ARMv8 instructions to move all elements in the array from memory
to register X10 and move it back, one at a time.
60 2 Instruction Set Architecture

 Notice arr is the name an array, so its value is the base address
of the array. Because the array is long int type, each element
takes eight bytes. Therefore we have the following solution:

1 LDR X10, [X9] // Load arr[0] to register


2 STR X10, [X9] // Store arr[0] back to memory
3 LDR X10, [X9, 8] // Load arr[1]
4 STR X10, [X9, 8] // Store arr[1]
5 LDR X10, [X9, 16] // Load arr[2]
6 STR X10, [X9, 16] // Store arr[2]

Quick Check 2.2

1. What’s the value of X9 after each of the following instruction in


hexadecimal?

1 MOV X9, -9
2 MOV W9, 10

 After the first instruction, X9 is 0xfffffffffffffff7 ; after


the second instruction, X9 is 0xA , because the highest 32 bits
will be zeroed out.
2. Assume X9 has a value of 0xFFFF101029AB00FE , while X10 has
a value of 0x33F59077F2FFABCD . What are the values of X9 and
X10 after the following instruction: MOV W10, W9 ?

 X10 is 0x29AB00FE since MOV with W registers will zero out


high 32 bytes. X9 is 0xFFFF101029AB00FE .
3. Which of the following instructions are invalid and why?
a) MOV W10, WZR
 Valid. Afterwards W10 will be zero.
b) MOV X20, [X22,10]
 Invalid. MOV instruction can only copy either immediate
number or another register to a register; it cannot directly
move data from memory to a register.
c) MOV 382, X20
 Invalid. The destination cannot be an immediate.
d) MOV XZR, 382
 Valid. However, XZR will stay zero.
e) MOV 382, 392
 Invalid. The destination cannot be an immediate.

Quick Check 2.3

(Fall 21 quiz) Write the corresponding ARMv8 assembly code for the fol-
lowing C statement. Assume that the variable f is in register X20 , and
2.7 Quick Check Solutions 61

the base address of array A is in register X21 . A is an array of integers.


A[0] = f + A[5];

 Each element is an integer, so it takes four bytes. The correct offset


for A[5] therefore is 20. Solution:

1 LDR X22, [X21, 20] // Load A[5] to X22


2 ADD X23, X22, X20 // X23 = f + A[5]
3 STR X23, [X21] // Store X23 back to A[0]

Quick Check 2.4

For the following assembly program, write the values of condition codes
N , Z , C , and V after the execution of every instruction. Assume at
beginning all four codes are cleared.

1 MOV X9, 9
2 MOV X10, 10
3 CMP X9, X10
4 ADD X9, X9, 10
5 SUB X10, X10, 3
6 CMP X9, X10
7 CMP X10, X9

 There are several keys to this question. First, remember not all in-
structions set condition codes. In this question, only CMP will change
the condition codes. For instructions that do not change condition
codes, the values of N , Z , C , and V will stay the same. Second,
we need to be clear about binary operations to know which instruc-
tion might set which codes.

 Thus, we have the following solution:

N Z C V
MOV X9,9 0 0 0 0
MOV X10,10 0 0 0 0
CMP X9,X10 1 0 0 0
ADD X9,X9,10 1 0 0 0
SUB X10,X10,3 1 0 0 0
CMP X9,X10 0 0 1 0
CMP X10,X9 1 0 0 0

 One thing to notice is CMP X9,X10 changed C to 1. What that


instruction does is to subtract X10 from X9 , which is 19 − 7 =
12. The machine operates it as 19 + (−7). The binary for 19 is
0b00...10011 , while for −7 is 0b11...11001 . To add them to-
gether, you’ll see there’s a carry to the front.
62 2 Instruction Set Architecture

Quick Check 2.5

(Fall 21 Quiz) Write an ARMv8 assembly code that sets x to y + 10 , if


y is equal to 1, otherwise sets x to y/8 . Assume that x and y are in
X19 and X20 , and are treated as signed integers. Note: you don’t need
multiplication or division instructions to finish this question at all.

 The following is a possible version as there are many other ways to


write it. Note y/8 is equivalent to arithmetic shifting right 3 bits
y >> 3 , because they are treated as signed integers. If it’s treated
as unsigned integer, both arithmetic and logic shifting right work.

1 MOV X0, 1
2 CMP X20, X0 // Compare y and 1
3 B.EQ Else
4 ADD X19, X20, 10 // x = y + 10
5 B End
6 Else: ASR X19, X20, 3 // y/8 == y >> 3
7 End:

Quick Check 2.6

(Fall 21 Final) Assume there’s an array: long int arr[] = {12,34,56,78} .


Write an assembly to store the numbers in the reverse order in another ar-
ray arr2 . Suppose arr is already in X9 , and arr2 in X10 . You have
to use a loop to complete the task.

 The following is a possible answer. Here we calculate the offsets di-


rectly without using an index first. The stopping condition is when
X1 — the offset for arr2 — is smaller than 0.

1 MOV X0, 0 // offset for arr


2 MOV X1, 24 // offset for arr2
3

4 Loop: LDR X2, [X9, X0]


5 STR X2, [X10, X1]
6 ADD X0, X0, 8
7 SUB X1, X1, 8
8 CMP X1, XZR
9 B.GT Loop
10 End: ...

Quick Check 2.7

Translate the following C code into assembly, with correct calling conven-
tion and procedure frame creation.
2.7 Quick Check Solutions 63

1 void init_array(long int length, long int val) {


2 long int arr[length] = {val};
3 return;
4 }
5

6 void init_num(int a, int b) {


7 int num = a + b;
8 return;
9 }
10

11 int main() {
12 init_array(20, -1);
13 init_num(10,20);
14 exit(0);
15 }

 The key to solving this problem is calculating the correct frame size
for each procedure. init_array() needs to host an array of length
long integers, so the frame size is 8*length bytes. init_num()
has a local variable of type int , so the frame size should be 4 bytes.
Also notice that both parameters of init_num() are 4-byte integers,
so we need to operate on W -registers.

1 init_array:
2 LSL X2, X0, 3 // X2 = total bytes of arr
3 // = frame size
4 SUB SP, SP, X2 // Allocate frame
5

6 MOV X3, 0 // X3 = index


7 loop: CMP X3, X0 // compare index & length
8 B.EQ exit // if (length == index)
9 // goto exit;
10

11 LSL X4, X3, 3 // X4 = offset = index*8


12 STR X1, [SP, X4] // Store X1 (val) to
13 // arr[index]
14 ADD X3, X3, 1 // X3 ++;
15 B loop // goto loop;
16

17 exit: ADD SP, SP, X2 // De-allocate frame


18 RET // return;
19

20 init_num:
21 SUB SP, SP, 4 // Allocate frame
22

23 ADD W2, W0, W1 // X2 = a + b;


24 STR W2, [SP] // num = X2;
25

26 ADD SP, SP, 4 // De-allocate frame


27 RET // return;
28
64 2 Instruction Set Architecture

29 _start: MOV X0, 20 // arg1 = X0 = 20;


30 MOV X1, -1 // arg2 = X1 = -1;
31 BL init_array // call init_array();
32

33 MOV W0, 10 // arg1 = W0 = 10;


34 MOV W1, 20 // arg2 = W1 = 20;
35 BL init_num // call init_num();
36

37 MOV X0, 0 // arg1 = X0 = 0;


38 MOV X8, 93 // syscall #93
39 SVC 0 // sys call

Quick Check 2.8

(Fall 22 Midterm 1) Translate the following C procedure into assembly


with correct calling convention and procedure frame creation. Note you
can assume long int dog() and long int bunny() have already been
implemented properly with correct calling convention. No comments needed.

1 long int cat(long int x, long int y) {


2 if (x < y) return dog();
3 else return bunny();
4 }

Note: long int cat() itself is also a procedure – it needs to follow call-
ing convention as well.

 Note that cat() is a non-leaf procedure without local variables,


so we need to allocate 8 bytes on the stack to store X30 . Here’s a
possible solution:

1 cat: SUB SP, SP, 8 // Allocate 8-byte


2 // frame for cat()
3 STR X30, [SP] // Store X30
4 CMP X0, X1 // Compare x - y
5 B.LT call_dog // if (x < y)
6 // goto call_dog;
7 BL bunny // else bunny();
8

9 exit_cat: LDR X30, [SP] // Restore X30


10 ADD SP, SP, 8 // De-allocate
11 RET // Return
12

13 call_dog: BL dog // Call dog();


14 B exit_cat // goto exit_cat;

 We created a label call_dog , because remember B.LT is not a


procedure call instruction: it doesn’t save return address to X30 .
Therefore, we have to use a separate label to call the procedure.
2.7 Quick Check Solutions 65

 Notice in the C code, the return values of dog() and bunny() are
also used as return value for cat() . After returning from dog()
and bunny() , the return value is already stored in X0 due to call-
ing convention, so we can just restore X30 , deallocate the frame,
and return from cat() .

Quick Check 2.9

Briefly answer the following questions on calling conventions:


1. How do we pass arguments into procedures?
 We put the first 8 arguments to registers X0 – X7 , and more
arguments would need to be put on stack with alignment of
eight bytes.
2. How are values returned by procedures?
 Return value is stored in register X0 .
3. What is SP and how should it be used in the context of procedures
in ARM assembly?
 SP stores the address of the current stack top. When enter
a procedure, we need to subtract multiples of 16 to make the
call frame for that procedure. When leaving, we need to “de-
allocate” the frame by adding the bytes back to SP .
4. Which values need to saved by the caller, before jumping to a proce-
dure using BL ?
 Caller saves any registers from X0 to X18 as necessary.
5. Which values need to be restored by the callee, before returning from
a procedure?
 Callee saves registers X19 – X29 and specifically X30 , the re-
turn address.
6. In a bug-free program, which registers are guaranteed to be the same
after a procedure call? Which registers aren’t guaranteed to be the
same?
 Registers X19 – X30 will be the same because they are callee
saved. All other registers could be different.
Microprocessor Design 3
The two most essential components of a microprocessor are registers and 3.1 Fundamental of Logics . 67
arithmetic logic unit (ALU), corresponding to two distinct functionalities— 3.2 From Assembly to Ma-
storing temporary data, and performing arithmetic calculations. In this chine Code . . . . . . . . . 78
chapter, we will start with logic fundamentals to look at small electronic 3.3 A Single-Cycle Datapath 82
devices used for building up a microprocessor, and then design a simple 3.4 A Pipelined Datapath . . 92
microprocessor that can executes a sequence of assembly instructions. Af-
3.5 Hazards . . . . . . . . . . . 98
ter discussing the flaws of the design, we improve our model in terms of
3.6 Performance Evaluation . 111
efficiency as well as preventing errors.
3.7 Quick Check Solutions . 114

3.1 Fundamental of Logics

3.1.1 Logic Gates

Here we introduce four simple gates: and , or , xor (exclusive or), and
1: Sometimes you will see gates with
not . Each of them takes one input or two, and produces one output. 1 more than two inputs. These are simply
We can build much more complicated logic using only these gates: 2 stacked gates. For example, if an and
gate has three inputs a , b , and c , the
output is calculated by a&b first, and
a a then &c , i.e., (a&b) & c .
a & b a | b
b b 2: Fun fact: Charles Sanders Peirce and
Henry M. Sheffer both showed that all
gates can be constructed by just NOR gates
alone (or NAND).
a
a !a a ^ b
b

In the figure above, each wire is either zero or one. In other words, each
wire transfers one bit.

One thing we need to remember is those gates (almost) instantly respond


to signal changes. Assume at one moment, we have 1 (high voltage) and
0 (low voltage) passing through the and gate, and the output is 0 . If
at some point we change the low voltage input to high voltage, the output 3: Strictly speaking, there’s a tiny teeny
will instantly change to 1 . 3 amount of delay, but that amount of time
can be neglected without any problem for
In Figure 3.1, we show a timing diagram where a&b changes as a and now.
b change their values as the inputs of the and gate. At timestamp 1 ,
input a has high voltage (consider it as signal 1 ) while b has low volt-
age (as 0 ), and therefore a&b is 0 . At the end of 2 , input b raised
its voltage to 1 , so a&b started reflecting this change to 1 . The small
amount of delay between 2 and 3 is called rising delay. Similarly, the
delay between 4 and 5 is called falling delay. This amount of delay
68 3 Microprocessor Design

Voltage
a & b b
Figure 3.1: a and b are the two inputs of
the and gate, and a&b is the output. The
and gate constantly and almost imme-
diately reflects the change of the input, a
with small amount of delay which is ne-
❶❷❸ ❹❺ Time
glectable.

b
Figure 3.2: On the left, signal a has been a a
branched into two; on the right, a and b a a
are separate signals without relations, in- a b
dicated by a line hop.

is almost neglectable, so we roughly consider the change of the output of


these gates is almost instant.

Line Notations

You probably already know this but just a small refresher — as in Figure
3.2 left, the small dot means the two wires come from the same wire, so
they have the same signals. On the right, the two wires have no relations
as there’s “line hop”.

3.1.2 Combinational Logic

Now that we have the three fundamental gates, we can use them to build
more complicated logics. When we combine these gates and the signals
are in one direction without loop, it’s called a combinational logic. Be-
cause there’s no loops in the combinational logics, the signal changes are
almost immediate, and will respond to input changes constantly.

3.1.2.1 Comparison

Let’s create a very simple combination logic, where we compare if two bits
are the same. If they are the same, we output a signal of 1 ; otherwise a
0 . The combinational logic for bit equality is shown in Figure 3.3. Both
a and b are input, and using a truth table we can easily verify that when
a==b , output is 1 ; otherwise it’s 0 .
a a&b
Now imagine we want to compare if two double words (64 bits) are equal.
One way to do this is that for each second, we pass one bit of each number
!a to the bit equality logic as inputs, and take notes on the output. After
b !a&!b one minute and four seconds of waiting, we got 64 bits of outputs. If all
!b
of them are 1 ’s, we know these two numbers are the same; otherwise it’s
(a&b)|(!a&!b)
different. A major flaw of this method is we have to wait for 64 seconds! So
Figure 3.3: The combinational logic for why don’t we compare all 64 bits at a time? Figure 3.4 shows this idea.
comparing two bits. When the two bits
are equal, the output is 1; otherwise it’s
0.
In Figure 3.4, we use a[i] to denote the i -th bit of the double word a .
On the left, the two sets of parallel wires transfer a and b , one bit each
3.1 Fundamental of Logics 69

| <latexit sha1_base64="KXtoaDp0L9PNRTA1TjqbPOoQqAY=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNWOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18GXaZ1</latexit>
a[63]
double word a

... eq[63]
{z

b[63]
}

...
a[1] Eq

...
eq[1]
b[1]
Figure 3.4: The parallel sets of wires on
the left are called buses, where each wire
a[0] transfers one bit of data, and all the wires
<latexit sha1_base64="dYRhVRGrdYApVtyZZMOtrv9VbG8=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEPOocH6l6tf8Ieg4CUakSkY4a1fewljzLAGFXDJrm4GfYitnBgWXUJTDzIIb32PX0HRUsQRsKx+uWNBNp8S0o407CulQ/dmRs8TafhK5yoRh1/71BuJ/XjPDzkErFyrNEBT/HtTJJEVNB3nRWBjgKPuOMG6E+yvlXebiQpdq2YUQ/F15nFxu14K92s75TvXoeBTHLFknG2SLBGSfHJFTckbqhJN78kieyYv34D15r977d2nJG/WskV/wPr8AB+Smdg==</latexit>
|
double word b

transfer data at the same time. To make


...

the graph clearer, we made lines from in-


{z

eq[0] put b blue. Each pair of input bits uses


b[0] the bit equality logic in Figure 3.3 to com-
}

pare.

wire. These two are the buses. Moving towards right, we see each bit
from each number is directed to a corresponding bit-equality logic, and
produces the output eq[i] . Then all the outputs from 64 bit-equality
logics are pushed into a final and gate where a single bit Eq (either 1
or 0 ) is generated.

3.1.2.2 Selection

Another simple but important combinational logic is to select one of two


(or multiple) inputs as an output. Consider it like as faucet where we can
get hot and cold water. The “inputs” are hot and cold water, but we can
only have one output, and it’s either one of them. The logic that chooses
one of the inputs as the output is called multiplexer (usually denoted as
MUX ). Let’s start with the bit multiplexer.
s !s&a
a
In Figure 3.5, we show the logic of a bit-multiplexer. In addition to input
signals a and b , we also need to introduce a control signal s . This
control signal is like a switch that controls which input can pass through
the multiplexer and become the output. Usually, when a control signal is b s&b (!s&a)|(s&b)

1 , we call the signal is asserted or set; otherwise deasserted or clear. By Figure 3.5: The combinational logic for se-
using a truth table, it’s not difficult to notice when s == 1 , the output is lecting one of the inputs as the output.
Input s acts as a “switch”, or “control”.
b ; otherwise it’s a . When s == 1 (asserted), input b passes
through the multiplexer; when s == 0
Notice in this multiplexer, technically we have three inputs: a , b , and (deasserted), input a passes through.
s , but we call a and b the input in the sense of “data input”, while s
“control signal input”. In the future, we will usually use “input” to refer
to data input, but its actual meaning should be clear based on context.

We show the logic for selecting one of the two double words in Figure
3.6. Similar to the equality logic, here for each bit of the input, we use
a multiplexer to select. One thing we need to notice is in this example,
70 3 Microprocessor Design

s
Control signal

MUX
a[63]

<latexit sha1_base64="KXtoaDp0L9PNRTA1TjqbPOoQqAY=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNWOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18GXaZ1</latexit>
|
double word a

double word output


}
...
{z

. . .
b[63]

{z
. . .
}

|<latexit sha1_base64="0ajmTxKlKjTkllotz1OxsLOfP3Q=">AAACJ3icbVDLSgMxFM34tr6qLt0Ei+CqzIivlYhuXCpYFTqlZDK3NphJhuRGLcP8jRt/xY2gIrr0T0xrF74uhBzOuTe55yS5FBbD8D0YGR0bn5icmq7MzM7NL1QXl86sdoZDg2upzUXCLEihoIECJVzkBliWSDhPrg77+vk1GCu0OsVeDq2MXSrREZyhp9rVvdipFExiGIci7tq8f29BVpbtIka4xSLVzj9Gb7RJ6YBBLLTD3GHpm6q1sB4Oiv4F0RDUyLCO29WnONXcZaCQS2ZtMwpzbBXMoOASykrsLPgdrtglND1ULAPbKgY+S7rmmZR2tPFHIR2w3ycKllnbyxLfmTHs2t9an/xPazrs7LYKobwpUPzro46TFDXth0ZTYYCj7HnAuBF+V8q7zGeGPtqKDyH6bfkvONuoR9v1zZPN2v7BMI4pskJWyTqJyA7ZJ0fkmDQIJ3fkgTyTl+A+eAxeg7ev1pFgOLNMflTw8Qm9N6j3</latexit>
MUX
a[1]

Figure 3.6: The inputs a and b, as well as


the output, are all 64-bit double words. b[1]
The same control signal controls all the
<latexit sha1_base64="dYRhVRGrdYApVtyZZMOtrv9VbG8=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEPOocH6l6tf8Ieg4CUakSkY4a1fewljzLAGFXDJrm4GfYitnBgWXUJTDzIIb32PX0HRUsQRsKx+uWNBNp8S0o407CulQ/dmRs8TafhK5yoRh1/71BuJ/XjPDzkErFyrNEBT/HtTJJEVNB3nRWBjgKPuOMG6E+yvlXebiQpdq2YUQ/F15nFxu14K92s75TvXoeBTHLFknG2SLBGSfHJFTckbqhJN78kieyYv34D15r977d2nJG/WskV/wPr8AB+Smdg==</latexit>
|
double word b
bits of an input, which makes the logic
MUX
choose one of the inputs for every bit, a[0]
...
{z

and thus choose one double word to pass


through. We highlighted the wires for in-
put b and its corresponding control sig-
b[0]
}

nal wires.

the control signal has the same value for all the bits of each double word,
because we want to select all bits of a double word, so they should have
the same control signal. The output, instead of being just one bit, contains
64 bits, and they are equal to either a or b , depending on the control
signal s .

In Figure 3.6, we only need one bit of control signal. In some cases we
need more than one bit, and thus all the control signals together form a
control bus.

N-Way multiplexer

If we need to select more than two inputs, we’d need to create a 𝑁 -way
multiplexer. Notice that when there are 𝑁 inputs, one bit of control signal
is not enough apparently. For example, if 𝑁 = 4, we’d need two-bit con-
trol signals, so that 00 , 01 , 10 , and 11 will choose one of the inputs
as the output. In general, the number of control signals can be calculated
as

𝑆 = ⌈log2 𝑁 ⌉ (3.1)

where ⌈𝑥⌉ is the ceiling function that takes the nearest integer above 𝑥.

3.1.2.3 Arithmetics

Recall back in Section 1.2.1.2 we introduced Arithmetic Logic Unit (ALU),


the core part of the processor, as it does all the calculations. One of the
functions an ALU can operate is addition, i.e., adding two inputs together.
Since ALU itself is also combinational, in this section we’ll see how we
can use those simple logic gates to build a logic that can perform addition
operations.
3.1 Fundamental of Logics 71

The logic for adder is also built on logic operations. Without thinking
about how the gates are organized, let’s start with a simple truth table
first. Assume the two inputs are x and y . When doing addition, we
need to add a carry-in flag, called cin . Since both x and y are one
bit data, if the addition result takes two bits, say z[1]z[0] , the leading
bit z[1] will be the carry-out flag cout , while the last bit z[0] is the
addition result, denoted as s .
x y cin cout s
Once the input and output are determined using the truth table, the rest of
0 0 0 0 0
the work is simply to design a combination using all the possible gates, to
0 0 1 0 1
make sure it can produce the correct output given each input. As long as
0 1 0 0 1
it can produce correct values, any combination is valid, though we surely
favor simpler designs. Figure 3.7 is one of the possible designs, where we 0 1 1 1 0
use an xor gate. 1 0 0 0 1
1 0 1 1 0
Similar to how we built up a 64-bit multiplexer, when there are two double 1 1 0 1 0
words a and b , we align the bits for the two numbers, and send a pair 1 1 1 1 1
of bits into a bit adder. See Figure 3.8. What’s different here is, from LSB
to MSB, the carry flag from bit i will be also used as the carry flag for bit
i+1 , just like calculating by hand. The first carry flag can simply just be
zero, and the carry flag produced by MSB, i.e., cout[63] , will be sent to
CSPR’s carry flag. (Still remember condition codes?)

3.1.3 Sequential Logic

When introducing combinational logic, we are clear that it doesn’t contain


any cycles: the output of a gate will not become its input at some point. We
also emphasized that combinational logic will constantly change accord-
ing to its input signals. The question is, how can we store data, instead
of simply calculating data? In this section we introduce sequential logic,
which exactly serves this purpose.

Figure 3.9 shows us a very simple sequential logic, called bistable element,
where we compare it with a similar combinational logic. On the left, the
logic is acyclic with two outputs, p and q , and an input signal in . As
shown in the timing diagram on the right, when we give in a high volt-
age pulse, both p and q respond to the input with small amount of delay,
but soon change back. Therefore, the input in is temporary, and so are
the outputs. What if we want the outputs stay where they are? This means
we want to store the outputs.

cin
y s

cout
Figure 3.7: Bit adder, where input y is
marked as red, x as blue, the carry-in sig-
nal cin as black.
72 3 Microprocessor Design

cin[0]

<latexit sha1_base64="KXtoaDp0L9PNRTA1TjqbPOoQqAY=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNWOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18GXaZ1</latexit>
|
double word a
a[0] s[0]

...
{z
b[0]
cout[0]

}
cin[1]
s[1]

}
double word s
a[1]

...
{z
b[1]
cout[1]

|
<latexit sha1_base64="WIyAFSogs7P8VASF6JU4B3hOflI=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNbOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18h26aH</latexit>
. . .
cin[63]
<latexit sha1_base64="dYRhVRGrdYApVtyZZMOtrv9VbG8=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEPOocH6l6tf8Ieg4CUakSkY4a1fewljzLAGFXDJrm4GfYitnBgWXUJTDzIIb32PX0HRUsQRsKx+uWNBNp8S0o407CulQ/dmRs8TafhK5yoRh1/71BuJ/XjPDzkErFyrNEBT/HtTJJEVNB3nRWBjgKPuOMG6E+yvlXebiQpdq2YUQ/F15nFxu14K92s75TvXoeBTHLFknG2SLBGSfHJFTckbqhJN78kieyYv34D15r977d2nJG/WskV/wPr8AB+Smdg==</latexit>
|
double word b

s[63]
...

a[63]
{z

b[63]
}

Figure 3.8: Full 64-bit adder using the bit cout[63]


adder design from Figure 3.7. From bit C flag
0 to bit 62, each adder’s carry-out flag
cout[i] will be used as carry-in flag for
bit i+1’s adder.

On the right of Figure 3.9, we added a branch from q back to the input
of p , making a loop. This is not combinational anymore; it’s sequential
instead. On the right, we see even if we give a small and temporary high
voltage to in , both q and p stay where they are and keep the change
after the signal from in disappears. Apparently, this is because of the
loop, where the output of q serves as the input of p so it doesn’t rely on
signal in anymore.

Also notice p and q respond to in with a more noticeable delay, which


is called propagation delay.

For a real life example, think about a faucet again. The combinational logic
is like the automatic faucet that can sense your hands. If your hands are
close, there’s water; if you take your hands away, it stopped. The faucet
constantly respond to our “input” — hands. What if we want the water
keep flowing? We just change to a regular faucet, where we only need to
turn on the water manually once, and it’ll keep flowing until we manually
turn it off. In this scenario, the regular faucet only responds to our “input”
once, and will keep the change.

This bistable element is very simple but shows an important idea of se-
quential logic. One flaw of the one in Figure 3.9 is, however, how can we
change the output p and q back?
3.1 Fundamental of Logics 73

Combinational Logic Sequential Logic


q q
in in

q q

in p p in p p

Figure 3.9: The left shows a combinational logic, whereas the right shows a sequential logic. In the combinational logic, both outputs p and
q respond to the change of input in almost instantly, and thus we are not able to “store” the output. In the sequential logic, however, one
temporary change in the input in will trigger the permanent change in the outputs, making them stay, and thus to be “stored”.

3.1.3.1 SR Latch

SR latch, or Set-Reset latch, allows us to change the output anytime, as


shown in Figure 3.10. Here we have two inputs S and R , representing
“set” and “reset” respectively. The outputs are denoted by Q+ and Q- .
It’s not difficult to notice that Q+ and Q- always have the opposite val-
ues. 4 4: Note Q- always outputs the negative
of Q+ so we’ll simply ignore it.
In the timing diagram in Figure 3.10, we give S a temporary high voltage
at timestamp (1) and thus the output Q+ is changed to 1. At timestamp
(2), if we want to change Q+ back to 0, we can give R a high voltage (i.e.,
“reset” the value of Q+ ). All the changes in the outputs are permanent,
unless we trigger either S or R .

SR latches have four states, three of which are shown as the three times-
tamps in Figure 3.10: (1) setting, (2) resetting, and (3) latched or stored.
The state where we give high voltage to both S and R is called metastable,
which usually causes error.
R
Q+
We can write a truth table for SR latch as well, but it’s a little bit different
where we let q denote the actual value (either 1 or 0) of the output of Q+ .
See the table below:
S Q-
S R Q+ Q- State
0 0 q !q Latched (Stored) R
0 1 0 1 Resetting S
1 0 1 0 Setting
Q+
1 1 – – Metastable/Error
Q-
❶ ❷ ❸
As we see, unlike the truth table we are familiar with where all the outputs Figure 3.10: A simple SR latch and an ex-
ample of timing diagram. Time (1) is the
have determined values of either 0 or 1, the state of latched has q and
state of setting, (2) for resetting, and (3)
!q in it. This is because the output depends on its actual value before for latched. The temporary change in ei-
ther R or S will make the change in the
switching to this state, instead of the inputs S and R . output stay, and thus to be stored.

3.1.3.2 D Latch

We’re getting closer and closer to the actual implementation of storage


devices in microprocessors! SR latch is great because it allows us to change
74 3 Microprocessor Design

Figure 3.11: D latch has a clock C to con- R


trol when the data D is allowed to pass D
(Data) Q+ C
through and to cause change in the out-
put Q+. C is 1, the status is called “latch-
ing”, where output Q+ responds to the D
change of the input data D. When C is
0, the status is “storing”, and Q+ stays/- Q-
stores the value regardless of changes of C S Q+
input D. (Clock) ❶ ❷ ❸ ❹ ❺ ❻

the output arbitrarily and make the change stay. Two flaws of SR latch,
however:

1. S and R are like switches; where do we send and store actual data?
2. S and R can be used at any time, but when there are many SR
latches, how can we make sure they are synchronized, or on the
same page?

These questions bring us to a better design, called D latch, where D stands


for data. D latches explicitly addressed the two flaws mentioned above.

In a synchronous setting, a circuit that consists of many smaller sequential


logic components are controlled by a single clock. This single clock ticks at
a constant speed which is determined by the slowest propagation delay in
the circuit. In D latches, in addition to a input D signal for one-bit data, we
also add a new input signal C which stands for clock. See Figure 3.11.

The clock signal C controls when the input data D can pass through and
cause changes in the output Q+ . In Figure 3.11 at timestamp (1), C gives
a high voltage, which brings the D latch to a state called latching. At this
moment, Q+ will change its value according to the input data D . As long
as C stays at high voltage, Q+ will always respond to the change of D .
Observe that in the figure Q+ and D have the same waves between (1)
and (5).

At timestamp (5), C falls back, bringing the D latch to a state of storing.


This means that we have stored the input D in Q+ . After this, as long as
C stays at 0, no matter how we change D , Q+ will not change (e.g., see
timestamp (6)); it’s storing the value of input data D at the moment that
the clock falls.

We can similarly write a truth table as the one on the side where d can be
either 0 or 1, and the value of q depends on the value of d at the moment
when C drops.
C D Q+ Q- State
0 d q !q Storing
1 d d !d Latching 3.1.3.3 Flip-Flops

Last stop before we get to the real thing in microprocessors! D latches


allow us to control the output with a data input, which is good, but notice
that as long as C stays at high voltage, Q+ will change based on D . This
is not expected, because any interference of the input signal will affect the
output, making it unstable to store. Imagine you send an input data of 1,
and during latching mode, the signal of input becomes unstable and so
3.1 Fundamental of Logics 75

R C
D
(Data) Q+
T

D
Q-
C T S
(Clock) (Trigger) Q+

Figure 3.12: An edge-triggered latch, or a flip-flop, where when C rises, the trigger T will temporarily rise to high voltage, allowing Q+ store
the value of input data D at that moment. Afterwards, T drops back down, and no matter how input C changes Q+ stays stable.

does the output Q+ . So how can we make sure during latching state the
output Q+ stays stable?

The solution is edge-triggered latches, or flip-flops, where the output will


change only when signal C is on the rising edge. The output will not
change even when E is 1.

We use Figure 3.12 to show the logic of a flip-flop and an example of timing
diagram. The major change is we added a trigger T , which is essentially
just an and gate. Both two inputs of the trigger come from the enable
signal C . If you look closely, after passing through three not gates, one
input is !C , while the other is C , so the trigger will be C & !C = 0
eventually. Then why do we want to create a trigger this way?

Remember we said we only want the output changes when C is on the


rising edge. In this case, trigger T controls if Q+ will change with D , so
that means we only want T to be at high voltage at a very short amount
of time, i.e., when the enable signal becomes 1. So how can we create such
a small pulse to allow Q+ stores D and stays stable afterwards? We just
need to make sure T can be 1 when clock rises, and quickly switch back
to 0 eventually.

From the beginning, we mentioned that logic gates will constantly respond
to input changes, but with small amount of delay, meaning the more gates a
signal needs to pass, the longer delay it’ll cause. There you go! See when
C rises, one of its two branches will quickly reach to the trigger, at which
moment the trigger will rise to high voltage. The other branch of C will ar-
rive shortly but with a delay, because it has so many gates to pass through.
It will, however, eventually arrive, and thus brings the trigger back to low
voltage. This gap of arrival of the two inputs gives us the chance to change
Q+ and store its value, and make it stay stable afterwards.

3.1.4 Registers

Finally it’s time to talk about the real thing! In microprocessors, the hard-
ware we use to store data is registers, our old friend from assembly. Now
that we know one flip-flop can store one bit of data, storing a double word
is just too straightforward. In fact, as shown in Figure 3.13, a register is a
group of flip-flops where each of them stores one bit. Notice that all the
76 3 Microprocessor Design

| <latexit sha1_base64="u15RdLJ3bRxTg+ha1k3Ox4sN6Gw=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpU0GmF8ytVv+YPQcdJMCJVMsJZu/IWxppnCSjkklnbDPwUWzkzKLiEohxmFtz4HruGpqOKJWBb+XDFgm46JaYdbdxRSIfqz46cJdb2k8hVJgy79q83EP/zmhl2Dlq5UGmGoPj3oE4mKWo6yIvGwgBH2XeEcSPcXynvMhcXulTLLoTg78rj5HK7FuzVds53qkfHozhmyTrZIFskIPvkiJySM1InnNyTR/JMXrwH78l79d6/S0veqGeN/IL3+QUK8qZ4</latexit>
R
double word d

d[63]

register value r
}
… r[63]
{z

{z

}

|<latexit sha1_base64="Fx3vXCypuZmzpja99jfWm0lulv4=">AAACJXicbVDLSsNAFJ34tr6qLt0MFsFVScTXwoXoxmUFW4WmlMnkph2cTMLMTbGE/Iwbf8WNC4sIrvwVp4+Fth4Y5nDuudx7T5BKYdB1v5y5+YXFpeWV1dLa+sbmVnl7p2GSTHOo80Qm+iFgBqRQUEeBEh5SDSwOJNwHj9fD+n0PtBGJusN+Cq2YdZSIBGdopXb5ws9UCDrQjEPud006/E8gLop27iM8Ya6hY9cATXtMZkBHIlq5sJZyxa26I9BZ4k1IhUxQa5cHfpjwLAaFXDJjmp6bYitnGgWXUJT8zIDd4JF1oGmpYjGYVj66sqAHVglplGj7FNKR+rsjZ7Ex/Tiwzphh10zXhuJ/tWaG0XkrFyrNEBQfD4oySTGhw8hoKDRwlH1LGNfC7kp5l9nEbCqmZEPwpk+eJY2jqndaPb49rlxeTeJYIXtknxwSj5yRS3JDaqROOHkmr+SdDJwX5835cD7H1jln0rNL/sD5/gHPnKfv</latexit>
S

. . .
...
d[1]
r[1]

d[0]
r[0]

C
(Clock) T (Trigger)

Figure 3.13: A register implemented using edge-triggered latches. One latch can store one bit of data, and all the 64 bits will be updated all
together when the clock rises.

flip-flops in the register are controled by the same clock signal. After all,
you don’t want to store several bits first and wait until next rising edge of
the clock to store other bits.

In ARMv8 architecture, we have 32 general purpose registers, all controlled


by one clock signal. All registers are encapsulated in a group called reg-
ister file, and in our architecture one microprocessor has one register file.
Figure 3.14 shows an illustration of a register file.

Each register file has two read ports and one write port. Here based on
Figure 3.14, we declare the signals with “variable-style” names, so we can
refer back to them later.

On the write port, we have three input signals: RegDataW for the actual
data we want to store into the register; WriteReg for the destination reg-
ister number, i.e., where do we want to store RegDataW ; and RegWrite ,
used for indicating if we want to write to a register or not.

Input WriteReg , a 5-bit data, is sent to a decoder, where the output is ad-
hoc. For example, if WriteReg = 00111b , the output of the decoder will
be all zero except the wire transferred to register X7 , because 00111b is
7 in decimal. Because of the and gates, only X7 will be written once
RegWrite changes to 1 , meaning we do want to write to the register.
3.1 Fundamental of Logics 77

ReadReg1
Clock
0 RegData1
E

Decoder
1 D X0 M Read port 1
C

...
U
E X
31 D
C
X1

. . .
Write port

WriteReg M
RegWrite E
U
RegDataW X Read port 2
D
C
X31
RegData2 Figure 3.14: A little more detailed register
Register File ReadReg2
file.

Each register has three inputs. C and D correspond to the inputs in Fig-
ure 3.13, while E (enable) controls if any data can be written to the regis-
ter even when the clock rises. Therefore, writing registers is a sequential
logic.

On the right side of Figure 3.14, we have two read ports. The two regis-
ter numbers, ReadReg1 and ReadReg2 , are used as “selectors” to select
data from corresponding registers. The data are denoted as RegData1
and RegData2 , respectively. Reading registers is more like a combi-
national logic, meaning as soon as ReadReg1 and ReadReg2 receives
changes, the corresponding data will be read almost instantly. This is dif-
ferent than writing to a register, since the read port has no control wire
that connects to the clock.

3.1.5 Memory

Random Access Memory (RAM), or simply memory, also uses sequential


logic to store data, even if it’s not inside the CPU. Figure 3.15 shows an
illustration for memory.

In practice, memory has three buses—data, address, and control buses.


The data bus is bidirectional: it can send data to memory, or read data
from memory. This means, during access to the memory each time, we
can only either read data or write data—we can’t do both at the same
time!

The address bus is used for a memory location. During reading, the data at
the address on address bus will be put on data bus and sent out. Similarly,
during writing, the data on the data bus will be sent into memory location
specified by the address bus. Either reading or writing is controlled by the
control bus. The width of the control bus — the number of bits — depends Memory
Address bus (64-bit)
<latexit sha1_base64="fsTG6PIHeCgZGkSmWHjQNRq4MII=">AAACIHicbVDLTsMwEHR4U14BjlwsKiQ4UCWotBx5XDgWiUKlpqocZ0stHCeyN4gq6qdw4Ve4cAAhuMHX4D4OUBjJ8mhmV7s7YSqFQc/7dKamZ2bn5hcWC0vLK6tr7vrGlUkyzaHOE5noRsgMSKGgjgIlNFINLA4lXIe3ZwP/+g60EYm6xF4KrZjdKNERnKGV2m41yFQEOtSMQx50TTr4DyHu99t5gHCP+UkUaTCGhpmhu5Xyfihwz7pu0St5Q9C/xB+TIhmj1nY/gijhWQwKuWTGNH0vxVbONAouoV8IMgN2+C27gaalisVgWvnwwD7dsUpEO4m2TyEdqj87chYb04tDWxkz7JpJbyD+5zUz7By1cqHSDEHx0aBOJikmdJAWjYQGjrJnCeNa2F0p7zIbFtpMCzYEf/Lkv+TqoORXSuWLcvH4dBzHAtki22SX+KRKjsk5qZE64eSBPJEX8uo8Os/Om/M+Kp1yxj2b5Becr2/l7qP/</latexit>

on specific architecture. In our design, we only need two bits.


|
...

Address
{z

Data bus (64-bit)


}

3.1.6 Summary
}

...

Data
{z

The most essential components of a microprocessor have been introduced


...

Control bus
|

in this section. In sum, each microprocessor has two basic functionalities—


<latexit sha1_base64="iVeqqfi9DWs7eCWiwq64ikDQHjs=">AAACHXicbVA9SwNBEN2LXzF+RS1tFoOgheFOYrQMamEZwaiQO8LeZmKW7O0du3NiOO6P2PhXbCwUsbAR/42bmMKvB8s+3pthZl6YSGHQdT+cwtT0zOxccb60sLi0vFJeXbswcao5tHgsY30VMgNSKGihQAlXiQYWhRIuw8HxyL+8AW1ErM5xmEAQsWsleoIztFKnXPNT1QUdasYh8/smGf37EOV5J/MRbjE7YchomBq6Xa/thgJ3rFWuuFV3DPqXeBNSIRM0O+U3vxvzNAKFXDJj2p6bYJAxjYJLyEt+asBOHrBraFuqWAQmyMbX5XTLKl3ai7V9CulY/d6RsciYYRTayohh3/z2RuJ/XjvF3mGQCZWkCIp/DeqlkmJMR1HRrtDAUQ4tYVwLuyvlfWaTQhtoyYbg/T75L7nYq3r1au2sVmkcTeIokg2ySbaJRw5Ig5ySJmkRTu7IA3kiz8698+i8OK9fpQVn0rNOfsB5/wQ8NKKV</latexit>

computing and temporary storage. The component used for computing Figure 3.15: Control bus and address bus
is arithmetic logic unit (ALU), implemented using combinational circuits. are unidirectional, while data bus is bidi-
rectional.
78 3 Microprocessor Design

ALU can perform basic arithmetics, and we showed the operation of ad-
dition in Figure 3.8. Registers are used for temporary data storage, imple-
mented by sequential circuits (Figure 3.13). Writing data to a register is
controlled by a clock, and only happens when the clock signal has a rising
edge. Reading data from registers, however, is similar to combinational
logic, and can happen any time.

3.2 From Assembly to Machine Code

Before we dive into the real construction of our CPU, we need to consider
how we can first let it recognize our program, or instructions. Surely we
understand ADD X0,X0,X1 is to add X1 to X0 , but those digital circuits
only recognize high and low voltage, or at least only 0s and 1s. The first
step for us, therefore, is to translate our text code (assembly) into digital
signals (binary 0s and 1s) so that they can be pushed into the circuits and
perform operations. Once the circuits finished responding to the signals,
we interpret the resulting 0s and 1s back to know what’s going on.

Each instruction is unique: they have different operands and operators;


but some instructions have similar attributes: they perform the similar op-
erations. For example, both ADD X0,X0,X1 and ADDS X20,X21,0xff14
are calculating the sum of two numbers, and storing it to a register. When
translating those instructions into binary, we need to make sure they are
distinct enough so the machine can perform the job requested precisely,
but also somehow similar to reduce the repeated work on the hardware
design. This “translation” from text assembly instruction to binary ma-
chine code is called encoding.

Since we are using ARM assembly, we will look at the encodings designed
by ARM, so no need to design our own. In the following, we will only
study encodings of most frequently used instructions. They are sufficient
to serve the purpose of the discussion of CPU design. We also removed
5: See https://developer.arm.com/
and modified some of the fields in the encodings to make them easier and
documentation/ddi0602/2022-03/
Base-Instructions.
more straightforward. To see the complete sets of encodings, you can visit
the ARM documentation. 5

Each ARM assembly instruction takes four bytes (32 bits), and can be roughly
separated into two fields. The leading bits are called opcode ; which is
unique to different mnemonics. The rest of the bits are used for operands,
such as encoding register read/write numbers, immediates, addresses,
etc.

3.2.1 Arithmetic/Logic Instructions

Let’s start with the ones we’re most familiar with—arithmetic and logic
instructions.
3.2 From Assembly to Machine Code 79

31 21 20 16 15 10 9 5 4 0
Instruction opcode Rm 111000 Rn Rd

ADD Rd,Rn,Rm 1 0 0 0 1 0 1 1 0 0 1
ADDS Rd,Rn,Rm 1 0 1 0 1 0 1 1 0 0 1
SUB Rd,Rn,Rm 1 1 0 0 1 0 1 1 0 0 1 2nd Register
SUBS Rd,Rn,Rm 1 1 1 0 1 0 1 1 0 0 1
AND Rd,Rn,Rm 1 0 0 0 1 0 1 0 0 0 0 1st Register
ANDS Rd,Rn,Rm 1 1 1 0 1 0 1 0 0 0 0
ORR Rd,Rn,Rm 1 0 1 0 1 0 1 0 0 0 0 Destination Register

Figure 3.16: Encodings of arithmetic and logic instructions with register operands.

3.2.1.1 With Registers

The first case is all the sources are registers, including ADD[S] , SUB[S] ,
AND[S] , and ORR . Looking at these instructions, we notice they all take
three register operands: destination Rd , source 1 Rn , and source 2 Rm .
Thus, we show their encodings in Figure 3.16.
It is straightforward that all the register files take five bits, since we have
32 general purpose registers which can be encoded in five bits. Note that
here Rm refers to the register number or ID, instead of the data inside of
them. The bits 15:10 are all zeros for all instructions. 6 6: In fact, bits 15:10 contain 3-bit of
option and a 3-bit immediate. To keep
It’s not difficult to find some patterns in the opcode . For example, it is
things simple and focused, we don’t re-
clear that bit 30 is to indicate which operation we want: 0 for addition, ally use or care about the fields, so we can
while 1 for subtraction. Also, bit 29 is 1 if we want to set condition simply let option in all such instruc-
codes, or 0 otherwise. tions be 111 , while immediate be 000 .
As to why we use these two values, if
˛ Quick Check 3.1 you’re interested, feel free to ask me, or
visit ARM documentation.
Remember CMP Rn,Rm is to compare two registers, and set condi-
tion codes. What it actually does, is to perform subtraction between
Rn and Rm , and send the result to XZR (register X31 ). In other
words, it’s just an alias for instruction SUBS XZR,Rn,Rm . Please
translate CMP X1,X0 into machine code based on the discussion
above.
B See solution on page 114.

3.2.1.2 With Immediates

The general structure of instructions with immediates are still the same—a
few bits of opcode followed by operands, as in Figure 3.17. Notice Rd
is the destination register, Rn the source register, and imm11 is a 11-bit
immediate number.

˛ Quick Check 3.2


When looking at machine code, you found a binary sequence:

1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0
80 3 Microprocessor Design

31 21 20 10 9 5 4 0
Instruction opcode imm11 Rn Rd
ADD Rd,Rn,imm11 1 0 0 1 0 0 0 1 0 0 0
ADDS Rd,Rn,imm11 1 0 1 1 0 0 0 1 0 0 0
11-bit immediate
SUB Rd,Rn,imm11 1 1 0 1 0 0 0 1 0 0 0
SUBS Rd,Rn,imm11 1 1 1 1 0 0 0 1 0 0 0 1st Register
ORR Rd,Rn,imm11 1 0 1 1 0 0 1 0 0 0 0
Destination Register

Figure 3.17: Encodings of arithmetic and logic instructions with register operands and immediates.

31 21 20 10 9 5 4 0
Instruction opcode imm11 Rn Rt
LDR Rt,[Rn,imm11] 1 1 1 1 1 0 0 0 0 1 0
STR Rt,[Rn,imm11] 1 1 1 1 1 0 0 0 0 0 0 11-bit immediate 1st Register Target Register

Figure 3.18: Encodings of memory accessing instructions.

0 1 0 1 0

What assembly instruction does it represent?


B See solution on page 114.

3.2.2 Memory Accessing Instructions

Here we mainly focus on LDR and STR with only 64-bit registers sup-
ported. In these two instructions, as shown in Figure 3.18, the 11-bit im-
mediate number indicates offset, while Rn is the base register. Since Rt
could be source or destination register, we call it “target register”.

3.2.3 Branching Instructions

Branching instructions follow the same format. To keep our instruction


set consistent with the datapath, we only consider the instructions shown
in Figure 3.19.

Notice the immediate number encoded in bits [20:10] are PC-relative


offset. For example, assume current instruction is B L1 at address of x .
The instruction labeled as L1 is at address of y . Thus, the imm11 field
of B L1 is y-x (and then truncate to 11 bits). This is done automatically
by the assembler.

7: Note that technically all zeros in Rt


For bits [4:0] , instruction CBZ does have Rt in it, so it’ll just be the
represent register X0 , but this instruc-
tion doesn’t even need to read a register,
register number. For B , since no register is involved, we simply put all
so we’re fine. zeros in Rt . 7 Instructions BL and RET use X30 by default, so 11110
is hardcoded into Rt .
3.2 From Assembly to Machine Code 81

31 21 20 10 9 5 4 0
Instruction opcode imm11 00000 Rt
B label 0 0 0 1 0 1 0 0 0 0 0
BL label 1 0 0 1 0 1 0 0 0 0 0
CBZ Rt,label 1 0 1 1 0 1 0 0 0 0 0 11-bit immediate Target Register
(PC-relative offset) B: Rt = 00000
CBNZ Rt,label 0 1 0 1 0 1 0 0 0 0 0
BL: Rt = 11110
RET 1 1 0 1 0 1 1 0 0 1 0
RET: Rt = 11110

Figure 3.19: Encodings of branching instructions.

ą Example 3.1
Assume we have the following segment of assembly code with
each instruction’s address in the comments:

1 ...
2 ...
3 loop: SUB X5, X5, 1 // 0x1000
4 LDRB W1, [SP, X5] // 0x1004
5 CBZ X5, exit // 0x1008
6 B loop // 0x100C
7 exit: MOV X0, 0 // 0x1010
8 MOV X8, 93 // 0x1014
9 SVC 0 // 0x1018

Please translate instructions at 0x1008 and 0x100C to binary


machine code.

Solution: The instruction at 0x1008 is CBZ , whose target is the


instruction at address 0x1010 labeled as exit . The offset there-
fore is 0x1010 - 0x1008 = 0x8 . The Rt in this instruction is
X5 , so based on Figure 3.19, we have the following binary code:

10110100000
⏟⏟⏟⏟⏟⏟⏟ ⏟⏟ 00000001000
⏟⏟⏟⏟⏟ 00000 ⏟
00110
opcode imm11 Rt

The instruction at 0x100C is B , whose target is the instruction


at address 0x1000 labeled as loop . Notice that in this case it
is branching backwards, so the offset is negative. Thus, we have
offset 0x1000 - 0x100C = 0x7F4 . For B , Rt is always zero,
so the binary code is:

00010100000
⏟⏟⏟⏟⏟⏟⏟ ⏟⏟ 11111110100
⏟⏟⏟⏟⏟ 00000 ⏟
00000
opcode imm11 Rt
82 3 Microprocessor Design

3.3 A Single-Cycle Datapath

3.3.1 Preliminaries

Sometimes a line of code speaks louder than a thousand words, so to better


illustrate the idea of datapath, in the following we will use a “Python-like”
language. It has similarities as to syntax to Python, but doesn’t necessarily
follow correct syntax. Our focus is not writing a correct Python program,
but to make the datapath clearer.

Notations and Description Language

We use R[x] to denote the 64-bit data inside register x , while M[a]
the data at address of a inside memory. For a multibit data D , D[a:b]
indicate bit a to b (inclusive). For example, R[0][5:2] is the 2nd to
the 5th bits inside register X0 , while M[R[1]][31:0] is the data at an
address indicated by the lower word of data inside X1 .

We often use a compact if-else statement, such as data = x if b else y


where data is x if b is True or 1 ; otherwise data is y . This
can also be nested, such as data = x if b else y if a else z : if
b == 1 , data = x ; else if a == 1 , data = y ; otherwise data = z .

Airings

In the illustrations of datapath, we have two types of wiring. Blue lines


are to transfer control signals, usually one bit of data, but could be more.
Control signals are used to guide CPU to perform specific operations, and
closely related to the mnemonics/operators of each instruction.
Black lines are used for data signals, where real data (such as operands)
are transferred. Data wires are usually parallel lines (buses) where each
line transfers one bit, and all bits are transferred all at once. For example,
data buses and address buses can transfer 64 bits at a time. To simply the
drawings, we use only thin lines to indicate such signals or buses, but it
should be clear that they have wider bandwidth of data. This way, we
focus on the flow of data in the system by removing details about actual
width of those wires.

3.3.2 Datapath

Figure 3.20 shows our first attempt at building the datapath, where we’ll
run one assembly instruction at a time. We call this model single-cycle
implementation. This is absolutely over-simplified and inefficient, but it
shows us the most important ideas of designing a processor.
In Figure 3.20, we separate memory into instruction and data memory,
only for illustration purposes. It should be clear that there’s only one
3.3 A Single-Cycle Datapath 83

memory, where both instructions and data are stored together but in dif-
ferent regions. Instructions are in .text segments, and regular data are
in .data , .bss , or stack area of the memory.

PC
4

A B Address
Instruction
Add Memory I
nextPC

Reg2Loc
MUX
Control ALUop

ReadReg1 ReadReg2
RegWrite Register File WriteReg

RegDataW

RegData1 RegData2

A B
Add ALUsrc
MUX
PBr
CBr A B action
ALU ALU
Z Control
ALUout

UBr

MUX

Address
WriteDataM
MemRead
MemWrite Data Memory
ReadDataM

MemToReg Figure 3.20: A single-cycle implementa-


MUX
LinkReg tion of datapath. Black lines are data sig-
nals, while blue lines are control signals.

Clocking Methodology

In digital logic design, a circuit consists of a mix of combinational and


sequential components. As discussed earlier, the main purpose of a com- Clock
binational logic is to compute, while that of a sequential is to store and
update data. In Figure 3.21, we show a simple circuit where we have one Sequential Combinational

sequential and one combinational. Since writing to sequential is controlled


by a clock and only happens at the rising edge, we make one pass of the
data through the circuit at each clock cycle.
<latexit sha1_base64="cWKqR0lmfnXvEwPUf0Rz2tVhrrs=">AAACBnicbVDLSsNAFJ34rPUVdSlCsAiuSlJ8LYtuXFawD2hCmExu26GTSZiZCCVk5cZfceNCEbd+gzv/xkmbhbYeGOZwzr3ce0+QMCqVbX8bS8srq2vrlY3q5tb2zq65t9+RcSoItEnMYtELsARGObQVVQx6iQAcBQy6wfim8LsPICSN+b2aJOBFeMjpgBKstOSbR27KQxCBwAQydyST4m/YEOW5n+W+WbPr9hTWInFKUkMlWr755YYxSSPgijAsZd+xE+VlWChKGORVN5WgR4zxEPqachyB9LLpGbl1opXQGsRCP66sqfq7I8ORlJMo0JURViM57xXif14/VYMrL6M8SRVwMhs0SJmlYqvIxAqpAKLYRBNMBNW7WmSEdSRKJ1fVITjzJy+STqPuXNTP785qzesyjgo6RMfoFDnoEjXRLWqhNiLoET2jV/RmPBkvxrvxMStdMsqeA/QHxucPMDGZmQ==</latexit>

| {z }
Comparing Figure 3.20 with the idea above, we see similarities. Follow- One clock cycle

ing the arrows, we see that the flow of the data starts from PC, and ends Figure 3.21: A clock controls the sequen-
at register file for writing, after passing through data memory. This route tial logic, so each clock cycle makes one
pass of the data.
84 3 Microprocessor Design

suggests one processing of an instruction. Clearly, an instruction (as a bi-


nary machine code) will pass through a number of combinational logic
(ALU, adders, register file reading ports, etc), as well as sequential logic
(register file writing port). Recall that the writing port of register file is
considered as sequential and thus is controlled by a clock. Therefore, in
this design, the process of each instruction — from beginning to end —
take only one clock cycle. This is also why we call this model single-cycle
datapath.

3.3.3 Stages of an Instruction

Each instruction, from being obtained from memory, to finished executing,


will follow the five stages as it passes through the datapath:

1. (IF) Instruction Fetching: the instruction is obtained from memory;


2. (ID) Instruction Decoding: the fetched instruction will pass differ-
ent fields to different data signals, and the opcode will be used for
converting into control signals. Register data will also be read;
3. (EX) Execution: ALU will perform the operation based on the de-
coded instruction, and produce the result;
4. (ME) Memory Access: Sending data to memory or reading data
from memory;
5. (WB) Writing Back: Write result back to register.

ą Example 3.2
Describe what happens in the five stages for instruction
ADD X0,X1,X2 .
IF The binary machine code of ADD X0,X1,X2 is fetched
from instruction memory;
ID The opcode field is used to generate control signals, while
registers X1 and X2 are read from register file;
EX ALU receives the two operands, read from X1 and X2 ,
and produce R[1] + R[2] ;
ME Since this instruction does not access memory, this stage
was passed (but not skipped!);
WB The result produced by ALU, R[1] + R[2] , is written
back to register X0 .
Remarks. One thing we need to notice is, for ADD instruction, ME
stage was passed, but not skipped. This is true for all instructions
that do not need memory access. We defined five stages for all in-
structions for the purpose of consistency, but they don’t necessarily
do any stuff in some stages.

3.3.3.1 Stage 1: Instruction Fetching

This is the first stage for an instruction, see Figure 3.22. PC , the program
counter, sends the address of the instruction to be executed, and the actual
3.3 A Single-Cycle Datapath 85

nextPC
4 A
Add Address
PC B Data I
Instruction
Memory
Figure 3.22: Stage 1: instruction fetching.

PBr
CBr
opcode = I[31:21] UBr
LinkReg
Control MemToReg MemRead
MemWrite ALUsrc
ALUop
Reg2Loc

RegWrite RegDataW

n = I[9:5]
ReadReg1 RegData1 R[n]

m = I[20:16]
M
I ReadReg2 RegData2 R[m] if Reg2Loc Figure 3.23: Stage 2: decoding. Fields in
t = I[4:0] U
X else R[t] an instruction is sent to different parts of
d = I[4:0] the register file. The opcode is sent to a
WriteReg Register File
control unit that can generate control sig-
nals.

instruction will be read from memory and sent out from Data on the data
bus. We denote the retrieved instruction as I .

At the same time, notice PC is also sent to an adder. This is to calculate


the address of the next instruction, and update PC . The adder will add
four bytes because each instruction takes four bytes in memory, so adding
four will make us move on to the next instruction.

At this time, PC will not be updated yet, because the next instruction
could be a branch somewhere. The real address of next instruction cannot
be decided until EX stage, so we’ll talk about it then. For now, what’s
happening in IF stage can be described as:

1 I = M[PC]
2 nextPC = PC + 4

3.3.3.2 Stage 2: Instruction Decoding

The decoding stage is to take the instruction I we just read from instruc-
tion memory and send different fields to different data wires, as well as
reading register data. We use Figure 3.23 to show some details.

The highest 11 bits are used as opcode . This will be sent to the control
unit, a combinational logic, that converts opcode into different control
signals used by all parts of the datapath. We will discuss them in the stages
when they are actually used.

The register file has two read ports. The first read ports will take I[9:5]
which encodes the register number, denoted as n . This is the same for all
instructions. The data read from this register is denoted as R[n] .
86 3 Microprocessor Design

The second read port needs some work, though. For some instructions
such as ADD Rd,Rn,Rm , the second read register is Rm , encoded in I[20:16] .
Other instructions, however, such as LDR Rt,[Rn,imm] , the second read
register Rt is encoded in I[4:0] . Thus, before sending the register num-
ber to ReadReg2 , we need a control signal Reg2Loc that selects one of
them from a multiplexer. When it’s zero, it selects I[4:0] ; otherwise
I[20:16] .

In summary, this stage can be described as follows:

1 # A group of control signals defined in a 'class'


2 # All values are initialized as zeros
3 class sigCntl:
4 Reg2Loc,PBr,CBr,UBr,MemToReg,MemRead,MemWrite,ALUsrc =
↪ 0,0,0,0,0,0,0,0
5 ALUop = 0b00
6

7 # Obtain opcode from I, and generate control signals


8 # Control() function returns an object of sigCntl
9 opcode = I[31:21]
10 sigCntl = Control(opcode)
11

12 # Read registers
13 n, m, t, d = I[9:5], I[20:16], I[4:0], I[4:0]
14 ReadReg1 = n
15 ReadReg2 = m if sigCntl.Reg2Loc else t
16 RegData1 = R[ReadReg1]
17 RegData2 = R[ReadReg2]

In the pseudocode above, we first created a class called sigCntl , to group


all individual control signals together, and initialize them all as zeros. Then
after obtaining opcode , we use Control() to generate an object of sigCntl
based on the opcode , as the control unit is a combinational logic and thus
like a function.

We also didn’t talk about WriteReg , RegDataW , and RegWrite . Reg-


ister writing doesn’t happen until the last stage, so we’ll leave it to the
end.

3.3.3.3 Stage 3: Execution

This is the main stage where calculation happens. We have to remember


that the “calculation” here is not only for arithmetic/logic instructions
such as ADD and ORR . Other instructions also need calculations. For
example, when we use LDR Rt,[Rn,imm] , we need to calculate memory
address Rn+imm . This address calculation is also done by ALU (otherwise
where else could it be done?).

As shown in Figure 3.24, there are two sub-jobs done in this stage: Comput-
ing and PC updating. We will look at computing first, since that happens
3.3 A Single-Cycle Datapath 87

before PC updating.

Computing

The main function for ALU is to perform simple calculations, so natu-


rally an ALU takes two inputs, we refer them as inputA and inputB .
inputA is simple, which is transferred from RegData1 . For some in-
structions, we are not operating on two registers; instead we operate on
one register and one immediate, such as LDR Rt,[Rn,imm] . Therefore,
inputB needs a signal ALUsrc to determine which source — RegData2
or imm (encoded in I[20:10] ) — to use as the second operand. In our
description language, it can be done as: inputB = imm if ALUsrc else
RegData2 .

The calculation result from ALU is denoted as ALUout . Recall that we


mentioned condition codes inside CPU when using instructions such as
CMP . In our case, we only care about the Z flag, so we assume it only
produces Z .

There are multiple possible calculation, however, such as ADD , ORR , etc.
How does it know which one to perform? We would need a control signal
called action , which is converted based on both opcode and ALUop
through ALU Control. Based on different values of action , ALU will
perform different calculations and produce corresponding result. We as-
sume only Z flag in the condition codes will be generated for simplicity.

We design signal action with four bits, and the following eight possible 8: As to why we use 0000 to indicate
operations for ALU in Table 3.1. 8 and operations and so on, the reason is
usually practical, for easier circuit design.
Thus, for this part, we can describe it in the following code:

1 inputA = RegData1
2 inputB = imm if sigCntl.ALUsrc else RegData2
3 action = ALUControl(sigCntl.ALUop, opcode)
4 ALUout,Z = ALU(action, inputA, inputB)

nextPC
PC Updating
PC A
BrPC
Add nextPC
imm = I[20:10] B
PBr MUX PC
UBr Br
CBr

RegData1 inputA A Z

ALUsrc ALU ALUout


M
RegData2 U inputB B
imm = I[20:10] X Figure 3.24: Stage 3 — execution —
consists of two tasks: updating PC for
ALUop ALU action
Control branching instructions, and computing
opcode = I[31:21] Computing
through ALU.
88 3 Microprocessor Design

Table 3.1: ALU operations for each in- ALUop action ALU operation
struction in the instruction set architec-
ture. AND 10 0000 ALUout = inputA & inputB
ORR 10 0001 ALUout = inputA | inputB
SUB 10 0011 ALUout = inputA - inputB
ADD 10
LDR 00 0010 ALUout = inputA + inputB
STR 00
B 01
BL 01
0111 pass inputB, i.e., ALUout = inputB
CBZ 01
RET 01
ANDS 11 1000 ALUout = inputA & inputB (set Z)
ADDS 11 1010 ALUout = inputA + inputB (set Z)
SUBS 11 1011 ALUout = inputA - inputB (set Z)

PC Updating

In Section 3.3.3.1, we learned that after sending current PC to the instruc-


tion memory, we need to use an adder to add four bytes to PC so that we
can move on to the next instruction. That is true when all instructions are
sequential, but we also need to consider branching instructions.

In general, we have four possible options for updating PC : (1) the instruc-
tion four bytes after ( nextPC calculated in IF stage); (2) target instruction
by unconditional branch; (3) target by conditional branch; and (4) target
instruction pointed by the link register X30 . So at the rightmost of Figure
3.24, we see there’s a multiplexer whose output is sent to PC .

This multiplexer receives three inputs:

1. nextPC , meaning there’s no branch and we’ll move to the very next
instruction;
2. BrPC , the address of target instruction calculated by adding PC
and immediate together, to perform either conditional or uncondi-
tional branch. Recall that in branch instructions such as CBZ X0,label
and B label , label is encoded as a PC-relative offset in the bi-
nary machine code in the immediate field. Thus, the target address
BrPC is calculated as: BrPC = PC + I[20:10] ;
3. ALUout , which is used for procedure return. In instruction RET ,
the return address is read from register X30 as RegData2 , and
passed through ALU as ALUout .

Which input is selected to pass through and update PC is determined by


three control signals: PBr , UBr , and CBr . PBr is used for procedure
return, i.e., selecting ALUout to update PC . UBr is for unconditional
branch, while CBr is for conditional. For instructions B target and
BL target , UBr ’s value is 1 whereas CBr is 0. For instructions such as
CBZ , UBr is 0 and CBr is 1.
3.3 A Single-Cycle Datapath 89

MemRead
MemWrite

MemWrite MemRead

ALUout Address ReadDataM ReadDataM

ALUout

RegData2 WriteDataM Data


Memory Figure 3.25: Stage 4: memory access.
Memory has two control signals
MemWrite and MemRead .

For conditional branches, notice that a signal itself is not enough: to branch
or not depends on Z flag as well. Thus, we can express the final branch
signal Br as: Br = UBr | (CBr & Z) .
In summary:

1 BrPC = PC + imm
2 Br = sigCntl.UBr | (sigCntl.CBr & Z)
3 PC = ALUout if sigCntl.PBr \
4 else (BrPC if Br else nextPC)

3.3.3.4 Stage 4: Memory Access

Not all instructions will get access to memory, but they still pass this stage.
Figure 3.25 shows an illustration of this stage. Note that we mentioned
in Section 3.1.5 and Figure 3.15 that there’s only one set of data bus, and
it can transfer data to and from memory bidirectionally. For clarity rea-
sons, in our datapath, we separate the data bus into WriteDataM and
ReadDataM , two unidirectional buses.

For those who do need memory, such as LDR and STR , we use MemWrite
and MemRead to control if it’s reading or writing memory. The memory
operation is summarized in the following table:

MemWrite MemRead Operation


0 0 No operation
0 1 ReadDataM = M[Address]
1 0 M[Address] = WriteDataM
1 1 Invalid

Notice the input Address comes from ALUout , the output of ALU. This
is because instructions such as LDR Rt,[Rn,imm] need to use ALU to
calculate the address, i.e., ALUout = Rn + imm . However, not all ALU
outputs represent a memory address, so we branch ALUout to bypass the
memory as well.
Input WriteDataM comes from register read RegData2 . For example,
in STR X0,[X1,0] , RegData2 is R[0] , which will be written to the
90 3 Microprocessor Design

MemToReg
RegWrite RegDataW

ReadReg1 RegData1 . . . ALUout


M
nextPC U
. . . X
ReadReg2 RegData2 ReadDataM

WriteReg Register File LinkReg


Figure 3.26: Stage 5: writing back.

memory location ALUout = R[1]+0 . For LDR , ReadDataM is the data


read from memory, at the address calculated from ALU.

As we see on the right of Figure 3.25, at the end of the MEM stage we have
two outputs: one from ALU ALUout , and another from data memory
ReadDataM . Since all we do at this stage is to get access to memory, we
will discuss how we handle the two outputs in the next section.

3.3.3.5 Stage 5: Writing Back

The update of register happens in the last stage, as in Figure 3.26. In


this stage, we have RegWrite to control if we need to write to a regis-
ter, based on different instructions. Following stage 4, we have ALUout ,
ReadDataM and nextPC , so we need to choose which one to push into
the register as RegDataW . We use two signals called MemToReg and
LinkReg , and the last stage can be described as:

1 RegDataW = nextPC if sigCntl.LinkReg \


2 else ReadDataM if sigCntl.MemToReg \
3 else ALUout
4 R[WriteReg] = RegDataW if sigCntl.RegWrite else R[WriteReg]

Notice if RegWrite is 0, meaning we don’t need to write to register, R[WriteReg]


does not change, hence the last else case.

As an example, we see LDR X0,[X1] is to read data from address R[1] ,


and write it to register X0 . Therefore, MemToReg is 1, and ReadDataM
is selected to write to register. On the other hand, for ADD X0,X0,X1 , we
want to write the addition result of R[0]+R[1] back to X0 , so MemToReg
is 0, and ALUout is written back.

Signal LinkReg is used for instruction BL specifically, where we need


to write the return address, i.e., PC+4 , back to register X30 .

Now that the last stage has finished, and PC has been updated in stage 3,
we are ready to start the next instruction and walk through the five stages
again.
3.3 A Single-Cycle Datapath 91

PC

4
IF
A B Address
Instruction I = M[PC]
Add nextPC = PC + 4
nextPC
Memory I

0] 1] 6] 0] 1]
:1 :2 :5
] :1 0] 0] :1 :2
20 31 9 20 4: [4: [20 31
I[ I[ I[ I[ I[ I I I[

Reg2Loc
MUX opcode = I[31:21]
Control ALUop
sigCntl = Control(opcode)
n, m, t, d = I[9:5], I[20:16], I[4:0], I[4:0] ID
ReadReg1 ReadReg2 imm = I[20:10]

Register File
ReadReg1 = n
RegWrite WriteReg
ReadReg2 = m if Reg2Loc else t
RegDataW RegData1 = R[ReadReg1]
RegData2 = R[ReadReg2]
RegData1 RegData2 WriteReg = d

A B
Add ALUsrc
BrPC MUX
PBr inputA = RegData1
CBr A B action inputB = imm if ALUsrc else RegData2
ALU ALU
Z
ALUout
Control action = ALUControl(sigCntl.ALUop, opcode) EX
ALUout,Z = ALU(action, inputA, inputB)
Br = UBr | (CBr & Z)
UBr BrPC = PC + imm
if PBr: PC = ALUout
else: PC = BrPC if Br else nextPC

Br
MUX

Address
WriteDataM Address = ALUout ME
WriteDataM = RegData2
MemRead if MemWrite and not MemRead: M[Address] = WriteDataM
MemWrite Data Memory elif MemRead and not MemWrite: ReadDataM = M[Address]
elif not MemRead and not MemWrite: pass
ReadDataM else: error WB
MemToReg if LinkReg: RegDataW = nextPC
MUX else: RegDataW = ReadDataM if MemToReg else ALUout
LinkReg R[WriteReg] = RegDataW if RegWrite else R[WriteReg]

Figure 3.27: A summary of five stages each instruction goes through in the datapath, with description language on the side.

3.3.4 Summary

We use Figure 3.27 to summarize the five stages, where the datapath dia-
gram and description language are side by side.
92 3 Microprocessor Design

3.4 A Pipelined Datapath

The sequential implementation is a simple model and shows important


ideas and concepts in building a datapath. It is a very inefficient one,
because when an instruction goes through later stages such as ME, all
the other components are idle while the following instruction is waiting,
which is a waste of resources. In this section, we’ll modify our model to a
pipeline to address this problem.

3.4.1 Operating a Pipeline

Before we measure the efficiency of a pipeline, we’ll define some termi-


nologies.

 Throughput: We express throughput in units of giga-instructions per


second (abbreviated GIPS), or billions of instructions per second;
 Latency: The total time required to perform a single instruction from
beginning to end is known as the latency. We typically use picosec-
onds (abbreviated “ps”), or 10−12 seconds to measure latency. For
example, 250𝑝𝑠 = 0.25𝑛𝑠 = 250 × 10−12 𝑠.

To fully understand pipeline in a datapath, let’s start with a very simple


example, where we simply send some input data to a combinational logic
that converts it into output data, and we store it to a register. See Figure
3.28.

In Figure 3.28, the input has three bits: sba , and the component sur-
rounded by dotted lines is the combinational logic, whose output is D .
Recall from Section 3.1.3.3, the register is also controlled by a clock signal
C — the data will be written into register only when there’s rising edge
of the clock signal.

Assume it takes 300ps to get D from sba , which is the delay of the combi-
national logic, and it takes 20ps to push the data into the register. In Figure
3.29, we show a sequence of three inputs: 110 , 010 , 110 with an un-
pipelined version. A clock cycle is the period between two consecutive
rising edges. In order to process one instruction at a time in every clock
cycle, we have to make sure we leave sufficient time for the delay caused
by the combinational logic, so each clock cycle takes 300+20=320ps. To get
output from the three inputs, we need to wait 320ps×3=960ps.

However, if we look at the combinational logic closely, we’ll notice that


while the first input passed through the two and gates, the not gate is

s
b
Figure 3.28: An example of combina-
tional logic, where the input is three-bit, D
register
and the output D is one bit. When the
clock rises, the output D is written to the a
register. C
3.4 A Pipelined Datapath 93

Clock cycle
(320ps)

Input 1: 110
Input 2: 010
Input 3: 110
Figure 3.29: A sequence of three inputs:
110 , 010 , 110 with an unpipelined
❶ ❷ ❸ version.

idle. Since the second input will use this not gate anyway, why don’t
we send input 2 there already? To clearly show this idea, as in Figure
3.30, we separate the combinational circuits into three stages, where each
stage’s input is previous one’s output. Once input 1 passed through stage
1 and is currently in stage 2, we can send input 2 to stage 1, and so on.
Now you can see at the peak of the execution, we will have three input
signals running through the combinational circuits at the same time. This
is usually called a three-way pipeline.

3.4.1.1 Pipeline Registers

The idea is simple and straightforward, but there are problems. When
input 1 moves to stage 2, we were supposed to send input 2 to stage 1.
However, notice only signal s will go through stage 1; signals a and
b will shoot the and gate right away, but currently it’s used by input
1. Because combinational logic will respond to any signal input change
immediately, now in fact input 2 is in both stage 1 and 2, while the data
from input 1 has been overwritten and lost.

Therefore, we need to put a “barrier” between the stages, as if telling the


signals: “hold on a second Ƶ; let’s make sure previous signals have fin-
ished the job!” This “barrier” can be easily implemented using registers.

Stage 1: NOT Stage 2: AND Stage 3: OR

b
D
register

Figure 3.30: A three-way pipeline struc-


a C ture where each stage runs one input.

Stage 1: NOT Stage 2: AND Stage 3: OR


s

b
register A

D
register B

register
Figure 3.31: We added two registers be-
a tween the three stages as “barriers”, to
make sure the signals in each stage will
C not be interrupted or overwritten.
94 3 Microprocessor Design

Clock cycle
(120ps)

Input 1 Input 1 Input 1


NOT AND OR
Figure 3.32: When we separate the combi-
national logic into three stages, the clock Input 2 Input 2 Input 2
NOT AND OR
cycle can be shorten to only execute one
stage. At the peak of the system, between Input 3 Input 3 Input 3
time 2 and 3, all logic gates are working NOT AND OR
on different instructions, which greatly
improves the throughput and reduces re-
source waste.
❶ ❷ ❸

As shown in Figure 3.31, we inserted two registers, A and B, between the


three stages, called pipeline registers. Register A stores 4 bits of data,
while B 2 bits. When there’s a rising edge of the clock, they’ll store the
output from the previous stage, and then use them as the input for the
next stage. Because registers only write data on rising edge of the clock,
even if previous stage has changed its signals, during the cycle it won’t
overwrite the existing data in the registers.
Now that we have separated the combinational circuits into three stages,
each of the stage has a shorter delay, so we assume each stage takes 100ps.
Since we added two registers, we also need to consider the delay of writ-
ing to registers, and we can assume the delay is the same for all registers,
which is 20ps.
We show the timeline in Figure 3.32 using this 3-way pipeline, where each
clock cycle executes one stage. Each cycle takes 120ps (100ps for combina-
tional circuits, and 20ps for writing registers), so in total for completing the
three inputs the delay is 360ps. Comparing to the unpipelined version in
Figure 3.29 which takes 960ps, the total delay has been greatly reduced.
One thing we do need to notice is that in the pipelined version, each in-
dividual instruction’s latency has been increased actually, from 320ps to
120ps×3 = 360ps. Regardless, pipeline is still a more efficient model be-
cause in total the latency is almost only one third of the unpipelined model.

˛ Quick Check 3.3


True or false, and why?

1. By pipelining the CPU datapath, each instruction will exe-


cute faster, resulting in a speed-up in performance;
2. A pipelined CPU datapath results in instructions being exe-
cuted with higher latency and higher throughput.

B See solution on page 114.

3.4.1.2 Limitations

Even though pipeline is a great idea, it has its limitations as well. In our
previous example, when we separate the combinational logic into three
3.4 A Pipelined Datapath 95

Clock cycle

Input 1 stage 1 stage 2 stage 3


Figure 3.33: It is very challenging to sep-
Input 2 stage 1 stage 2 stage 3 arate combinational circuits into stages
with equal latency. Thus, the clock cy-
Input 3 stage 1 stage 2 stage 3 cle needs to be long enough to cover the
slowest stage, which limits the through-
put of the entire system.

stages, we assume the delay of each stage is equal. This, however, is rarely
the case. Typically, some stages take longer time than others. Notice in
Figure 3.31, we use one clock signal to control all registers, for the purpose
of synchronization of all stages. If some stages take longer than others, the
clock has to wait for it, and all the other stages are idle as well.

In other words, the shorter a clock cycle is, the shorter the latency of each
instruction. However, to synchronize all stages, the clock cycle is deter-
mined by the slowest stage. In Figure 3.33, we assume stage 3 takes the
longest time to complete. Therefore, the clock cycle needs to cover the la-
tency of stage 3 in order to synchronize all other stages, and we see that
both stages 1 and 2 are idle for quite some time, which is a waste of re-
sources.

ą Example 3.3
(Fall 22 Midterm 2) We currently have a four-way pipeline circuit,
where the delay of each stage is 25ps, 50ps, 200ps, and 75ps, respec-
tively, and the delay includes sequential logic.

1. How long is the clock cycle in this pipeline?


2. With the pipeline above, what’s the latency of one pass of
data?

Solution: As discussed, the clock cycle is determined by the longest


stage of the pipeline. Therefore, in this example, the clock cycle
should be 200ps.

Each processing of data will pass all four stages, so it take four clock
cycles in total. Since each clock cycle is 200ps, the total latency of
one pass of data is 200ps× 4 = 800ps.

Another limitation is register overhead. Someone proposed that since


the shorter clock cycle the better, and some stages take longer time, why
don’t we separate the combinational logic into more stages, i.e., deepen
the pipeline?

For example, we can deepen the pipeline by separating a combinational


logic into as many stages as possible, where each stage takes much shorter
time, resulting in shorter clock cycle. In addition, we can process more
inputs at the peak.

Remember, however, that every time we separate a combinational logic


96 3 Microprocessor Design

Clock cycle
Sequential

LDR X0,[X1,24] IF ID EX ME WB
ADD X0, X0,X1 IF ID EX ME WB
STR X0,[X1,24] IF ID EX ME WB
Clock cycle
Pipeline

LDR X0,[X1,24] IF ID EX ME WB
ADD X0, X0,X1 IF ID EX ME WB
STR X0,[X1,24] IF ID EX ME WB

Figure 3.34: The horizontal axis is a time line. At the top we run one instruction through all five stages at a time, so it takes much longer to
complete all three instructions. At the bottom, we run multiple instructions at the same time, which resembles a pipeline.

into two subparts, we need to add a pipeline register in-between. Storing


data into pipeline register also has latency, even if it’s only 20𝑝𝑠. With
only one stage, we spend about 6% of the time loading registers. With six
stages, we spend about 30% of the time. Since efficiency is important, we
do not want to spend that much time on just loading registers. Therefore,
finding a balance is important to address this issue.

In our course, we do not go into depth about how those limitations are
addressed. Instead, we focus on how creating a pipeline can make our
single-cycle datapath more efficient.

3.4.2 From Single-Cycle to Pipeline

In the previous example, the analogy to our sequential datapath is obvious.


Our datapath can be treated entirely as a combinational logic separated
into smaller parts, and only the last stage — writing back — is sequen-
tial.

To apply pipeline to the datapath of five stages, we use Figure 3.34 to show
this idea. In the figure, we have a sequence of three instructions to execute.
Using the sequential model from previous section, one instruction takes
over the entire sequence (clock cycle), so the second instruction cannot
start executing until its previous instruction has finished all stages.

If we “stack” them, as at the bottom of Figure 3.34, we see that it takes


much shorter time to complete the entire sequence. For example, at time
(1), LDR is in ID stage, and at this time PC is idle, so why don’t we start
putting ADD instruction in IF stage? Now at time (1), we have two instruc-
tions running at the same time.

Now that the big picture of constructing a pipeline is clear, we’re going
into details on the datapath in the next section.
3.4 A Pipelined Datapath 97

3.4.3 Adding Pipeline Registers to Datapath

Similar to the simply circuit example, we need to add pipeline registers be-
tween every pair of stages (except WB and IF). The pipeline registers store
different values, depending on the data that each stage needs to store. We
name them using the names of the stages they insert: IFID, IDEX, EXME,
and MEWB. In Figure 3.35, we add these four pipeline registers, where we
also show the data stored in them.

PC
4

Address
A B Instruction
Add Memory
I

nextPC PC Rn Rm Rt Rd imm opcode IFID

MUX
Control
ReadReg1 ReadReg2 WriteReg

RegWrite RegDataW
RegData1 RegData2

ReadReg1
EX ME WB nextPC PC
ReadReg2
RegData1 RegData2 WriteReg imm opcode IDEX

MUX
ALUop

A B A B
ALU
Z ALU Control
Add action
ALUout

MUX

ME WB nextPC ALUout RegData2 WriteReg EXME

Address WriteDataM
MemWrite
MemRead Data
ReadDataM Memory

WB nextPC ReadDataM ALUout WriteReg MEWB

Figure 3.35: A detailed datapath with


MUX
pipeline registers.

Control signals also need to be pipelined, but not all signals will be used
in every stage. Based on the stages they are used, we group them into EX,
ME, and WB. EX contains PBr , CBr , UBr , ALUsrc , and ALUop . ME
and WB signals are not used in the EX stage, so we pass them on to the
98 3 Microprocessor Design

1 2 3 4 5 6 7 8 9

LDR X10, [X1,40] IF ID EX ME WB


SUB X11, X2, X3 IF ID EX ME WB
ADD X12, X3, X4 IF ID EX ME WB
LDR X13, [X1,48] IF ID EX ME WB
Figure 3.36: A pipeline diagram showing IF ID EX ME WB
the progression of instruction executions. ADD X14, X5, X6

next pipeline register. ME contains MemWrite and MemRead , while WB


LinkReg , MemToReg , and RegWrite .

We will use similar description language in the following sections. For


example, IDEX.nextPC indicates the data nextPC currently in register
IDEX ; EXME.MemRead is the signal MemRead in EXME ; and so on.

We will also use a description diagram in examples to show the instructions


and stages in each cycle, much similar to Figure 3.34. Figure 3.36 shows a
pipeline diagram of a sequence of instructions. We mark the cycle number
at the top from 1 to 9, and each cycle can have at most five instructions
running at different stages at the same time, since our datapath has five
stages.

3.5 Hazards

Pipeline is efficient, but if we just pay a little bit attention to some details
we’ll realize it can lead to wrong executions. Based on the error it can lead
to, we categorize it to two types: data hazard and control hazard, and
we’ll discuss them in detail in this section.

3.5.1 Data Hazard

Data hazard refers to wrong values in the pipeline are read and written.
Let’s look at a simple example in Figure 3.37.

We assume in the beginning, we have the following register values:

X0 X1 X2 X3 X11 X12
50 10 10 20 -125 180

At cycle 3, instruction 1 SUB has read register values from X1 and X3 ,


so at the end of EX stage, ALUout is the result of R[1]-R[3] = -10 ,
which will also be written to X2 at the end of WB stage. As seen in Figure
3.37, the result -10 is carried over through ME and WB, and X2 is only
updated at the end of cycle 5. Before that, X2 ’s value is always 10, the
old value.

Now this is the key point of data hazard. At cycle 4, instruction 3 ADD
which uses X2 as one of the source operands, is already in the ID stage
where the old value of X2 is read. In other words, the new value hasn’t
3.5 Hazards 99

1 2 3 4 5 6 7 8 9

1: SUB X2, X1, X3 IF ID EX ME WB


2: ADD X10, X11, X12 IF ID EX ME WB
3: ADD X0, X0, X2 IF ID EX ME WB
4: STR X20, [X20,20] IF ID EX ME WB
Cycle 3 Cycle 4 Cycle 5
EX ME WB
ALUout MEWB.ALUout RegDataW
= R[1] - R[3] = –10 = EXME.ALUout = -10 = EXME.ALUout = -10
EXME.ALUout R[WriteReg]
= ALUout = -10 = R[2] = -10

EX ME
ALUout MEWB.ALUout
= R[11] + R[12] = 55 = EXME.ALUout = 55
EXME.ALUout Figure 3.37: A sequence that can lead
= ALUout = 55 to data hazard, due to dependencies be-
tween instructions. Register X2 in the
ID EX first instruction is the destination, but
also one of the source operands in the
ReadReg1 = 0 ALUout third instruction. At cycle 4, instruction
ReadReg2 = 2 = R[0] + R[2] = 60
3 has already read X2 ’s old value, but
RegData1 = R[0] = 50 EXME.ALUout
it hasn’t been updated from instruction 1
RegData2 = R[2] = 10 = ALUout = 60
yet.

even been updated back yet. This wrong value is carried to EX stage, and
surely the calculation result is wrong. It’s supposed to be 40, but now we
see it’s 60.

Had we use the sequential model, this problem would’ve been avoided,
because in sequential model an instruction starts fetching and reading reg-
isters only after previous instructions have written back registers.

3.5.1.1 Stalling the Pipeline Manually

Our first solution is the most straightforward one—stall the pipeline. Re-
call that the problem we have is the instruction that needs to read an up-
dated register starts executing too fast. Then why not let it wait one cycle
or two? That’s exactly how stalling works.

Following the example above, let’s see how we can manually implement
stalling. Simply put, what we want is to delay the execution of the instruc-
tion ADD , but we have two questions. First, delay to which cycle? As in
Figure 3.37, we see that in cycle 5 has SUB instruction finished updating
X2 . Because reading registers is a combinational circuits—the read port
will read the updated value as soon as the register value has changed, as
long as we let instruction 3’s ID stage align with instruction 1’s WB stage,
we’re good.

Second question is, the pipeline constantly fetches instructions from PC,
so we need to add some instruction between instructions 1 and 3. It can-
not, however, do anything other than making the stages busy, and cannot
interfere with the logic of the program.
100 3 Microprocessor Design

1 2 3 4 5 6 7 8

1: SUB X2, X1, X3 IF ID EX ME WB


2: ADD X10, X11, X12 IF ID EX ME WB
3: STR X20, [X20,20] IF ID EX ME WB
4: ADD X0, X0, X2 IF ID EX ME WB
Cycle 3 Cycle 4 Cycle 5
EX ME WB
ALUout MEWB.ALUout RegDataW
= R[1] - R[3] = –10 = EXME.ALUout = -10 = EXME.ALUout = -10
EXME.ALUout R[WriteReg]
= ALUout = -10 = R[2] = -10

EX ME
ALUout MEWB.ALUout
= R[11] + R[12] = 55 = EXME.ALUout = 55
EXME.ALUout
= ALUout = 55

ID EX
ReadReg1 = 20 ALUout
RegData1 = R[20] = R[20] + imm
imm = 20 EXME.ALUout
= ALUout

ID
Figure 3.38: Because there’s no depen-
ReadReg1 = 0
dency on X2 in instruction STR , we
ReadReg2 = 2
swap it with ADD to align its ID stage RegData1 = R[0] = 50
RegData2 = R[2] = -10
with SUB ’s WB stage.

Solution 1: Rearrange instructions. This is a useful trick, but not appli-


cable for all programs. In terms of the program in the above example, if
we swap instruction 3 and 4, then the problem is solved. See Figure 3.38.

Again, in some cases, this solution cannot be used. For example, for a
sequence such as the following:

1 ADD X2, X2, X0


2 SUB X2, X2, SP
3 LDR X0, [X2,48]

this trick doesn’t work anymore, because we see data dependency between
every pair of instructions; no other instructions can be swapped.

Solution 2: Inserting NOP instruction as necessary. NOP means “no


operation”; its only purpose is to “waste” some cycles so that some instruc-
tions can be delayed. This instruction does nothing but taking a place in
the pipeline stages.

In Figure 3.39 we added one NOP between the two ADD instructions. We
only need one NOP inserted in this example, because as long as ADD ’s ID
stage is aligned with SUB ’s WB stage, we’re all good. More NOP s only
waste cycles unnecessarily. With this, we can see that the position of NOP
is not unique—we can certainly add one NOP right after SUB instruction,
and it’ll have the same effect on delaying the pipeline.
3.5 Hazards 101

1 2 3 4 5 6 7 8 9

1: SUB X2, X1, X3 IF ID EX ME WB


2: ADD X10, X11, X12 IF ID EX ME WB
3: NOP IF ID EX ME WB
4: ADD X0, X0, X2 IF ID EX ME WB
5: STR X20, [X20,20] IF ID EX ME WB
Cycle 3 Cycle 4 Cycle 5
EX ME WB
ALUout MEWB.ALUout RegDataW
= R[1] - R[3] = –10 = EXME.ALUout = -10 = EXME.ALUout = -10
EXME.ALUout R[WriteReg]
= ALUout = -10 = R[2] = -10

ID EX ME

ReadReg1 = 11 ALUout MEWB.ALUout


ReadReg2 = 12 = R[11] + R[12] = 55 = EXME.ALUout = 55
RegData1 = R[11] = -125 EXME.ALUout
RegData2 = R[12] = 180 = ALUout = 55

ID EX

ReadReg1 = 11 ALUout
ReadReg2 = 12 = R[11] + R[12] = 55
RegData1 = R[11] = -125 EXME.ALUout
RegData2 = R[12] = 180 = ALUout = 55 Figure 3.39: Inserting NOP instruction al-
lows us to delay instructions that depend
ID on the completion of earlier instructions.
In this example, to make SUB ’s WB and
ReadReg1 = 0
ReadReg2 = 2 ADD ’s ID stages align, we only need to
RegData1 = R[0] = 50 add one NOP . Certainly in some cases
RegData2 = R[2] = -10
more NOP s may be needed.

One thing we’d like to mention about NOP . In the detailed boxes in Figure
3.39, notice in cycles 4 and 5, NOP ’s ID and EX stages are exactly like
the ones in its previous ADD instruction. This is because, like we said,
NOP instruction does not do anything in the pipeline—meaning it doesn’t
change or alter what’s already there. When its previous instruction moves
along the pipeline to the next stage, the data/control signals stay there
unless its next instruction moves in and overwrites them. NOP doesn’t
do anything, so the data will not be changed.

Remarks. Note that the two solutions presented above do not need to be
exclusive. Sometimes re-arranging the instruction sequence is not enough,
and so we’d have to insert NOP s as well.

˛ Quick Check 3.4


(Fall 21 Midterm 2) Rearrange the following instructions to avoid
data hazard. You can insert NOP s as needed.

1 LDR X1, [X0, 24]


2 ADD X2, X1, X0
3 SUB SP, SP, 16
4 BL proc
102 3 Microprocessor Design

Note: explain why you rearrange the instructions the way you did.
Without explanation you’ll only get half of the points.
B See solution on page 115.

3.5.1.2 Stalling the Pipeline Automatically

It’ll be really exhausting if resolving data hazards relies on programmers


entirely—the programmers should focus on the task itself, and let the ma-
chine resolve these hazards. Therefore, we can create a combinational cir-
cuits called hazard detection unit to automatically detect data dependen-
cies between stages, and stall instructions as necessary.

Let’s consider the example presented in Figure 3.40. At cycle 4, instruction


1 is at ME stage while instruction 3 at ID stage. There’s data dependency
between them, so we need to make sure instruction 3 reads the correct
value of X1 . After detecting this hazard, instruction 3 has been stalled
in the ID stage in the next cycle. Because at cycle 5, instruction 3 is still
at ID stage, its following instruction STR is held in IF stage as well. Our
pipeline has five stages and they should all be occupied. If instruction 3
cannot move to EX stage, what should we put in there? The processor will
automatically insert a “bubble”, which is basically a NOP instruction, and
it’ll pass EX, ME, and WB stages in the following cycles.

Inserting one bubble is not enough, though, because we notice there’s an-
other data hazard for X2 at cycle 5. In this case, we keep stalling instruc-
tions 3 and 4, and insert another bubble. At cycle 6, both X1 and X2 have
been updated, so instruction 3, which is currently held at ID stage, can read
both X1 and X2 successfully. Thus, from cycle 7, we move instructions
3 and 4 along the pipeline, and the data hazard has been resolved.

Now let’s analyze in what situation will there be data hazard. From previ-
ous examples and Figure 3.40, we have seen that typically, if one instruc-
tion’s trying to read a register that hasn’t been updated from previous in-
structions, there’s data hazards. In other words, the instruction at ID stage
relies on its previous instructions whose RegWrite control signal is 1, and
yet they haven’t reached to the WB stage. Such instructions are either in
EX or ME stages. Using description language, we can have the following
condition to check data hazards:

1 if IDEX.RegWrite:
2 if IDEX.WriteReg == IFID.ReadReg1 or \
3 IDEX.WriteReg == IFID.ReadReg2:
4 # Stall the pipeline
5 if EXME.RegWrite:
6 if EXME.WriteReg == IFID.ReadReg1 or \
7 EXME.WriteReg == IFID.ReadReg2:
8 # Stall the pipeline

With this pseudocode, implementing a hazard detection unit is very straight-


forward — we just need to create a combinational circuits that can stall
3.5 Hazards 103

1 2 3 4 ❶
1: MOV X1, 10 IF ID EX ME Data hazard detected: X1 is
IF ID EX needed in instruction 3, but hasn’t
2: MOV X2, 20 been updated from instruction 1
yet.

3: ADD X0, X1, X2 IF ID


4: STR X5, [X6,0] IF

1 2 3 4 5 ❷
1: MOV X1, 10 IF ID EX ME WB Instruction 3 is stalled in ID stage
IF ID EX ME for cycle 5, and its following
2: MOV X2, 20 instruction is held in IF stage as
EX well.

In cycle 5, a “bubble” will be


3: ADD X0, X1, X2 IF ID ID inserted into EX stage.
4: STR X5, [X6,0] IF IF

1 2 3 4 5 6 Data hazard for X1 has been



1: MOV X1, 10 IF ID EX ME WB resolved, but again there’s a data
IF ID EX ME WB hazard for X2, so instruction 3 and
2: MOV X2, 20 4 are stalled for another cycle.
EX ME
EX In cycle 6, all data hazards have
been resolved, so instruction 3
3: ADD X0, X1, X2 IF ID ID ID and 4 can move on from this
4: STR X5, [X6,0] IF IF IF cycle.

1 2 3 4 5 6 7 8 9 10 ❹
1: MOV X1, 10 IF ID EX ME WB
2: MOV X2, 20 IF ID EX ME WB Figure 3.40: As long as there’s a data haz-
EX ME WB ard, we’d postpone the instruction and its
EX ME WB following instructions by stalling them
3: ADD X0, X1, X2 IF ID ID ID EX ME WB in place, and let NOP s move along the
4: STR X5, [X6,0] IF IF IF ID EX ME WB pipeline. Once all data hazards have been
resolved, stalled instructions can restart.

the pipeline. In Figure 3.41, we added a hazard detection unit, where it


receives signals from multiple stages.

Once data hazard detected, this hazard detection unit does two jobs:

1. Stall instructions currently at ID and IF stages. If data hazard has


been detected, it’ll overwrite PC and the values in IFID register.
The instruction at IF stage will be fetched again, since PC will not
be updated. The instruction at ID stage will stop updating IFID
register’s values from next instruction, and will stay the same. Thus,
both instructions have been held without moving along the pipeline;

2. Insert bubble to EX stage. Before the EX stage starts, EX, ME, and WB
signals in IDEX represented the instruction that we want to stall.
If there’s data hazard, we cannot let them move to the next stage.
Therefore, instead of passing control signals generated by the control
unit to IDEX directly, we send them to a multiplexer, and let the
hazard detection unit decide if those signals can be passed to IDEX .
If there’s data hazard, the multiplexer will pass all zero signals to
IDEX instead, which is used as NOP instruction.
104 3 Microprocessor Design

PC

Address Instruction
I Memory

Hazard
Detection nextPC PC Rn Rm Rt Rd imm opcode IFID
Unit

MUX
00…0

MUX Control ReadReg1 ReadReg2


WriteReg

RegWrite RegDataW
RegData1 RegData2

ReadReg1
EX ME WB nextPC PC
ReadReg2
RegData1 RegData2 WriteReg imm opcode IDEX
IDEX.RegWrite

ME WB nextPC ALUout RegData2 WriteReg EXME


EXME.RegWrite

WB nextPC ReadDataM ALUout WriteReg MEWB

Figure 3.41: Hazard detection unit receives signals from multiple stages (representing multiple instructions), and stall instructions currently
at ID and IF stages, and insert bubble to EX stage.

3.5.1.3 Forwarding Data

Stalling a pipeline can surely avoid data hazards, but it also slows down
the pipeline. In many cases, forwarding data right from EX stage back is
faster. Thus, we create another combinational circuits called forwarding
unit, as shown in Figure 3.42. The main function of this unit is to detect
if there’s a data hazard. If there is, it’ll overwrite the data read from the
registers. This way, a new value will be used as the input operand(s) of
ALU, instead of the old and not updated values from register read.

For the two operands from ALU, we use a four-way multiplexer to se-
lect one of the three possible inputs: one from the ALU result of its pre-
vious instruction, EXME.ALUout ; one from the written back value from
RegDataW ; one from register read, either RegData1 or RegData2 .

Since it’s a four-way multiplexer (see Section 3.1.2.2), we’d need two bits
of signals for selection, which are also the output of the forwarding unit.
We call them FA and FB , used for operands A and B of ALU. Thus, we
can establish a truth table for the multiplexers as follows:
3.5 Hazards 105

Value Forwarding source Explanation


00 IDEX.RegData1 A is from the register file
FA 01 EXME.ALUout A is forwarded from previous ALU result
10 RegDataW A is forwarded from RegDataW
00 IDEX.RegData2 B is from the register file
FB 01 EXME.ALUout B is forwarded from previous ALU result
10 RegDataW B is forwarded from RegDataW

The description language for the multiplexer can be written this way:

1 if FA == 0b00: inputA = RegData1


2 elif FA == 0b01: inputA = EXME.ALUout
3 else: inputA = RegDataW

which is the same for operand B as well.

Here are some details we still need to work on:

A. Conditions for Forwarding


Now that the output of the forwarding unit and their situations have
been identified, next step is to consider when to forward, and de-
sign corresponding conditions for the unit. Let’s focus on the case
where FA==01 first, i.e., the operands are forwarded from previous
ALU result. If ALU result of instruction i will be used as one of
the operands for the following instruction i+1 , apparently i ’s
WriteReg matches one of the operands of i+1 . Therefore, we
9: We only write conditions for operand
have 9 A, and those for operand B can be easily
derived.
1 if EXME.RegWrite and \
2 EXME.WriteReg == IDEX.ReadReg1 and \
3 EXME.WriteReg != 31:
4 FA = 0b01

Notice in the condition, we added EXME.WriteReg != 31 , because


X31 is XZR and it will never be overwritten, and thus there’s no
data hazard involved with this register.
The second possible condition is FA==10 , meaning operand A is for-
warded from RegDataW , which is one of the three signals— MEWB.nextPC ,
MEWB.ReadDataM from memory read, and MEWB.ALUout from pre-
vious ALU result. Following the code above, we can keep checking
the conditions as follows:

1 elif MEWB.RegWrite and \


2 MEWB.WriteReg != 31 and \
3 MEWB.WriteReg == IDEX.ReadReg1:
4 FA = 0b10
5 else:
6 FA = 0b00
106 3 Microprocessor Design

B. Multiple Possible Forwarding


Notice in the code listing above we put the checking from ME stage
in elif instead of an independent if statement. It means that if
we can receive forwarded data from EX stage, we will not consider
the possibility from ME stage anymore.
Look at the following example:

1 ADD X1, X1, X2


2 ADD X1, X1, X3
3 ADD X1, X1, X4

When the third ADD instruction is at ID stage, we have new X1


values forwarded from both EX and ME stages, corresponding to
the result from the second and the first instructions. So which one
should we use? Apparently we should use the newest one, i.e., calcu-
lated from the second ADD . By using elif for ME stage condition,
we successfully avoided this problem.

C. Removing Wires for Stalling


After added forwarding unit, we learned that this is a faster option,
because it doesn’t waste cycles and slow down our pipeline. How-
ever, there’s one situation that forwarding doesn’t work. Let’s look
at the following sequence:

1 LDR X0, [X1, 48]


2 ADD X2, X0, X0

where the destination register of LDR is exactly both operands for


the next instruction. When ADD instruction is at ID stage and needs
to read the correct value of X0 , LDR is still reading the memory,
and will not extract the correct value of X0 until next cycle. How-
ever, after LDR reads the correct value and moves to WB stage, ADD
has moved to EX stage, which uses the wrong value of X0 . Forward-
ing in this case fails, and apparently the only solution is to insert a
bubble between the two instructions.

As shown in Figure 3.42, we removed some of the wires for haz-


ard detection unit, because forwarding unit has been added and
its more efficient. We did save one case for it, where it receives
IDEX.MemRead and WriteReg , just to deal with the situation we
discussed above. Thus, we modify the description language in Sec-
tion 3.5.1.2:

1 if IDEX.MemRead:
2 if IDEX.WriteReg == IFID.ReadReg1 or \
3 IDEX.WriteReg == IFID.ReadReg2:
4 # Stall the pipeline
3.5 Hazards 107

PC

Address
Instruction
I Memory

Hazard
Detection nextPC PC Rn Rm Rt Rd imm opcode IFID
Unit

MUX
00…0

MUX Control ReadReg1 ReadReg2 WriteReg


RegWrite
Register File RegDataW
RegData1 RegData2

ReadReg1
EX ME WB nextPC PC
ReadReg2
RegData1 RegData2 WriteReg imm opcode IDEX

FB M
Forwarding
Unit U
X
M
U
FA X

A B
Z ALU
ALUout

ME WB nextPC ALUout RegData2 WriteReg EXME

Address WriteDataM
Data Memory
ReadDataM

WB nextPC ReadDataM ALUout WriteReg MEWB

MUX Figure 3.42: The forwarding unit will


take signals from ME and WB stages, and
overwrite ALU operands.

3.5.2 Control Hazard

Control hazard refers to the wrong execution of programs (not control sig-
nals), which is mainly due to branching. Consider the code sequence:

1 CBZ X1, .L1


2 AND X12, X2, X5
3 ORR X13, X6, X2
4 ADD X14, X2, X2
5 .L1: LDR X4, [X7, 48]

If X1 is not zero, everything is fine; but if it is zero, after execution of


CBZ we should execute LDR . Now put this into the pipeline, we’ll see
the problem. The value of X1 can only be determined at EX stage. 10 At 10: X1 is read from ID stage, but if it’s
zero or not cannot be detected until at
this time, the instructions AND and ORR have already been put into the
EX stage ALU receives X1 and produces
pipeline. If X1 is zero, we have two instructions in the pipeline that are zero flag.
not supposed to be there.

A simple solution is just to cancel the wrong instructions in the pipeline


when we notice there’s a control hazard. This operation of “canceling” is
called flush. This is pretty much like inserting bubbles and NOP s, which
108 3 Microprocessor Design

1 2 3
1: CBZ X1, .L1 IF ID EX
2: AND X12, X2, X5 IF ID

3: ORR X13, X6, X2 IF

4: ADD X14, X2, X2


5:.L1: LDR X4, [X7, 48]
1 2 3 4 5 6 7 8
1: CBZ X1, .L1 IF ID EX ME WB
Figure 3.43: At cycle 3, the value of X1 2: AND X12, X2, X5 IF ID
has been determined. Assume it is zero, EX ME WB
then we need to branch to .L1 . The two 3: ORR X13, X6, X2 IF
instructions AND and ORR that are al-
ID EX ME WB
ready in the pipeline will not continue ex-
ecuting; instead, we flush them and let 4: ADD X14, X2, X2
NOP s move along the pipeline. 5:.L1: LDR X4, [X7, 48] IF ID EX ME WB

prevents wrong instructions from moving on to next stages in the pipeline.


See Figure 3.43.

3.5.2.1 Branch Prediction

The inserting-bubble method we used above seems working fine; it wastes


only two cycles and it’s only 50% of the chance. Notice, however, that the
reason why we only waste two cycles is because we have five stages, and
the value of a register can be confirmed as early as the third stage. Modern
processors have longer and deeper pipelines, i.e., more stages. If a proces-
sor has 𝑁 stages, and the value of a register can be confirmed at the 𝑁 2
stage (in the middle), then we’ll have to waste 𝑁 2 cycles. For a large 𝑁 ,
this apparently is not ideal at all. In other words, the penalty for control
hazard becomes intolerable as the pipeline deepens. Therefore, we intro-
duce two common practice for modern processors: static and dynamic
prediction.

3.5.2.2 Static Prediction

Static prediction is based on the characteristics of the programs, and hap-


pens at compilation time. Several simple methods include:

 Always taken;
 Always not taken;
 BTFN: backward taken, forward not taken. For example, at the end
of a loop body, we always predict that it’ll branch back, because a
loop typically runs multiple iterations. If the branch is not part of a
loop but to a procedure or other part of the code, we always predict
that it’ll not take the branch. In other words:
• Predict branch taken if it’s going backwards;
• Predict branch not taken if it’s going forwards.
3.5 Hazards 109

One typical problem is to calculate the misprediction rate of a certain strat-


egy, which is calculated by

Total number of wrong prediction


MR = (3.2)
Total number of prediction

ą Example 3.4
The following is the example from Section 2.5.4.1, and assume the
string is qwert . Please calculate the misprediction rate using dif-
ferent static prediction strategies: always taken, always not taken,
and BTFN.

1 MOV X11, 0 // index i = 0;


2 Loo: LDRB W12, [X9, X11] // Load one byte *(str+i);
3 // LDRSB is fine too;
4 CMP W12, 0 // compare *(str+i) and 0;
5 B.EQ End // if(*(str+i)==0) goto End
6 ADD W12, W12, 2 // else W12 = W12 + 2;
7 STRB W12, [X9, X11] // Update *(str+i);
8 ADD X11, X11, 1 // i ++;
9 B Loo // Branch back
10 End:

Solution: It is easier to write out all the actual branching for all
iterations first, and then compare them with different branch
prediction strategies. One thing to be aware of is that only condi-
tional branching causes control hazard; unconditional branching
instructions such as B , BL , and RET do not cause control hazard.
In this example, therefore, we only need to consider B.EQ .

The actual branching outcomes are:

N N N N N T
↓ ↓ ↓ ↓ ↓ ↓
q w e r t '\0'

where N and T stand for not-taken and taken, respectively.

For the strategy of “always-taken”, the prediction sequence is


TTTTTT , so the misprediction rate is 65 . For “always-not-taken”,
the predictions are NNNNNN , so MR = 16 . Lastly, for BTFN, we
notice that B.EQ is going forward, so all predictions should be
not taken. Thus, the misprediction rate for BTFN is the same as
“always-not-taken”, which is 16 .

3.5.2.3 Dynamic Prediction

Dynamic prediction uses a hardware to record the behavior of every branch


— if it’s taken or not — and make decisions based on current status, and
110 3 Microprocessor Design

this happens during run time. There are two typical methods: Last-time
Predictor, and Two-bit Predictor.

Last-time predictor indicates which direction branch went last time it exe-
cuted. For example, if last time the program actually took the branch, this
time it will take the branch as well. If there’s a loop with 𝑁 iterations, it
is easy to see that only the first and last iterations will have misprediction,
2
so MR = 𝑁 .

The problem with last-time predictor is it changes state very quickly. A


two-bit predictor is introduced to address the problems, at the cost of more
complicated hardware. It records the branch behavior and predicts the
branch based on its past two predictions and the actual behaviors. Figure
3.44 shows the state machine for two-bit predictor.
Strongly Taken Weakly Taken

not taken
Assume in iteration 𝑖, the predictor is at strongly taken state in Figure 3.44
Predict Predict
taken

taken taken taken and predicts the branch is taken, but it’s actually not taken. In iteration
𝑖 + 1, instead of predicting not taken immediately as in last-time predictor,
not taken

two-bit predictor will enter weakly taken state and still predict it’s taken.
taken

If it’s still a misprediction this time, i.e., actually not taken, in iteration 𝑖+2,
it will predict not taken, and end weakly not taken.
not taken

Predict not taken Predict


not taken taken not taken
Research found that two-bit predictor has only 10%–15% misprediction
Strongly Not Taken Weakly Not Taken
rate, which is a huge boost.
Figure 3.44: Two-bit predictor.
˛ Quick Check 3.5
Assume the following piece of code that iterates through two large
arrays, j and k , each populated with completely random pos-
itive integers. The code has two branches (labeled B1 and B2 ).
When we say that a branch is taken, we mean that the code inside
the curly brackets is executed. Assume the code is run to comple-
tion without any errors. For the following questions, assume that
this is the only block of code that will ever be run, and the loop-
condition branch ( B1 ) is resolved first in the iteration before the
if-condition branch ( B2 ). N and X are unspecified non-zero in-
tegers.

1 for (int i = 0; i < N; i++) { /* B1 */


2 /* TAKEN PATH for B1 */
3 if (i % X == 0) { /* B2 */
4 j[i] = k[i] - i; /* TAKEN PATH for B2 */
5 }
6 }

You are running the above code on a machine with a two-bit pre-
dictor, initialized to the state of Strongly Taken.

1. Assuming that 𝑁 is larger than 10, after running the loop for
10 iterations, you observe that the branch predictor mispre-
dicts 0% of the time. What is the value of 𝑋 ?
2. What is the prediction accuracy of the branch predictor if
𝑁 = 20 and 𝑋 = 2?
3.6 Performance Evaluation 111

B See solution on page 116.

3.6 Performance Evaluation

Throughout this chapter, we have seen that in CPU design it is important


to control the period of each clock. Each clock should be long enough to
accommodate all stages, but also not so long to have unnecessary delays.
Thus, in performance evaluation, we first care about clock period, which
is the time spent on each clock cycle.

What we are most familiar with, especially when buying a computer, is


the reciprocal of clock period, called clock rate or clock frequency:

1
clock rate = . (3.3)
clock period

This measurement denotes how many cycles a CPU can run each second,
and its unit is Hz. For example, for a computer with clock period of 250𝑝𝑠 =
2.5 × 10−10 𝑠, its clock rate is
1
clock rate = = 4 × 109 Hz (3.4)
2.5 × 10−10 𝑠
= 4 × 103 MHz (3.5)
= 4GHz. (3.6)

What does this number even mean? Consider computer A with 4GHz
clock rate and B with 2GHz. The number tells us that computer A can
have 4 × 109 cycles in each second, while B has only 2 × 109 cycles per
second. If both A and B are using the CPU we designed in this chapter,
meaning they have five stages, we see that computer A can load twice
the amount of instructions than B in each second. Thus, given the same
amount of time, A can accomplish more tasks than B.

A more straightforward measurement is simply run the same program


on two computers, and see which one runs faster. Let’s assume all other
factors are the same for both computers, and the only thing that’s affect-
ing their performance is the CPU design. The time spent on a program
is called CPU time. Think about this — to execute a program is basically
to execute a sequence of assembly instructions. If a program consists of
𝐼 instructions, and each instruction needs CPI cycles to finish, and each
cycle takes 𝐶 seconds (the clock period) to finish, the CPU time is simply
𝐼 × 𝐶𝑃 𝐼 × 𝐶 .

Thus, we derive the equation for CPU time of a program:

CPU time = 𝐼 × 𝐶𝑃 𝐼 × 𝐶 (3.7)


𝐼 × 𝐶𝑃 𝐼
= (3.8)
Clock rate
112 3 Microprocessor Design

where 𝐼 is the number of instructions in the program, 𝐶𝑃 𝐼 the number of


clock cycles per instruction, and 𝐶 the clock period.

This equation also outlines the factors that can improve a computer’s per-
formance:

 For CPI, we prefer smaller number of cycles for each instruction.


Take our previous CPU design as an example. For our pipeline de-
sign, because each of the five stages costs one cycle, the clock cycle
is 5 per instruction. However, in our non-pipelined design, each in-
struction goes through the entire five stages in one cycle, its clock
cycle is 1. So why don’t we just use one cycle and non-pipelined de-
sign? Because there’s another factor in the equation: clock rate;

 For clock rate, the higher the better, because that means given one
second we can process more instructions. Again, for our non-pipelined
design, because one cycle needs to deal with all five stages, each cy-
cle needs much longer time, which greatly reduces the clock rate.
Therefore, for CPU designers, finding a balance between smaller
CPI but also larger clock rate is important to achieve better perfor-
mance. So as programmers, if we cannot improve the hardware de-
sign, what can we do? Optimize our algorithms!

 The only factor we can control as programmers is instruction count


𝐼 . Surely modern compilers are very smart—when they compile our
high-level language programs into assembly they optimize a lot to
reduce 𝐼 . However, no matter how smart a compiler is, it can never
optimize a linear search to compete with binary search, and that’s
why algorithms are important to improve performance as well, but
from a software perspective.

ą Example 3.5
We designed a processor A with 2GHz clock rate, and when we
run a performance test program on processor A, the CPU time is
10𝑠. Our competing company designed a processor B, and when
we run the same test program on processor B, it takes only 6𝑠. Our
CEO was not happy about this, so he hired a spy to find out what
makes their processor so fast. Unfortunately, the company keeps
their secret so well, so the only thing the spy can get is that their
clock cycles per instruction 𝐶𝑃 𝐼𝐵 is 1.2 times of our 𝐶𝑃 𝐼𝐴 . So
how fast must computer B’s clock be?

Solution: Let 𝐶𝑃 𝐼𝐴 be the cycles per instruction for processor A.


Based on the question, we can easily get the total number of instruc-
3.6 Performance Evaluation 113

tions in the program:

CPU time × Clock rate


𝐼 = (3.9)
𝐶𝑃 𝐼𝐴
10𝑠 × 2GHz
= (3.10)
𝐶𝑃 𝐼𝐴
2 × 1010
= . (3.11)
𝐶𝑃 𝐼𝐴

Because 𝐶𝑃 𝐼𝐵 = 1.2 ⋅ 𝐶𝑃 𝐼𝐴 , the clock rate of processor B can be


calculated as:
𝐼 × 𝐶𝑃 𝐼𝐵
Clock rate𝐵 = (3.12)
CPU time
𝐼 × 1.2 × 𝐶𝑃 𝐼𝐴
= (3.13)
6𝑠
𝐼 × 𝐶𝑃 𝐼𝐴 × 1.2
= (3.14)
6𝑠
2 × 1010 × 1.2
= = 4GHz. (3.15)
6𝑠

The calculation above is a simplified version, though, because in modern


processors, not all instructions take the same number of cycles to finish.
For example, a floating point calculation takes longer than an integer one.
Therefore, a more detailed measurement should be used.

Based on the different number of cycles an instruction takes, we separate


them into different classes. Assume there are 𝐾 instruction classes, and for
a program there are 𝐼𝑘 instructions for a 𝑘 ∈ 𝐾 . The clock cycles needed
for this program can be calculated as:

𝐾
Clock cycles = ∑ 𝐼𝑘 ⋅ 𝐶𝑃 𝐼𝑘 (3.16)
𝑘=1

The weighted average CPI can be calculated as

Clock cycles
𝐶𝑃 𝐼 = (3.17)
𝐼
𝐾
∑𝑘=1 𝐼𝑘 ⋅ 𝐶𝑃 𝐼𝑘
= (3.18)
𝐼
𝐾
𝐼
= ∑ (𝐶𝑃 𝐼𝑘 ⋅ 𝑘 ) (3.19)
𝑘=1
𝐼

𝐼𝑘
where 𝐼 is the total number of instructions in the program, and 𝐼 is called
relative frequency of instruction 𝑘.
114 3 Microprocessor Design

3.7 Quick Check Solutions

Quick Check 3.1

Remember CMP Rn,Rm is to compare two registers, and set condition


codes. What it actually does, is to perform subtraction between Rn and
Rm, and send the result to XZR (register X31 ). In other words, it’s just an
alias for instruction SUBS XZR,Rn,Rm . Please translate CMP X1,X0 into
machine code based on the discussion above.

 Based on the description, CMP X1,X0 is equivalent to SUBS XZR,X1,X0 .


From Figure 3.16 we have the opcode: 11101011001. Register X1
has an encoding of 00001, X0 of 00000, and destination register
XZR of 11111. Putting them together, we have the encoding for
CMP X1,X0 : 11101011001 00000 111000 00001 11111

Quick Check 3.2

When looking at machine code, you found a binary sequence:

1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
1 0

What assembly instruction does it represent?

 The highest 11 bits are 11010001000. Based on Figure 3.17, the in-
struction that matches the opcode is SUB Rd,Rn,imm11 . The next
11 bits are 00000001000, which represents the immediate number 8.
The next 5 bits 10010 is the operand register X18 , and the last 5 bits
01010 the destination register x10. Therefore, the exact instruction
is: SUB X10, X18, 8 .

Quick Check 3.3

True or false, and why?

1. By pipelining the CPU datapath, each instruction will execute faster,


resulting in a speed-up in performance;
 False. Because we implement registers between each stage of
the datapath, the time it takes for an instruction to finish execut-
ing will be longer than the single-cycle datapath we were first
introduced with. A single instruction will take multiple clock
cycles to get through all the stages, with the clock cycle based
on the stage with the longest timing.
2. A pipelined CPU datapath results in instructions being executed
with higher latency and higher throughput.
3.7 Quick Check Solutions 115

 True. Recall that latency is the time for one instruction to finish,
while throughput is the number of instructions processed per
unit time. Pipelining results in a higher throughput because
more instructions are run at once. At the same time, latency
is also higher as each individual instruction may take longer
from start to finish because each cycle must last as long as the
longest cycle. Additionally, hazards may be introduced.

Quick Check 3.4

(Fall 21 Midterm 2) Rearrange the following instructions to avoid data haz-


ard. You can insert NOPs as needed.

1 LDR X1, [X0, 24]


2 ADD X2, X1, X0
3 SUB SP, SP, 16
4 BL proc

Note: explain why you rearrange the instructions the way you did. With-
out explanation you’ll only get half of the points.

 Notice the first two instructions have data dependency ( X1 ), so


we can switch ADD and SUB . After switching, however, it is not
enough to avoid data hazard because we need at least two cycles in
between. Therefore, we add NOP . The following is the assembly
code after re-arrangement.

1 LDR X1, [X0, 24]


2 SUB SP, SP, 16
3 NOP
4 ADD X2, X1, X0
5 BL proc

 One common mistake is to put ADD after BL :

1 LDR X1, [X0, 24]


2 SUB SP, SP, 16
3 BL proc
4 ADD X2, X1, X0

Indeed it looks like now between LDR and ADD there are enough
cycles to avoid data hazard. However, notice after SUB instruction
we branched to a procedure proc . From what’s given in the ques-
tion, it’s never clear if we need to use X22 in the procedure. If we
calculate X2 after return back, the procedure might have used the
wrong value of X2 . Therefore, remember, when you re-arrange in-
structions, anything before BL should stay before BL . Same for
everything after BL . And of course here we assume there’s no data
hazard in proc .
116 3 Microprocessor Design

Quick Check 3.5

Assume the following piece of code that iterates through two large arrays,
j and k , each populated with completely random positive integers. The
code has two branches (labeled B1 and B2 ). When we say that a branch
is taken, we mean that the code inside the curly brackets is executed. As-
sume the code is run to completion without any errors. For the following
questions, assume that this is the only block of code that will ever be run,
and the loop-condition branch ( B1 ) is resolved first in the iteration be-
fore the if-condition branch ( B2 ). N and X are unspecified non-zero
integers.

1 for (int i = 0; i < N; i++) { /* B1 */


2 /* TAKEN PATH for B1 */
3 if (i % X == 0) { /* B2 */
4 j[i] = k[i] - i; /* TAKEN PATH for B2 */
5 }
6 }

You are running the above code on a machine with a two-bit predictor,
initialized to the state of Strongly Taken.

1. Assuming that 𝑁 is larger than 10, after running the loop for 10 iter-
ations, you observe that the branch predictor mispredicts 0% of the
time. What is the value of 𝑋 ?
 𝑋 = 1. This is because the two-bit predictor is initialized to
Strongly Taken state, so both B1 and B2 have to be taken all
the time for the first 10 iterations. The only condition that can
make 0, 1, 2, … , 9%𝑋 = 0 is 𝑋 = 1.
2. What is the prediction accuracy of the branch predictor if 𝑁 = 20
and 𝑋 = 2?
 When 𝑁 = 20, in total we have to make 41 predictions: 21 for
B1 (where 20 for the actual iterations and 1 for the last check),
and 20 for B2 . Since 𝑋 = 2, let’s write out the branching
outcomes for the first few iterations and see if there’s a pattern:

i = 0 i = 1 i = 2 i = 3
B1 B2 B1 B2 B1 B2 B1 B2
...
Outcome T T T N T T T N
Prediction T𝑠 T𝑠 T𝑠 T𝑠 T𝑤 T𝑠 T𝑠 T𝑠

where we use T𝑠 and T𝑤 to indicate the states of Strongly Taken


and Weakly Taken, respectively. From the table, it is clear that
all branches are predicted to be taken, and misprediction only
happens at B2 when 𝑖 is an odd number. Since there are 10 odd
numbers during the 20 iterations, we have 10 misprediction.

 However, do not forget after the last iteration, branch B1 is


not taken but predicted to be taken as well! Therefore, we have
10 misprediction for B2 and 1 misprediction for B1 . So the
3.7 Quick Check Solutions 117

misprediction rate is:

10 + 1
MR = = 0.269. (3.20)
41
Memory System 4
4.1 Memory Hierarchy . . . . 119
In previous chapters we discussed the processor. Now we are going to
4.2 Cache Memory . . . . . . 122
move on to the memory. Even if it’s not part of the processor, it plays a
4.3 Virtual Memory . . . . . . 139
critical role in our datapath and processor design. Thus, in this chapter,
we’re going to learn about memory technologies. 4.4 Reference . . . . . . . . . . 149
4.5 Quick Check Solutions . 150

4.1 Memory Hierarchy

Throughout Chapter 3, we learned that utilizing a pipeline in the datap-


ath can greatly improve the efficiency and throughput of our processor.
Specifically, since we separate the entire execution of an instruction into
five stages, at its peak, we can have five instructions running at the same
time.

Recall that in a pipelined datapath, the clock cycle is determined by the


longest stage, and is usually the bottleneck of the entire system. For exam-
ple, in our datapath, among the five stages, the ME stage where we need
to read or write memory is the slowest. It is too slow actually, compar-
ing to reading or writing registers. As shown in Figure 4.1, we see that 1: The memory we usually talk about
the CPU cycle time is getting shorter, which is a good thing because that is DRAM (dynamic RAM). Later we will
means the delay of each instruction is shorter. However, the memory ac- also introduce SRAM (static RAM), but
cess time (DRAM), didn’t get much shorter, and so the gap between CPU it’s not used for memory.

and memory is getting larger. 1

To address this, let’s start from a programmer’s view, and then look at how
hardware designers take advantage of this observation to design faster
access devices.

Disk seek time

SSD access time


Time (ns)

DRAM access time

SRAM access time


CPU cycle time Figure 4.1: The time gap between access-
Effective CPU cycle time ing DRAM/SRAM (the memory) and
CPU cycle time is getting larger through
the years. Figure borrowed from Com-
Year puter Systems: A Programmer’s Perspective.
120 4 Memory System

4.1.1 Locality of Reference

Let’s start with a very simple example, where we assume there’s an integer
array a of length n:

1 int sum = 0;
2 for (int i = 0; i < n; i++) sum += a[i];
3 return sum;

This example sums over all the element in the array a, by using a for-loop
and add each element to the variable sum each time. So far we should’ve
been pretty familiar with how arrays are stored in memory. In this exam-
ple, all the elements in array a are stored next to each other. We are also
familiar with the calculation happened in the processor, so the variable
sum is an integer stored in the memory, and each time when we add a[i]
to it, we need to load sum to the processor.

Our memory is a very large device that can store lots of data, but when we
look at the example above, we notice:

 The data we’re dealing with, i.e., a[i], are stored right next to each
other, instead of being all over the place;
 The variables sum and i are used in each iteration, frequently.

These two characterizes common observations of our programs: spatial


locality and temporal locality.

More formally, spatial locality means the data with nearby addresses tend
to be used (referenced) close together in time, such as the array elements;
temporal locality means recently referenced data are likely to be refer-
enced again in the near future, such as sum in our example.

Language Reference Patterns


a[0][0]
Optimizing code with locality in mind is a good practice for programmers.
a[0][1]
0th row

Compare the following two functions in C language. Which one has good
locality and why?
...

a[0][N-1] 1 int sum_array_rows(int a[M][N]) {


2 int i, j, sum = 0;
a[1][0]
3 for (i = 0; i < M; i ++)
1st row

for (j = 0; j < N; j ++)


...

5 sum += a[i][j];
a[1][N-1] 6 return sum;
7 }
...

9 int sum_array_cols(int a[M][N]) {


a[M-1][0]
(M-1)-th row

10 int i, j, sum = 0;
...

11 for (j = 0; j < N; j ++)


12 for (i = 0; i < M; i ++)
a[M-1][N-1] 13 sum += a[i][j];
Figure 4.2: Row-major languages store all 14 return sum;
elements in each row together.
4.1 Memory Hierarchy 121

15 }

To compare this, we need to know how two-dimensional arrays are stored.


C language is a row-major order language, meaning for a two-dimensional
array, it stores rows next to each other, so the elements in the same row
are grouped together. See Figure 4.2 for an illustration. Languages like
FORTRAN are column-major order language: they store elements in each
column together (Figure 4.3).
a[0][0]

0th column
Since we’re comparing C functions, we need to look at the row-major ar- a[1][0]

rangement. Apparently, sum_array_rows() has good locality, because

...
the inner loop accesses elements stored together. We usually call this stride-
1 reference. a[M-1][0]

a[0][1]

1st column
The inner loop of sum_array_cols() needs to skip over 𝑁 elements in

...
each iteration, so the elements it gets access to are far away from each
other. It has a stride-𝑁 reference. a[M-1][1]

...
However, this is not to say sum_array_cols() does not have good locality

(N-1)-th column
in every language. In column-major order languages, it has better locality a[0][N-1]
than sum_array_rows(). So knowing the language behavior is the first

...
step to determine locality.
a[M-1][N-1]

Once this observation has been made, smart hardware designers thought, Figure 4.3: column-major languages
store all elements in each column
if they are used frequently and together, why don’t we just load all of them together.
into another device that’s faster than memory just once, and put them back
to memory once it’s done? That’s exactly what caching does.

4.1.2 Caching and Hierarchy

The idea of caching is not new actually. Think about this: our programs
are stored on the hard drive, but why would they be sent to memory when
executing? From Figure 4.1, we see that disk seek time is almost 109 times
slower than CPU cycle time, so if the data and the code we need to use
in a program are stored in the hard drive, the delay would be intolerable.
Thus, we store the code and data we’re going to use in this program into 2: Of course the design of CPU is also
memory, to accelerate the execution. We can think memory is “caching” one of the factors. It’s entirely possible to
have thousands of registers built in a pro-
the disk.
cessor, assuming you’re a billionaire and
can afford them. However, remember if
you have those registers, you also need
The reality about storage devices is usually the faster the device is, the to build an entirely new instruction set
more expensive it is to store one byte, and therefore we tend to store less architecture to operate them, which is al-
data in that device. Registers are built inside the CPU, and so they are the ready a large project. Even if you finished
fastest storage. Due to the cost, however, we cannot have large amount of it, probably no other people can afford us-
ing this processor because it’s too expen-
registers, so in our ARM architecture there are only 32 of them. 2 sive. After all, how many billionaires do
we have in total on this earth?
122 4 Memory System

Table 4.1: A typical memory hierarchy in detail.

Cache type What’s cached? Where is it cached? Latency (cycles) Managed by


Registers 4–8 bytes words CPU core 0 Compiler
TLB Address translations On-Chip TLB 0 Hardware MMU
L1 cache 64-byte blocks On-Chip L1 4 Hardware
L2 cache 64-byte blocks On-Chip L2 10 Hardware
Virtual memory 4KB pages Main memory 102 Hardware + OS
Buffer cache Parts of files Main memory 102 OS
Disk cache Disk sectors Disk controller 105 Disk firmware
Network buffer cache Parts of files Local disk 107 Web browser
Browser cache Web pages Local disk 107 Web browser
Web cache Web pages Remote server disks 109 Web proxy server

CPU

words

L0 Registers
Smaller, lines
faster, L1 cache
more expensive L1
(SRAM) lines
per byte
L2 L2 cache
(SRAM)
lines

L3 L3 cache
(SRAM)
lines
L4 Main memory
Larger, (DRAM)
slower, disk blocks
cheaper
per byte L5 Local secondary storage
(local disks, SSD, etc.)
files

Figure 4.4: Memory hierarchy. Figure Remote secondary storage


L6
borrowed from Computer Systems: A Pro- (Web servers, cloud storage, etc.)
grammer’s Perspective.

If we organize these devices in a picture, we’ll have a pyramid-like dia-


gram shown in Figure 4.4, which is usually called memory hierarchy. As
we see, the device at level 𝑘 is a cache of the device at level 𝑘 + 1. As 𝑘
increases, the devices are getting cheaper, larger, but also slower. The de-
vice at level 𝑘 stores frequently used data from level 𝑘 + 1 as a buffer, so
that device at level 𝑘 −1 can get faster access. If the data requested by 𝑘 −1
is not at level 𝑘, device 𝑘 will retrieve it from 𝑘 + 1.
Table 4.1 also shows a more detailed hierarchy.

4.2 Cache Memory

All of our running programs reside in the main memory, including the
variables used in our programs. So far it should be very clear that in order
4.2 Cache Memory 123

CPU chip

Main memory
Register (DRAM)
ALU
file

Memory
bus
System bus
Bus interface I/O bridge Figure 4.5: Traditional bus structure be-
tween CPU chip and main memory.

to do any computation, we need to load them from memory to the regis-


ters. However, from Table 4.1 we see that in order to retrieve data from
main memory, we have to wait 100 clock cycles. Therefore, we add a de-
vice called cache on the processor, as a buffer between memory and the
processor. 3 This cache is even more expensive than memory to produce, 3: The term cache, or caché, has two
so we can only hold a small amount of data inside it. When we execute a meanings in our discussion. In general,
it simply means “buffer”. For example,
program, the entire program is still stored in memory, but if we’re going
“memory is a cache of hard drive”. It
to deal with an array, we can copy the array into the cache, so CPU doesn’t can also mean a specific type of device
have to go to memory each time when it needs the data. called cache, which is used specifically
between memory and CPU. Since they
are based on static RAM (SRAM), we will
In this section, we’ll start with brief overview of main memory and its
call them SRAM cache when there could
operations, and then move on to the cache. be some confusion. Otherwise we’ll sim-
ply use the term cache, and it should be
clear based on context.

4.2.1 Memory Transactions

The memory we talked about is usually called Random Access Memory


(RAM). It’s called “random” not because its data are random; it means
given an address, we can get the data stored at that address right away.
4: Think about cassettes ɧ, if you know
This is different than other type of storage devices, where we’ll have to what that is!
read from the beginning. 4

RAM is traditionally packaged as a chip, where each chip contains mul-


tiple cells. Each cell stores one bit of data. RAM comes in two varieties:
static (SRAM) and dynamic (DRAM). The main memory we talk about is
formed by connecting multiple such DRAM chips together. Thus, we will
also sometimes use DRAM to refer to main memory.

Figure 4.5 shows a typical bus structure between CPU chip and the main
memory. The bus interface inside CPU chip allows extension of internal
bus (from register file to bus interface) to connect with I/O devices as well
as the main memory. There are also some other components in this struc-
ture:

 System bus contains three major parts: control bus, address bus,
and data bus;
 I/O controller connects I/O devices to the system bus to be con-
trolled and used by CPU chip;
 Memory bus contains address, data, and control bus as well, which
has been covered earlier in Chapter 3. The role of control bus here is
to indicate if this is a read or write transaction.
124 4 Memory System

Read Transaction

The read transaction between CPU and the main memory is mostly done
by instructions such as LDR. For example, given an instruction LDR X10,[X9] ,
the value of X9 is read from the register file first. Then it’s put on the sys-
tem bus and memory bus through bus interface and I/O bridge. Next,
main memory retrieves the data stored at the address, and puts the data
on memory bus, then transfers back through system bus and writes to reg-
ister X10 .

Write Transaction

Similarly, STR instruction invokes write transaction. For example, STR X10,[X9] ,
the data in X9 will be put on address bus, and that in X10 on data bus
transferred to the main memory.

CPU chip

Main memory
Cache Register (DRAM)
ALU
(SRAM) file

Memory
bus
System bus
Figure 4.6: A CPU chip with one level of Bus interface I/O bridge
cache.

4.2.2 Adding Cache to the Processor

Since memory transactions are too slow for the processor, we add another
type of memory called cache on the CPU chip. This cache is formed by a
collection of Static RAM (SRAM), which is more expensive but faster. It’s
also smaller than the main memory, so it can only hold a tiny subset of the
5: Through the years, engineers realized
that one level of cache is not enough, so
data we use in a program. 5 In Figure 4.6, we added a cache on the CPU
they created three levels of cache, named chip between register files and the bus interface.
L1, L2, and L3 cache, as in Figure 4.4. In
fact there are more levels, but here one is Let’s start with a toy example to see what this cache actually does. As in
enough for understanding the concept. Figure 4.7, in the beginning, cache is empty, and all the data needed are
inside the main memory. Assume now the CPU executes an instruction
LDR X0,[X1] . Because the data M[X1] is not in the cache, we have to
go to the main memory to retrieve it. The thing here is, when retrieving
data from memory, we don’t just bring the requested data back; instead,
we bring a “block” of data back. In other words, in addition to copy the
requested data back, we also copy the data stored next to it back to the
cache. For example, if we want to load the double word at address 0x1000
to 0x1008, we copy all the data from 0x1000 to 0x1040 back to the cache,
which contains eight double words.

Next time, as long as the data requested by CPU are in that address range
(0x1000—0x1040), since they are already copied in the cache, there’s no
need to go to the main memory anymore; we can simply retrieve the data
from the cache. When this happens, we say it’s a hit.
4.2 Cache Memory 125

CPU chip Main memory CPU chip Main memory


(DRAM) (DRAM)
Cache Cache
(SRAM) Register (SRAM) Register a
file
ALU file
ALU b
a
b
e
f

❶ In the beginning, the cache is empty, and all the data needed ❷ CPU requests a data inside main memory, but it’s not inside
are inside the main memory. cache, so the “block” that contains the requested data is copied
into cache.

CPU chip Main memory CPU chip Main memory


(DRAM) (DRAM)
Cache Cache
(SRAM) Register a (SRAM) Register a
ALU b ALU b
a file e file
b f
e e
f f

❸ As long as the data requested by CPU are in the cache (called ❹ If CPU requests a data that’s not in the cache (called miss),
hit), we don’t need to go to main memory anymore, and can the “block” that contains the data in the main memory will be
simply go to cache to retrieve the data. copied into cache, and overwrite the cache.

Figure 4.7: An illustration of how cache works in a toy example.

However, if the data requested is not in the cache, we’d have to go to main
memory again, and copy the “block” back to the cache, and possibly over-
write the data in the cache. The case where the data requested is not in the
cache is called a miss.

Now with this idea, it’s no wonder why we prefer good spatial locality in
our programs. Cache stores a chunk of memory data each time, so if every
LDR in our program is requesting data that are close to each other, we can
easily and rapidly get them just from the cache, without going further to
the main memory. This is also why we prefer stride-1 reference in arrays,
because cache copies consecutive elements in an array all at once.

4.2.3 Cache Organization

The “blocks” we mentioned earlier, technically, is called a line in cache.


In Figure 4.8 we show a general organization of a cache. A cache can be
characterized by three parameters: 𝑆 , 𝐸 , and 𝐵. Each cache has 𝑆 = 2𝑠
sets, where each set has 𝐸 = 2𝑒 lines. In each line, there’s a valid bit v to
indicate if the data stored in this line is valid or not. 6 A tag is also used 6: For now we can just assume if there’s
data stored in the line, it is valid, so the
for identification of the data. 7 Then there are 𝐵 = 2𝑏 bytes that store the
valid bit is set. Otherwise, it’s invalid, and
actual cached data we copied from the main memory. Thus, we see that the valid bit is clear.
the capacity of the cache is 𝑆 × 𝐸 × 𝐵 bytes. 7: If this is not clear for now, keep read-
ing, and we’ll have an example soon to
As we know, when CPU executes an instruction such as LDR X1,[X0] , help you understand.
it’s requesting the data stored at the address indicated by X0 . When we
126 4 Memory System

Cache

line 0 v tag 0 1 2 3 ... B-1 line 1 ... line E-1 set 0

line 0 v tag 0 1 2 3 ... B-1 line 1 ... line E-1 set 1


Figure 4.8: A cache with 𝑆 sets, 𝐸 lines

...
per set, and stores 𝐵 bytes for line. Note
all the lines in the cache have the same
structure as shown in line 0; due to illus-
line 0 v tag 0 1 2 3 ... B-1 line 1 ... line E-1 set S-1
tration purposes we only show structures
of line 0.

have a cache between the processor and the main memory, the first step
in this case is to see if the data is already in the cache. So given an address
issued by CPU, how can we get the corresponding data from the cache?

In fact, the address sent out by CPU can be chopped into three fields:

𝑡 bits 𝑠 bits 𝑏 bits


tag set index block offset

The first 𝑡 bits are used for the tag; next 𝑠 bits for indexing to a specific set;
and last 𝑏 bits for block offset.

Given an address, there are several steps to retrieve the data:

 Use set index in the address to locate to a specific set;


 Compare the tags in all the lines in that set with the one in the ad-
dress:
• If there’s a line whose tag matches ours, we have a hit. The
data is then located in that line, starting from block offset;
• If none of the line has matching tags, we have a miss. Then
we’ll need to go to the main memory, and copy the data back.

˛ Quick Check 4.1


Assume we have three caches with different organization param-
eters. In the following table, 𝑚 is the number of bits of a memory
address, while 𝐶 the capacity of the cache (in bytes). The rest of
the parameters are the same as we used in this section. Please fill
in the table below.

Cache 𝑚 𝐶 𝐵 𝐸 𝑆 𝑡 𝑠 𝑏
1 32 1,024 4 1
2 32 1,024 8 4
3 32 1,024 32 32
B See solution on page 150.

In the following sections, we will discuss two special types of caches, and
use concrete examples to see how cache works.
4.2 Cache Memory 127

4.2.3.1 Direct-Mapped Cache

The simplest cache is called direct-mapped cache, where each set has only
one line, i.e., 𝐸 = 1. We’ll create a toy direct mapped cache to show how
it works. In this toy example, we assume:

 The main memory uses 4-bit addresses, so it can store 16 bytes of


data. Each byte has an unique address;
 Our mini cache is direct mapped, so one line per set;
 The cache has in total 4 sets, so 𝑠 = 2;
 Each set can store 2 byte of data, so 𝑏 = 1;
 Each time we retrieve one byte of data.

Because each address has 4 bits, we use 𝑠 = 2 bits for set index and 𝑏 = 1
bit for block offset, which leaves us one bit for the tag, i.e., 𝑡 = 1. We’ll also
follow the convention from last chapter, and use M[x] to denote the data
in memory at address of x.
8: You can think that we are running
Now let’s start requesting data! Assume the CPU requests data at the a sequence of assembly instruc-
following addresses one at a time: 8 0b0000, 0b0001, 0b0111, 0b1000, tions such as LDRB W1,[0b0000] ,
0b0000. LDRB W2,[0b0001] , etc.

We first parse their addresses into tag, set index, and block offset as fol-
lows:

Address Tag Set index Block offset


0b0000 0 00 0
0b0001 0 00 1
0b0111 0 11 1
0b1000 1 00 0
0b0000 0 00 0

In the beginning, the cache is empty:

Cache Main memory


Address Data
v tag set 00
0b0000: M[0b0000]

v tag set 01 0b0001: M[0b0001]

0b0010: M[0b0010]
v tag set 10
...

v tag set 11 0b1111: M[0b1111]

0b0000 The first address requested is 0b0000. Based on our discussion,


we have a tag 0b0, set index 0b00, and block offset 0b0. Set 00 has
nothing in it, so we have a miss. We’ll go to memory address 0b0000
to retrieve both M[0b0000] and M[0b0001] back, because each line
in the cache can store two bytes. After this, our cache looks like this:
128 4 Memory System

Cache Main memory


Address Data
1 0 M[0b0000] M[0b0001] set 00
0b0000: M[0b0000]

v tag set 01 0b0001: M[0b0001]

0b0010: M[0b0010]
v tag set 10

...
v tag set 11 0b1111: M[0b1111]

Now that we have the data in the cache, we will load one byte start-
ing from offset 0 in the block. Thus, we take M[0b0000] back to
register files.
0b0001 The second address is 0b0001. Its set index is 0b00, so we index
into set 00. At this point, we notice the valid bit is set, and its tag 0
matches the tag from our address, so we have a hit! Next, the block
offset is 1, so we’ll retrieve the data from the second byte in the line,
which is M[0b0001].
9: Note here we didn’t copy M[0b0111] 0b0111 This time the set index becomes 0b11, so we have a miss. We go
and M[0b1000] back, because remember
to the main memory, and copy both M[0b0110] and M[0b0111] back
the last bit in an address in this exam-
ple indicates the block offset. Given an to the cache. 9 After copying the data from the main memory, we set
address, we want to copy the data at ad- valid bit, and set tag to 0. Since the block offset is 1, we again load
dresses with the same tag, same set index, the second byte in the line, M[0b0111], to the register file. The cache
but from block offset of 0. This way we
at this point looks like this:
can carry an entire line into the cache.

Cache Main memory


Address Data
1 0 M[0b0000] M[0b0001] set 00
0b0000: M[0b0000]

v tag set 01 0b0001: M[0b0001]

0b0010: M[0b0010]
v tag set 10
...

1 0 M[0b0110] M[0b0111] set 11 0b1111: M[0b1111]

0b1000 This address has a set index of 0b00, so we index into set 00, and
proceed to compare the tag. The tag in the set 00 is 0, which doesn’t
match the tag in our address, so the data currently in this set is not
the one we want, and we have a miss. We need to go to the main
memory, and copy M[0b1000] and M[0b1001] back. Since each set
has only one line, we’ll have to evict the data currently residing in
the set, and replace it with the new data we just brought back from
the main memory. Thus, M[0b0000] and M[0b0001] are overwritten
by M[0b1000] and M[0b1001], and the tag is updated to 1. After
these operations, the cache looks like this:
4.2 Cache Memory 129

Cache Main memory


Address Data
1 1 M[0b1000] M[0b1001] set 00
0b0000: M[0b0000]

v tag set 01 0b0001: M[0b0001]

0b0010: M[0b0010]
v tag set 10

...
1 0 M[0b0110] M[0b0111] set 11 0b1111: M[0b1111]

Then we can send the first byte from the line back to the processor.
0b0000 Lastly, this address again indexed to set 00, but from last address
we replaced the data in set 00 with tag 1, so we have a miss again,
unfortunately. We’ll have to repeat the procedure described above:
bring the data back from the main memory, overwrite the current
data in the set, and update the tag.

4.2.3.2 A Real-World Example

In the following, to help you understand the concept, we’ll use a real world
example. Of course, if the procedure of cache read we discussed in the
previous section is clear to you, you can skip this section. 00 01 10 11

Let’s say you’re working at a casino, and your job is to provide one of
the eight cards to a customer: ♥00, ♥01, ♥10, ♥11, ♦00, ♦01, ♦10, and 00 01 10 11
♦11.

Those cards are stored in the stockroom but you’re working at the front
Box 0 Box 1
desk. You realized that it’ll be too much hassle if you go back to the back-
room every time a customer comes and asks for a card. You’re very smart, Figure 4.9: For convenience, two boxes
are used for storing two cards each. Box
so you prepared two boxes at the front desk: box 0 and box 1. When a cus- x only stores cards starting with digit x.
tomer asks for a card, you check if it’s already in the box or not. If it is, then
you just give it to the customer; otherwise you go back to the stockroom,
bring it back, and put it in the box.

The rule of using the box is simple. The first digit on the card represents
which box. For example, if a card is 0x, you put that into box 0; if it’s
1x you put it into box 1. Because the box can fit two cards at a time, you
decided to bring two cards with the same first digit back to the box. Thus,
the second digit represents which card. For example, x0 means it’s the
first card in the box; x1 the second card.
00 01 10 11
Now let’s get started!

♥00 The first customer asked for a ♥00 card. Since you just started work- 00 01 10 11
ing, there’s nothing in the boxes—you had a miss. You went back to
the stockroom, and brought both ♥00 and ♥01 back and put them
00 01
in box 0, because both of them start with 0. ♥00 means it’s the 1st
Box 0 Box 1
card in box 0, you grabbed it and handed it to the customer. The
customer used the card and gave it back to you; Figure 4.10: After ♥00 was requested.
130 4 Memory System

♦11 The next customer asked for a ♦11 card. You checked box 1 but noth-
ing there—you had a miss again. You went back to the stockroom,
and grabbed both ♦10 and ♦11 back, because both of them start
with 1. After putting them in the box, you realize ♦11 is the second
card in box 1, so you took it to the customer;
♥01 This customer wanted a ♥01. First thing you checked is box 0, and
yes there are cards there. Then you need to make sure the cards
there are in the ♥ suit, and fortunately they are, so we have a hit!
You gladly picked the second card in box 0, and handed it to the
customer. No need to go back to the stockroom!
00 01 10 11 ♦01 This time, a customer wanted a ♦01. You went to Box 0, and there
were some cards there indeed. However, when you compared their
suits, you realized the one in the box (♥) is not what’s requested (♦),
00 01 10 11 so you had a miss again. You had to go back to the stockroom, and
grabbed both ♦00 and ♦01 back. The box 0 can only fit two cards,
so you swapped the two cards in the box out, with the new cards
00 01 10 11
you brought.
Box 0 Box 1
Figure 4.11: After ♦01 was requested. In this example, a customer’s request contain three information: suit, box
number, and card number. If we treat the requests as three-bit addresses,
and the card as the data requested, we have the perfect analogy:

 suit → tag;
 box number → set index;
 card number → block offset.

Table 4.2: Notations in cache.

𝐶 Cache capacity in bytes 4.2.3.3 Associative Cache


𝑆 Number of sets
𝐸 Number of lines per set In direct-mapped cache, we see that when there’s a conflict (the tag of the
𝐵 Number of bytes per line address doesn’t match the one in the line), we have to go to the main mem-
ory and replace the data in the cache. You might want to ask: what if for
each set, instead of having only one line, there are two lines? One line has
Direct-mapped cache a tag of 0, while the other has 1. In that case, we don’t need to evict any
line 0 set 00 line, and both M[0b0000,0b0001] and M[0b1000,0b1001] will be placed
in the cache. This is actually the exact idea behind associative cache.
line 0 set 01
As discussed earlier, a cache can be characterized by a tuple of parameters:
line 0 set 10
(𝑆, 𝐸, 𝐵), and the capacity in bytes 𝐶 = 𝑆×𝐸×𝐵. See Table 4.2 for a small
line 0 set 11 summary. Given a fixed capacity 𝐶 , we can adjust the three parameters
to have different types of caches. When 𝐸 = 1, meaning each set has only
2-way set associative cache one line, we have direct-mapped cache; when 𝑆 = 1, we have a fully
line 0 line 1 set 00
associative cache where we only have one set; any other combinations of
𝑆 , 𝐸 , and 𝐵 is called E-way set associative cache.
line 0 line 1 set 01
Let’s look at an example. Assume the memory addresses are of 4-bit, and
Fully associative cache the total capacity of the cache 𝐶 = 8. We also assume each line can store
line 0 line 1 line 2 line 3 set 00
two bytes of data, and so the LSB of the memory addresses is always block
offset. The three possible cache organizations are shown in Figure 4.12.
Figure 4.12: Three possible cache orga-
nizations with a total capacity of eight  Direct-mapped cache: Because 𝐸 = 1, we’ll have 𝐶 = 8 = 𝑆 × 𝐵.
bytes, and with lines that can store two Because 𝐵 = 2, we have 𝑆 = 4 sets in total. Thus, for a four-bit
bytes of data each.
4.2 Cache Memory 131

address, MSB is the tag, and LSB the block offset, and the middle
two bits are set indices;
 2-way set associative cache: We can reduce the number of set by half,
which makes 𝑆 = 2 sets. To keep 𝐵 = 2 bytes per line, we have
to make two lines per set to keep the capacity unchanged. Thus,
𝑆 = 𝐸 = 𝐵 = 2. Now that we only have two sets, we’d need one
bit for set index. LSB is still the block offset, so the most significant
two bits are used for the tag;
 Fully associative cache: We have only one set in this case, so there’s
no need to use any bit in the address as set index. To keep capac-
ity unchanged, we need four lines, and since LSB is the block offset
and we don’t need set index, the most significant three bits of the
addresses are used for the tags.

There are two possible problems coming with E-way set associative caches.
Take the 2-way cache above as an example.

The first problem is if the data requested is not in the cache and there are
empty lines in the set, which line should we put the data in? A simple
solution is just to put it in the next available line.

Another problem is when there’s a conflict. Because each line has two
bits for the tag, there are four possible tags for each set in total. However,
because we only have two lines per set, there’s a chance that both lines’
tags don’t match the tag in the address. So we need to replace one line,
but which line?

There are several simple algorithms. For example, we can take the earliest
used line out, assuming it won’t be used again very soon. We can also
take a random line out and hope for the best. In our class, we’ll replace
the earliest used line.

˛ Quick Check 4.2


The following problem concerns basic cache lookups.

 The memory is byte addressable;


 Physical addresses are 13 bits wide;
 The cache is 2-way set associative, with a 4 byte line size and
16 total lines.

In the following tables, all numbers are given in hexadecimal.


The contents of the cache are as follows:
132 4 Memory System

2-way Set Associative Cache


0 1 2 3 0 1 2 3
Set Tag V Bytes Tag V Bytes
0 09 1 86 30 3F 10 00 0 99 04 03 48
1 45 1 60 4F E0 23 38 1 00 BC 0B 37
2 EB 0 2F 81 FD 09 0B 0 8F E2 05 BD
3 06 0 3D 94 9B F7 32 1 12 08 7B AD
4 C7 1 06 78 07 C5 05 1 40 67 C2 3B
5 71 1 0B DE 18 4B 6E 0 B0 39 D3 F7
6 91 1 A0 B7 26 2D F0 0 0C 71 40 10
7 46 0 B1 0A 32 0F DE 1 12 C0 88 37

Part 1

The box below shows the format of a physical address. Indicate (by
labeling the diagram) the fields that would be used to determine
the following:
O The block offset within the cache line
I The cache index
T The cache tag
12 11 10 9 8 7 6 5 4 3 2 1 0

Part 2

For the given physical address, indicate the cache entry accessed
and the cache byte value returned in hex. Indicate whether a cache
miss occurs. If there is a cache miss, enter “-” for “Cache Byte re-
turned”.
Physical address: 0x0E34

1. Physical address in binary (one bit per box)


12 11 10 9 8 7 6 5 4 3 2 1 0

2. Physical memory reference:


Parameter Value
Byte offset 0x
Cache Index 0x
Cache Tag 0x
Cache Hit? (Y/N)
Cache Byte returned 0x

B See solution on page 150.


4.2 Cache Memory 133

4.2.3.4 Write Transactions

The operations we discussed above are limited to read transactions, i.e.,


LDR instructions, since they are easiest to demonstrate the idea behind
caches. In this section we will briefly discuss write transactions, but it’s
not our focus.
The main issue with write transactions is due to multiple copies through-
out the entire system. Think about this: suppose the processor executes an
instruction STR X0,[0x1000] , which needs to write data stored in X0
to memory address 0x1000. If the data stored at this address has already
been copied into the cache, now the problem is: do we only update the
cache or only the memory, or both?
If we only update the cache, we will have two different values of the same
address in the cache and the memory. If later this cache line was replaced,
the new value will be gone, and the memory still stores the old value.
If we only update the memory, the delay is too long, but also there will be
data inconsistency. If the next instruction is LDR X0,[0x1000] , because
the data is already in the cache, we will only load cache into the processor,
and it’ll be the old value.
For write transactions we also have hit and miss, i.e., if the destination of
STR instructions has already been copied into the cache or not, and the rule
is the same as read transactions.
We have two ways to deal with write hit:
 Write-through: writes directly to the main memory as well as the
cache;
 Write-back: updates the cache only first, and only updates the main
memory when the line is replaced.
We also have two ways to deal with write miss:
 Write-allocate: copy the line into the cache from the main memory
first, and then update the data in the cache;
 No-write-allocate: writes straight to the main memory without load-
ing into the cache first.
Typically, designers follow a specific combination when dealing with write
transactions. The common combinations are write-through + no-write-
allocate, and write-back + write-allocate. For the first option both hit and
miss cases deal with the main memory, while the second option deals with
cache first only, and only interacts with the main memory when the cache
line is replaced.

4.2.4 Multi-Level Caches

As mentioned before, one level of cache between the main memory and the
processor is good enough for our understanding of the concept of cache,
but it’s not practical in reality. Therefore, we see that in Figure 4.4 that we
have three levels of L1, L2, and L3 caches. L𝑘 cache holds parts of data
134 4 Memory System

Processor package
Core 0 Core 1 Core N-1
ALU ALU ALU
A B A B A B

Registers PC Registers PC Registers PC

… Main
L1 L1 L1 L1 L1 L1
d-cache i-cache d-cache i-cache d-cache i-cache memory

Internal L2 cache

External L3 cache
Figure 4.13: Simplified illustration of
ARM processor chip with multi-level
Bus
caches.

copied from L𝑘 + 1 cache as a buffer, and L𝑘 is smaller, faster, and more


expensive. Let’s take a quick look at how ARM chips organize multi-level
caches in Figure 4.13.

You usually see “8-core” or “quadcore” when you buy a computer. Each
core is just a CPU that contains the basic elements we have learned. In
modern architectures, L1 cache is actually split into two, called d-cache
for caching data and i-cache for instructions. This further accelerates data
processing than just using one L1 cache for both data and instructions.

When we have multiple cores on one processor package, we also have a


L2 cache that connects all the cores. It’s larger and a bit slower. Sometimes
you see some computers contain multiple processor packages, so they can
be further connected by one L3 cache. We call this “external” because it’s
not on any processor chip.

4.2.5 Evaluation Metrics

To evaluate cache performance, the most commonly used one is miss rate:

Times of data not found in the cache


Miss rate = (4.1)
Total memory reference

The best way to understand miss rate and how to calculate it is through a
concrete example.

ą Example 4.1
A bitmap image is composed of pixels. Each pixel in the image is
represented as four values: three for the primary colors (red, green
and blue – RGB) and one for the transparency information defined
as an alpha channel.

Assume we will use a direct-mapped cache of 128 bytes with 8-byte


4.2 Cache Memory 135

blocks. The definition of a pixel and the matrix we’re going to use
is defined as follows:

1 typedef struct{
2 unsigned char r;
3 unsigned char g;
4 unsigned char b;
5 unsigned char a;
6 } pixel_t;
7

8 pixel_t pixel[16][16];

Also assume that

 sizeof(unsigned char) == 1 ;
 pixel begins at memory address 0;
 The cache is initially empty;
 Variables i,j are stored in registers and any access to these
variables does not cause a cache miss.

What’s the miss rate for writes to the pixel given the following
code?

1 for (i = 0; i < 16; i ++){


2 for (j = 0; j < 16; j ++){
3 pixel[i][j].r = 0;
4 pixel[i][j].g = 0;
5 pixel[i][j].b = 0;
6 pixel[i][j].a = 0;
7 }
8 }

Before starting calculating the miss rate, we need to determine the


memory and cache structure. Notice that the code shown is in C,
which is a row-major order language, so we can depict the storage
of matrix pixel_t as follows:
[0][0] [0][1] [0][2] [15][14] [15][15]

r g b a r g b a r g b a … r g b a r g b a
0 1 2 1023

Since it’s assumed that the matrix begins at memory address 0, we


see that the last element of the matrix resides at memory address
16 × 16 × 4 − 1 = 1023 (decimal).

One another important thing we need to determine is the cache


organization. The cache we’re going to use is direct-mapped, so
𝐸 = 1. Each line has 8 = 23 bytes, so 𝑏 = 3. With the total capacity
𝐶 = 128 = 27 bytes, we have 𝑆 = 4 sets.

Now let’s see how to calculate the miss rate for this example. In the
inner loop, we have four writes (assigning zeros to the members
136 4 Memory System

of the struct), so each inner iteration we have four memory access


in total. Also notice that because each element takes four bytes,
while each line in the cache can store eight bytes, we can store two
consecutive elements in each line. We denote them as [i][j] and
[i][j+1].

In the loop of j, the first write, pixel[i][j].r = 0 , will def-


initely have a miss, either due to empty cache in the beginning,
or line replacement. After this miss, all eight bytes starting from
pixel+i+j will be carried into the cache, meaning all the following
seven writes—g,b,a for [i][j] and r,g,b,a for [i][j+1]—will
have hit. So we can certainly re-write the code to the following and
it’s identical in terms of cache and memory access:

1 for (i = 0; i < 16; i ++){


2 for (j = 0; j < 16; j += 2){
3 pixel[i][j].r = 0; // miss
4 pixel[i][j+1].r = 0; // hit
5 pixel[i][j].g = 0; pixel[i][j+1].g = 0; // hit
6 pixel[i][j].b = 0; pixel[i][j+1].b = 0; // hit
7 pixel[i][j].a = 0; pixel[i][j+1].a = 0; // hit
8 }
9 }

What we can conclude from the code above is, for every eight mem-
ory access in the inner loop, we will have one miss. The outer loop
doesn’t really matter here, because it’s simply scaling the access by
16. Therefore, clearly the miss rate is 18 = 0.125 or 12.5%.

˛ Quick Check 4.3


Given the follow chunk of code, analyze the miss rate given that
we have a byte-addressed computer with a total memory of 1 MB.
It also features a 16 KB direct-mapped cache with 1 KB blocks. As-
sume that the cache begins cold (empty).

1 #define NUM_INTS 8192 // 2^13


2 int A[NUM_INTS]; // A lives at 0x10000
3 int i, total = 0;
4 for (i = 0; i < NUM_INTS; i += 128) {
5 A[i] = i; // Line 1
6 }
7 for (i = 0; i < NUM_INTS; i += 128) {
8 total += A[i]; // Line 2
9 }

1. How many bits make up a memory address on this com-


puter?
2. How many bits are used for Tag, Index, and Offset in the
cache?
4.2 Cache Memory 137

3. Calculate the cache miss rate for the line marked Line 1;
4. Calculate the cache miss rate for the line marked Line 2.

B See solution on page 151.

4.2.6 Writing Cache-Friendly Code

You might be wondering: “sure, theoretically we know how cache works


and so on, but does that really make a difference? For all the programs
we’ve written it seems that everything just happens so fast.” On the other
hand, probably the only difference you’ve noticed is the Big-O notation
learned in algorithm class, where you were taught that the time complex-
ity increases as you have larger inputs.

When you were learning Big-O notation, recall in the beginning a big as-
sumption is that hardware differences are ignored, and so it’s a purely
theoretical abstraction. Remember, however, that the programs are not
running in your imagination; it eventually relies on the hardware on your
laptop! Since we have learned calculating miss rate, why don’t we see an
example, where the programs have identical time complexity but different
miss rates, and see in reality what their performance is.

Our task is very simple—matrix multiplication, given two long int ma-
trices a and b of same size 𝑛 × 𝑛. Each element takes eight bytes. Based
on the math definition, we can quickly write out a segment that performs
the multiplication:

1 /* ijk */
2 for (i = 0; i < n; i ++) {
3 for (j = 0; j < n; j ++) {
4 sum = 0.0;
5 for (k = 0; k < n; k ++)
6 sum += a[i][k] * b[k][j];
7 c[i][j] = sum;
8 }
9 }

Another way to perform matrix multiplication is to fix each element a[i][k]


in matrix a, and iterate over each row in matrices b and c:

1 /* kij */
2 for (k = 0; k < n; k ++) {
3 for (i = 0; i < n; i ++) {
4 r = a[i][k];
5 for (j = 0; j < n; j ++)
6 c[i][j] += r * b[k][j];
7 }
8 }
138 4 Memory System

Or, we can fix each element in matrix b while iterate over each column in
matrices a and c:

1 /* jki */
2 for (j = 0; j < n; j ++) {
3 for (k = 0; k < n; k ++) {
4 r = b[k][j];
5 for (i = 0; i < n; i ++)
6 c[i][j] += a[i][k] * r;
7 }
8 }

The three methods above are mathematically equivalent, and have the
same complexity of 𝑂 (𝑛3 ). Let’s analyze their miss rates first. Assume
the block size of the cache is 32 bytes and is not large enough to store mul-
tiple rows. We also assume the dimension 𝑛 of the matrices is very large,
meaning each row needs to be moved into the cache multiple times.

ijk The inner loop accesses elements in a and b each time. Notice for
a[i][k], the row-number is fixed while column number k constantly
changes from 0 to n-1. Also because the block size of the cache is
32 bytes and each element takes 8 bytes, we can hold four elements
each time. For every four elements, the first one will be a miss due
to empty cache or line replacement, while the rest of the three will
be hits. Therefore, the miss rate for matrix a is 1/4 = 0.25. As to ma-
trix b, notice it gets access to each element column-wise, so the miss
rate is simply 1. The inner loop does not involve matrix c, so we can
ignore it.

kij In this code, we first notice that the inner loop does not involve matrix
a, so we skip it. Element b[k][j] is iterating over all the elements
in each row, so its miss rate is simply 1/4 = 0.25 (see analysis in the
ijk case for matrix a). Similarly, matrix c iterates over all elements
in a row, so its miss rate is also 0.25.

jki Lastly, for jki case, matrix b is not present in the inner loop, so we
ignore it. For both matrices a and c, the inner loop iterates over all
the elements in one column. You’ve probably already sensed that
0
Cycles per inner loop iteration

.0
70

jki this is a bad idea, and yes the miss rate for both is 1.
Cycles per inner loop iteration

ijk
5
.7

kij
52

If we take average miss rate of the three methods, we will have 0.625 for
ijk, 0.125 for kij, and 1 for jki. So in terms of miss rate, it’s obvious that
0
.5
35

kij is the best way and jki the worst. But is that really the case?
5
.2
18

The authors of Computer System: A Programmer’s Perspective have run the


code above on an Intel Core i7 machine, and recorded the number of cycles
00

50 150 250 350 450 550 650


1.

Matrix dimension (n)


it takes for each inner loop iteration, as the matrix dimension 𝑛 goes up.
Matrix dimension (n) Figure 4.14 shows the result. As expected, the larger miss rate, the more
Figure 4.14: Matrix multiplication on
Core i7. Large missing rate leads to more cycles it needs per each inner loop iteration. As 𝑛 increases, the cycles
cycles needed to run each inner loop iter- needed also increases. Notice how kij almost stays the same.
ation.
4.3 Virtual Memory 139

Of course an efficient algorithm of low complexity is still desired, but re-


member it’s still an abstraction; eventually the computation happens in
the hardware. An experienced programmer should not only write code
with low complexity, but have the knowledge about the cache and write
cache-friendly code.

4.3 Virtual Memory

In previous section, we see that the CPU requests a data by issuing a mem-
ory address, and the address is parsed to check the cache first. The ad-
dresses received by the cache are actually physical address—meaning those
are real address used in the real main memory.

Why do we emphasize physical? Our laptops are usually now 16GB or at


least 8GB of RAM, which is the physical size of the main memory, shared
by all programs running in the system. Even if 16GB seems a lot, it is far
from being sufficient to run multiple real applications on our laptops. For
example, right now, just my music player on macOS itself needs 392GB of
memory space, not to mention all other apps running on my laptop that
are using the same main memory. My memory is only 16GB, so how is a
music player that needs 392GB of memory even able to execute? And how
do all the concurrently running programs squeeze into a memory as small
as 16GB?

Again, let’s think about locality. We mentioned that locality is an im-


portant feature of computer programs, which doesn’t only apply to the
program data, but also program code itself. For example, given a small Program 1 Program 2 Physical memory

amount of time, the assembly instructions running in the program tend 1: LDUR 1: SUBS 0x00

to cluster together (they are stored close to each other). Even if we have 2: ADD 2: ORR 0x04
3: SUB 3: LDUR 0x08
branching instruction, you don’t always jump all over the place, right? No- 4: SUB 4: STUR
0x0C
tice two key points there: “a small amount of time”, and “cluster”. Let’s 5: LDUR 5: CMP
6: STUR 6: ADDS 0x10
start with a motivating example. 7: CMP 7: ADDS 0x14
8: LDUR 0x18
0x1C

4.3.1 A Motivating Example Figure 4.15: Two programs need 60 bytes


of memory space, but the actual memory
Assume our memory can store only 48 bytes only, as in Figure 4.15, but has only 48 bytes.

we want to run two programs at the same time. For the two programs
shown in Figure 4.15, we see that there are in total 15 instructions. If each
instruction takes 4 bytes, to load them all into the physical memory we Program 1 Program 2 Physical memory
need 15 × 4 = 60 bytes at least. So how to fit both programs into a small 1: LDR 1: SUBS 0x00 LDR
physical memory? 2: ADD 2: ORR 0x04 ADD
3: SUB 3: LDR 0x08 SUB
Because of locality, we know that the executed instructions of a program 4: SUB 4: STR
0x0C SUB
5: LDR 5: CMP
in a small amount of time are usually stored next to each other in memory. 6: STR 6: ADDS 0x10 SUBS

So why don’t we just load parts of the program into the memory first? Dur- 7: CMP 7: ADDS 0x14 ORR
8: LDR 0x18 LDR
ing that small amount of time, just that part of the program is enough to
0x1C STR
execute. As shown in Figure 4.16, we only load program 1’s and program
2’s first four instructions into the physical memory, which will fill the en- Figure 4.16: Only loading parts of the pro-
tire memory nicely. When the loaded instructions have all been executed, grams allows them to fit in to small phys-
ical memory at the same time.
140 4 Memory System

we take them out, and swap in the other parts of the programs, so we’ll
end up in a situation as in Figure 4.17.
Program 1 Program 2 Physical memory
This is such a simple thing we do almost every day. There might be lots of
1: LDR 1: SUBS 0x00 CMP
2: ADD 2: ORR 0x04 ADDS books at your home, but your bag can only take probably five books max,
3: SUB 3: LDR 0x08 ADDS so before you go to campus, you’d take only the books you need that day
4: SUB 4: STR
5: LDR 5: CMP
0x0C LDR to bring. If next day you’re going to use a different book, you’ll take out
6: STR 6: ADDS 0x10 LDR the one you’re not going to use, and put the new book into your bag. The
7: CMP 7: ADDS 0x14 STR
8: LDR
books are like parts of the program, and the bag is the physical memory.
0x18 CMP
0x1C

Figure 4.17: When other parts of the pro- 4.3.2 Definitions


grams are needed, we just swap them in.

The example in the previous section is a very simple one, but it illustrates
the idea behind virtual memory well. Remember, however, the example is
only simplified. In fact, the areas of a process—text, data, stack, and heap—
will all be separated into “parts” and loaded into the physical memory, not
just the text segment.

We, as programmers, write our code without even thinking about mem-
ory space limit. This is because the operating system provides a memory
management mechanism called virtual memory. From our point of view,
10: Technically not infinitely-infinitely
large; the typical virtual memory size
each of our program can take basically infinitely large memory space. 10
for a program is 4GB on Linux, but can So all the statements involving “memory address” so far (except the one in
certainly be extended as needed. the cache), e.g., “pointers represent memory addresses”, “LDR loads data
at a memory address to a register”, etc, refer to virtual memory. If a vir-
tual memory address has 𝑛 bits, each program will have the same set of
𝑁 = 2𝑛 − 1 unique virtual addresses {0, 1, … , 2𝑛 − 1}, which is called a
virtual address space.

From the machine’s view, however, the physical memory is small and lim-
ited. The physical memory is also byte-addressed, meaning each byte has
an address. We call this address physical address—a real address, not vir-
tual, not imaginary. If a physical address has 𝑚 bits, the set of 𝑀 = 2𝑚 − 1
unique addresses {0, 1, … , 2𝑚 −1} is called physical address space. Based
on our analysis, we notice that 𝑁 >> 𝑀 .

Obviously, to fit all the programs into one small physical memory needs
some arrangement, as we showed in Section 4.3.1. Briefly, we chop each
program’s virtual memory space into smaller pieces, called virtual pages,
and only load some virtual pages into the physical memory. When they
reside in the physical memory, they are called physical pages.

Let’s look at Figure 4.18 for an illustration. In this example, we have two
processes share one physical memory. Each of the processes has a com-
plete virtual memory space, from 0 to 𝑁 − 1. Their virtual memory spaces
also have complete organizations including text and data segment, stack,
and heap. This is what we see for each process.

Due to space limit, it’s impossible to fit all of these two processes into one
physical memory, so we chop the virtual memory spaces into pages, and
only the data and code in a page is needed do we move the page into the
physical memory. In the figure, we see that two pages of process 1 and
Stack
4.3 Virtual Memory 141
Heap

Read/write segment
N-1 (.data, .bss) N-1
Stack Physical memory
Stack
Stack
Read-only segment
Stack
Heap
(.init, .text, .rodata) M-1
Unused
Read/write segment
(.data, .bss)

Read-only segment
Heap (.init, .text, .rodata) Heap
Stack
Heap
Unused
Heap Figure 4.18: To fit multiple processes’ vir-
Read/write segment Read/write segment
(.data, .bss) Read/write (.data, .bss) tual memory space into a small physical
Read/write segment
segment 0
(.data, .bss) memory, the system needs to split virtual
Read-only segment (.data, .bss) Read-only segment memory spaces into pages, and only load
(.init, .text, .rodata) Virtual memory
Read-only Virtual memory (.init, .text, .rodata)
Read-only segment
segment the pages that contain needed data and
0 (.init,
(Process 1)
(.init, .text,
.text, .rodata)
(Process 2)
.rodata) Unused 0
Unused code into physical memory.
Unused
Unused
Heap

Read/write segment
three pages of process 2 are currently being used. If the processor needs
(.data, .bss)
data or code in a page that’s not currently in the physical memory, it needs
Read-only segment
to find a page to swap out. If(.init,
it couldn’t find such a page, then delay will
.text, .rodata)
happen. Remember when you open too many apps on your laptop and it
Unused

got stuck? Yeah that’s basically what happened.

Notice some details in the figure:

 Physical pages are not dedicated to a specific process, and that’s why
we see the pages from two processes are all over the place, and they
are intertwined in the physical memory;
 There’s no order in physical memory. A page in a low virtual mem-
ory address doesn’t always end up in a low physical memory ad-
dress. For example, the page that contains stack of process 1 is at
the lower physical address, while the page with text segment is at
the higher physical address;
 The pages are of equal size, so they are not aligned to a specific seg-
ment. For example, we see that one of the pages in process 2 has
some part of the heap, and also some part of the data segment.

Now the ultimate question is, how does the processor know which phys-
ical address is corresponding to which virtual memory address in which
process?! As to which process is running, it’s related to context switching, a
topic you’ll learn in operating systems, so we’ll skip it here and only focus
on the first question.

4.3.3 Address Translation

The procedure of finding a physical memory address given a virtual ad-


dress is called address translation. More formally, assume the virtual
memory space is V = {0, 1, … , 𝑁 − 1} and the physical memory space
P = {0, 1, … , 𝑀 − 1}. Address translation can be defined as a function
such that

MAP ∶ V ↦ P ∪ {∅} (4.2)

Thus, given a virtual address 𝑎, we have 𝑎′ = MAP(𝑎) where 𝑎′ is the trans-


lated physical address. If 𝑎′ ∈ P , the virtual address can be successfully
142 4 Memory System

Virtual address CPU chip


MMU Main memory
Register (DRAM)
Physical ALU
address file

Cache Memory
(SRAM) bus
System bus
Figure 4.19: Adding memory manage- Bus interface I/O bridge
ment unit (MMU) to CPU chip.

mapped to physical memory, and the data can be found there. If 𝑎′ = ∅,


the virtual address cannot be mapped to physical memory. This situation
happens when, for example, we are trying to use a variable that’s currently
in a virtual page, but that page hasn’t been loaded into physical memory.
Thus, the first case is called a page hit, while the second page fault.

4.3.3.1 Translation with Page Table

The translation from virtual memory address to physical address is per-


formed by a special hardware called memory management unit (MMU).
As shown in Figure 4.19, we add MMU to the CPU chip between the pro-
cessor (register file and ALU) and SRAM cache. MMU receives a virtual
address request issued by CPU, translates it to a physical address, and
sends it to the cache. MMU’s main job is to do the address translation to
enable multiple tasks/processes share the main memory. It’s managed
and programmed by the operating system.

The address translation is nothing fancy—it’s simply a one-to-one map-


ping by using a look-up table called page table. The page table is stored
in the main memory, and looks like this:

Valid bit Physical Page Number (PPN)


0 0x0000
1 0x0110
1 0xCC00
⋮ ⋮

Each row in called a page table entry, and has two fields: valid bit, and
physical page number (PPN). The index to each page table entry is called
virtual page number (VPN). For example, if given a virtual address we
know the VPN of it is 2, then by looking up the table above, we know that
this virtual page is located at the address starting from 0xCC00 in the phys-
ical memory. The valid-bit indicates if the page is actually in the physical
memory. If it’s 1 then yes and we’ll have a page hit; otherwise we’ll have
a page fault.

When there’s a page fault, the main memory will pick a page called victim
page, and evict it to the secondary storage, such as the hard drive. Then
the page we need will be taken back to the main memory from the hard
drive, and the address will be translated again by the MMU.
4.3 Virtual Memory 143

Here’s a real-world example to help you understand. Assume you have


multiple pages of a spreadsheet, where each page has 100 rows. What’s
the easiest way to locate to a specific row? You don’t want to say, get me
row 4,552 from the very first row. Instead, it’ll be easier to say, get me row
52 on page 45, right? So you index to page 45, and count from the first row
on that page to 52. Notice how we determine the location: the first two
digits 45 is the page number, and 52 is an offset. Bingo! That’s exactly how
we translate virtual address to physical address.
A virtual address of 𝑛 bit can be split into two parts:

Bits n-1..p Bits p-1..0


Virtual page number (VPN) Virtual page offset (VPO)

Given a 𝑛-bit virtual address, MMU first splits it into two parts. The high-
est 𝑛 − 𝑝 bits are used as the index to a row in the page table, and that’s the
virtual page number. If the valid bit is 1, we have a page hit, then we re-
trieve the physical page number of that row. The virtual page offset is the
lowest 𝑝 bits, and they will simply be used as physical page offset (PPO)
without change. So we attach PPO to PPN, and we have the physical ad-
dress.
The contents in physical or virtual pages are simply bytes, so the page off-
set indicates which byte in that page. For example, if PPO = 0, we want
the first byte of that page; if PPO = 1600, we want the 1601-th byte, and so
on. Since we need 𝑝 bits to represent the offset, and each byte has its own
address, it is clear that given a page, there will be 2𝑝 bytes stored. There-
fore, we call 𝑃 = 2𝑝 the page size.

˛ Quick Check 4.4


Assume the physical address has 32 bits, while the virtual address
64 bits. Based on different page sizes, determine the number of bits
needed to represent VPN, VPO, PPN, and PPO.

P VPN VPO PPN PPO


1KB
2KB
4KB
16KB

B See solution on page 152.

ą Example 4.2
Now let’s look at an example of translating virtual address to phys-
ical address. Assume:

 The memory is byte addressable;


 Virtual addresses are 16 bits wide;
 Physical addresses are 14 bits wide;
144 4 Memory System

 The page size is 1024 bytes.

In the following tables, all numbers are given in hexadecimal. The


contents of the page table for the first 15 pages are as follows:

Page Table
VPN PPN Valid VPN PPN Valid VPN PPN Valid
00 2 0 06 B 0 0A 3 0
01 5 1 07 D 1 0B 1 1
02 7 1 08 7 1 0C 0 1
03 9 0 09 C 0 0D D 0
04 F 1 0A 3 0 0E 0 0

What’s the physical address of virtual address 0x2F09?

Solution: The first step is to write the virtual address in binary:


15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1
Then we need to decide various parameters for this virtual address.
Because the page size is 1,024 = 210 bytes, both VPO and PPO need
10 bits to represent. The virtual address is 16-bit, so the leading 6
bits will be used for VPN. Therefore, the virtual address is parsed
as the following:
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 0 1 1 1 1 0 0 0 0 1 0 0 1
⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
VPN VPO

We convert VPN to hexadecimal and get VPN = 0x0B. According


to the page table, corresponding PPN for VPN of 0x0B is 1, and
since its valid bit is 1, we have page hit!

Physical addresses are 14 bits wide. Because VPO is identical to


PPO, meaning it’ll take 10 bits, we only need four bits for PPN.
Therefore, we zero extend PPN to 4 bits: 0x0001, and attach it
with VPO:

13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 0 0 0 0 1 0 0 1
⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
PPN PPO
This is the physical address in binary, so the last step is to simply
convert this into hexadecimal, and we get physical address of
0x709.

4.3.3.2 Accelerating Translation with TLB

Notice that the page tables are nothing special: it resides in the main mem-
ory just like any other data, such as our code, or other process data. There-
fore, it’ll be cached in L1 cache (and so L2, L3, etc) as well, which means
it’ll be evicted and replaced if the cache is full and we need to store other
4.3 Virtual Memory 145

data in the cache. However, remember the special role of page table: it
is used for translating addresses, which is almost needed all the time, so
we certainly don’t want it to be replaced too often. Otherwise the delay of
retrieving page table entry would be too long.

Similar to the idea of caching memory data in a SRAM cache, we also add
a small SRAM cache called translation lookaside buffer (TLB) on MMU,
as in Figure 4.20. Essentially this TLB is also a set-associative cache; each
line of this cache stores one page table entry.

We’re fairly familiar with SRAM caches by now, so the indexing rule to a
line is the same to TLB. Because page table entries can be determined by
VPN, we separate VPN into two parts: TLB tag (TLBT), and TLB index
(TLBI). These two parts will be used to index a page table entry (a line) in
TLB:

Bits n-1..p+t Bits p+t-1..p Bits p-1..0


TLB tag TLB index Virtual page offset (VPO)
⏟⏟⏟⏟⏟⏟⏟
Virtual page number (VPN)

Here we use the middle 𝑡 bits to indicate TLB set index, so in total we have
𝑇 = 2𝑡 sets. Figure 4.21 shows the idea of TLB, which as you see, is just a
SRAM cache where the cached data are page table entries.

˛ Quick Check 4.5


Are the following statements true of false? Why?

1. If a page table entry can not be found in the TLB, then a page
fault has occurred;
2. The virtual and physical page number must be the same size;
3. The virtual address space is limited by the amount of mem-
ory in the system;
4. The page table is accessed before the cache.

B See solution on page 152.

4.3.4 From Virtual Address to Data

In this section, we will look at a concrete example to see what actually


happened during the entire address translation process. Let’s start with

CPU chip
MMU
TLB
Main memory
Register (DRAM)
ALU
file

Cache Memory
(SRAM) bus
System bus
Bus interface I/O bridge Figure 4.20: TLB as a cache for page table
entries on MMU.
146 4 Memory System

TLB

line 0 v tag Page table entry line 1 ... line E-1 set 0

line 0 v tag Page table entry line 1 ... line E-1 set 1

...
Figure 4.21: A translation lookaside
buffer (TLB) is simply a SRAM cache line 0 v tag Page table entry line 1 ... line E-1 set T-1
that caches page table entries.

something we’re most familiar with. Say there’s a statement in a C pro-


gram we wrote:

1 char c = *ptr; // ptr is a char pointer

where ptr is a char pointer that points to a byte in the virtual memory.
During compilation, this line will be translated into assembly:

1 LDRB W0, [X1]

where we assume X1 stores the value of ptr, which is a virtual memory


address. From what we learned in Chapter 3, we know that during EX
stage, the value of X1 will be passed through ALU, and used as memory
address for memory read transaction. Suppose the value stored in X1
is 0x1DDE. To retrieve the data stored at this virtual memory address, we
need to do address translation.

Now assume the machine we run the above code has the following config-
urations

 The memory is byte addressable;


 Virtual addresses are 16 bits wide;
 Physical addresses are 13 bits wide;
 The page size is 512 bytes;
 The TLB is 8-way set associative with 16 total entries;
 The cache is 2-way set associative, with a 4 byte line size and 16 total
lines.

At the time when we execute the LDRB instruction, the contents of the TLB,
the page table for the first 32 pages, and the cache are as shown in Table
4.3

Step 1: Decide bit representations of virtual address.


Before we translate the address, we need to know which part of the
virtual address is VPN, which part is VPO. This information is de-
termined by the organization of the page table as well as TLB.

From the description, we know that page size is 512 = 29 bytes, so


we need 9 bits to represent page offsets, and therefore bits [8..0] is
VPO. Since the virtual addresses are 16 bits wide, the leading seven
4.3 Virtual Memory 147

Table 4.3: Contents of TLB, page table (first 32 pages), and the cache in the example. All numbers are hexadecimal.

TLB Page Table


Index Tag PPN Valid VPN PPN Valid VPN PPN Valid
0 09 4 1 00 6 1 10 0 1
12 2 1 01 5 0 11 5 0
10 0 1 02 3 1 12 2 1
08 5 1 03 4 1 13 4 0
05 7 1 04 2 0 14 6 0
13 1 0 05 7 1 15 2 0
10 3 0 06 1 0 16 4 0
18 3 0 07 3 0 17 6 0
1 04 1 0 08 5 1 18 1 1
0C 1 0 09 4 0 19 2 0
12 0 0 0A 3 0 1A 5 0
08 1 0 0B 2 0 1B 7 0
06 7 0 0C 5 0 1C 6 0
03 1 0 0D 6 0 1D 2 0
07 5 0 0E 1 1 1E 3 0
02 2 0 0F 0 0 1F 1 0
2-way Set Associative Cache
Index Tag Valid 0 1 2 3 Tag Valid 0 1 2 3
0 19 1 99 11 23 11 00 0 99 11 23 11
1 15 0 4F 22 EC 11 2F 1 55 59 0B 41
2 1B 1 00 02 04 08 0B 1 01 03 05 07
3 06 0 84 06 B2 9C 12 0 84 06 B2 9C
4 07 0 43 6D 8F 09 05 0 43 6D 8F 09
5 0D 1 36 32 00 78 1E 1 A1 B2 C4 DE
6 11 0 A2 37 68 31 00 1 BB 77 33 00
7 16 1 11 C2 11 33 1E 1 00 C0 0F 00

bits [15...9] is VPN.

VPN consists of TLBI and TLBT. From the table, we see that TLB has
only two sets, therefore one bit is enough to index to either. Thus,
in the virtual address, bit [9] is TLBI, while bits [15...10] is TLBT.

Step 2: Write out and parse the virtual address in binary.


Once the components of the virtual address have been identified,
we need to write the address in binary, and recognize each compo-
nent. In this example, we assume the virtual address sent by CPU is
0x1DDE. Thus, we parse it as follows:
TLBI

TLBT
⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏞
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
VPN VPO

From this we have TLBT is 0x07, TLBI 0x00, VPN 0x0E, and VPO
0x1DE.
148 4 Memory System

Step 3: Index into TLB.


Given the TLBI, we index into set 0 of TLB, and see if there’s any line
that has TLBT of 0x07. Unfortunately there’s no line’s tag matches,
so we have a TLB miss. Therefore, we’ll have to use VPN to check
the page table. When there’s a TLB miss, MMU will go to the main
memory and fetch the page table entry given the VPN.

Based on Table 4.3, we see that the page table entry for VPN 0x0E
has a PPN of 0x1. Fortunately, it’s valid bit is 1, so we have a page
hit, instead of a page fault. MMU will take this entry back, and trans-
late it into physical memory address.

Step 4: Translate into physical address.


Now that we retrieved the PPN, MMU takes PPN and attach it with
VPO to form the physical address, which is 0x11DE. This is the real
address in the physical memory where our variable ptr points to.

Step 5: Write out and parse the physical address in binary.


Once again, we need to determine the components of the physical
address to utilize the cache. From the last step, we know that phys-
ical address is 0x11DE. For cache, we need to determine the bits for
the tag, set index, and block offset. Because the cache has 8 = 23
sets, we need 3 bits for the set index. Each line has 4 = 22 bytes, so
we need 2 for the block offset. The rest of the 13 − (2 + 3) = 8 bits
will be used for tags. Thus, we write out the address in binary:
Tag Set Block
⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏞⏞⏞⏞⏞ ⏞
12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 1 0 1 1 1 1 0
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
PPN PPO

Therefore, the tag is 0x1E, set index 0x7, and block offset 0x2.

Step 6: Visit cache.


Given the parsed result of the physical address, we first visit the
cache. In set 7, we see that the second line has a tag 0x1E which
matches the one in our address, and its valid bit is 1, so we have a
cache hit!! The block offset is 0x2, so the byte we retrieved is 0xF.
This byte will be brought back to the CPU and stored in the register
W0 . Now, we know in our C program in the beginning, variable c
will store 0xF.

Note that if we had a cache miss, we’d have to go to the main mem-
ory, and evict the line with tag 0x1E in set 7. This involves two opera-
tions: write the line in the cache back to the main memory (assuming
it’s write-allocate), and copy the line we want to the cache.
4.4 Reference 149

pages
N-1 N-1 N-1
Stack Stack Stack
lines

words
A

ALU
Heap Heap Heap
B
Read/write Read/write Read/write
(.data, .bss) (.data, .bss) (.data, .bss) Cache Registers
(SRAM)
Read-only segment Read-only segment Read-only segment
CPU chip
0 Unused 0 Unused 0 Unused Physical memory
(DRAM)
Virtual memory

Figure 4.22: Physical memory retrieves and stores pages from the hard drive, and sends lines to the cache. The cache then sends words to
the registers, where the data are requested by the ALU.

4.3.5 Summary

After all this mess, let’s take a look at a big picture about how data are
moved around inside our computers in Figure 4.22.

We write our code and compile them into executables. Those executables
are stored on hard drive, and have the complete image of virtual mem-
ory. From our perspective, each program has its own and identical virtual
address spaces.

When we invoke a program, parts of that virtual memory will be carried


into the physical memory as pages. Then some parts or data in that page
will be carried into the cache as lines. When CPU requests a data and we
have a cache hit, the data will be moved into registers as words (4 bytes),
and loaded as operands to the ALU for calculation. This should give you
a clearer picture of what’s happening behind the scene.

As to the details on how exactly these things happened, it’s way beyond
our scope, and one course is not enough to talk about them in details. If
you are interested in this, you might want to take system courses to study
them in depth, and I encourage you to, because it’s really fun!

4.4 Reference

Of course the contents in this chapter have been extremely simplified to


make sure you focus on the most important concept, not distracted by too
much details and technical discussions. For those who are interested to
learn more in depth on ARM CPU chips, however, I recommend visiting
the official website of ARM, specifically this one: https://developer.
arm.com/documentation/den0024/a/Caches. You can see it includes
contents such as cache, virtual memory, address translation, and so on. It
has a lot more details, but you’ll realize with the concept learned through
this chapter as a foundation, it is not very difficult to hop on it and explore
further.
150 4 Memory System

4.5 Quick Check Solutions

Quick Check 4.1

Assume we have three caches with different organization parameters. In


the following table, 𝑚 is the number of bits of a memory address, while
𝐶 the capacity of the cache (in bytes). The rest of the parameters are the
same as we used in this section. Please fill in the table below.

Cache 𝑚 𝐶 𝐵 𝐸 𝑆 𝑡 𝑠 𝑏
1 32 1,024 4 1 256 22 8 2
2 32 1,024 8 4 32 24 5 3
3 32 1,024 32 32 1 27 0 5

Quick Check 4.2

The following problem concerns basic cache lookups.

 The memory is byte addressable;


 Physical addresses are 13 bits wide;
 The cache is 2-way set associative, with a 4 byte line size and 16 total
lines.

In the following tables, all numbers are given in hexadecimal. The con-
tents of the cache are as follows:

2-way Set Associative Cache


0 1 2 3 0 1 2 3
Set Tag V Bytes Tag V Bytes
0 09 1 86 30 3F 10 00 0 99 04 03 48
1 45 1 60 4F E0 23 38 1 00 BC 0B 37
2 EB 0 2F 81 FD 09 0B 0 8F E2 05 BD
3 06 0 3D 94 9B F7 32 1 12 08 7B AD
4 C7 1 06 78 07 C5 05 1 40 67 C2 3B
5 71 1 0B DE 18 4B 6E 0 B0 39 D3 F7
6 91 1 A0 B7 26 2D F0 0 0C 71 40 10
7 46 0 B1 0A 32 0F DE 1 12 C0 88 37

Part 1

The box below shows the format of a physical address. Indicate (by label-
ing the diagram) the fields that would be used to determine the following:
O The block offset within the cache line
I The cache index
T The cache tag

12 11 10 9 8 7 6 5 4 3 2 1 0
T T T T T T T T I I I O O
4.5 Quick Check Solutions 151

Part 2

For the given physical address, indicate the cache entry accessed and the
cache byte value returned in hex. Indicate whether a cache miss occurs. If
there is a cache miss, enter “-” for “Cache Byte returned”.

Physical address: 0x0E34

1. Physical address in binary (one bit per box)


12 11 10 9 8 7 6 5 4 3 2 1 0
0 1 1 1 0 0 0 1 1 0 1 0 0

2. Physical memory reference:


Parameter Value
Byte offset 0x0
Cache Index 0x5
Cache Tag 0x71
Cache Hit? (Y/N) Y
Cache Byte returned 0x0B

Quick Check 4.3

Given the follow chunk of code, analyze the miss rate given that we have
a byte-addressed computer with a total memory of 1 MB. It also features
a 16 KB direct-mapped cache with 1 KB blocks. Assume that the cache
begins cold (empty).

1 #define NUM_INTS 8192 // 2^13


2 int A[NUM_INTS]; // A lives at 0x10000
3 int i, total = 0;
4 for (i = 0; i < NUM_INTS; i += 128) {
5 A[i] = i; // Line 1
6 }
7 for (i = 0; i < NUM_INTS; i += 128) {
8 total += A[i]; // Line 2
9 }

1. How many bits make up a memory address on this computer?


 We take log2 (1𝑀 𝐵) = log2 220 = 20.
2. How many bits are used for Tag, Index, and Offset in the cache?
 Offset = log2 (1𝐾𝐵) = log2 210 = 10;

 Index = log2 ( 16𝐾𝐵


1𝐾𝐵 ) = log2 16 = 4;

 Tag = 20 − 4 − 10 = 6.

3. Calculate the cache miss rate for the line marked Line 1;
 The integer accesses are 4×128 = 512 bytes apart, which means
there are 2 accesses per block. The first accesses in each block is
a compulsory cache miss, but the second is a hit because A[i]
152 4 Memory System

and A[i+128] are in the same cache block. Thus, we end up


with a hit rate of 50%. The miss rate is simply 50% as well.
4. Calculate the cache miss rate for the line marked Line 2.
 The size of A is 8192 × 4 = 32768 bytes. This is exactly twice
the size of our cache. At the end of Line 1, we have the second
half of A inside our cache, but Line 2 starts with the first half of
A. Thus, we cannot reuse any of the cache data brought in from
Line 1 and must start from the beginning. Thus our hit rate is
the same as Line 1 since we access memory in the same exact
way as Line 1. We don’t have to consider cache hits for total, as
the compiler will most likely store it in a register. Thus, we end
up with a hit rate (and therefore a miss rate) of 50%.

Quick Check 4.4

Assume the physical address has 32 bits, while the virtual address 64 bits.
Based on different page sizes, determine the number of bits needed to rep-
resent VPN, VPO, PPN, and PPO.

P VPN VPO PPN PPO


1KB 54 10 22 10
2KB 53 11 21 11
4KB 52 12 20 12
16KB 50 14 18 14

Quick Check 4.5

Are the following statements true of false? Why?


1. If a page table entry can not be found in the TLB, then a page fault
has occurred;
 False, the TLB acts as a cache for the page table, so an item can
be valid in page table but not stored in TLB. A page fault occurs
either when a page cannot be found in the page table or it has
an invalid bit.
2. The virtual and physical page number must be the same size;
 False. There could be fewer physical pages than virtual pages.
However, the page size does need to be the same.
3. The virtual address space is limited by the amount of memory in the
system;
 False. The physical address space is limited by the amount of
physical memory in the system, the size of the virtual address
space is set by the OS.
4. The page table is accessed before the cache.
 True. The CPU has a virtual address to memory and the cache
requires a physical address, so the page table must be accessed
prior in order to translate from virtual to a physical address.
Appendix
C Language in Action A
This chapter is served as a quick-and-dirty introduction to get started with
C language programs.

Compile, Link, and Execute

Our C program should be stored as a source file, with extension .c . Open


terminal, and cd to the directory where the source code is defined. The
first step is to use gcc to compile the code:

1 $ gcc hello.c -c

This command will let gcc compile our source code hello.c , and gen-
erate an object file hello.o .

Next step is to link all the object files. Since in this example we only have
one object file, we can simply use

1 $ gcc hello.o

which will generate an executable file, whose name is a.out by default,


if there’s no any error in our code. If there are multiple object files gen-
erated from different source files, say hello.o , test.o , and demo.o ,
the command line would be

1 $ gcc hello.o test.o demo.o

Then we can run the executable:

1 $ ./a.out

Note that ./ in the beginning is necessary.

If we want to generate an executable with another cute name instead of


the default a.out , we need to use flags. For example, we want to name
our executable as cute , then we should do:

1 $ gcc hello.o -o cute

To run this, type the following in the terminal:

1 $ ./cute
156 A C Language in Action

Note that the output file name ( cute in this example) has to follow -o ,
but -o output_name can be anywhere after the command gcc . Re-
member, the values of the flags and the flags themselves always go to-
gether as a pair.
Usually the following steps can be simplified to just one command:

1 $ gcc hello.c

which will compile (without generating an object file explicitly), link, and
generate an executable. For the rest of the course, feel free to use this com-
mand to speed up your workflow, but for this lab, you have to know how
to use the flags to do the compilation and link separately. This will help
you later when we write assembly programs.

Flags

You can have many flags attached to the command gcc . You can view
them using the --help flag. This is a flag without any values:

1 $ gcc --help

Some commonly used flags are:

 -Wall : short for warning all, which will show all the warnings (they
are not errors);
 -g : use this if we want to debug our code using gdb . More on this
later;
 -O : capital O, for optimizing our code. It doesn’t modify our source
code; instead, it only generates more efficient executable file.

So, to generate an efficient C executable called too_cute that can be de-


bugged, the command line looks like:

1 $ gcc -O -o too_cute -g hello.c


ARMv8 Assembly in Action B
B.1 Program Structure B.1 Program Structure . . . . 157
B.2 Linking and Executing
Assembly Programs . . . 164
ARM assembly programs are stored in files with extension .s . A com-
B.3 Debugging using gdb . 169
plete assembly program has two major segments: .text where we mark
B.4 Floating Point Operations 175
the beginning of our code, and .data which indicates the global data
stored in the program.

B.1.1 Our First Assembly Program

The following program is the simplest assembly program, where we just


exit the program:

1 .text
2 .global _start
3

4 _start:
5 MOV X0, 0 /* status <- 0 */
6 MOV X8, 93 /* exit() is system call #93 */
7 SVC 0 /* invoke system call */

Line 5 – 7 are the standard procedure to exit any assembly program. What
it does is actually to make a system call that can end the execution of the
program. We will see other examples in the future, but for now, you only
need to remember that always put these three lines at the end of your
program, to make sure it can exit successfully.
On line 4 we have a label _start , which is where all assembly programs
start executing. You can write it anywhere in your code, but it always
marks the first instruction in your program. 1 On line 2, we see we de- 1: Different machines use different la-
clared _start as a global label, using .global . The first line .text bels to mark the start of the program. For
example, some machines use main or
marks the text segment, meaning all the lines after it (until other segments)
_main . For the virtual machine we are
are assembly code.
using it’s _start . If you work on differ-
Both .global and .text are called directives. They are not part of ent environments, you need to check this
the executable—they cannot be translated into machine code and put into information first.
the CPU to execute; they are part of the syntax of the assembler, and help
the assembler to manage the organization of the program, or with other
purposes. All directives start with a dot.

B.1.2 Segments

Assembly programs are managed by segments. In the example above


we’ve seen a .text segment. There are also other segments that might
be useful in your program.
158 B ARMv8 Assembly in Action

 .data :
This segment is used to store global variables, meaning they can be
accessed by any procedures in the program. Local variables used
in procedures/functions shouldn’t be declared here; instead, they
should be directly stored on stack in their frames;

 .bss :
This segment is also used to store global variables, but it’s usually
used for uninitialized data. For example, if you want to reserve a
space of 100 bytes in case in your program you need to store some
data. In this case, you can declare that empty space of 100 bytes in
the .bss segment.

Let’s see an example below.

1 .text
2 .global _start
3

4 _start:
5 MOV X0, 0 /* status <- 0 */
6 MOV X8, 93 /* exit() is system call #93 */
7 SVC 0 /* invoke system call */
8

9 .data
10 hello_str: .ascii "Hello World!\n\0"
11 arr: .dword 13, 24, 1024

In this example, we declared a string hello_str and an array of double


words arr in the .data segment.

B.1.3 Declaring Data

Again, we emphasize that there’s no data type or variable type in assem-


bly language, or in the main memory. Everything is a byte, and different
data types are simply different groupings and interpretations when they
are brought back to higher-level languages. For the convenience of pro-
grammers, however, assemblers typically provide a set of directives that
look like declaration of data types. These directives have two purposes:
for us, we don’t need to care about internal binary representation of data;
for assembler, it determines how to translate the data into bytes. For exam-
ple, if we declare a double word of number zero, the assembler will make
Table B.1: Directives for different C data the area 64 bits of zeros; if we declare a character of zero, then it’ll simply
types. make one byte of zeros.
C type Directive
The format to declare data is as follows:
char .byte
int .int , .word
1 label: .type_directive data1, data2, data3, ...
long .quad , .dword
float .single , .float
Here label is used as the address of the first data we declared after it;
double .double
.type_directive is where we declare the type of the data; and the values
char[] .string , .ascii
B.1 Program Structure 159

of the data can be stored after the directive, separated by commas. A brief
summary of different C data types and corresponding assembly directives
can be found in Table B.1.

B.1.3.1 Loading Labeled Data

We usually declare data in the .data segment as follows:

1 .data
2 hello: .quad 1024

To use this data, the first step is to load its address to a register, using ADR
instruction:

1 ADR X0, hello

where we loaded the memory address labeled as hello to X0 . Then we


can use LDR instruction to load the data into, say, register X1 :

1 LDR X1, [X0]

Note that we cannot use a label itself as base address. For example, this is
wrong: LDR X1,[hello] . Ú Trick

You can use ADR Xm, . (a dot as


the second operand) to load the cur-
B.1.3.2 Strings
rent PC, or the address of this ADR
instruction to register Xm .
Strings are declared using .string or .ascii , with double-quoted char-
acters, such as:

1 str1: .string "Hello"


2 str2: .ascii "Hello\0"

Notice for str2 we added a null-terminator at the end. If we declare


the string using .string , the assembler will automatically add a null-
terminator at the end of the string. Alternatively, we can also use .asciz
where z stands for zero (null-terminator):

1 str1: .asciz "Hello" /* Same to .string */

B.1.3.3 Arrays

You probably have already noticed that there’s no “array” type in assem-
bly. If we want to declare an array of integers, for example, we simply put
all numbers in a row, separated by commas:
160 B ARMv8 Assembly in Action

1 arr: .quad 12, 34, 56, 78, 90

Here we declare an “array” of five long integers. Recall from your data
structure class that arrays are simply a list of elements that are logically
stored one after another. In fact, when we put a declaration like the one
above in assembly, all the elements are exactly stored next to each other,
from low address to high.

If we want to index into the array for an element, we need to calculate the
byte offset from the base. In the example above, label arr points to the
address of its first element, so to load an element to a register, we need to
first load the base address, and then use the offset to load the element:

1 ADR X0, arr /* X0 stores the base address of arr */


2 LDR X1, [X0, 16] /* The offset of arr[2] is 2*8 = 16 */

What if the offset is out of the array’s boundary? Again, the structure
“array” is an abstraction used by us; the internal system of computers does
not know anything about “arrays”, and therefore there’s no such a thing
called array’s boundary. So if we load an offset of a large number which is
outside of the array, the assembler does not issue warnings, and you will
probably still be able to run your program without problem. However, as
to what data you actually loaded into the register and whether the actual
result of your program is correct or not... who knows? Therefore, we as
assembly programmers need to carefully manage our own data.

ą Example B.1
Given the following .data segment, draw a memory layout to
show how these data are organized. Assume the lowest address of
.data segment is 0x1000 , and the machine is little-endian.

1 .data
2 str: .string "Hello"
3 arr: .quad 80302, 01230, 07030
4 vec: .int -1000

We show the layout of the memory in the following figure, where


each row contains eight bytes. The addresses are increasing from
left to right, and from lower rows to higher ones. All the numbers
shown are hexadecimal.
vec

0x1020 FF FF 0x1027
0x1018 00 00 00 00 00 00 18 FC 0x101F arr+16
0x1010 00 00 00 00 00 00 18 0E 0x1017 arr+8
0x1008 01 00 00 00 00 00 98 02 0x100F
0x1000 48 65 6C 6C 6F 00 AE 39 0x1007

str arr arr+1


B.1 Program Structure 161

The data in .data segments are always starting from low address,
and store one right next to another. We first have str which is
a string, so each character in the string will be stored as ASCII
code. Note that in the end we have an additional null-terminator
because we used .string directive. If we used .ascii the
null-terminator will not be there.

Next, we have three double words, and they will be stored right
after the string. Notice we’re using little-endianness, meaning
the least significant byte is at the lowest address. After the three
numbers, we have an integer, which only takes four bytes.

Previously we mentioned that there’s no such a thing as array


boundary. As in this example, arr can also be referenced as
str+6 (i.e., base address at str with six bytes of offset); or
arr+32 is actually pointing to the first byte of vec , which is
considered by us as “out of boundary”.

Assembly doesn’t do element check either; once the numbers have


been stored there, they are nothing but bytes. For example, if you
want to get the first element of the arr , but for some reason you
added 2 to the base address like this:

1 ADR X0, arr


2 ADD X0, X0, 2
3 LDR X1, [X0]

assembly wouldn’t give you an error saying, “hey that’s the wrong
address!”. It’ll just do whatever you told it to do. In that case, X1
does not store the actual first element of the array which is 80302 ;
instead it stores the most significant six bytes of the first element,
and the least significant two bytes of the second:

1 01 00 00 00 00 00 98 02

so your X1 will end up being 186899384535875585 . This is be-


cause X0 , after you accidentally added 2, is address 0x1008 , and
LDR loads a double word (eight bytes) starting from the base ad-
dress.

B.1.3.4 Alignment

In Example B.1, you probably noticed that when we draw the memory
layout we put eight bytes on each row, and each row starts at an address
of multiples of eight. The byte pointed by the label arr , however, start
at 0x1006 , which is non divisible by eight, i.e., not aligned. Having data
aligned at a memory boundary in many cases will increase machine per-
formance greatly. In some machine if the data are not aligned it can even
162 B ARMv8 Assembly in Action

0x1020 18 FC FF FF 0x1027
0x1018 18 0E 00 00 00 00 00 00 0x101F
0x1010 98 02 00 00 00 00 00 00 0x1017
0x1008 AE 39 01 00 00 00 00 00 0x100F
arr
Figure B.1: Using .balign exp will 0x1000 48 65 6C 6C 6F 00 00 00 0x1007
align the next data at the address of mul-
tiple of exp , with zeros as padding. str
padding

generate errors.

To align data to a boundary, we can use .balign directive:

1 .balign exp

which will make the data start at the next address of multiples of exp .
For example, if .balign 8 is used, the location of the next data will start
at an address of a multiple of 8.

We can modify the .data segment in Example B.1 to the following:

1 .data
2 str: .string "Hello"
3 .balign 8
4 arr: .quad 80302, 01230, 07030
5 vec: .int -1000

The layout of the memory then is shown in Figure B.1. Notice after the
end of str we have two bytes of zeros, as paddings, and arr starts at
0x1008 , which is a multiple of eight.

B.1.3.5 Repetitive Initialization

In some cases we’d like to initialize an array with the same values. In C,
this is how we do it:

1 int arr[10] = {20};

where we declare an array of 10 integers, and all of them are 20.

In assembly, we will use a pair of directives: .rept and .endr ., The


following example shows us how to use this pair:

1 .data
2 arr:
3 .rept 10
4 .int 20
5 .endr
B.1 Program Structure 163

This also generates 10 integers of 20. Basically, first write the label for
the variable; then use .rept num to specify how many times you want
to repeat the following declarations, and write normal data declarations
after it as usual. Lastly, remember to use .endr to close the repetition.

Another easier way is to use .fill directive:

1 .fill repeat, size, value

which will generate repeat number of elements; each of the elements


takes size bytes, and has the value of value . For example, again, if we
want to initialize an array of 10 integers of 20, here’s how to do it:

1 arr: .fill 10, 4, 20

where 4 stands for the size of each element, and so in this case, the size
of integers.

B.1.3.6 Reserving Empty Space

Empty spaces are usually declared in .bss segment, since it’s used for
uninitialized data. In fact it’s nothing special; if we want to reserve a space
of 100 bytes, it’s same as declaring 100 bytes of zeros. Two directives can
be used:

1 .space size, fill


2 .skip size, fill

Here size is the total size, not an individual element, and fill is the
data for every byte in that space. Say we want to reserve an empty space
of 100 bytes in .bss segment:

1 .bss
2 empty_arr: .skip 100, 0

This is to fill every byte of the 100 bytes with zero. It’s the same as:

1 empty_arr: .skip 100


2 // or empty_arr: .space 100

meaning fill can be ignored if we just want to fill with zeros. And of
course, it’s the same as follows:

1 empty_arr: .fill 100, 1, 0

There’s no restriction which directive to use in which segment, so you just


need to be clear about which parameter stands for what.
164 B ARMv8 Assembly in Action

˛ Quick Check B.1


We know each byte takes 8 bits, and so if it’s unsigned number, the
range is from 0 to 255. Now let’s do something interesting:

1 .bss
2 empty_arr: .skip 100, 1024

and see what happens.

B.2 Linking and Executing Assembly Programs

B.2.1 General Workflow

Typically, the three steps to execute an assembly program are:

 Assemble source code using an assembler;


 Link object files generated by the assembler;
 Execute the binary from the linker using an emulator.

Assume we have an assembly source code called demo.s . The first step
is to generate an object file using the aarch64 assembler:

1 $ aarch64-linux-gnu-as demo.s -o demo.o

If there’s no error message, we’re all good, and the next step is to link object
files to generate an executable, using the linker:

1 $ aarch64-linux-gnu-ld demo.o

In case we have multiple object files needed to link to generate an exe-


cutable, we can just list all of them here:

1 $ aarch64-linux-gnu-ld obj1.o obj2.o obj3.o

The output executable name by default is a.out , but can be renamed by


using -o <newname> with the linker. To execute it, use QEMU emula-
tor:

1 $ qemu-aarch64 a.out

B.2.2 Listing Files

Listing files are very useful to examine the format of the object file or even
the final executable. It shows you the encoding of each instruction one
by one, and how they are arranged and organized in the memory when
B.2 Linking and Executing Assembly Programs 165

loaded. Assume we have a source code named demo.s . To generate the


listing file, we can use the following command:

1 $ aarch64-linux-gnu-as demo.s -a=demo.lst

where we generate a listing file called demo.lst . We usually use lst


as the file extension for listing files, but it’s not required.

The following is an example of a listing file: 2 2: You can compare the memory con-
tent of the .data segment in this listing
file with the example in Section B.1.3.3.
1 AARCH64 GAS demo.s page 1 It helps you verify the memory layout in
2 1 .text that example.
3 2 .global _start
4 3
5 4 _start:
6 5 0000 000080D2 mov x0, #0
7 6 0004 A80B8052 mov x8, #93
8 7 0008 010000D4 svc #0
9 8
10 9 .data
11 10 0000 48656C6C str: .string "Hello"
12 10 6F00
13 11 0006 AE390100 arr: .quad 80302, 01230, 07030
14 11 00000000
15 11 98020000
16 11 00000000
17 11 180E0000
18 12 001e 18FCFFFF vec: .int -1000
19

20 AARCH64 GAS demo.s page 2


21 DEFINED SYMBOLS
22 demo.s:4 .text:0000000000000000 _start
23 demo.s:5 .text:0000000000000000 $x
24 demo.s:10 .data:0000000000000000 str
25 demo.s:11 .data:0000000000000006 arr
26 demo.s:12 .data:000000000000001e vec
27

28 NO UNDEFINED SYMBOLS

On page 1, we see the memory content and the assembly code are listed
side by side. For .text segments, each instruction’s four-byte encoding
is shown next to it. For .data segment, the binary representations of the
data are shown. The four digits in front of them are offsets relative to the
beginning of the segment. Also notice directives and labels do not have
corresponding memory content in the listing file. This also shows us that
they are not part of the program that can be executed by the CPU.

Page 2 summarizes defined and undefined symbols. If, say, we have BL foo
in our code but we never labeled anything foo , the symbol foo will ap-
pear under “undefined symbols”.

One thing about listing file is it will not show all the data declared repeti-
tively. For example, if we declare something like this: .fill 200, 8, 100 ,
166 B ARMv8 Assembly in Action

the listing file will only show the first few bytes.

B.2.3 Using External Libraries

We can certainly use some of the functions from the C library in our assem-
bly program, since those C functions eventually need to be compiled into
assembly, after all.

B.2.3.1 Examples of Using printf()

 Simple Printing
It is not fun to print values directly in assembly, so one common situ-
ation is to print variable values using printf() . Since printf()
is still a procedure, we follow the same rules as branching to proce-
dures. Let’s look at the first example below.

1 /* print_msg.s */
2 .text
3 .global _start
4 .extern printf
5

6 _start:
7 ADR X0, hello_str
8 BL printf
9 MOV X0, 0 /* Exit */
10 MOV X8, 93
11 SVC 0
12

13 .data
14 hello_str:
15 .ascii "Hello World!\n\0"

In the code listing above, notice we declare printf() as an exter-


nal symbol using .extern directive. This is to tell the assembler
that, “it’s ok if you didn’t see this symbol defined in this file. It’s
3: ARM assembler treats all unde- external, so it’ll be in some other files, and will be used during link-
clared symbols as external, so it
wouldn’t be a problem if you didn’t
ing.” 3
use .extern printf in your code.
However, we do recommend using it The call of printf() follows the standard procedure call as de-
to make the code more readable and
maintainable.
scribed in Section 2.6.2.6. Thus, before we branch to printf() , we
need to pass the parameters to registers X0 to X7 , and even to the
stack if more parameters needed. The string we want to print out
is declared as hello_str, so on line 7, we pass the address of it to
X0 as the first and also only one parameter for printf() in this
example. Then we just need to branch and link using BL .

 Formatted Printing
Here’s another example. Assume we want to print out the value of
B.2 Linking and Executing Assembly Programs 167

register X20 . Written in C, the code should look like this, assuming
X20 stores the value of variable a :

1 long a;
2 printf("The value in X20 is %ld", a);

Here we have two parameters. The first is a formatted string, and


the second is a register value. For the first parameter, we need to
create a string in the data segment:

1 .data
2 check_str: .ascii "The value in X20 is %d\n\0"

For the second parameter, we simply need to copy X20 ’s data to


X1 . The code looks like this:

1 /* print_num.s */
2 .text
3 .global _start
4 .extern printf
5

6 _start:
7 ADR X0, check_str
8 MOV X1, X20
9 BL printf
10 MOV X0, 0 /* Exit */
11 MOV X8, 93
12 SVC 0
13

14 .data
15 check_str: .ascii "The value in X20 is %d\n\0"

 More Arguments
Since printf() and all C functions follow the procedure call stan-
dards, they can only use the first eight registers for passing parame-
ters. If we want to pass more than eight parameters, we would have
to use the stack. In the following example, we want to print five
characters and their ASCII codes. With the formatted string, we will
have to pass 11 parameters, so X0 – X7 apparently is not enough.
Therefore, we have to utilize the stack space:

1 /* more_args.s */
2 .text
3 .global _start
4 .extern printf
5

6 _start: ADR X0, check_str


7 MOV X1, 49
8 MOV X2, 49
9 MOV X3, 60
168 B ARMv8 Assembly in Action

10 MOV X4, 60
11 MOV X5, 80
12 MOV X6, 80
13 MOV X7, 100
14 MOV X8, 100
15 MOV X9, 120
16 MOV X10, 120
17

18 // Assign more space for passing parameters


19 SUB SP, SP, 24
20

21 STR X8, [SP, 0]


22 STR X9, [SP, 8]
23 STR X10, [SP, 16]
24 BL printf
25

26 // De-allocate the stack


27 ADD SP, SP, 24
28

29 MOV X0, 0
30 MOV X8, 93
31 SVC 0
32

33 .data
34 check_str:
35 .ascii "ASCII code of char %c is %d, %c is %d, "
36 .ascii "%c is %d, %c is %d, %c is %d.\n\0"

On line 19, we allocate a space of 24 bytes on stack to store the three


additional parameters. For C functions, regardless of data types,
each parameter on stack takes eight bytes. Lastly, remember to de-
allocate the frame after returning from printf() , as on line 27.

This is yet another example to show you that the machine doesn’t
care about specific data types and it all depends on our own inter-
pretation of the data. For example, we pass the same number 49
to X1 and X2 . Inside the machine, their presentations are exactly
the same, but when they are printed out, we see one is the character
1 and the other is the number 49 . This is because we used dif-
ferent specifiers: %c and %d. Thus, when they are printed out, the
printf() function takes the bytes needed for different specifiers
(1 byte for %c and 4 for %d), and presents them to the terminal.

B.2.3.2 Pitfalls

One common mistake (more common than you think!) when using C li-
brary functions, especially printf() , is forgetting about procedure call
conventions. From last section it is clear that we need to use X0 – X7 and
possibly stack space for passing arguments. One thing that we typically
forget is that upon returning from the procedure, X0 – X7 ’s values might
be changed by the procedure.
B.3 Debugging using gdb 169

Some students put some values in X7 , and called printf() . After that,
they found out that they lost the data stored in X7 . This is because X7
is a caller-saved register; it is the caller’s responsibility to save them in
case the called procedure needs to use them. Additionally, procedures
need to store return values in X0 as well. To review this topic, see Sec-
tion 2.6.2.5.

However, it’s not saying all the register values will be changed, but there’s
no guarantee that they won’t be changed, because especially for external
procedures, you never know which of them will be used. Therefore, before
branching to any procedure, remember to save the data from X0 – X7
somewhere, and restore them after return.

B.2.3.3 Linking C Library

If the assembly code uses the standard C library, we need to use the -lc
flag to link it:

1 $ aarch64-linux-gnu-ld demo.o -lc

where we assume the object file is called demo.o . 4: On some machines there’s no need to
dynamically link the library, so if you can
run the executable perfectly fine without
When executing the program that’s been linked with C library, we need to linking it, you don’t have to.
dynamically link it as well: 4

1 $ qemu-aarch64 -L /usr/aarch64-linux-gnu/ a.out

B.3 Debugging using gdb

B.3.1 Installation

The regular gdb we used for debugging C programs cannot be used here,
because our program can only be executed in QEMU emulator, and the ar-
chitecture is different. We will have to install a gdb that can be used for
multiple different architectures.

In our virtual environment, we can simply use the following command to


install:

1 $ sudo apt-get install gdb-multiarch

We will use gdb to refer to gdb-multiarch from now on.


170 B ARMv8 Assembly in Action

B.3.2 Start Debugging

We’ll use gdb and qemu together, so we need to open two terminals at
the same time: one for gdb to step through, and the other for qemu to
provide an emulated environment. If we want to use gdb to debug an
assembly file, we need to add -g flag with the assembler.

If we need to debug an assembly program, say demo.s , we must add -g


flag when assembling:

1 $ aarch64-linux-gnu-as demo.s -g -o demo.o

Then link the object file as usual:

1 $ aarch64-linux-gnu-ld demo.o

which will generate an executable a.out .

We start QEMU first:

1 $ qemu-aarch64 -g 1234 a.out

where the number 1234 is arbitrary—it’s a port for gdb to connect. Once
this command is in, it’ll freeze on the terminal and wait for us to start gdb.
Now leave it there (do not close it), and we can go to the other terminal
and start gdb:

1 $ gdb-multiarch --nh -q a.out -ex 'set


↪ disassemble-next-line on' -ex 'target remote :1234'
↪ -ex 'set solib-search-path
↪ /usr/aarch64-linux-gnu-lib/' -ex 'layout regs'

Note there’s a space between remote and :1234 . The flags -ex are
the commands we want gdb to execute in the beginning. If you don’t add
these flags when invoking gdb, you’ll have to type them once you’re in the
gdb environment. The interface you see as in Figure B.2 is called TUI (Text
User Interface).

B.3.3 Debugging Commands

B.3.3.1 Breakpoints

The advantage of a debugger is that we can set a breakpoint somewhere


in our code, so that the program will pause there and so we can see what’s
going on in the program.

When started gdb, the program is paused at somewhere in the static library.
We can use the following command to set a breakpoint at the beginning of
our program:
B.3 Debugging using gdb 171

Figure B.2: Interface of running gdb for


assembly.

1 b _start

Then we can use continue command (or simply c ) to reach our entry
point. Other labels can be set as breakpoint as usual.

Note: if after we set the breakpoint to _start and used continue , but
notice the break point is not exactly at label _start , see Troubleshooting
section.

B.3.3.2 Steps

Most of the commands are pretty much the same to what we’ve been using
for debugging a C program in gdb. As a reminder, when we want to go to
a procedure, we would need to use step or s . If the procedure is from
a library, such as printf() , it’s not a good idea to step into it, so we can
use next or n .

B.3.3.3 Panel Focus

When we enter gdb with assembly code and register group laid out, the
default focus is the assembly code. Thus, if we press up/down keys on the
keyboard, or scroll up/down using a mouse, it only works on the assembly
code panel. To change focus to register group to view all registers, we
can use command focus regs . To change back to assembly code panel,
simply use focus asm .
172 B ARMv8 Assembly in Action

B.3.4 Printing Memory

It is quite often that we need to examine the data stored in memory. The
syntax is as follows:

1 x/<length><format><unit> address

The length parameter specifies how much data we want to print from
address . It can be both positive and negative integers.

The format parameter tells gdb in what format do we want to see the
data. For example, if we pass x , it will print the data in hexadecimal. If
we pass d , it will print in decimal.

The unit parameter specifies how to group the data and interpret it.

For the format and unit parameters, there are many options. Please re-
fer to the gdb documentation on https://sourceware.org/gdb/onlinedocs/
gdb/Memory.html#Memory as well as https://sourceware.org/gdb/onlinedocs/
gdb/Output-Formats.html#Output-Formats.

The address is the starting address in memory. It can be a label, or a


hexadecimal address, or a register:

1 # Print 3 bytes in hexadecimal starting from address


↪ 0x54320:
2 x/3xb 0x54320
3

4 # Print 2 bytes in hexadecimal from the stack pointer:


5 x/2xb $sp
6

7 # Print 2 bytes in decimal from the address stored in x10:


8 x/2db $x10
9

10 # Print 5 bytes in character from the label hello:


11 x/5cb &hello
12

13 # Print the content in a string from label hello until


↪ '\0':
14 x/s &hello

In the example in Figure B.3, we declared a string str_src in the .data


segment. See how different parameters display the same content.

B.3.5 Inspecting Condition Codes

To see condition codes, we can observe cpsr field in the register group.
CPSR stands for Current Program Status Register. CPSR is a 32 bit register,
where different flags or conditions take different bits. We only care about
the highest four bits: N, Z, C, and V:
B.3 Debugging using gdb 173

Figure B.3: An example of different mem-


ory examining format.

31 30 29 28 0 – 27
N Z C V Other flags

The value displayed in register group panel for cpsr is usually in hex-
adecimal and decimal, so it’s not very obvious to see individual bits. We
can use the following command to print out the value in binary format:

1 p/t $cpsr

where p stands for print, and t for binary (two’s complement), and only
look at the most significant 4 bits.

B.3.6 Troubleshooting

B.3.6.1 Nothing Showed Up

If in the beginning of gdb , register panel and assembly code panel are
both empty, do not worry – you just need to set a break point at _start
by typing

1 b _start

and then continue executing by typing c . Then you’ll see everything


there.
174 B ARMv8 Assembly in Action

B.3.6.2 Breakpoint Not Set Exactly at a Label

5: This typically doesn’t happen on M1- You might notice that sometimes even if you set breakpoint to _start, gdb
based macOS, so try to set a breakpoint actually set the breakpoint a few bytes later after the label start. 5
at _start at first, and if it’s not exactly
at the label, then proceed to use the actual
To set a breakpoint at the first instruction correctly, you can use the com-
address to set the breakpoint.
mand info files to find out the actual address of _start :

1 Symbols from "/home/shudong/demo/a.out".


2 Remote serial target in gdb-specific protocol:
3 Debugging a target over a serial line.
4 While running this, GDB does not access memory from...
5 Local exec file:
6 `/home/shudong/demo/a.out', file type
↪ elf64-littleaarch64.
7 Entry point: 0x400204
8 --Type <RET> for more, q to quit, c to continue without
↪ paging--

Notice in the output above, on line 7, we have the address of entry point
0x400204 , which is location of the label _start . Then we can set the
breakpoint there:

1 b *0x400204

B.3.6.3 No Such File or Directory

If you run the qemu-aarch64 command to start debugging, but you got
an error message like this:

1 /lib/ld-linux-aarch64.so.1: No such file or directory

you would need to run the following command to install aarch64 ver-
sion of gcc :

1 $ sudo apt install gcc-aarch64-linux-gnu

B.3.6.4 Cannot Link -lc

Again, you forgot to install aarch64 version of gcc :

1 $ sudo apt install gcc-aarch64-linux-gnu


B.4 Floating Point Operations 175

B.4 Floating Point Operations

Real numbers are a different species than integer numbers: they’re en-
coded in a different way; they are stored in a separate group of registers, in-
stead of X0 – X31 ; they are calculated in a unit called FPU (floating point
unit), not ALU (!); and they use a totally different set of instructions.

 Registers:
Real numbers in ARM have 5 types of precision, but we mostly use 2
of them: single (encoded in 32 bits), and double (64 bits). Single pre-
cision numbers correspond to float type in C, and double preci-
sion numbers are, well, double type. Therefore, correspondingly,
the registers that hold them are Hn (half precision), Sn (single),
and Dn (double), where n ranging from 0 to 31 is the register num-
ber.

 Usage:
Like integer registers, we use D0 – D7 to pass parameters to proce-
dures as well as store return values. D8 – D15 are preserved across
calls, and D16 – D31 can be used as temporary registers. The same
applies to single and half precision registers.

B.4.1 Basic Instructions

For real numbers, they have a separate set of instructions, but fortunately
they are very similar to the ones we know so far.

B.4.1.1 Arithmetic

Typically, when using floating point registers, the instruction has an F


prefix. For example:

1 FADD S0, S2, S3 // S0 = S2 + S3


2 FADD D19, D12, D12 // D19 = D12 + D12

However, load and store are the same as before.

B.4.1.2 Moving Real Numbers

You can move an immediate real number to an S or D register using


FMOV , but there’s a restriction on that. Based on ARMv8 manual, only
𝑛
numbers that can be expressed as ± 16 × 2𝑟 where 𝑛 ∈ [16, 31] and 𝑟 ∈
[−3, 4] can be moved. Numbers such as x.0 (integers), x.5 , x.25 are
fine.

Therefore, we recommend that you just store all real numbers in the .data
segment, and load them into registers, and use FMOV between registers.
Assume we have a number declared as such:
176 B ARMv8 Assembly in Action

1 .data
2 pi: .float 3.1415

Then to move this number into a register, we should do:

1 ADR X0, pi
2 LDR S0, [X0]
3 FMOV S1, S0

B.4.1.3 Converting Precisions

SCVTF D0,X0
Sometimes we’d like to cast a floating point to a double precision number,
or to an integer, or vice versa. We can’t just FMOV an S register to a D
X0 D0 register. The instruction we need to use is FCVT :

1 FCVT D0, S0 // Upcast, gain precision


FCVTZS X0,D0 2 FCVT S1, D1 // Downcast, lose precision
3 SCVTF D2, X2 // Convert an integer to a real number
Figure B.4: Converting between double
precision real numbers and integers. 4 FCVTZS X3, D3 // Convert a real number to an integer
↪ (the fraction part is removed)

See Figure B.4 for an illustration. For other instructions, the best resource
out there is ARM64’s reference sheet: https://developer.arm.com/documentation/
100076/0100/a64-instruction-set-reference/a64-floating-point-instruction

B.4.2 Printing Using printf()

Printing is a little bit tricky. As usual, we need to load a string to X0 , but


the rest of the parameters are passed from D0 – D7 .

1 .data
2 fmt_str: .ascii "%lf %lf\n\0"
3 number1: .double 3.1415
4 number2: .double 10
5 ...
6 ADR X0, fmt_str // Load address of the string
7 ADR X1, number1 // Load address of number1
8 LDR D0, [X1] // Load number1 to D0
9 ADR X1, number2
10 LDR D1, [X1]
11 BL printf

When printf() recognizes %lf, it doesn’t go to X1 to fetch the number;


instead it goes to D0 .

If you use printf() to print both integer values ( int , char , long int )
and floating point values, move corresponding numbers into their regis-
ters in order:
B.4 Floating Point Operations 177

1 .data
2 fmt_str: .ascii "%d = %lf, %d = %lf\n\0"
3 ...
4 ADR X0, fmt_str
5 LDR X1, [...] // integer #1
6 LDR D0, [...] // floating point #1
7 LDR X2, [...] // integer #2
8 LDR D1, [...] // floating point #2

ψ Caution! In case you declare your number as .float like follow-


ing:

1 .data
2 fmt_str: .ascii "%f\n\0"
3 fpnum: .float 3.14

you would need to cast fpnum from .float to .double because printf()
doesn’t check S registers at all, and automatically uses double precision
as output format (even if your format is %f instead of %lf):

1 ADR X0, fmt_str


2 ADR X1, fpnum
3 LDR S0, [X1]
4 FCVT D0, S0
5 BL printf

And if you declare your number as .float , you cannot load it directly
into D registers.
Thus, the easiest way to avoid those situations is just declare your number
as .double , and load it into D registers all the time.

B.4.3 Debugging

The floating point registers cannot be viewed in the register panel, so we


would have to use p to print out:

1 p/f $d0

where /f is to print out the register value as a floating point.


Alphabetical Index

.data segment, 47 control unit, 85


.text segment, 47 CPSR, 38, 172
CPU time, 111
abstraction, 10
address translation, 141 data hazard, 98
address-of, 16 data signals, 82
arithmetic logic unit (ALU), 8 deasserted, 69
asserted, 69 dereference, 16
directives, 157
base address, 26
bistable element, 71 encoding, 78
bit, 1 endian
breakpoint, 170 big endian, 17
BTFN, 108 little endian, 17
bus, 9 evict, 128
address bus, 9, 123 extension
control bus, 70, 123 signed extension, 3
data bus, 9, 123 zero extension, 2
memory bus, 123
falling delay, 67
system bus, 123
flip-flops, 75
buses, 69
flush, 107
byte, 1
format specifier, 14
byte-addressed, 9
formatted output, 14
cache, 123, 124 forwarding unit, 104
direct-mapped, 127 frame pointer, 52
E-way set associative, 130 frames, 48
fully associative, 130
hazard detection unit, 102
line, 125
heap, 47
set, 125
hit, 124
tag, 125
valid bit, 125 I/O controller, 123
callee, 47 immediate, 25
callee-saved registers, 56
caller, 47 label, 25
caller-saved registers, 56, 169 Last-time Predictor, 110
carry, 5 latch
clear, 39, 69 D latch, 74
clock cycle, 92 edge-triggered latches, 75
clock frequency, 111 flip-flops, 75
clock period, 111 latching, 74
clock rate, 111 Set-Reset latch, 73
column-major order language, 121 SR latch, 73
combinational logic, 68 latency, 92
compilation time, 47 leaf procedures, 54
control signals, 69, 82 least significant bit (LSB), 1
locality, 120 current program status register, 38, 172
spatial locality, 120 general purpose registers, 8
temporal locality, 120 link register, 49
pipeline registers, 94
memory hierarchy, 122 register file, 8, 76
memory management unit, 142 register overhead, 95
misprediction rate, 109 relative frequency, 113
miss, 125 return address, 49
miss rate, 134 rising delay, 67
mnemonic, 25 row-major order language, 121
most significant bit (MSB), 1 run time, 47
multiplexer, 69
sequential logic, 71
nibble, 1 set, 39, 69
no-write-allocate, 133 single-cycle implementation, 82
non-leaf procedures, 54 stack, 47
null terminator, 12 stack pointer, 51
stored-program computers, 9
offset, 18, 26 storing, 74
opcode, 78 stride-1 reference, 121
operands, 25 strongly not taken, 110
overflow, 6 strongly taken, 110
negative overflow, 6
positive overflow, 6 three-way pipeline, 93
throughput, 92
page fault, 142 translation lookaside buffer, 145
page hit, 142 two’s complement, 3
page size, 143 Two-bit Predictor, 110
page table, 142
page table entry, 142 virtual address space, 140
physical address, 140 virtual memory, 140
physical address space, 140 virtual memory space, 47
physical page number, 142 virtual page number, 142
physical pages, 140 virtual pages, 140
picoseconds, 92 von Neumann model, 8
pointer arithmetic, 18
pointers, 15 weakly not taken, 110
pop, 47 weakly taken, 110
Princeton architecture, 8 weighted average CPI, 113
procedure frames, 48 word, 2
program counter (PC), 33 double word, 2
propagation delay, 72 half word, 2
push, 47 quadword, 2
write port, 76
read ports, 76 write-allocate, 133
register, 8 write-back, 133
CPSR, 38, 172 write-through, 133
ARM Assembly Directives & Instructions

ADDS, 39 FADD, 175 STRH, 28


ADD, 30 FCVTZS, 176 STR, 28
ADR, 159 FCVT, 176 SUBS, 39
ANDS, 39 FMOV, 175 SUB, 30
AND, 32 LDRB, 27
ASR, 32 LDRH, 27 Directives
B.EQ, 38 LDRSB, 27 .balign , 162
B.LT, 40 LDRSH, 27 .bss , 158
B.NE, 38 LDR, 26 .data , 158
B.cond, 40 LSL, 32 .endr , 162
BL, 49 LSR, 32 .extern , 166
BR, 34 MOV, 29
.fill , 163
B, 34 MUL, 31
.global , 157
CBNZ, 35 ORR, 32
.rept , 162
CBZ, 35 RET, 49
CMP, 38 SCVTF, 176 .skip , 163
DIV, 31 SDIV, 31 .space , 163
EOR, 32 STRB, 28 .text , 157

You might also like