Computer Architecture and Organization: Lecture Notes
Computer Architecture and Organization: Lecture Notes
Lecture Notes
Computer Architecture
and Organization
Shudong Hao
September 4, 2024
Computer Science 382
Disclaimer
The lecture notes have not been subjected to the usual scrutiny reserved for formal publications. They may not
be distributed outside this class without the permission of the Instructor.
Colophon
This document was typeset with the help of KOMA ‐ Script and LATEX using the kaobook class.
The source code of this book is available at:
https://github.com/fmarotta/kaobook
Edition History
1st edition: August 2022.
2nd edition: August 2023.
Contents
Contents iii
1 Fundamentals 1
1.1 Number Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Binary Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.3 Binaries and Decimals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3.1 Unsigned Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.3.2 Signed Integers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.4 Binaries and Hexadecimals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.5 Binary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5.1 Fixed Width Binary Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.5.2 Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.1.5.3 Bit-wise Logical Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Basic Components of Microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Central Processing Unit (CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1.1 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1.2 Arithmetic Logic Unit (ALU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.2 Random Access Memory (RAM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.3 Peripheral Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Computer Abstractions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 A Peek Into Memory with C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Quick Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1.1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1.2 Binary Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1.3 Formatted I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1.4 goto Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.2 Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2.1 Reference and Dereference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.2.2 Pointers and Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.2.3 Endianness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2.4 Arrays and Pointer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.4.2.5 Null-Terminated Strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.5 Quick Check Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3 Microprocessor Design 67
3.1 Fundamental of Logics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.1 Logic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.1.2 Combinational Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.2.1 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.1.2.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.1.2.3 Arithmetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.1.3 Sequential Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.1.3.1 SR Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.3.2 D Latch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.1.3.3 Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.1.4 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.1.5 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.1.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.2 From Assembly to Machine Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.1 Arithmetic/Logic Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.1.1 With Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.1.2 With Immediates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.2 Memory Accessing Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.2.3 Branching Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.3 A Single-Cycle Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.2 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
3.3.3 Stages of an Instruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.3.1 Stage 1: Instruction Fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.3.3.2 Stage 2: Instruction Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.3.3.3 Stage 3: Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.3.3.4 Stage 4: Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.3.3.5 Stage 5: Writing Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.4 A Pipelined Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4.1 Operating a Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.4.1.1 Pipeline Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.4.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.2 From Single-Cycle to Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.4.3 Adding Pipeline Registers to Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.5 Hazards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.1 Data Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.5.1.1 Stalling the Pipeline Manually . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.5.1.2 Stalling the Pipeline Automatically . . . . . . . . . . . . . . . . . . . . . . . . . 102
3.5.1.3 Forwarding Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
3.5.2 Control Hazard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
3.5.2.1 Branch Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.2.2 Static Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.5.2.3 Dynamic Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
3.6 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
3.7 Quick Check Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
Appendix 153
A C Language in Action 155
1.1 How a discrete value can be extracted from continuous signals. In this example, we can get a binary
number 010 based on the current change on a wire. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Zero extension. In each example, we extend an unsigned binary number of four bits to one byte. . 2
1.3 Signed extension, which copies MSB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 For non-negative numbers, signed and unsigned have no difference. When consider MSB as sign
bit, the same binary pattern will be mapped to a positive number in the unsigned range. . . . . . . 4
1.5 Truncating MSB of a 𝑑 + 1-bit binary will result in positive or negative overflow. There’s no effect
on any number between [−2𝑑−1 , 2𝑑−1 − 1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.6 We apply three bits of logical shift left operation to binary number 10001100b (top). If the size of
the binary is limited (bottom), the MSBs are discarded. . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.7 We apply three bits of logical shift right operation to binary number 10001100b (top). If the size of
the binary is limited (bottom), the LSBs are discarded. . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.8 In arithmetic shift right, we pad MSBs with copies of original numbers MSB. We only show the
versions where the binary size is limited. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.9 A typical von Neumann model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.10 Visualization of RAM. Notice each byte (8 bits) has a unique address. The addresses are ranged from
0x00...0 to 0xFF...F. What’s the size of this RAM? . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.11 Abstractions of computer systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Register file of ARM architecture. X-registers can store 64 bits. The lowest half 32 bits for each X-
register can also be used independently as W-registers. However, the upper half 32 bits cannot be
used independently. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2 LDR will grab a few bytes (determined by the size of the destination register) from the starting ad-
dress, calculated by base+simm9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 STR is the opposite direction of LDR. Notice how the same value of X9 is stored differently on little
and big endian machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Using W and X registers for addition. If W registers are used, the highest 32 bits will be cleared
out. The gray boxes are W registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 Program flow of the example. Instruction B modifies PC based on its target, which makes the pro-
gram skip some instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6 CBNZ will check the register value; if it’s not zero, it’ll branch to the instruction tagged as L1. . . . . 35
2.7 Without unconditional branch B, the program flow would be wrong when variable a is zero. . . . . 37
2.8 A loop is simply a backward branching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.9 LDRB loads one byte into the lowest byte in a W -register. . . . . . . . . . . . . . . . . . . . . . . . 44
2.10 Each long integer takes eight bytes, so the starting address of each element is 8*i where i is the
index of the element. In this figure, LDR will copy eight bytes starting from address pointed by
X9+X13 to X12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.11 Visualization of a virtual memory space for a program. Our assembly code and global variables will
be loaded to this space straight out of the executable file. During run time, the heap and stack are
growing towards each other as our program calls a procedure, or allocates space dynamically. For
stack, the bottom is actually at the high address, while the top is at the low address, so you can take
it as a upside-down stack. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.12 If we treat function/procedure calls as stacking some “blocks”, the process of procedure calls looks
really like pushing blocks to the stack. At point (2), fun2() returned to fun1(), and therefore its
block is removed from the stack top. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.1 a and b are the two inputs of the and gate, and a&b is the output. The and gate constantly and almost
immediately reflects the change of the input, with small amount of delay which is neglectable. . . 68
3.2 On the left, signal a has been branched into two; on the right, a and b are separate signals without
relations, indicated by a line hop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 The combinational logic for comparing two bits. When the two bits are equal, the output is 1; oth-
erwise it’s 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 The parallel sets of wires on the left are called buses, where each wire transfers one bit of data, and
all the wires transfer data at the same time. To make the graph clearer, we made lines from input b
blue. Each pair of input bits uses the bit equality logic in Figure 3.3 to compare. . . . . . . . . . . . 69
3.5 The combinational logic for selecting one of the inputs as the output. Input s acts as a “switch”,
or “control”. When s == 1 (asserted), input b passes through the multiplexer; when s == 0
(deasserted), input a passes through. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.6 The inputs a and b, as well as the output, are all 64-bit double words. The same control signal controls
all the bits of an input, which makes the logic choose one of the inputs for every bit, and thus choose
one double word to pass through. We highlighted the wires for input b and its corresponding control
signal wires. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.7 Bit adder, where input y is marked as red, x as blue, the carry-in signal cin as black. . . . . . . . . 71
3.8 Full 64-bit adder using the bit adder design from Figure 3.7. From bit 0 to bit 62, each adder’s carry-
out flag cout[i] will be used as carry-in flag for bit i+1’s adder. . . . . . . . . . . . . . . . . . . . 72
3.9 The left shows a combinational logic, whereas the right shows a sequential logic. In the combina-
tional logic, both outputs p and q respond to the change of input in almost instantly, and thus we are
not able to “store” the output. In the sequential logic, however, one temporary change in the input
in will trigger the permanent change in the outputs, making them stay, and thus to be “stored”. . . 73
3.10 A simple SR latch and an example of timing diagram. Time (1) is the state of setting, (2) for resetting,
and (3) for latched. The temporary change in either R or S will make the change in the output stay,
and thus to be stored. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.11 D latch has a clock C to control when the data D is allowed to pass through and to cause change in
the output Q+. C is 1, the status is called “latching”, where output Q+ responds to the change of the
input data D. When C is 0, the status is “storing”, and Q+ stays/stores the value regardless of changes
of input D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.12 An edge-triggered latch, or a flip-flop, where when C rises, the trigger T will temporarily rise to high
voltage, allowing Q+ store the value of input data D at that moment. Afterwards, T drops back down,
and no matter how input C changes Q+ stays stable. . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.13 A register implemented using edge-triggered latches. One latch can store one bit of data, and all the
64 bits will be updated all together when the clock rises. . . . . . . . . . . . . . . . . . . . . . . . . 76
3.14 A little more detailed register file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.15 Control bus and address bus are unidirectional, while data bus is bidirectional. . . . . . . . . . . . 77
3.16 Encodings of arithmetic and logic instructions with register operands. . . . . . . . . . . . . . . . . 79
3.17 Encodings of arithmetic and logic instructions with register operands and immediates. . . . . . . . 80
3.18 Encodings of memory accessing instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.19 Encodings of branching instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.21 A clock controls the sequential logic, so each clock cycle makes one pass of the data. . . . . . . . . 83
3.20 A single-cycle implementation of datapath. Black lines are data signals, while blue lines are control
signals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.22 Stage 1: instruction fetching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.23 Stage 2: decoding. Fields in an instruction is sent to different parts of the register file. The opcode
is sent to a control unit that can generate control signals. . . . . . . . . . . . . . . . . . . . . . . . . 85
3.24 Stage 3 — execution — consists of two tasks: updating PC for branching instructions, and computing
through ALU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.25 Stage 4: memory access. Memory has two control signals MemWrite and MemRead . . . . . . . . 89
3.26 Stage 5: writing back. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.27 A summary of five stages each instruction goes through in the datapath, with description language
on the side. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
3.28 An example of combinational logic, where the input is three-bit, and the output D is one bit. When
the clock rises, the output D is written to the register. . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.29 A sequence of three inputs: 110 , 010 , 110 with an unpipelined version. . . . . . . . . . . . . . 93
3.30 A three-way pipeline structure where each stage runs one input. . . . . . . . . . . . . . . . . . . . . 93
3.31 We added two registers between the three stages as “barriers”, to make sure the signals in each stage
will not be interrupted or overwritten. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.32 When we separate the combinational logic into three stages, the clock cycle can be shorten to only
execute one stage. At the peak of the system, between time 2 and 3, all logic gates are working on
different instructions, which greatly improves the throughput and reduces resource waste. . . . . . 94
3.33 It is very challenging to separate combinational circuits into stages with equal latency. Thus, the
clock cycle needs to be long enough to cover the slowest stage, which limits the throughput of the
entire system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3.34 The horizontal axis is a time line. At the top we run one instruction through all five stages at a time, so
it takes much longer to complete all three instructions. At the bottom, we run multiple instructions
at the same time, which resembles a pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.35 A detailed datapath with pipeline registers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.36 A pipeline diagram showing the progression of instruction executions. . . . . . . . . . . . . . . . . 98
3.37 A sequence that can lead to data hazard, due to dependencies between instructions. Register X2 in
the first instruction is the destination, but also one of the source operands in the third instruction. At
cycle 4, instruction 3 has already read X2 ’s old value, but it hasn’t been updated from instruction 1
yet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.38 Because there’s no dependency on X2 in instruction STR , we swap it with ADD to align its ID
stage with SUB ’s WB stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.39 Inserting NOP instruction allows us to delay instructions that depend on the completion of earlier
instructions. In this example, to make SUB ’s WB and ADD ’s ID stages align, we only need to add
one NOP . Certainly in some cases more NOP s may be needed. . . . . . . . . . . . . . . . . . . . . 101
3.40 As long as there’s a data hazard, we’d postpone the instruction and its following instructions by
stalling them in place, and let NOP s move along the pipeline. Once all data hazards have been
resolved, stalled instructions can restart. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
3.41 Hazard detection unit receives signals from multiple stages (representing multiple instructions), and
stall instructions currently at ID and IF stages, and insert bubble to EX stage. . . . . . . . . . . . . . 104
3.42 The forwarding unit will take signals from ME and WB stages, and overwrite ALU operands. . . . 107
3.43 At cycle 3, the value of X1 has been determined. Assume it is zero, then we need to branch to .L1 .
The two instructions AND and ORR that are already in the pipeline will not continue executing;
instead, we flush them and let NOP s move along the pipeline. . . . . . . . . . . . . . . . . . . . . . 108
3.44 Two-bit predictor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.1 The time gap between accessing DRAM/SRAM (the memory) and CPU cycle time is getting larger
through the years. Figure borrowed from Computer Systems: A Programmer’s Perspective. . . . . . . 119
4.2 Row-major languages store all elements in each row together. . . . . . . . . . . . . . . . . . . . . . 120
4.3 column-major languages store all elements in each column together. . . . . . . . . . . . . . . . . . . 121
4.4 Memory hierarchy. Figure borrowed from Computer Systems: A Programmer’s Perspective. . . . . . . 122
4.5 Traditional bus structure between CPU chip and main memory. . . . . . . . . . . . . . . . . . . . . 123
4.6 A CPU chip with one level of cache. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.7 An illustration of how cache works in a toy example. . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.8 A cache with 𝑆 sets, 𝐸 lines per set, and stores 𝐵 bytes for line. Note all the lines in the cache have
the same structure as shown in line 0; due to illustration purposes we only show structures of line 0. 126
4.9 For convenience, two boxes are used for storing two cards each. Box x only stores cards starting
with digit x. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.10 After ♥00 was requested. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.11 After ♦01 was requested. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.12 Three possible cache organizations with a total capacity of eight bytes, and with lines that can store
two bytes of data each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.13 Simplified illustration of ARM processor chip with multi-level caches. . . . . . . . . . . . . . . . . 134
4.14 Matrix multiplication on Core i7. Large missing rate leads to more cycles needed to run each inner
loop iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.15 Two programs need 60 bytes of memory space, but the actual memory has only 48 bytes. . . . . . . 139
4.16 Only loading parts of the programs allows them to fit in to small physical memory at the same time. 139
4.17 When other parts of the programs are needed, we just swap them in. . . . . . . . . . . . . . . . . . 140
4.18 To fit multiple processes’ virtual memory space into a small physical memory, the system needs to
split virtual memory spaces into pages, and only load the pages that contain needed data and code
into physical memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.19 Adding memory management unit (MMU) to CPU chip. . . . . . . . . . . . . . . . . . . . . . . . . 142
4.20 TLB as a cache for page table entries on MMU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.21 A translation lookaside buffer (TLB) is simply a SRAM cache that caches page table entries. . . . . 146
4.22 Physical memory retrieves and stores pages from the hard drive, and sends lines to the cache. The
cache then sends words to the registers, where the data are requested by the ALU. . . . . . . . . . 149
B.1 Using .balign exp will align the next data at the address of multiple of exp , with zeros as
padding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
B.2 Interface of running gdb for assembly. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
B.3 An example of different memory examining format. . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
B.4 Converting between double precision real numbers and integers. . . . . . . . . . . . . . . . . . . . 176
List of Tables
3.1 ALU operations for each instruction in the instruction set architecture. . . . . . . . . . . . . . . . . 88
One problem though — the voltage is not stable due to materials of the
wire, temperature, and so on. It can stay roughly at a certain level with
small up-and-downs, so we can use a threshold instead. Figure 1.1 shows
this idea. Voltage
Time
When writing a decimal number, we usually add a letter D at the end, e.g.,
100D , -382D . The letter B will be used as suffix or prefix for binary
numbers: 100B , 100010B , 0b1010 , etc. For hexadecimal, we can add
H at the end such as 1AB5H , but we can also add 0x as prefix: 0x1AB5 .
In the future, without specifying which number we’re using, you should
be able to recognize them correctly based on suffix or prefix.
A byte is always 8-bits, and this is the unit we’re going to use most fre- 2: Fun fact, 4-bits is called a nibble. We
quently. 2 When we have more bytes, the unit definition could vary among rarely use that though.
2 1 Fundamentals
Conversion to Decimals.
Given a binary number of 𝑑 bits 𝑟𝑑−1 𝑟𝑑−2 ⋅ ⋅ ⋅ 𝑟1 𝑟0 , where each 𝑟𝑖 is a
single bit 0 or 1, the decimal number is calculated as follows:
You can see that all digits of the binary number are used to convert
into the decimal number.
Extension.
Later in this course we’ll usually need to extend a binary number
to a larger size, e.g., extend a 8-bit number to a 16-bit number while
1001 0111 keeping its value the same. For unsigned numbers, we can just do
zero extension, because no matter how many zeros we add to the
0000 1001 0000 0111
front, in Equation (1.1) they don’t carry any weight eventually. See
Figure 1.2: Zero extension. In each exam- Figure 1.2 for an example.
ple, we extend an unsigned binary num-
ber of four bits to one byte.
1.1 Number Representations 3
Data Range.
When it comes to binary, another thing we usually talk about is the
𝑑 𝑑
range of the number. Let 𝑈𝑚𝑎𝑥 and 𝑈𝑚𝑖𝑛 denote the maximal and
minimal number an unsigned integer of 𝑑 bits can represent, then
we have:
𝑑
𝑈𝑚𝑎𝑥 = 2𝑑 − 1, (1.3)
𝑑
𝑈𝑚𝑖𝑛 = 0. (1.4)
Signed numbers are not very complicated either. The only difference is we
use the MSB as the sign: 1 for negative, and 0 for positive.
Conversion to Decimals.
When converting a signed binary into a decimal, we simply make
the weight of the MSB negative:
The representation of signed number is called two’s complement. 5 5: There’s also one’s complement if
So... does 1000b represent −0?? Let’s keep reading. you’re interested.
ψ Caution!
Signed Extension.
A common mistake is simply take
Now that we’re dealing with signed numbers whose MSB is a sign,
MSB as a sign and apply Equa-
when we extend the numbers we cannot simply add zeros to the tion (1.1) to the rest of the bits. For ex-
front. For example, if we want to extend 1010b to a byte with zero ample, 1010b is −6 in decimal based
extension, we’ll get 00001010b, which is +10, apparently a different on our equation above. However, if
you just take the leading 1 as neg-
number.
ative sign, and convert 010b using
Equation (1.1), you’ll get −2, which
But what if we just do zero extension and make the highest bit the is obviously wrong.
sign, as in 10001010b? Do the conversion using Equation (1.5) and
you’ll see that’s not right either.
With a bit guess, we notice that in the number 10001010b, the three
contiguous zeros take exactly a weight of +112, then why don’t we
just make them all 1s, as in 11111010b ? This is exactly the signed
extension!
Data Range.
𝑑 𝑑
Let 𝑆𝑚𝑎𝑥 and 𝑆𝑚𝑖𝑛 the maximal and minimal numbers a signed 𝑑-bit
binary can represent, respectively:
𝑑 𝑑−1
𝑆𝑚𝑎𝑥 = 𝑈𝑚𝑎𝑥 = 2𝑑−1 − 1, (1.7)
𝑑
𝑆𝑚𝑖𝑛 = − 2𝑑−1 . (1.8)
d
Figure 1.4 shows a mapping from signed number to unsigned num-
<latexit sha1_base64="Fh+sW8gkafvCAqqaYNAcCfM0QP4=">AAAB8nicbVDLSsNAFJ3UV62vqks3g0VwVRIp6rLgxmUF0xbSWCaTSTt0HmFmIpaQz3DjQhG3fo07/8Zpm4W2HrhwOOde7r0nShnVxnW/ncra+sbmVnW7trO7t39QPzzqapkpTHwsmVT9CGnCqCC+oYaRfqoI4hEjvWhyM/N7j0RpKsW9maYk5GgkaEIxMlYK/Ic8LoY5R0/FsN5wm+4ccJV4JWmAEp1h/WsQS5xxIgxmSOvAc1MT5kgZihkpaoNMkxThCRqRwFKBONFhPj+5gGdWiWEilS1h4Fz9PZEjrvWUR7aTIzPWy95M/M8LMpNchzkVaWaIwItFScagkXD2P4ypItiwqSUIK2pvhXiMFMLGplSzIXjLL6+S7kXTu2y27lqNdquMowpOwCk4Bx64Am1wCzrABxhI8AxewZtjnBfn3flYtFaccuYY/IHz+QPL5ZGP</latexit>
Umax 111...11
d
<latexit sha1_base64="Gu7LPOxptVGN91mA3uky5ahxk4Y=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4sSRS1GPBi8cKpi20sWw2m3bpbhJ3N8US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8w8P+FMadv+tlZW19Y3Nktb5e2d3b39ysFhS8WpJNQlMY9lx8eKchZRVzPNaSeRFAuf07Y/upn67TGVisXRvZ4k1BN4ELGQEayN5LkPWZD3M4Gf8nOnX6naNXsGtEycglShQLNf+eoFMUkFjTThWKmuYyfay7DUjHCal3upogkmIzygXUMjLKjystnROTo1SoDCWJqKNJqpvycyLJSaCN90CqyHatGbiv953VSH117GoiTVNCLzRWHKkY7RNAEUMEmJ5hNDMJHM3IrIEEtMtMmpbEJwFl9eJq2LmnNZq9/Vq416EUcJjuEEzsCBK2jALTTBBQKP8Ayv8GaNrRfr3fqYt65YxcwR/IH1+QOt45IB</latexit>
1 111...10
Umax
ber.
Umax + 1 100...00
d 1 d
First, we get the unsigned binary representation for its absolute value 5,
<latexit sha1_base64="GQ1n7/GjhRcUCWiiKqgd/6lwiiM=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4sSRS1GPBi8cKpi20sWw2m3bpbhJ3N8US8ju8eFDEqz/Gm//GbZuDtj4YeLw3w8w8P+FMadv+tlZW19Y3Nktb5e2d3b39ysFhS8WpJNQlMY9lx8eKchZRVzPNaSeRFAuf07Y/upn67TGVisXRvZ4k1BN4ELGQEayN5LkPWXDu5P1M4Ke8X6naNXsGtEycglShQLNf+eoFMUkFjTThWKmuYyfay7DUjHCal3upogkmIzygXUMjLKjystnROTo1SoDCWJqKNJqpvycyLJSaCN90CqyHatGbiv953VSH117GoiTVNCLzRWHKkY7RNAEUMEmJ5hNDMJHM3IrIEEtMtMmpbEJwFl9eJq2LmnNZq9/Vq416EUcJjuEEzsCBK2jALTTBBQKP8Ayv8GaNrRfr3fqYt65YxcwR/IH1+QOqQZIB</latexit>
<latexit sha1_base64="AdSQwDN3hZIvPJbrQ4U1+Pu5GpU=">AAAB8nicbVBNS8NAEN34WetX1aOXxSJ4KokU9Vjw4rGi/YA0ls1m2y7d3YTdiVhCfoYXD4p49dd489+4bXPQ1gcDj/dmmJkXJoIbcN1vZ2V1bX1js7RV3t7Z3duvHBy2TZxqylo0FrHuhsQwwRVrAQfBuolmRIaCdcLx9dTvPDJteKzuYZKwQJKh4gNOCVjJv3vIoryfSfKU9ytVt+bOgJeJV5AqKtDsV756UUxTyRRQQYzxPTeBICMaOBUsL/dSwxJCx2TIfEsVkcwE2ezkHJ9aJcKDWNtSgGfq74mMSGMmMrSdksDILHpT8T/PT2FwFWRcJSkwReeLBqnAEOPp/zjimlEQE0sI1dzeiumIaELBplS2IXiLLy+T9nnNu6jVb+vVRr2Io4SO0Qk6Qx66RA10g5qohSiK0TN6RW8OOC/Ou/Mxb11xipkj9AfO5w/IyZGN</latexit>
which is 0b0101 . Then flip every bit to 0b1010 and add 1 to it, which
gives us 0b1011 , and that’s exactly the two’s complement of −5. If it’s a
positive number, then the two’s complement is identical to the unsigned
000...00 000...00 0
<latexit sha1_base64="Ysbopyf6ykGDA1PlXgQLwFYRP70=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeCF48t2A9oQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBFcG9f9dgobm1vbO8Xd0t7+weFR+fikreNUMWyxWMSqG1CNgktsGW4EdhOFNAoEdoLJ3dzvPKHSPJYPZpqgH9GR5CFn1Fip6Q7KFbfqLkDWiZeTCuRoDMpf/WHM0gilYYJq3fPcxPgZVYYzgbNSP9WYUDahI+xZKmmE2s8Wh87IhVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUUWZsNiUbgrf68jppX1W962qtWavUa3kcRTiDc7gED26gDvfQgBYwQHiGV3hzHp0X5935WLYWnHzmFP7A+fwBd9GMrw==</latexit>
1
binary. 6
111...10 2
<latexit sha1_base64="PNmnlg/VseG4xBbW5b8Avx8LvfI=">AAAB6XicbVBNS8NAEJ34WetX1aOXxSJ4sSSlqMeCF49V7Ae0oWy2m3bpZhN2J0IJ/QdePCji1X/kzX/jts1BWx8MPN6bYWZekEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2Mb2d++4lrI2L1iJOE+xEdKhEKRtFKD5fVfqnsVtw5yCrxclKGHI1+6as3iFkacYVMUmO6npugn1GNgkk+LfZSwxPKxnTIu5YqGnHjZ/NLp+TcKgMSxtqWQjJXf09kNDJmEgW2M6I4MsveTPzP66YY3viZUEmKXLHFojCVBGMye5sMhOYM5cQSyrSwtxI2opoytOEUbQje8surpFWteFeV2n2tXK/lcRTgFM7gAjy4hjrcQQOawCCEZ3iFN2fsvDjvzseidc3JZ07gD5zPH+QejOg=</latexit>
100...00 Smin
Signed range
One nice thing about conversion between hexadecimals and binaries is
Figure 1.4: For non-negative numbers, they can be mapped directly: every four-bit binary is corresponding to a
signed and unsigned have no difference.
When consider MSB as sign bit, the same hexadecimal digit:
binary pattern will be mapped to a posi-
tive number in the unsigned range. 1010 → A 1011 → B 1100 → C
1101 → D 1110 → E 1111 → F
6: Can you think of a justification why
this trick works?
Thus, if you want to convert a binary 000101101010011111010100 to hex-
adecimal, you just need to chop them into four-bit chunks:
Calculations between binary numbers are bit-wise operations, and it’s not
that different than decimal calculations. In computer systems, however,
each binary has a fixed width, e.g., 64 bits, so the result might be different
or not make sense mathematically.
Let’s look at an example first, where we restrict both operands and the re-
sult to be only one byte. Assume operand a is 0b1111 1111 , while b
is 0b0000 0001 . When we perform a + b by hand, we know we need
to carry one bit of 1 to the front, so we have a + b = 0b1 0000 0000 .
However, remember we want to restrict the result to only one byte, we can
only save the lowest eight bits, which makes is a + b = 0b0000 0000 :
1111 1111
+ 0000 0001
1
0000 0000
In this case, when the result has a carry (or a borrow in subtraction), we
say there’s a carry happened. When we interpret the operands as unsigned
integers, we see this results in 255 + 1 == 0 .
6 1 Fundamentals
0111 1111
+ 0111 1111
0
1111 1110
This time, we add two 0b0111 1111 together, and get 0b1111 1110 . If
we interpret them as signed numbers, this is basically 127+127 == -2 .
Now this doesn’t make sense mathematically — how come adding two
positive numbers results in a negative number? In this case, we say there’s
an overflow.
positive
0111...1 overflow
Assume the lengths of the operands are 𝑑 bits, and when we perform addi-
0100...0 tion on them we result in a 𝑑 + 1-bit number, so we have to get rid of MSB.
0011...1 011...1
As shown in Figure 1.5, if the result is between −2𝑑−1 and 2𝑑−1 − 1 (inclu-
sive), truncating MSB doesn’t affect our result. However, if it’s between
0000...0 000...0
111...1 [−2𝑑−1 − 1, −2𝑑 ], truncating MSB will take us to positive range. This is
called negative overflow. Similarly, truncating a number in the range of
1100...0 100...0
1011...1
<latexit sha1_base64="H0fjCowplGClm9ablGEOeZeQw2c=">AAACBXicbVDLSsNAFJ34rPUVdamLYBFclUTqY1l047KCfUATwmRy2w6dTMLMRCghGzf+ihsXirj1H9z5N07aLLT1wDCHc+7l3nuChFGpbPvbWFpeWV1br2xUN7e2d3bNvf2OjFNBoE1iFotegCUwyqGtqGLQSwTgKGDQDcY3hd99ACFpzO/VJAEvwkNOB5RgpSXfPHJTHoIIBCaQuSOZFP85RHnuZ7lv1uy6PYW1SJyS1FCJlm9+uWFM0gi4IgxL2XfsRHkZFooSBnnVTSXoCWM8hL6mHEcgvWx6RW6daCW0BrHQjytrqv7uyHAk5SQKdGWE1UjOe4X4n9dP1eDKyyhPUgWczAYNUmap2CoisUIqgCg20QQTQfWuFhlhnYjSwVV1CM78yYukc1Z3LuqNu0ateV3GUUGH6BidIgddoia6RS3URgQ9omf0it6MJ+PFeDc+ZqVLRtlzgP7A+PwBvLmZYQ==</latexit>
|
d bits
<latexit sha1_base64="OlQqY3YZLIxt2bgUY/MluCmov9E=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mkqMeiF48t2FpoQ9lsJu3azSbsboRS+gu8eFDEqz/Jm//GbZuDtj4YeLw3w8y8IBVcG9f9dgpr6xubW8Xt0s7u3v5B+fCorZNMMWyxRCSqE1CNgktsGW4EdlKFNA4EPgSj25n/8IRK80Tem3GKfkwHkkecUWOlZtgvV9yqOwdZJV5OKpCj0S9/9cKEZTFKwwTVuuu5qfEnVBnOBE5LvUxjStmIDrBrqaQxan8yP3RKzqwSkihRtqQhc/X3xITGWo/jwHbG1Az1sjcT//O6mYmu/QmXaWZQssWiKBPEJGT2NQm5QmbE2BLKFLe3EjakijJjsynZELzll1dJ+6LqXVZrzVqlfpPHUYQTOIVz8OAK6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8fyteM8Q==</latexit>
{z }
[2𝑑−1 , 2𝑑 − 1] we end up having a positive overflow. Note that overflow
is meaningless when we interpret the numbers as unsigned.
negative
1000...0 overflow
| {z }
˛ Quick Check 1.2
<latexit sha1_base64="H0fjCowplGClm9ablGEOeZeQw2c=">AAACBXicbVDLSsNAFJ34rPUVdamLYBFclUTqY1l047KCfUATwmRy2w6dTMLMRCghGzf+ihsXirj1H9z5N07aLLT1wDCHc+7l3nuChFGpbPvbWFpeWV1br2xUN7e2d3bNvf2OjFNBoE1iFotegCUwyqGtqGLQSwTgKGDQDcY3hd99ACFpzO/VJAEvwkNOB5RgpSXfPHJTHoIIBCaQuSOZFP85RHnuZ7lv1uy6PYW1SJyS1FCJlm9+uWFM0gi4IgxL2XfsRHkZFooSBnnVTSXoCWM8hL6mHEcgvWx6RW6daCW0BrHQjytrqv7uyHAk5SQKdGWE1UjOe4X4n9dP1eDKyyhPUgWczAYNUmap2CoisUIqgCg20QQTQfWuFhlhnYjSwVV1CM78yYukc1Z3LuqNu0ateV3GUUGH6BidIgddoia6RS3URgQ9omf0it6MJ+PFeDc+ZqVLRtlzgP7A+PwBvLmZYQ==</latexit>
d + 1 bits
<latexit sha1_base64="qBuevcQwSNJLBUUIDx4jzYBqJ5o=">AAAB6nicbVBNS8NAEJ34WetX1aOXxSIIQkmkqMeiF48V7Qe0oWw2m3bpZhN2J0Ip/QlePCji1V/kzX/jts1BWx8MPN6bYWZekEph0HW/nZXVtfWNzcJWcXtnd2+/dHDYNEmmGW+wRCa6HVDDpVC8gQIlb6ea0ziQvBUMb6d+64lrIxL1iKOU+zHtKxEJRtFKD+G51yuV3Yo7A1kmXk7KkKPeK311w4RlMVfIJDWm47kp+mOqUTDJJ8VuZnhK2ZD2ecdSRWNu/PHs1Ak5tUpIokTbUkhm6u+JMY2NGcWB7YwpDsyiNxX/8zoZRtf+WKg0Q67YfFGUSYIJmf5NQqE5QzmyhDIt7K2EDaimDG06RRuCt/jyMmleVLzLSvW+Wq7d5HEU4BhO4Aw8uIIa3EEdGsCgD8/wCm+OdF6cd+dj3rri5DNH8AfO5w+hE41h</latexit>
Figure 1.5: Truncating MSB of a 𝑑 + 1-bit Is the following statement true or false, and why? It is possible to
binary will result in positive or negative get an overflow error when adding two signed numbers of oppo-
overflow. There’s no effect on any num-
ber between [−2𝑑−1 , 2𝑑−1 − 1].
site signs.
B See solution on page 21.
1.1.5.2 Shifting
As the name suggests, shifting is simply to shift every bit several digits
over. There are two directions: left and right, meaning we can shift the
number to the left or the right.
[7][6][5][4][3][2][1][0]
Even though the bit-wise operation of logical shift left and right are 1 0 0 0 1 1 0 0
the opposite, shifting right cannot be regarded as dividing by two.
This can be easily verified from the examples shown in Figure 1.7. 0 0 0 1 0 0 0 1 1 0 0
The most obvious thing to notice is that, if the number is a signed in- [10][9][8][7][6][5][4][3][2][1][0]
teger, shifting right might even change the sign of the number. This [7][6][5][4][3][2][1][0]
is true when the MSB of the original binary number is 1. We call 1 0 0 0 1 1 0 0
0 0 0 0 0 0 1 1 1 0 0
[7][6][5][4][3][2][1][0]
Most of our modern computers are built on the simple model from John
von Neumann, called the von Neumann model or Princeton architecture.
Figure 1.9 shows us a visualization of this model. For our course, we fo-
cus on Central Processing Unit (Chapter 3) and Random Access Memory
(Chapter 4) most of the time. I/O devices will be mentioned, but not dis-
cussed in depth.
As its name suggests, CPU is the core of all computers, since it’s where
the actual calculation happens. Everything we see eventually needs to go
through CPU to calculate results. Roughly speaking, the two most impor-
tant components of CPU are registers and arithmetic logic unit (ALU).
1.2.1.1 Registers
Most of the machines have a set of registers for us to use, and this set is
usually called a register file. You can think registers are just small storage
to store one binary sequence.
Data Bus
Register Random
ALU Clock
File Access I/O Device I/O Device
Memory #1 #2
Central Processor Unit Control
(RAM)
(CPU) Unit
Control Bus
Figure 1.9: A typical von Neumann model. Address Bus
1.2 Basic Components of Microprocessors 9
want. There are also special registers that we cannot modify the numbers
stored there, and we’ll introduce them later.
Once the operands are ready in the registers, they’ll be sent to ALU for the
real calculation. ALU receives two operands, and does simple calculations
as specified. In fact, ALU is very simple: the only calculations it can do
are arithmetic (addition and subtraction) and bit-wise logic ( and , or ,
etc). Later in this course, we will understand how these simple elementary-
school-math in ALU leads to something so magical and powerful.
Data Storage.
You can think the RAM as a large array or list of bytes. It’s byte-
addressed, meaning each byte in the memory has its own address.
The addresses typically range from 0 to a very large number, depend-
ing on the size of the RAM. For example, with a 4GB RAM, we can
store 4 ⋅ 230 bytes, and therefore the address range will be [0, 232 − 1].
Apparently, this 4GB memory is never large enough for a modern
computer, or in fact much much smaller than what we actually need.
Later in this course, we will learn a very genius idea about virtual
memory and physical memory. For now, their difference doesn’t
really matter.
Data Transfer.
Note here RAM is not inside CPU; thus, if we want to do some cal-
culations using CPU on data from RAM, we need to move the data
from RAM to registers first. With Figure 1.9, you probably have
guessed that CPU sends an address to RAM through address bus,
and then RAM will send the data stored at that address back to CPU
through data bus.
to use just one wire, because we’d have to wait for 8 seconds (as-
sume each second transfers one bit) to get the number. What makes
sense is to group 8 wires together, and transfer the 8 bits on them all
at once, one bit per wire. This group of 8 wires is called a bus.
The width of the buses can vary based on their needs. For 32-bit
machines, an address bus has a width 32, and thus it can address
at most 4GB of memory. On 64-bit machines, the address bus has a
width of 64, which makes the addressable space 264 bytes, or 16EB.
The width of control bus depends on how many signals the control
unit needs to send to different parts of the machine. We’ll learn in
this Chapter 3.
CPU and RAM are the main focus of our course, but for a real computer
that’s far from complete obviously. First of all, remember the storage of
memory is very limited, and therefore we can’t store all our data/program
into it. In fact, only when we start executing a program will it be loaded
into RAM. Also, we need to connect monitors, keyboard, etc. We call these
peripheral devices, and from Figure 1.9, we see that they are mostly I/O
devices, including hard drives.
from the bottom (hardware), the computer scientists built one layer on top
at a time, which eventually transforms digital signals to everything we’re
familiar today.
To prepare you for the madness later in this course, let’s start from the
layer you’re most familiar with, which is the high-level language. Not all
high-level languages are created equally, though. Our purpose is not to
learn a new language; instead we want to use a language that’s high-level
enough for understanding but also has close connection to low-level hard-
ware design as an entry point. C apparently wins in this perspective.
1 int main() {
2 return 0;
3 }
12 1 Fundamentals
Variable and function declarations are very similar to Java, so we will skip
that part. One difference is that C is not an object-oriented language, so the
functions and variables don’t have attributes such as public , private ,
etc.
In C, we have several basic data types: char , short int , int , long int ,
float , and double . The difference among the three types of ints is
they have different numbers of bits to represent an integer, so they will
have different ranges of representation. Same for float and double ,
though the result is they have different precision.
In fact, char types are also integers, but can only store eight bits (see
Section 1.4.2.2). The integers they store represent a character’s ASCII code,
so the following two declarations are equivalent:
1 char c = 'A';
2 char c = 65;
Notice here we don’t have string types. Since a string is basically an array
of characters, we can declare it using three different methods:
The difference between the first two is the place where the string is stored,
and we can just ignore it now. Notice for the last method, we add a 0 at
the end. This is a null terminator, marking the end of the string. If we use
the first two methods, a null terminator will be automatically attached, so
we do not need to add it manually. The null terminator can be declared
using integer value of 0, or a character '\0' .
Integer data types (including char ) by default are signed. You can add
modifier unsigned to make them unsigned. For example,
We’ll introduce printf() in the next section, but based on the output,
you can see they have different values, even if we declare them to be the
same. Actually you can verify that, if an integer takes four bytes, −253 is
indeed 4294967043 when interpreted as an unsigned integer.
1.4 A Peek Into Memory with C 13
C language can deal with binary numbers as well. We show the operators
and their descriptions in Table 1.2.
Table 1.2: Binary operators in C lan-
Notice that what’s different than C++ is << and >> are not used as guage.
stream operators in C, so we cannot print something to the terminal using
Operator Description
them. They are simply used for numbers and manipulate bits. Also, &
<< Logical shift left
and | are different than logical operators && and || . Please be careful >> Arithmetic shift right
when you use them. & Bit-wise and
| Bit-wise or
You can certainly apply them to any numbers you like. See the following ^ Bit-wise xor
example: ~ Bit-wise negation
where left operand of << is the original number, and right operand is the
number of bits we want to shift.
1. a & b;
2. a | c;
3. a | 0;
4. a | (b >> 5);
5. a & (b >> 5);
6. ~((b | c) & a);
1 #include <stdio.h>
If it’s just a string, such as those declared in the code listing above, we can
use puts() function:
1 puts(str);
14 1 Fundamentals
where we assume there’s a variable a whose value is 10. Then the output
will be like: Variable a = 10 , with a new line. You can see that %d
is replaced by the value of a when printed out. Different types require
different format specifiers. The common ones are shown in Table 1.3
Table 1.3: Commonly used format speci-
fiers in C. In C, we cannot print an entire array out with one single printf() ; in-
%d Decimal signed integers stead, we need to loop into the array, and print each element individu-
%u Decimal unsigned integers ally:
%f Float numbers
%lf Doubles 1 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};
%c Characters 2 for (int i = 0; i < 5; i ++) {
%s Null-terminated strings 3 printf("The %d-th element is %lf\n", i, arr[i]);
%p Pointers 4 }
We’re fairly familiar with some typical control structures such as loops and
if-else , so we’ll skip them here, but bring a unique and “infamous”
keyword in C, called goto . We say it’s infamous, because it has been
criticized for so long.
What it does is very simple: you mark a line of your C code with a label,
and you can use goto to change your program to execute the line with
the label. For example:
1 #include <stdio.h>
2 int main() {
3 int a = 10;
4 a = 20;
5 goto L1;
6 a = 40;
7 L1: printf("a = %d\n", a);
8 return 0;
9 }
1.4 A Peek Into Memory with C 15
In this example, we mark the line with printf() using a label L1 . After
we assigned 20 to a , we used a goto statement to jump to L1 . As
you see, in the print out string, value of a is 20, instead of 40, because
the statements between goto and the destination label L1 were simply
skipped.
Except that you cannot label the statements that are declaring a variable,
there’s no limit where you can put a label. You can certainly put a label
before the goto statement like this:
1 int a = 10;
2 L1: a = 20;
3 goto L1;
4 a = 40;
5 printf("a = %d\n", a);
6 return 0;
But what will happen? You’re stuck in an infinite loop! You can even do
this:
But as you see goto statement will cause problems: the program logic
and algorithm becomes very unclear, the control is not structured, and it
makes the program difficult to debug and understand. That’s why it’s
been criticized so hard widely, and almost everyone advises against using
it. Please do not use goto in any of your assignment/lab/exam in this
course unless specifically asked.
1.4.2 Pointers
Every variable, even our code, stores in memory when executing the pro-
gram. If it’s in memory, apparently they’ll be stored as binary numbers,
and every byte will have an address (see Figure 1.10). For C language, we
can explicitly use those addresses, which is a major difference than the
languages you have learned so far. The addresses in C are called point-
ers. Pointers are “pointing” to a variable in our program, and have a type
closely related to the type of the variable they point to.
1 int main() {
2 int var = 10;
3 int* ptr_var = &var;
4 return 0;
5 }
Given a pointer, if we want to get the value stored at the address, we can
dereference the pointer by using * operator:
1 int main() {
2 int var = 10;
3 int* ptr_var = &var;
4 int deref = *ptr_var;
5 return 0;
6 }
When running this program, on line 3, we’ll have ptr_var to store the
address of var , say 0xffff1000 . Then on line 4, we use * to derefer-
ence the address, which gives us the value stored there, and assign it back
to variable deref . Thus, the value of deref is 10.
Following the example above, and based on our current discussion, it’s
not difficult to derive the double pointer:
Which leads to another question then. We know different data types can
occupy different numbers of bytes. When you dereference a pointer, how
does the system know you want a, say integer, or a double? Remember
pointers also have types, e.g., int* and double* . If you’re dereferenc-
ing a int* , the system will group four contiguous bytes starting from
the address indicated by int* and interpret it as an integer. Thus, the
type of the pointer determines how many bytes it needs to retrieve from
the address.
1.4.2.3 Endianness
Almost all the memories are byte-addressed, meaning each byte has a
unique address. We also know some data types occupy multiple bytes.
The question is, how do you store these different bytes in the memory, i.e.,
in what order?
Now these four bytes will be stored in the memory when the program is
executing, and each byte has its own address. Let’s assume the integer’s
address &var is 0x1000 . Question is, which byte has what address?
There are two ways to manage this. First, we can let the lowest byte occupy
the lowest address, called little endian:
Of course we can also let the lowest byte occupy the highest address, which
is called big endian:
In fact, the name of an array is also the address of its first element, i.e.,
arr == &arr[0] .
When running this program, you’ll notice that the addresses on two con-
secutive lines have eight bytes differences. This is because arr is a double*
and each double takes eight bytes, so adding 1 to the pointer will pro-
duce an offset of eight bytes. More generally, for a pointer p of type x
and an integer i , when we do p + i , the new address will be actually
i * sizeof(x) bytes higher than p .
1.4 A Peek Into Memory with C 19
We have also learned dereferencing a pointer to get the value stored at that
address, so here’s another way to print all elements out:
We know that to print out an array we need to use for loops, where we’ll
specify the number of iterations we want to repeat. One exception, how-
ever, is strings, where we can just use %s to print all the characters in the
char array:
Note that printf() receives str , which is the starting address of the
char array, and so it’ll start and continue printing one byte after another.
When will it stop, then? That’s where the null terminator comes into play:
printf() will start printing given the starting address of the string, keep
printing, until there’s a null terminator.
What if we don’t have the null terminator? Sometimes it’s fine, because
there’s already no data stored at the end of the string, and therefore its
value is just 0, which happens to be the null terminator. But try the follow-
ing program:
1 #include <stdio.h>
2 int main() {
20 1 Fundamentals
1. For each of the following statements, state if it’s true or false, and
explain why it’s true or false:
a) Depending on the context, the same sequence of bits may rep-
resent different things;
True. The same bits can be interpreted in many different
ways with the exact same bits! It can be from an unsigned
number to a signed number or even a program. It is all
dependent on your interpretation.
b) If you interpret a 𝑁 bit two’s complement number as an un-
signed number, negative numbers would be smaller than pos-
itive numbers.
False. In Two’s Complement, the MSB is always 1 for a neg-
ative number. This means ALL Two’s Complement nega-
tive numbers will be larger than the positive numbers.
2. For the following questions, assume an 8-bit integer and answer
each one for the case of an unsigned number and two’s complement
number. Indicate if it cannot be answered with a specific represen-
tation:
a) What is the largest integer? What is the result of adding one to
that number?
Unsigned: 255, 0;
Two’s Complement: 127, -128.
b) How would you represent the numbers 0, 1, and -1?
Unsigned: 0b0000 0000 , 0b0000 0001 , not possible;
Two’s Complement: 0b0000 0000 , 0b0000 0001 , 0b1111 1111 .
c) How would you represent 17 and -17?
Unsigned: 0b0001 0001 , not possible;
Two’s Complement: 0b0001 0001 , 0b1110 1111 .
3. What is the least number of bits needed to represent the following
ranges using any number representation scheme:
a) 0 to 256;
1.5 Quick Check Solutions 21
False. Overflow errors only occur when the correct result of the ad-
dition falls outside the range of [−2𝑑−1 , 2𝑑−1 − 1]. Adding numbers
of opposite signs will not result in numbers outside of this range.
Assume we have a binary number 1100 0101 . How can we apply shift-
ing operations on this number so we can get 0000 0101 ? That is, only
preserve the lowest four bits and zero out the rest.
First shift left four bits, and shift right four bits.
Assume we have a binary number 1100 0101 . How can we apply only
bit-wise logical operations on this number so we can get 0000 0101 ?
That is, only preserve the lowest four bits and zero out the rest.
Now let’s do a simple quick check to make sure you understand every-
thing so far.
True or False?
2. a | c;
0b1111 1011
3. a | 0;
0b1000 1011
4. a | (b >> 5);
0b1000 1011
Can you write a program to show all the bytes in the following array indi-
vidually, in the order they are stored?
1.5 Quick Check Solutions 23
Solution:
1 #include <stdio.h>
2 int main() {
3 double arr[5] = {3.1, 4.223, 5.152, 10.01 , 2};
4 char* p = (char*)arr;
5 for (size_t i = 0; i < sizeof(double)*5; i ++) {
6 printf("0x%x\n", *(p+i));
7 // p[i] is also correct!
8 }
9 return 0;
10 }
Instruction Set Architecture 2
In this chapter we will discuss ARMv8 instruction set architecture. From 2.1 Instruction Format . . . . . 25
now on, I want you forget all the magical things you learned about pro- 2.2 Accessing Memory . . . . . 26
gramming languages.
2.3 Moving Constants & Regis-
ters . . . . . . . . . . . . . . . 29
2.4 Data Processing Operations30
2.1 Instruction Format 2.5 Flow Control . . . . . . . . . 32
2.6 Procedures . . . . . . . . . . 47
Instructions consist of four parts and have the following format: 2.7 Quick Check Solutions . . 59
where label marks the address of the instruction in memory, which is op-
tional; mnemonic is the text representation of specific operation we want
the instruction to carry out (think about the plus sign in 22 + 33), which is
required for any instruction; operands are separated by commas, and the
amount of operands needed is determined by the actual mnemonic for
that instruction (it can vary from 0 to 3). The operands can be numbers
(called immediate) and/or registers. Lastly, line comments are marked
by double slash, and block comments can be used with a pair of /* ...
*/.
Register Names
64 bits 32 bits
<latexit sha1_base64="LfdtqX7Zgdrdmqa3pXaPSHzWw5w=">AAACE3icbVC7SgNBFJ2Nrxhfq5Y2i0EQi7ArIWoXtLGMYB6QjWF2cpMMmX0wc1cSlvUbbPwVGwtFbG3s/Bsnj0ITDwxzOOde7r3HiwRXaNvfRmZpeWV1Lbue29jc2t4xd/dqKowlgyoLRSgbHlUgeABV5CigEUmgvieg7g2uxn79HqTiYXCLowhaPu0FvMsZRS21zRM31LYnKYPE7ato/F/Y0TBN75JS0UUYYvLgcVRp2jbzdsGewFokzozkyQyVtvnldkIW+xAgE1SppmNH2EqoRM4EpDk3VqAHDmgPmpoG1AfVSiY3pdaRVjpWN5T6BWhN1N8dCfWVGvmervQp9tW8Nxb/85oxds9bCQ+iGCFg00HdWFgYWuOArA6XwFCMNKFMcr2rxfpUB4Q6xpwOwZk/eZHUTgtOqVC8KebLl7M4suSAHJJj4pAzUibXpEKqhJFH8kxeyZvxZLwY78bHtDRjzHr2yR8Ynz8FaZ+H</latexit> <latexit sha1_base64="NvTksVe9IQ1b2iP8oWC5v2j1zog=">AAACE3icbVC7TgJBFJ31ifhCLW02EhNjQXYRHyXRxhITeSQsktnhAhNmH5m5ayCb9Rts/BUbC42xtbHzb5wFCgVPMpmTc+7Nvfe4oeAKLevbWFhcWl5Zzaxl1zc2t7ZzO7s1FUSSQZUFIpANlyoQ3IcqchTQCCVQzxVQdwdXqV+/B6l44N/iKISWR3s+73JGUUvt3LETaNuVlEHs9FWY/qXTcJgkd/FJ0UEYYvzgclRJ0s7lrYI1hjlP7CnJkykq7dyX0wlY5IGPTFClmrYVYiumEjkTkGSdSIEeOKA9aGrqUw9UKx7flJiHWumY3UDq56M5Vn93xNRTauS5utKj2FezXir+5zUj7F60Yu6HEYLPJoO6kTAxMNOAzA6XwFCMNKFMcr2ryfpUB4Q6xqwOwZ49eZ7UigX7rFC6KeXLl9M4MmSfHJAjYpNzUibXpEKqhJFH8kxeyZvxZLwY78bHpHTBmPbskT8wPn8A/X2fgg==</latexit>
z }| { z }| {
X0 W0 X9 W9 X16 W16 X28 W28
···
···
···
<latexit sha1_base64="JXzBjyZgmsI8ags2tosNzBng8sY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALFFjzg=</latexit>
<latexit sha1_base64="JXzBjyZgmsI8ags2tosNzBng8sY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALFFjzg=</latexit>
<latexit sha1_base64="JXzBjyZgmsI8ags2tosNzBng8sY=">AAAB7XicbVBNS8NAEJ3Ur1q/qh69BIvgqSRS1GPRi8cK9gPaUDabTbt2sxt2J0Ip/Q9ePCji1f/jzX/jts1BWx8MPN6bYWZemApu0PO+ncLa+sbmVnG7tLO7t39QPjxqGZVpyppUCaU7ITFMcMmayFGwTqoZSULB2uHodua3n5g2XMkHHKcsSMhA8phTglZq9Wik0PTLFa/qzeGuEj8nFcjR6Je/epGiWcIkUkGM6fpeisGEaORUsGmplxmWEjoiA9a1VJKEmWAyv3bqnlklcmOlbUl05+rviQlJjBknoe1MCA7NsjcT//O6GcbXwYTLNEMm6WJRnAkXlTt73Y24ZhTF2BJCNbe3unRINKFoAyrZEPzll1dJ66LqX1Zr97VK/SaPowgncArn4MMV1OEOGtAECo/wDK/w5ijnxXl3PhatBSefOYY/cD5/ALFFjzg=</latexit>
X29 W29
Figure 2.1: Register file of ARM architecture. X-registers can store 64 bits. The lowest half 32 bits for each X-register can also be used
independently as W-registers. However, the upper half 32 bits cannot be used independently.
26 2 Instruction Set Architecture
2.2.1 Load
ą Example 2.1
An integer is currently stored at memory address of
0xFFFFFFFFEB101010 , and this address has already been
stored in register X9 . Please load this integer into the 11-th
register.
LDR W10,[X9,0] X9
W9
LDR W10,[X9,2] X9
W9
Starting address: X9+0 Starting address: X9+2
FF FF FF FF EB 10 10 10 FF FF FF FF EB 10 10 10
Figure 2.2: LDR will grab a few bytes (determined by the size of the destination register) from the starting address, calculated by base+simm9.
We can also load a byte or a half word (2 bytes) into a register, but with
caution. Remember the register operands are either X-registers (8 bytes)
or W-registers (4 bytes), so if we want to grab only one or two bytes from
memory, we need to also extend the data to match the size of the desti-
nation. The instructions that can do this for us have the same format for
operands as in LDR, but the mnemonic can have variations:
The [S] in the mnemonic is optional; if used, it’ll sign extend the data to
match the destination register; otherwise it’ll zero extend the data. The
{H|B}, required, stands for either move half-word (H) or a byte (B). For ex-
ample, LDRSH W10, [X9,-24] will grab two bytes starting from address
X9-24, and sign extend this two byte data to four bytes, and store it into
W10.
The suffix mentioned above can also be applied to these two forms of the
instruction.
2.2.2 Store
ą Example 2.2
Currently X9 stores a quad number 0x12345678ABCDEF00 ,
and we want to store this number into the memory address
0xF..FEB101010. Currently X10 is storing this memory address.
Write the instruction to achieve this, and draw the memory layout.
Assume we’re using a little-endian machine.
The suffix mentioned above can also be applied to these two forms of the
instruction.
12 34 56 78 AB CD EF 00 12 34 56 78 AB CD EF 00
Address Data Address Data
0xFFFFFFFF EB101017 12 0xFFFFFFFF EB101017 00
0xFFFFFFFF EB101016 34 0xFFFFFFFF EB101016 EF
0xFFFFFFFF EB101015 56 0xFFFFFFFF EB101015 CD
0xFFFFFFFF EB101014 78 0xFFFFFFFF EB101014 AB
0xFFFFFFFF EB101013 AB 0xFFFFFFFF EB101013 78
0xFFFFFFFF EB101012 CD 0xFFFFFFFF EB101012 56
0xFFFFFFFF EB101011 EF 0xFFFFFFFF EB101011 34
0xFFFFFFFF EB101010 00 0xFFFFFFFF EB101010 12
X10 X10
FF FF FF FF EB 10 10 10 FF FF FF FF EB 10 10 10
Figure 2.3: STR is the opposite direction of LDR. Notice how the same value of X9 is stored differently on little and big endian machines.
Later you’ll find out that moving data from memory to registers is not al-
ways the best choice. Sometimes we just want to set some integer number
to a register, something as simple as giving a variable a value. The instruc-
tion we can use is MOV .
Notice two things here: first, the destination and source registers should
match the size; second, it’s called “moving”, but really it copies data from
source to destination. This means after the instruction, the data will have
two copies: one in the destination, and the other in the source.
For example, MOV X9,X10 will copy the data from X10 to X9 . If be-
fore X10 stores an integer value of 382, after the instruction, both X9
and X10 will have value of 382. The old value in X9 will be completely
overwritten.
30 2 Instruction Set Architecture
1 MOV X9, -9
2 MOV W9, 10
Once the data have been moved to registers, either from an immediate
number or memory, we are ready to perform real computations.
22 22 22 22 22 22 22 22 X0 Let’s start with the simplest arithmetic: addition and subtraction. The
mnemonics for these two operations are very straightforward: ADD and
33 33 33 33 33 33 33 33 X1
SUB .
Figure 2.4: Using W and X registers The subtraction instructions have the same format except the mnemonic
for addition. If W registers are used, the is SUB, so we’ll skip them here.
highest 32 bits will be cleared out. The
gray boxes are W registers. When using W registers, however, we have to be careful. For example, as-
sume we have X0 = 0x2222222222222222 and X1 = 0x3333333333333333 ,
both full 64-bit numbers. When we use X registers to do addition, e.g.,
1: This is very different than other ma-
chines such as x86, where operating on ADD X2,X1,X0 , X22 will be 0x5555555555555555 without any prob-
low bits of a register does not change the lem. If we use ADD W2,W1,W0 , however, the highest 32 bits will be cleared
high bits at all. Therefore, reading docu- (set to zero), and only the lowest 32 bits will be used. Therefore, in this case,
mentation carefully is important; some-
times it’s also good to just play around
X2 will be 0x0000000055555555 . See Figure 2.4 for an illustration. 1
and find out small details like this.
2.4 Data Processing Operations 31
ą Example 2.3
Assume a long integer var is stored at address ptr_var in mem-
ory. The value of ptr_var has been loaded into register X9 .
Write a sequence of ARMv8 assembly to add 1 to var . This is
similar to translating the following C code into assembly:
After this instruction, we are not done yet, because the result
currently resides in a register, but what we want is to update the
variable which resides in memory. Therefore, we need to move the
result back by using STR instruction. Notice that at this moment
X9 still stores the address of var , so if we write back there, it’ll
overwrite the old value of var : STR X12,[X9] .
Remarks: You can certainly set the destination register for ADD to
X10 , which will overwrite the old value of X10 , basically like
x10 = x10 + 1 . However, that X10 was updated with new
value does not mean the value inside the memory var will also
be updated. Therefore, you have to STR the result back manually.
Another type of calculation that the ALU can perform is logic operations,
which is basically bit-wise operations. The instructions we can use are:
All these instructions can also be used for W -registers, but all three operands
need to match the size.
One huge difference between the high-level languages you have learned
and assembly is that the former is highly structured, whereas the latter is
just a sequence of instructions. Therefore, repeat after me → “I will forget
2.5 Flow Control 33
The assembly program simply executes one instruction after another se-
quentially, unless we explicitly instruct the program to jump to a specific
instruction. Assembly programs don’t have “memory” either: you cannot
expect the program will magically jump back to somewhere without your
instruction.
Once we’re clear on this, let’s talk about program counter first.
The name of program counter (PC) might be a little bit confusing; its job
is not exactly to “count” the program. PC is a 64-bit special register inside
CPU, and no we cannot modify the data inside it.
Caution!
You probably already know this: when we’re executing our program, the
assembly instructions are also stored in RAM. Because everything stored Ę Note: PC is not one of the 32 general
purpose registers we introduced in
in RAM has an address, each instruction in our assembly programs also
Section 2.1, therefore it cannot be di-
has its own address. rectly referenced. For example, some-
thing such as ADD PC,PC,4 and
For us humans, when we read/write an assembly program, we know after
LDR X0,[PC] are wrong.
one instruction we just go to the next one, but how does machine know?
Here’s what happens: the CPU will store the address of its next instruc-
tion to be executed into PC. After the current instruction has finished, the
CPU will simply go to RAM and retrieve the instruction pointed by PC. 2 2: Disclaimer: this is an extremely simpli-
fied example. Later in Chapter 3, you’ll
see it’s more complicated. However, to
ą Example 2.4
not get confused, it’s entirely ok to think
Consider the following assembly sequence, about it this way now.
2.5.2 Branching
The simplest way is to use B instruction (stands for branching). The syn-
tax is B target where target is the target instruction’s memory ad-
dress. The problem is we don’t really know the memory address of each
instructions when they’re running. Therefore, we can use a label (if you
forgot about labels, go back to Section 2.1).
ą Example 2.5
Consider the following assembly sequence,
1 long int a;
2 if (a == 0) a ++;
3 a --;
There are two possible paths for the program: one goes through if block,
while the other one doesn’t. If we draw a flowchart, it’ll be like the middle
diagram in Figure 2.6.
Given this flowchart, we just need to translate each part into their cor-
responding assembly instructions. Assume the address of variable a is
stored in X0 , and we want to load this variable into X9 . The if block
is simple: we just need to add 1 to register X9 . The instruction SUB is
the destination of branching (for the case when a!=0 ), so we need to add
a label in front of it, say L1 .
Now we only have one problem left: how to translate the condition, i.e.,
the if statement, and this is where we’ll introduce the first set of condi-
tional branching instructions. Currently, variable a is in X9 , so we just
need to check if X9 is zero or not; if it is zero, we keep moving on to the
next instruction (ADD); otherwise we “jump” or branch to the instruction
labeled as L1. The instruction to do this is CBNZ (conditionally branch if
not zero):
This instruction will check the value inside Xt . If it’s not zero, it’ll move
the address of the instruction tagged with Label to PC, and thus branches
to that instruction. If it is zero, it’ll just move to the very next instruction.
Another related instruction is CBZ , which you probably figured that it’ll
jump to the destination if the register it’s checking is zero.
36 2 Instruction Set Architecture
Now that we’ve introduced the branching instructions (at least a small
portion of it), there are some common beginner mistakes that we need to
be careful of.
Let’s re-write the C code in Figure 2.6 using CBZ instruction. Following
our steps before, the instruction ADD is the target of CBZ , so it’s natu-
ral to give it a label, say L2 . So someone came up with the following
sequence:
Let’s walk through this sequence. If X9 is indeed zero, the flow of in-
struction would be LDR → CBZ → ADD → SUB, which is correct. If X9 is
not zero, the flow is also LDR → CBZ → ADD → SUB, which is not what we
want obviously. The problem is we didn’t skip the ADD instruction, so a
quick and simple fix: after CBZ, add an unconditional branch B to jump to
L1 :
ą Example 2.6
Translate the following C code into assembly:
1 long int a;
2 if (a == 0) a = a + 2;
3 else a = a - 2;
4 a = a * 4;
With the flowchart, we can clearly see there are two branches, and
thus two destinations: one is a = a - 2 for the else case, and the
other is a = a * 2 when if case has finished. Therefore, when
translating to assembly, we will add two labels for each of the
destinations, L1 and L2.
Figure 2.7: Without unconditional branch B, the program flow would be wrong when variable a is zero.
to skip else block and jump to L2. This time, this branching is
unconditional — it doesn’t depend on a specific value or register,
and therefore we use unconditional branch B L2 to force the
program dodge else case.
1 if (a == b) goto L1;
2 else goto L2;
This version of the assembly translation has its limitations: (1) we have to
store subtraction result in a register which can be wasteful, and (2) CBZ
or CBNZ can only branch if the register operand is zero or not zero, but
can’t branch if it’s positive or negative. Thus, we introduce another more
extendable way to branching.
The steps of conditional branching can be simplified to just two steps: (1)
compare the two numbers, and (2) branch to destination based on some
criteria. We have an instruction for step (1), which is CMP :
The meaning of the sequence above seems more human friendly, doesn’t
it?
What’s more convenient with CMP instruction is that it can also directly
compare with an immediate number:
Thus, if we want to compare, say X9 and 100 , we can simply do CMP X9,100 ,
instead of moving 100 to a register and subtracting two registers and
then using CBZ / CBNZ .
At this point, you probably would be wondering: it looks like CMP in-
struction is also doing subtraction, but where does it save the result? And
since B.EQ is a separate instruction, how does it know which two regis-
ters we are comparing? The magic behind these is condition codes.
Inside CPU, there’s a special register called CPSR, or current program sta-
tus register, which stores 32 bits. Different than the registers we have seen,
these 32 bits are used individually, and typically directly set by the proces-
sor (not us!). We only care about the highest four bits, which correspond
to the four condition codes respectively:
2.5 Flow Control 39
31 30 29 28 0 – 27
N Z C V Other flags
Negative Zero Carry Overflow \
When a condition code has a value of 1, we say it’s set; otherwise it’s
clear.
In sum, condition codes reflect the status change of the instruction just
being executed, or some attributes of the result of the operation. However,
not all instructions are designed to change condition codes. So far the only
instruction we learned that changes the codes is CMP . Arithmetic/logic
instructions such as ADD and ORR do not modify condition codes. If we
want them set condition codes, we need to use the following instructions
3: None of the logic instructions includ-
which explicitly set condition codes: ADDS , SUBS , and ANDS . 3 ing shifting modifies condition codes, ex-
cept ANDS .
Now let’s go back to this sequence:
During CMP , CPU subtracts X10 from X9 , but does not store the result
anywhere. If the result is zero (meaning X9 is equal to X10 ), the Z flag
in CPSR will be set; otherwise it’ll be cleared. Then we move on to the
next instruction B.EQ . This instruction only inspects the current value of
Z flag: if it’s set, it’ll branch to L1 ; otherwise it’ll execute the following
instruction, which in this case is B L2 .
1 MOV X9, 9
2 MOV X10, 10
3 CMP X9, X10
4 ADD X9, X9, 10
5 SUB X10, X10, 3
6 CMP X9, X10
7 CMP X10, X9
Table 2.1: Conditional branch instructions B.cond and condition code checking.
So now the magic of CMP has been revealed: CMP instruction simply does
the subtraction and changed the condition codes correspondingly. This
Caution!
provides us great ability to extend comparison and branching to more com-
CMP instruction changes all condi- plicated cases such as if (a < b) and so on.
tion codes based on the result, not
just the Z flag. Consider the following C code:
Assume variable a and b are in registers X9 and X10 . The first step
is to CMP X9,X10 . Next, we use a condition branch instruction B.LT
(branch if less than) . Thus, the assembly sequence looks like this:
One possible question is, for example, when comparing ≤, we have signed
numbers and unsigned numbers, and so how does the computer know if
a data is signed or unsigned? The answer is, it doesn’t know, or it doesn’t
need to know. Remember we emphasized this from very beginning: ev-
erything inside memory, either it’s code or data, or no matter what kind
of data, is just binary sequence. Surely when we write a C program, we can
specify something like: int a = -1; unsigned int b = 4294967296;
which makes variable a a signed number and b an unsigned number.
However, if you take a look at how they are stored in the machine, both of
them are exactly the same: 0xFFFFFFFF . The machine has no idea which
one is what.
Then who decide which number is signed or unsigned? Of course it’s us.
If we treat the numbers to compare as unsigned, then just use for example
B.LS ; if we want to take them as signed, then use B.LE . From Table 2.1,
2.5 Flow Control 41
you see both B.LE and B.LS are comparing ≤, but the only difference
is that these two instructions check different condition codes. Instructions
such as CMP will modify the condition codes regardless, then it’s up to us
how we want to treat our numbers and thus which instruction to choose.
ą Example 2.7
Note for this example, we just pretend X -registers are 8-bit wide
for convenience; they are of course 64 bits in fact. Assume currently
in register X9 we have data 1111 1011 whereas X10 we have
0000 0101 . Consider the following assembly code:
From this example, you can see when running instructions that set
condition codes, the machine has no idea if the operands are signed
or unsigned at all. It’ll do its job and set the condition codes based
merely on bit operations. It is our job to decide to treat them as
42 2 Instruction Set Architecture
Out: Out:
2.5.3 Loops
You probably already figured out that loops are basically just branching,
but backward instead of forward. Let’s look at the following C code:
1 long int i = 0;
2 while (i < 10) i = i + 1;
1 long int i = 0;
2 Begin: if (i - 10 < 0) goto Inc;
3 else goto Out;
4 Inc: i = i + 1;
5 goto Begin;
6 Out:
This way, we convert the loop into a conditional branch we’re familiar
with, and therefore the translation to assembly is straightforward. See Fig-
ure 2.8 for the assembly code.
However, there are some minor stuff in Figure 2.8 that’s worth mention-
ing. This time, we didn’t use LDR in the beginning. This is fine, and just
2.5 Flow Control 43
2.5.4 Arrays
Dealing arrays almost always needs loops. Let’s start with the simplest
case, i.e., character arrays (or strings), since each character takes only one
byte.
2.5.4.1 Strings
where we used the fact that all strings should end with a null-terminator
0. Now let’s write the assembly for this task!
1. Analysis:
From our C knowledge, we already know str is a pointer, and its
value is the address of the first byte of the array. Therefore, str
is the base address of the array. Because each character takes one
Register Data
byte, the index i of each character is also the offset of that char-
X9 base address of str
acter from the base address. Thus, (str+i) is the address of the
X11 index i
i -th character.
W12 i -th element *(str+i)
2. Panning:
Now, let’s assume the address of str is already loaded into X9 . In
summary, we have register usage shown in the table on the side.
X12 X12
W12 W12
0xFFFFFFFF EB101017 00 0xFFFFFFFF EB101017 00
00 00 00 00 00 00 00 77 0xFFFFFFFF EB101016 00 00 00 00 00 00 00 00 65 0xFFFFFFFF EB101016 00
0xFFFFFFFF EB101015 00 0xFFFFFFFF EB101015 00
X11 (index) X11 (index)
0xFFFFFFFF EB101014 't' 0xFFFFFFFF EB101014 't'
00 00 00 00 00 00 00 02 0xFFFFFFFF EB101013 'r' LDRB W12,[X9,X11] 00 00 00 00 00 00 00 02 0xFFFFFFFF EB101013 'r'
0xFFFFFFFF EB101012 'e' 0xFFFFFFFF EB101012 'e'
X9 (base address) X9 (base address)
0xFFFFFFFF EB101011 'w' 0xFFFFFFFF EB101011 'w'
FF FF FF FF EB 10 10 10 0xFFFFFFFF EB101010 'q' FF FF FF FF EB 10 10 10 0xFFFFFFFF EB101010 'q'
Figure 2.9: LDRB loads one byte into the lowest byte in a W -register.
3. Assembly:
All the direct translations of assembly have been added to the pseu-
docode for clarity. Now the only thing left is to complete the loop:
We also use Figure 2.9 to illustrate the step of LDRB in the assembly code.
Note integers are usually stored in W -registers, because each integer takes
four bytes. To simplify, we use X register for index i , and it won’t cause
a problem in this example. However, to move individual bytes, we have
to use X -registers with LDRB and STRB instructions.
Strings are relatively easy, because each element of the array takes only
one byte, and the index of the array can thus be conveniently used as offset
2.5 Flow Control 45
from the base address. When a data type takes more than one byte, it needs
to be planned more carefully.
Now we’ll modify the example in the previous section with a long int
type array, where each element takes eight bytes. The task is still to add 2
to every element in the array:
Let’s follow the three steps again — analysis, planning, and assembly —
to translate this C code into assembly.
1. Analysis:
From previous example, we’ve been fairly familiar with arrays, so
no doubt arr is the base address of the array. What’s different here
is, the index i cannot be used as offset anymore. When it’s a string,
moving the index by one also means moving/offsetting to the next
one byte, and therefore the index can be conveniently used as offset
as well.
2. Panning:
Based on the analysis above, we are clear that before using LDR or
STR , we have to calculate the offset correctly. Given an index i , if
each element takes eight bytes, the correct offset should be i * 8 .
Thus, we have the planning for registers as shown on the side.
The pseudocode is very similar to the previous example, except that Register Data
it needs an additional step to calculate the offsets correctly for each X9 base address of arr
element. Also notice this time because we’re using full eight bytes, X11 index i
we use LDR and STR instructions. X12 i-th element *(arr+i)
X13 offset i*8
46 2 Instruction Set Architecture
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
12 34 56 78 9A BC 00 00 FF CB 43 21 A9 03 30 66
Figure 2.10: Each long integer takes eight
arr[0] arr[1]
bytes, so the starting address of each el-
ement is 8*i where i is the index of
LDR X12,[X9,X13]
the element. In this figure, LDR will copy
eight bytes starting from address pointed
X12 66 30 03 A9 21 43 CB FF
by X9+X13 to X12 .
3. Assembly:
With the analysis and planning, translating to assembly is very straight-
forward:
2.6 Procedures
Before talking about procedures, let’s review the memory address space
first. Whenever we start a program, we assume that the program takes the
entire virtual memory space from address 0x00..0 to 0xFF..F in the
memory. At the bottom of this space, we have .text segment that stores
our assembly code and read only data (e.g., string literals). Going up, we
have .data segment that stores global variables. After .data segment
we also have .bss used for uninitialized data. These parts are basically
pre-determined during compilation time, meaning when generating an
object file, the layout and data in these segments are already arranged in
the object file.
The rest of the space will be used during run time, meaning they’re used
for storing and managing data that only occurs when the program is actu-
ally executing, such as local variables. The start of the area is called heap,
which is used for dynamic allocations during run time. The end of the
area is called stack, used for procedure calls. Since they occupy the two
ends of the empty space of the memory, they grow towards each other, or
towards the center. Figure 2.11 shows a bit more detailed visualization.
0xFFF...F
Stack bottom
Since we’re going to write procedures, we’ll naturally focus on the stack Stack
area, and see how calling or returning from a procedure manages the stack. SP
(stack pointer)
Through Data Structures you have learned that stack is a First-In-Last-Out Heap
(FILO) structure, and the main operations for stacks are push and pop. Read/write segment
Push is to add something to the top of the stack, whereas pop is to take (.data, .bss)
Loaded from
the top element of the stack out. executable file Read-only segment
(.text, .rodata)
Stack is also the structure for procedure calls. Intuitively, there are a lot of Unused
0x000...0
similarities of behaviors between stacks and procedures. Let’s look at the
Figure 2.11: Visualization of a virtual
following example. memory space for a program. Our as-
sembly code and global variables will be
Assume we have a C code with three functions called funx() (x = 1, 2, 3)). loaded to this space straight out of the ex-
ecutable file. During run time, the heap
main() calls fun1() , and fun1() calls fun2() and fun3() :
and stack are growing towards each other
as our program calls a procedure, or al-
locates space dynamically. For stack, the
bottom is actually at the high address,
1 int main() { 1 void fun1() {
while the top is at the low address, so you
2 fun1(); 2 fun2(); can take it as a upside-down stack.
3 return 0; 3 fun3();
4 } 4 }
Formally, fun1() is the caller of fun2() and fun3() , and these two
functions are the callee of fun1() . Assume every time we call a proce-
dure, we put a “block” that represents the callee at the top of a stack, and
every time we return from a procedure, we take that block out. Figure 2.12
48 2 Instruction Set Architecture
❶ ❷ ❸ ❹ ❺ ❻
Stack Time
bottom main() main() main() main() main() main() stamps
call return
Figure 2.12: If we treat function/procedure calls as stacking some “blocks”, the process of procedure calls looks really like pushing blocks
to the stack. At point (2), fun2() returned to fun1(), and therefore its block is removed from the stack top.
shows us a time line from when fun2() was called to fun1() returned
to main() .
Now think the “blocks” as boxes where each procedure stores its own local
variables. These boxes, called procedure frames, or simply frames, are ex-
actly how the system manages stack space for procedure calls as in Figure
2.11.
One question is then, are those frames created automatically when we call
a procedure? The answer is no. Frames are nothing but a designated area
on stack for procedure management; it is basically in our head. When
we call a procedure or return from a procedure, we have to manually
allocate/de-allocate the frame area on stack. We don’t need to do this in
high-level languages but in assembly we have to manage these areas.
Simply put, procedures in assembly are just branches with return. Let’s
look at the following example.
In the code listing above, line 1 to line 3 can be treated as a procedure called
Proc . You can see it’s nothing special but a branch.
We start running our program from label _start , and branch to Proc .
This looks like we’re “calling” the procedure Proc . Then at line 3, we un-
conditionally branch to Ret_Pt , which is the next instruction of B Proc ,
and move on from there. This is like we are returning from the procedure
Proc .
2.6 Procedures 49
This simple code is a very simple procedure call (though not complete), but
it shows us the first essential element of procedure call: return address.
What’s special about procedure is, after the callee finishes, the program
will return back to where it was called in the caller, and keep executing
from there. Since in assembly every instruction has an address, we need
to know where we should go back to from the procedure. In the example
above, instruction B Proc is the calling point of Proc , and so when
returning from the callee, its following instruction on line 5 is the return
point, which is labeled as Ret_Pt .
Using a label to mark the return address is an ok option, but not the best.
If you have a billion of functions, you’d have to create a billion of labels
for their return addresses. Too much labels only make programs hard to
follow. Plus there are also portability issues.
What BL Proc instruction does is first push the address of its following
instruction (i.e., the return address) into register X30 , and then copy the
address labeled by Proc to PC, so that we can branch to the procedure
Proc . Notice by convention, X30 (can also be referred as LR ) is called
link register, whose purpose is exactly to store return address. Therefore,
do not use X30 for other purposes.
The second change we did is on line 3 where we now use RET instruction,
which obviously stands for “return”. Because X30 is used by default for
return address, RET instruction will simply copy X30 to PC, so that next Caution!
instruction executed is the one at the return address. Note PC (program counter) and
X30 / LR (link register) are two dif-
ferent registers. PC is automatically
set by the machine or instructions
2.6.2.2 Passing Arguments and Return Values such as RET , while X30 can be
modified by us manually, or auto-
matically by using instructions such
The second essential element of a procedure is passing arguments/return
as BL . Ę Remember: BL changes
values. In the code example above, we didn’t pass any arguments, but in
X30 and PC, while RET only
practice it is quite common. changes PC.
50 2 Instruction Set Architecture
For return values, since usually one result is returned, we simply use X0
to store return value.
ą Example 2.8
(Fall 21 Homework) Write an assembly program that translates the
following C code:
In the previous examples we didn’t create any frames for procedure calls,
because the registers are enough to perform what we want. More than
often, though, registers are never enough to store all the data needed for
procedures. Thus, the purpose of creating frames is to store procedure
arguments and local variables.
Frames are just an area on stack in the memory, and again, everything in
the memory has an address. If, for a procedure call, we know where the
first and last bytes of its frame are at, we are able to draw a boundary and
claim something like, the area between xyz and abc is the frame area for
this procedure call.
Back to Figure 2.11, we see that there’s a stack pointer that points to the
top of the stack, and this is exactly the “boundary” we need. Stack pointer
stores the address of the current lowest byte of the stack, and is inside
register SP (which is not one of the 32 general purpose registers). Note
that this register can certainly be used for other purposes and the system
doesn’t prevent you from using it, but as a good practice and the sane of
your code (and your mental health), do not use SP other than storing
stack top address.
52 2 Instruction Set Architecture
The other side of the “boundary” for procedure frames is frame pointer,
stored in X29 or referred as FP . Thus, the bytes between FP and SP
form the area we visualize as procedure frame.
ą Example 2.9
Let’s use this example to show how procedure frames work in de-
tail. We’ll start from a C code again:
10 int main() {
11 proc(20);
12 long int y = fun(2,3);
13 }
First thing: how many bytes we need for the frame? Local vari-
2.6 Procedures 53
11 int main() {
12 init_array(20, -1);
13 init_num(10,20);
14 exit(0);
15 }
In the example above, both proc() and fun() are called leaf proce-
dures, because they didn’t call any other procedures. The name could be
more intuitive if you think procedure callings as a tree structure, where a
callee is the caller’s child. If a procedure calls other procedures, the caller
will be referred as non-leaf procedures, and this is where we need to be
careful.
1 void proc() {
2 fun();
3 return;
4 }
5
6 void fun() {
7 return;
8 }
9
10 int main() {
11 proc();
12 }
This is a very simple and intuitive because that’s what we translate line-
by-line from the C code. What could go wrong then? Let’s start walking
through the instructions from _start . We also add each instruction’s
address in memory in the comments.
The lesson learned here is, the machine doesn’t remember whatever value
you had in a register. If you overwrite X30 , you overwrite it, and execut-
ing RET doesn’t bring the old value back.
We only stored X30 in the frame of proc() , because it’s a non-leaf pro-
cedure whose return address is at risk of being lost. For fun() , since it’s
56 2 Instruction Set Architecture
It is very common to have multiple nested procedure calls and even to call
existing libraries. Because assembly is so simple it doesn’t do any memory
management or computing resource management for us, we’d have to be
careful ourselves.
One very important problem is register conflicts. We can call as many pro-
cedures as we like, but there’s only one set of registers. When the caller
and callee wanted to use the same registers, we have to resolve the con-
flicts. From previous sections, we see that X30 is a great example, where
both procedures need it for return address. We resolved it by storing it
on stack, and loading it back. This is the strategy we’re going to use very
often.
To avoid conflicts, both the caller and the callee are responsible for saving
registers.
The following table shows us the work both the caller and the callee need
to do around the point of procedure call and return.
Of course we don’t have to save all the 19 caller-saved registers every time
we branch; we just need to save those that we want to use later.
2.6 Procedures 57
Caller Callee
Before branching Save X0 ... X18 ×××
After entering callee ××× Save X19 ... X29 , X30
Before return ××× Restore X19 ... X29 , X30
After returning to the caller Restore X0 ... X18 ×××
2.6.2.6 Summary
The best way to illustrate the importance of following the calling conven-
tion is to examine recursive procedures. Let’s start with a very simple
task:
6 int main() {
7 long int x = facto(3);
8 exit(0);
9 }
First thing to notice is facto() can be both leaf and non-leaf procedure:
when num == 1 it’s a leaf procedure, otherwise it’s a non-leaf. To make
things consistent, we treat this procedure as non-leaf. To create a frame,
we need to calculate frame size. There’s no local variables in the procedure,
so the only thing we need to save on stack is X30 , and the frame size is
thus eight bytes.
58 2 Instruction Set Architecture
We write the following framework for facto() including the base case,
which is the simplest. Due to calling convention, we know that num is
stored in X0 , so we just need to compare it with constant 1. If it’s equal,
we just need to return 1. Because we need to store return value to X0 and
X0 is already 1, we don’t need to do anything but restoring X30 and
return.
4 /* Base case */
5 CMP X0, 1
6 B.EQ _end
7
8 /* Recursive case */
9
Now let’s fill in the recursive case. We need to call facto() and pass
num-1 . Currently num is in X0 , so naturally we just do SUB X0,X0,1
and BL facto . The problem is, when we return back from the callee and
need to do multiplication, X0 has been changed by the callee. Therefore,
we need to save X0 somewhere. Can we use registers? No, because all
the calls to facto() share the same set of registers, and it’ll inevitably be
overwritten. Therefore, the answer is stack, because every call has its own
stack frame.
In fact, you can see because num is used both for multiplication and for
passing new argument to recursive calls, it can be viewed as a local vari-
able as well. Remember local variables are also needed to be stored on
stack.
5 /* Base case */
6 CMP X0, 1 // if (num == 1)
7 B.EQ _end // goto _end;
8
9 /* Recursive case */
10 STR X0, [SP, 8] // Store num to [SP+8]
11 SUB X0, X0, 1 // X0 = num - 1
12 BL facto // call facto(num - 1)
13 LDR X1, [SP, 8] // Restore num to X1
14 MUL X0, X0, X1 // X0 = X0 * X1
15 // X0 = facto(num-1) * num
16
2.7 Quick Check Solutions 59
2.6.4 Reference
Notice arr is the name an array, so its value is the base address
of the array. Because the array is long int type, each element
takes eight bytes. Therefore we have the following solution:
1 MOV X9, -9
2 MOV W9, 10
(Fall 21 quiz) Write the corresponding ARMv8 assembly code for the fol-
lowing C statement. Assume that the variable f is in register X20 , and
2.7 Quick Check Solutions 61
For the following assembly program, write the values of condition codes
N , Z , C , and V after the execution of every instruction. Assume at
beginning all four codes are cleared.
1 MOV X9, 9
2 MOV X10, 10
3 CMP X9, X10
4 ADD X9, X9, 10
5 SUB X10, X10, 3
6 CMP X9, X10
7 CMP X10, X9
There are several keys to this question. First, remember not all in-
structions set condition codes. In this question, only CMP will change
the condition codes. For instructions that do not change condition
codes, the values of N , Z , C , and V will stay the same. Second,
we need to be clear about binary operations to know which instruc-
tion might set which codes.
N Z C V
MOV X9,9 0 0 0 0
MOV X10,10 0 0 0 0
CMP X9,X10 1 0 0 0
ADD X9,X9,10 1 0 0 0
SUB X10,X10,3 1 0 0 0
CMP X9,X10 0 0 1 0
CMP X10,X9 1 0 0 0
1 MOV X0, 1
2 CMP X20, X0 // Compare y and 1
3 B.EQ Else
4 ADD X19, X20, 10 // x = y + 10
5 B End
6 Else: ASR X19, X20, 3 // y/8 == y >> 3
7 End:
Translate the following C code into assembly, with correct calling conven-
tion and procedure frame creation.
2.7 Quick Check Solutions 63
11 int main() {
12 init_array(20, -1);
13 init_num(10,20);
14 exit(0);
15 }
The key to solving this problem is calculating the correct frame size
for each procedure. init_array() needs to host an array of length
long integers, so the frame size is 8*length bytes. init_num()
has a local variable of type int , so the frame size should be 4 bytes.
Also notice that both parameters of init_num() are 4-byte integers,
so we need to operate on W -registers.
1 init_array:
2 LSL X2, X0, 3 // X2 = total bytes of arr
3 // = frame size
4 SUB SP, SP, X2 // Allocate frame
5
20 init_num:
21 SUB SP, SP, 4 // Allocate frame
22
Note: long int cat() itself is also a procedure – it needs to follow call-
ing convention as well.
Notice in the C code, the return values of dog() and bunny() are
also used as return value for cat() . After returning from dog()
and bunny() , the return value is already stored in X0 due to call-
ing convention, so we can just restore X30 , deallocate the frame,
and return from cat() .
Here we introduce four simple gates: and , or , xor (exclusive or), and
1: Sometimes you will see gates with
not . Each of them takes one input or two, and produces one output. 1 more than two inputs. These are simply
We can build much more complicated logic using only these gates: 2 stacked gates. For example, if an and
gate has three inputs a , b , and c , the
output is calculated by a&b first, and
a a then &c , i.e., (a&b) & c .
a & b a | b
b b 2: Fun fact: Charles Sanders Peirce and
Henry M. Sheffer both showed that all
gates can be constructed by just NOR gates
alone (or NAND).
a
a !a a ^ b
b
In the figure above, each wire is either zero or one. In other words, each
wire transfers one bit.
Voltage
a & b b
Figure 3.1: a and b are the two inputs of
the and gate, and a&b is the output. The
and gate constantly and almost imme-
diately reflects the change of the input, a
with small amount of delay which is ne-
❶❷❸ ❹❺ Time
glectable.
b
Figure 3.2: On the left, signal a has been a a
branched into two; on the right, a and b a a
are separate signals without relations, in- a b
dicated by a line hop.
Line Notations
You probably already know this but just a small refresher — as in Figure
3.2 left, the small dot means the two wires come from the same wire, so
they have the same signals. On the right, the two wires have no relations
as there’s “line hop”.
Now that we have the three fundamental gates, we can use them to build
more complicated logics. When we combine these gates and the signals
are in one direction without loop, it’s called a combinational logic. Be-
cause there’s no loops in the combinational logics, the signal changes are
almost immediate, and will respond to input changes constantly.
3.1.2.1 Comparison
Let’s create a very simple combination logic, where we compare if two bits
are the same. If they are the same, we output a signal of 1 ; otherwise a
0 . The combinational logic for bit equality is shown in Figure 3.3. Both
a and b are input, and using a truth table we can easily verify that when
a==b , output is 1 ; otherwise it’s 0 .
a a&b
Now imagine we want to compare if two double words (64 bits) are equal.
One way to do this is that for each second, we pass one bit of each number
!a to the bit equality logic as inputs, and take notes on the output. After
b !a&!b one minute and four seconds of waiting, we got 64 bits of outputs. If all
!b
of them are 1 ’s, we know these two numbers are the same; otherwise it’s
(a&b)|(!a&!b)
different. A major flaw of this method is we have to wait for 64 seconds! So
Figure 3.3: The combinational logic for why don’t we compare all 64 bits at a time? Figure 3.4 shows this idea.
comparing two bits. When the two bits
are equal, the output is 1; otherwise it’s
0.
In Figure 3.4, we use a[i] to denote the i -th bit of the double word a .
On the left, the two sets of parallel wires transfer a and b , one bit each
3.1 Fundamental of Logics 69
| <latexit sha1_base64="KXtoaDp0L9PNRTA1TjqbPOoQqAY=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNWOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18GXaZ1</latexit>
a[63]
double word a
... eq[63]
{z
b[63]
}
...
a[1] Eq
...
eq[1]
b[1]
Figure 3.4: The parallel sets of wires on
the left are called buses, where each wire
a[0] transfers one bit of data, and all the wires
<latexit sha1_base64="dYRhVRGrdYApVtyZZMOtrv9VbG8=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEPOocH6l6tf8Ieg4CUakSkY4a1fewljzLAGFXDJrm4GfYitnBgWXUJTDzIIb32PX0HRUsQRsKx+uWNBNp8S0o407CulQ/dmRs8TafhK5yoRh1/71BuJ/XjPDzkErFyrNEBT/HtTJJEVNB3nRWBjgKPuOMG6E+yvlXebiQpdq2YUQ/F15nFxu14K92s75TvXoeBTHLFknG2SLBGSfHJFTckbqhJN78kieyYv34D15r977d2nJG/WskV/wPr8AB+Smdg==</latexit>
|
double word b
pare.
wire. These two are the buses. Moving towards right, we see each bit
from each number is directed to a corresponding bit-equality logic, and
produces the output eq[i] . Then all the outputs from 64 bit-equality
logics are pushed into a final and gate where a single bit Eq (either 1
or 0 ) is generated.
3.1.2.2 Selection
1 , we call the signal is asserted or set; otherwise deasserted or clear. By Figure 3.5: The combinational logic for se-
using a truth table, it’s not difficult to notice when s == 1 , the output is lecting one of the inputs as the output.
Input s acts as a “switch”, or “control”.
b ; otherwise it’s a . When s == 1 (asserted), input b passes
through the multiplexer; when s == 0
Notice in this multiplexer, technically we have three inputs: a , b , and (deasserted), input a passes through.
s , but we call a and b the input in the sense of “data input”, while s
“control signal input”. In the future, we will usually use “input” to refer
to data input, but its actual meaning should be clear based on context.
We show the logic for selecting one of the two double words in Figure
3.6. Similar to the equality logic, here for each bit of the input, we use
a multiplexer to select. One thing we need to notice is in this example,
70 3 Microprocessor Design
s
Control signal
MUX
a[63]
<latexit sha1_base64="KXtoaDp0L9PNRTA1TjqbPOoQqAY=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNWOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18GXaZ1</latexit>
|
double word a
. . .
b[63]
{z
. . .
}
|<latexit sha1_base64="0ajmTxKlKjTkllotz1OxsLOfP3Q=">AAACJ3icbVDLSgMxFM34tr6qLt0Ei+CqzIivlYhuXCpYFTqlZDK3NphJhuRGLcP8jRt/xY2gIrr0T0xrF74uhBzOuTe55yS5FBbD8D0YGR0bn5icmq7MzM7NL1QXl86sdoZDg2upzUXCLEihoIECJVzkBliWSDhPrg77+vk1GCu0OsVeDq2MXSrREZyhp9rVvdipFExiGIci7tq8f29BVpbtIka4xSLVzj9Gb7RJ6YBBLLTD3GHpm6q1sB4Oiv4F0RDUyLCO29WnONXcZaCQS2ZtMwpzbBXMoOASykrsLPgdrtglND1ULAPbKgY+S7rmmZR2tPFHIR2w3ycKllnbyxLfmTHs2t9an/xPazrs7LYKobwpUPzro46TFDXth0ZTYYCj7HnAuBF+V8q7zGeGPtqKDyH6bfkvONuoR9v1zZPN2v7BMI4pskJWyTqJyA7ZJ0fkmDQIJ3fkgTyTl+A+eAxeg7ev1pFgOLNMflTw8Qm9N6j3</latexit>
MUX
a[1]
nal wires.
the control signal has the same value for all the bits of each double word,
because we want to select all bits of a double word, so they should have
the same control signal. The output, instead of being just one bit, contains
64 bits, and they are equal to either a or b , depending on the control
signal s .
In Figure 3.6, we only need one bit of control signal. In some cases we
need more than one bit, and thus all the control signals together form a
control bus.
N-Way multiplexer
If we need to select more than two inputs, we’d need to create a 𝑁 -way
multiplexer. Notice that when there are 𝑁 inputs, one bit of control signal
is not enough apparently. For example, if 𝑁 = 4, we’d need two-bit con-
trol signals, so that 00 , 01 , 10 , and 11 will choose one of the inputs
as the output. In general, the number of control signals can be calculated
as
𝑆 = ⌈log2 𝑁 ⌉ (3.1)
where ⌈𝑥⌉ is the ceiling function that takes the nearest integer above 𝑥.
3.1.2.3 Arithmetics
The logic for adder is also built on logic operations. Without thinking
about how the gates are organized, let’s start with a simple truth table
first. Assume the two inputs are x and y . When doing addition, we
need to add a carry-in flag, called cin . Since both x and y are one
bit data, if the addition result takes two bits, say z[1]z[0] , the leading
bit z[1] will be the carry-out flag cout , while the last bit z[0] is the
addition result, denoted as s .
x y cin cout s
Once the input and output are determined using the truth table, the rest of
0 0 0 0 0
the work is simply to design a combination using all the possible gates, to
0 0 1 0 1
make sure it can produce the correct output given each input. As long as
0 1 0 0 1
it can produce correct values, any combination is valid, though we surely
favor simpler designs. Figure 3.7 is one of the possible designs, where we 0 1 1 1 0
use an xor gate. 1 0 0 0 1
1 0 1 1 0
Similar to how we built up a 64-bit multiplexer, when there are two double 1 1 0 1 0
words a and b , we align the bits for the two numbers, and send a pair 1 1 1 1 1
of bits into a bit adder. See Figure 3.8. What’s different here is, from LSB
to MSB, the carry flag from bit i will be also used as the carry flag for bit
i+1 , just like calculating by hand. The first carry flag can simply just be
zero, and the carry flag produced by MSB, i.e., cout[63] , will be sent to
CSPR’s carry flag. (Still remember condition codes?)
Figure 3.9 shows us a very simple sequential logic, called bistable element,
where we compare it with a similar combinational logic. On the left, the
logic is acyclic with two outputs, p and q , and an input signal in . As
shown in the timing diagram on the right, when we give in a high volt-
age pulse, both p and q respond to the input with small amount of delay,
but soon change back. Therefore, the input in is temporary, and so are
the outputs. What if we want the outputs stay where they are? This means
we want to store the outputs.
cin
y s
cout
Figure 3.7: Bit adder, where input y is
marked as red, x as blue, the carry-in sig-
nal cin as black.
72 3 Microprocessor Design
cin[0]
<latexit sha1_base64="KXtoaDp0L9PNRTA1TjqbPOoQqAY=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNWOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18GXaZ1</latexit>
|
double word a
a[0] s[0]
...
{z
b[0]
cout[0]
}
cin[1]
s[1]
}
double word s
a[1]
...
{z
b[1]
cout[1]
|
<latexit sha1_base64="WIyAFSogs7P8VASF6JU4B3hOflI=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEHNbOL9S9Wv+EHScBCNSJSOctStvYax5loBCLpm1zcBPsZUzg4JLKMphZsGN77FraDqqWAK2lQ9XLOimU2La0cYdhXSo/uzIWWJtP4lcZcKwa/96A/E/r5lh56CVC5VmCIp/D+pkkqKmg7xoLAxwlH1HGDfC/ZXyLnNxoUu17EII/q48Ti63a8Febed8p3p0PIpjlqyTDbJFArJPjsgpOSN1wsk9eSTP5MV78J68V+/9u7TkjXrWyC94n18h26aH</latexit>
. . .
cin[63]
<latexit sha1_base64="dYRhVRGrdYApVtyZZMOtrv9VbG8=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpUEPOocH6l6tf8Ieg4CUakSkY4a1fewljzLAGFXDJrm4GfYitnBgWXUJTDzIIb32PX0HRUsQRsKx+uWNBNp8S0o407CulQ/dmRs8TafhK5yoRh1/71BuJ/XjPDzkErFyrNEBT/HtTJJEVNB3nRWBjgKPuOMG6E+yvlXebiQpdq2YUQ/F15nFxu14K92s75TvXoeBTHLFknG2SLBGSfHJFTckbqhJN78kieyYv34D15r977d2nJG/WskV/wPr8AB+Smdg==</latexit>
|
double word b
s[63]
...
a[63]
{z
b[63]
}
On the right of Figure 3.9, we added a branch from q back to the input
of p , making a loop. This is not combinational anymore; it’s sequential
instead. On the right, we see even if we give a small and temporary high
voltage to in , both q and p stay where they are and keep the change
after the signal from in disappears. Apparently, this is because of the
loop, where the output of q serves as the input of p so it doesn’t rely on
signal in anymore.
For a real life example, think about a faucet again. The combinational logic
is like the automatic faucet that can sense your hands. If your hands are
close, there’s water; if you take your hands away, it stopped. The faucet
constantly respond to our “input” — hands. What if we want the water
keep flowing? We just change to a regular faucet, where we only need to
turn on the water manually once, and it’ll keep flowing until we manually
turn it off. In this scenario, the regular faucet only responds to our “input”
once, and will keep the change.
This bistable element is very simple but shows an important idea of se-
quential logic. One flaw of the one in Figure 3.9 is, however, how can we
change the output p and q back?
3.1 Fundamental of Logics 73
q q
in p p in p p
Figure 3.9: The left shows a combinational logic, whereas the right shows a sequential logic. In the combinational logic, both outputs p and
q respond to the change of input in almost instantly, and thus we are not able to “store” the output. In the sequential logic, however, one
temporary change in the input in will trigger the permanent change in the outputs, making them stay, and thus to be “stored”.
3.1.3.1 SR Latch
SR latches have four states, three of which are shown as the three times-
tamps in Figure 3.10: (1) setting, (2) resetting, and (3) latched or stored.
The state where we give high voltage to both S and R is called metastable,
which usually causes error.
R
Q+
We can write a truth table for SR latch as well, but it’s a little bit different
where we let q denote the actual value (either 1 or 0) of the output of Q+ .
See the table below:
S Q-
S R Q+ Q- State
0 0 q !q Latched (Stored) R
0 1 0 1 Resetting S
1 0 1 0 Setting
Q+
1 1 – – Metastable/Error
Q-
❶ ❷ ❸
As we see, unlike the truth table we are familiar with where all the outputs Figure 3.10: A simple SR latch and an ex-
ample of timing diagram. Time (1) is the
have determined values of either 0 or 1, the state of latched has q and
state of setting, (2) for resetting, and (3)
!q in it. This is because the output depends on its actual value before for latched. The temporary change in ei-
ther R or S will make the change in the
switching to this state, instead of the inputs S and R . output stay, and thus to be stored.
3.1.3.2 D Latch
the output arbitrarily and make the change stay. Two flaws of SR latch,
however:
1. S and R are like switches; where do we send and store actual data?
2. S and R can be used at any time, but when there are many SR
latches, how can we make sure they are synchronized, or on the
same page?
The clock signal C controls when the input data D can pass through and
cause changes in the output Q+ . In Figure 3.11 at timestamp (1), C gives
a high voltage, which brings the D latch to a state called latching. At this
moment, Q+ will change its value according to the input data D . As long
as C stays at high voltage, Q+ will always respond to the change of D .
Observe that in the figure Q+ and D have the same waves between (1)
and (5).
We can similarly write a truth table as the one on the side where d can be
either 0 or 1, and the value of q depends on the value of d at the moment
when C drops.
C D Q+ Q- State
0 d q !q Storing
1 d d !d Latching 3.1.3.3 Flip-Flops
R C
D
(Data) Q+
T
D
Q-
C T S
(Clock) (Trigger) Q+
Figure 3.12: An edge-triggered latch, or a flip-flop, where when C rises, the trigger T will temporarily rise to high voltage, allowing Q+ store
the value of input data D at that moment. Afterwards, T drops back down, and no matter how input C changes Q+ stays stable.
does the output Q+ . So how can we make sure during latching state the
output Q+ stays stable?
We use Figure 3.12 to show the logic of a flip-flop and an example of timing
diagram. The major change is we added a trigger T , which is essentially
just an and gate. Both two inputs of the trigger come from the enable
signal C . If you look closely, after passing through three not gates, one
input is !C , while the other is C , so the trigger will be C & !C = 0
eventually. Then why do we want to create a trigger this way?
From the beginning, we mentioned that logic gates will constantly respond
to input changes, but with small amount of delay, meaning the more gates a
signal needs to pass, the longer delay it’ll cause. There you go! See when
C rises, one of its two branches will quickly reach to the trigger, at which
moment the trigger will rise to high voltage. The other branch of C will ar-
rive shortly but with a delay, because it has so many gates to pass through.
It will, however, eventually arrive, and thus brings the trigger back to low
voltage. This gap of arrival of the two inputs gives us the chance to change
Q+ and store its value, and make it stay stable afterwards.
3.1.4 Registers
Finally it’s time to talk about the real thing! In microprocessors, the hard-
ware we use to store data is registers, our old friend from assembly. Now
that we know one flip-flop can store one bit of data, storing a double word
is just too straightforward. In fact, as shown in Figure 3.13, a register is a
group of flip-flops where each of them stores one bit. Notice that all the
76 3 Microprocessor Design
| <latexit sha1_base64="u15RdLJ3bRxTg+ha1k3Ox4sN6Gw=">AAACInicbVDLSgMxFM3Ud31VXboJFsFVmRGfO9GNSwWrhU4pmcytDc0kQ3JHLcN8ixt/xY0LRV0Jfoxp7UKtB0IO59ybm3uiVAqLvv/hlSYmp6ZnZufK8wuLS8uVldVLqzPDoc611KYRMQtSKKijQAmN1ABLIglXUe9k4F/dgLFCqwvsp9BK2LUSHcEZOqldOQwzFYOJDOOQh12bDu5dSIqinYcId5jHOnOP0VttYjpU0GmF8ytVv+YPQcdJMCJVMsJZu/IWxppnCSjkklnbDPwUWzkzKLiEohxmFtz4HruGpqOKJWBb+XDFgm46JaYdbdxRSIfqz46cJdb2k8hVJgy79q83EP/zmhl2Dlq5UGmGoPj3oE4mKWo6yIvGwgBH2XeEcSPcXynvMhcXulTLLoTg78rj5HK7FuzVds53qkfHozhmyTrZIFskIPvkiJySM1InnNyTR/JMXrwH78l79d6/S0veqGeN/IL3+QUK8qZ4</latexit>
R
double word d
d[63]
register value r
}
… r[63]
{z
{z
…
}
|<latexit sha1_base64="Fx3vXCypuZmzpja99jfWm0lulv4=">AAACJXicbVDLSsNAFJ34tr6qLt0MFsFVScTXwoXoxmUFW4WmlMnkph2cTMLMTbGE/Iwbf8WNC4sIrvwVp4+Fth4Y5nDuudx7T5BKYdB1v5y5+YXFpeWV1dLa+sbmVnl7p2GSTHOo80Qm+iFgBqRQUEeBEh5SDSwOJNwHj9fD+n0PtBGJusN+Cq2YdZSIBGdopXb5ws9UCDrQjEPud006/E8gLop27iM8Ya6hY9cATXtMZkBHIlq5sJZyxa26I9BZ4k1IhUxQa5cHfpjwLAaFXDJjmp6bYitnGgWXUJT8zIDd4JF1oGmpYjGYVj66sqAHVglplGj7FNKR+rsjZ7Ex/Tiwzphh10zXhuJ/tWaG0XkrFyrNEBQfD4oySTGhw8hoKDRwlH1LGNfC7kp5l9nEbCqmZEPwpk+eJY2jqndaPb49rlxeTeJYIXtknxwSj5yRS3JDaqROOHkmr+SdDJwX5835cD7H1jln0rNL/sD5/gHPnKfv</latexit>
S
. . .
...
d[1]
r[1]
d[0]
r[0]
C
(Clock) T (Trigger)
Figure 3.13: A register implemented using edge-triggered latches. One latch can store one bit of data, and all the 64 bits will be updated all
together when the clock rises.
flip-flops in the register are controled by the same clock signal. After all,
you don’t want to store several bits first and wait until next rising edge of
the clock to store other bits.
Each register file has two read ports and one write port. Here based on
Figure 3.14, we declare the signals with “variable-style” names, so we can
refer back to them later.
On the write port, we have three input signals: RegDataW for the actual
data we want to store into the register; WriteReg for the destination reg-
ister number, i.e., where do we want to store RegDataW ; and RegWrite ,
used for indicating if we want to write to a register or not.
Input WriteReg , a 5-bit data, is sent to a decoder, where the output is ad-
hoc. For example, if WriteReg = 00111b , the output of the decoder will
be all zero except the wire transferred to register X7 , because 00111b is
7 in decimal. Because of the and gates, only X7 will be written once
RegWrite changes to 1 , meaning we do want to write to the register.
3.1 Fundamental of Logics 77
ReadReg1
Clock
0 RegData1
E
Decoder
1 D X0 M Read port 1
C
...
U
E X
31 D
C
X1
. . .
Write port
WriteReg M
RegWrite E
U
RegDataW X Read port 2
D
C
X31
RegData2 Figure 3.14: A little more detailed register
Register File ReadReg2
file.
Each register has three inputs. C and D correspond to the inputs in Fig-
ure 3.13, while E (enable) controls if any data can be written to the regis-
ter even when the clock rises. Therefore, writing registers is a sequential
logic.
On the right side of Figure 3.14, we have two read ports. The two regis-
ter numbers, ReadReg1 and ReadReg2 , are used as “selectors” to select
data from corresponding registers. The data are denoted as RegData1
and RegData2 , respectively. Reading registers is more like a combi-
national logic, meaning as soon as ReadReg1 and ReadReg2 receives
changes, the corresponding data will be read almost instantly. This is dif-
ferent than writing to a register, since the read port has no control wire
that connects to the clock.
3.1.5 Memory
The address bus is used for a memory location. During reading, the data at
the address on address bus will be put on data bus and sent out. Similarly,
during writing, the data on the data bus will be sent into memory location
specified by the address bus. Either reading or writing is controlled by the
control bus. The width of the control bus — the number of bits — depends Memory
Address bus (64-bit)
<latexit sha1_base64="fsTG6PIHeCgZGkSmWHjQNRq4MII=">AAACIHicbVDLTsMwEHR4U14BjlwsKiQ4UCWotBx5XDgWiUKlpqocZ0stHCeyN4gq6qdw4Ve4cAAhuMHX4D4OUBjJ8mhmV7s7YSqFQc/7dKamZ2bn5hcWC0vLK6tr7vrGlUkyzaHOE5noRsgMSKGgjgIlNFINLA4lXIe3ZwP/+g60EYm6xF4KrZjdKNERnKGV2m41yFQEOtSMQx50TTr4DyHu99t5gHCP+UkUaTCGhpmhu5Xyfihwz7pu0St5Q9C/xB+TIhmj1nY/gijhWQwKuWTGNH0vxVbONAouoV8IMgN2+C27gaalisVgWvnwwD7dsUpEO4m2TyEdqj87chYb04tDWxkz7JpJbyD+5zUz7By1cqHSDEHx0aBOJikmdJAWjYQGjrJnCeNa2F0p7zIbFtpMCzYEf/Lkv+TqoORXSuWLcvH4dBzHAtki22SX+KRKjsk5qZE64eSBPJEX8uo8Os/Om/M+Kp1yxj2b5Becr2/l7qP/</latexit>
Address
{z
3.1.6 Summary
}
...
Data
{z
Control bus
|
computing and temporary storage. The component used for computing Figure 3.15: Control bus and address bus
is arithmetic logic unit (ALU), implemented using combinational circuits. are unidirectional, while data bus is bidi-
rectional.
78 3 Microprocessor Design
ALU can perform basic arithmetics, and we showed the operation of ad-
dition in Figure 3.8. Registers are used for temporary data storage, imple-
mented by sequential circuits (Figure 3.13). Writing data to a register is
controlled by a clock, and only happens when the clock signal has a rising
edge. Reading data from registers, however, is similar to combinational
logic, and can happen any time.
Before we dive into the real construction of our CPU, we need to consider
how we can first let it recognize our program, or instructions. Surely we
understand ADD X0,X0,X1 is to add X1 to X0 , but those digital circuits
only recognize high and low voltage, or at least only 0s and 1s. The first
step for us, therefore, is to translate our text code (assembly) into digital
signals (binary 0s and 1s) so that they can be pushed into the circuits and
perform operations. Once the circuits finished responding to the signals,
we interpret the resulting 0s and 1s back to know what’s going on.
Since we are using ARM assembly, we will look at the encodings designed
by ARM, so no need to design our own. In the following, we will only
study encodings of most frequently used instructions. They are sufficient
to serve the purpose of the discussion of CPU design. We also removed
5: See https://developer.arm.com/
and modified some of the fields in the encodings to make them easier and
documentation/ddi0602/2022-03/
Base-Instructions.
more straightforward. To see the complete sets of encodings, you can visit
the ARM documentation. 5
Each ARM assembly instruction takes four bytes (32 bits), and can be roughly
separated into two fields. The leading bits are called opcode ; which is
unique to different mnemonics. The rest of the bits are used for operands,
such as encoding register read/write numbers, immediates, addresses,
etc.
Let’s start with the ones we’re most familiar with—arithmetic and logic
instructions.
3.2 From Assembly to Machine Code 79
31 21 20 16 15 10 9 5 4 0
Instruction opcode Rm 111000 Rn Rd
ADD Rd,Rn,Rm 1 0 0 0 1 0 1 1 0 0 1
ADDS Rd,Rn,Rm 1 0 1 0 1 0 1 1 0 0 1
SUB Rd,Rn,Rm 1 1 0 0 1 0 1 1 0 0 1 2nd Register
SUBS Rd,Rn,Rm 1 1 1 0 1 0 1 1 0 0 1
AND Rd,Rn,Rm 1 0 0 0 1 0 1 0 0 0 0 1st Register
ANDS Rd,Rn,Rm 1 1 1 0 1 0 1 0 0 0 0
ORR Rd,Rn,Rm 1 0 1 0 1 0 1 0 0 0 0 Destination Register
Figure 3.16: Encodings of arithmetic and logic instructions with register operands.
The first case is all the sources are registers, including ADD[S] , SUB[S] ,
AND[S] , and ORR . Looking at these instructions, we notice they all take
three register operands: destination Rd , source 1 Rn , and source 2 Rm .
Thus, we show their encodings in Figure 3.16.
It is straightforward that all the register files take five bits, since we have
32 general purpose registers which can be encoded in five bits. Note that
here Rm refers to the register number or ID, instead of the data inside of
them. The bits 15:10 are all zeros for all instructions. 6 6: In fact, bits 15:10 contain 3-bit of
option and a 3-bit immediate. To keep
It’s not difficult to find some patterns in the opcode . For example, it is
things simple and focused, we don’t re-
clear that bit 30 is to indicate which operation we want: 0 for addition, ally use or care about the fields, so we can
while 1 for subtraction. Also, bit 29 is 1 if we want to set condition simply let option in all such instruc-
codes, or 0 otherwise. tions be 111 , while immediate be 000 .
As to why we use these two values, if
˛ Quick Check 3.1 you’re interested, feel free to ask me, or
visit ARM documentation.
Remember CMP Rn,Rm is to compare two registers, and set condi-
tion codes. What it actually does, is to perform subtraction between
Rn and Rm , and send the result to XZR (register X31 ). In other
words, it’s just an alias for instruction SUBS XZR,Rn,Rm . Please
translate CMP X1,X0 into machine code based on the discussion
above.
B See solution on page 114.
The general structure of instructions with immediates are still the same—a
few bits of opcode followed by operands, as in Figure 3.17. Notice Rd
is the destination register, Rn the source register, and imm11 is a 11-bit
immediate number.
1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0
80 3 Microprocessor Design
31 21 20 10 9 5 4 0
Instruction opcode imm11 Rn Rd
ADD Rd,Rn,imm11 1 0 0 1 0 0 0 1 0 0 0
ADDS Rd,Rn,imm11 1 0 1 1 0 0 0 1 0 0 0
11-bit immediate
SUB Rd,Rn,imm11 1 1 0 1 0 0 0 1 0 0 0
SUBS Rd,Rn,imm11 1 1 1 1 0 0 0 1 0 0 0 1st Register
ORR Rd,Rn,imm11 1 0 1 1 0 0 1 0 0 0 0
Destination Register
Figure 3.17: Encodings of arithmetic and logic instructions with register operands and immediates.
31 21 20 10 9 5 4 0
Instruction opcode imm11 Rn Rt
LDR Rt,[Rn,imm11] 1 1 1 1 1 0 0 0 0 1 0
STR Rt,[Rn,imm11] 1 1 1 1 1 0 0 0 0 0 0 11-bit immediate 1st Register Target Register
0 1 0 1 0
Here we mainly focus on LDR and STR with only 64-bit registers sup-
ported. In these two instructions, as shown in Figure 3.18, the 11-bit im-
mediate number indicates offset, while Rn is the base register. Since Rt
could be source or destination register, we call it “target register”.
31 21 20 10 9 5 4 0
Instruction opcode imm11 00000 Rt
B label 0 0 0 1 0 1 0 0 0 0 0
BL label 1 0 0 1 0 1 0 0 0 0 0
CBZ Rt,label 1 0 1 1 0 1 0 0 0 0 0 11-bit immediate Target Register
(PC-relative offset) B: Rt = 00000
CBNZ Rt,label 0 1 0 1 0 1 0 0 0 0 0
BL: Rt = 11110
RET 1 1 0 1 0 1 1 0 0 1 0
RET: Rt = 11110
ą Example 3.1
Assume we have the following segment of assembly code with
each instruction’s address in the comments:
1 ...
2 ...
3 loop: SUB X5, X5, 1 // 0x1000
4 LDRB W1, [SP, X5] // 0x1004
5 CBZ X5, exit // 0x1008
6 B loop // 0x100C
7 exit: MOV X0, 0 // 0x1010
8 MOV X8, 93 // 0x1014
9 SVC 0 // 0x1018
10110100000
⏟⏟⏟⏟⏟⏟⏟ ⏟⏟ 00000001000
⏟⏟⏟⏟⏟ 00000 ⏟
00110
opcode imm11 Rt
00010100000
⏟⏟⏟⏟⏟⏟⏟ ⏟⏟ 11111110100
⏟⏟⏟⏟⏟ 00000 ⏟
00000
opcode imm11 Rt
82 3 Microprocessor Design
3.3.1 Preliminaries
We use R[x] to denote the 64-bit data inside register x , while M[a]
the data at address of a inside memory. For a multibit data D , D[a:b]
indicate bit a to b (inclusive). For example, R[0][5:2] is the 2nd to
the 5th bits inside register X0 , while M[R[1]][31:0] is the data at an
address indicated by the lower word of data inside X1 .
Airings
3.3.2 Datapath
Figure 3.20 shows our first attempt at building the datapath, where we’ll
run one assembly instruction at a time. We call this model single-cycle
implementation. This is absolutely over-simplified and inefficient, but it
shows us the most important ideas of designing a processor.
In Figure 3.20, we separate memory into instruction and data memory,
only for illustration purposes. It should be clear that there’s only one
3.3 A Single-Cycle Datapath 83
memory, where both instructions and data are stored together but in dif-
ferent regions. Instructions are in .text segments, and regular data are
in .data , .bss , or stack area of the memory.
PC
4
A B Address
Instruction
Add Memory I
nextPC
Reg2Loc
MUX
Control ALUop
ReadReg1 ReadReg2
RegWrite Register File WriteReg
RegDataW
RegData1 RegData2
A B
Add ALUsrc
MUX
PBr
CBr A B action
ALU ALU
Z Control
ALUout
UBr
MUX
Address
WriteDataM
MemRead
MemWrite Data Memory
ReadDataM
Clocking Methodology
| {z }
Comparing Figure 3.20 with the idea above, we see similarities. Follow- One clock cycle
ing the arrows, we see that the flow of the data starts from PC, and ends Figure 3.21: A clock controls the sequen-
at register file for writing, after passing through data memory. This route tial logic, so each clock cycle makes one
pass of the data.
84 3 Microprocessor Design
ą Example 3.2
Describe what happens in the five stages for instruction
ADD X0,X1,X2 .
IF The binary machine code of ADD X0,X1,X2 is fetched
from instruction memory;
ID The opcode field is used to generate control signals, while
registers X1 and X2 are read from register file;
EX ALU receives the two operands, read from X1 and X2 ,
and produce R[1] + R[2] ;
ME Since this instruction does not access memory, this stage
was passed (but not skipped!);
WB The result produced by ALU, R[1] + R[2] , is written
back to register X0 .
Remarks. One thing we need to notice is, for ADD instruction, ME
stage was passed, but not skipped. This is true for all instructions
that do not need memory access. We defined five stages for all in-
structions for the purpose of consistency, but they don’t necessarily
do any stuff in some stages.
This is the first stage for an instruction, see Figure 3.22. PC , the program
counter, sends the address of the instruction to be executed, and the actual
3.3 A Single-Cycle Datapath 85
nextPC
4 A
Add Address
PC B Data I
Instruction
Memory
Figure 3.22: Stage 1: instruction fetching.
PBr
CBr
opcode = I[31:21] UBr
LinkReg
Control MemToReg MemRead
MemWrite ALUsrc
ALUop
Reg2Loc
RegWrite RegDataW
n = I[9:5]
ReadReg1 RegData1 R[n]
m = I[20:16]
M
I ReadReg2 RegData2 R[m] if Reg2Loc Figure 3.23: Stage 2: decoding. Fields in
t = I[4:0] U
X else R[t] an instruction is sent to different parts of
d = I[4:0] the register file. The opcode is sent to a
WriteReg Register File
control unit that can generate control sig-
nals.
instruction will be read from memory and sent out from Data on the data
bus. We denote the retrieved instruction as I .
At this time, PC will not be updated yet, because the next instruction
could be a branch somewhere. The real address of next instruction cannot
be decided until EX stage, so we’ll talk about it then. For now, what’s
happening in IF stage can be described as:
1 I = M[PC]
2 nextPC = PC + 4
The decoding stage is to take the instruction I we just read from instruc-
tion memory and send different fields to different data wires, as well as
reading register data. We use Figure 3.23 to show some details.
The highest 11 bits are used as opcode . This will be sent to the control
unit, a combinational logic, that converts opcode into different control
signals used by all parts of the datapath. We will discuss them in the stages
when they are actually used.
The register file has two read ports. The first read ports will take I[9:5]
which encodes the register number, denoted as n . This is the same for all
instructions. The data read from this register is denoted as R[n] .
86 3 Microprocessor Design
The second read port needs some work, though. For some instructions
such as ADD Rd,Rn,Rm , the second read register is Rm , encoded in I[20:16] .
Other instructions, however, such as LDR Rt,[Rn,imm] , the second read
register Rt is encoded in I[4:0] . Thus, before sending the register num-
ber to ReadReg2 , we need a control signal Reg2Loc that selects one of
them from a multiplexer. When it’s zero, it selects I[4:0] ; otherwise
I[20:16] .
12 # Read registers
13 n, m, t, d = I[9:5], I[20:16], I[4:0], I[4:0]
14 ReadReg1 = n
15 ReadReg2 = m if sigCntl.Reg2Loc else t
16 RegData1 = R[ReadReg1]
17 RegData2 = R[ReadReg2]
As shown in Figure 3.24, there are two sub-jobs done in this stage: Comput-
ing and PC updating. We will look at computing first, since that happens
3.3 A Single-Cycle Datapath 87
before PC updating.
Computing
There are multiple possible calculation, however, such as ADD , ORR , etc.
How does it know which one to perform? We would need a control signal
called action , which is converted based on both opcode and ALUop
through ALU Control. Based on different values of action , ALU will
perform different calculations and produce corresponding result. We as-
sume only Z flag in the condition codes will be generated for simplicity.
We design signal action with four bits, and the following eight possible 8: As to why we use 0000 to indicate
operations for ALU in Table 3.1. 8 and operations and so on, the reason is
usually practical, for easier circuit design.
Thus, for this part, we can describe it in the following code:
1 inputA = RegData1
2 inputB = imm if sigCntl.ALUsrc else RegData2
3 action = ALUControl(sigCntl.ALUop, opcode)
4 ALUout,Z = ALU(action, inputA, inputB)
nextPC
PC Updating
PC A
BrPC
Add nextPC
imm = I[20:10] B
PBr MUX PC
UBr Br
CBr
RegData1 inputA A Z
Table 3.1: ALU operations for each in- ALUop action ALU operation
struction in the instruction set architec-
ture. AND 10 0000 ALUout = inputA & inputB
ORR 10 0001 ALUout = inputA | inputB
SUB 10 0011 ALUout = inputA - inputB
ADD 10
LDR 00 0010 ALUout = inputA + inputB
STR 00
B 01
BL 01
0111 pass inputB, i.e., ALUout = inputB
CBZ 01
RET 01
ANDS 11 1000 ALUout = inputA & inputB (set Z)
ADDS 11 1010 ALUout = inputA + inputB (set Z)
SUBS 11 1011 ALUout = inputA - inputB (set Z)
PC Updating
In general, we have four possible options for updating PC : (1) the instruc-
tion four bytes after ( nextPC calculated in IF stage); (2) target instruction
by unconditional branch; (3) target by conditional branch; and (4) target
instruction pointed by the link register X30 . So at the rightmost of Figure
3.24, we see there’s a multiplexer whose output is sent to PC .
1. nextPC , meaning there’s no branch and we’ll move to the very next
instruction;
2. BrPC , the address of target instruction calculated by adding PC
and immediate together, to perform either conditional or uncondi-
tional branch. Recall that in branch instructions such as CBZ X0,label
and B label , label is encoded as a PC-relative offset in the bi-
nary machine code in the immediate field. Thus, the target address
BrPC is calculated as: BrPC = PC + I[20:10] ;
3. ALUout , which is used for procedure return. In instruction RET ,
the return address is read from register X30 as RegData2 , and
passed through ALU as ALUout .
MemRead
MemWrite
MemWrite MemRead
ALUout
For conditional branches, notice that a signal itself is not enough: to branch
or not depends on Z flag as well. Thus, we can express the final branch
signal Br as: Br = UBr | (CBr & Z) .
In summary:
1 BrPC = PC + imm
2 Br = sigCntl.UBr | (sigCntl.CBr & Z)
3 PC = ALUout if sigCntl.PBr \
4 else (BrPC if Br else nextPC)
Not all instructions will get access to memory, but they still pass this stage.
Figure 3.25 shows an illustration of this stage. Note that we mentioned
in Section 3.1.5 and Figure 3.15 that there’s only one set of data bus, and
it can transfer data to and from memory bidirectionally. For clarity rea-
sons, in our datapath, we separate the data bus into WriteDataM and
ReadDataM , two unidirectional buses.
For those who do need memory, such as LDR and STR , we use MemWrite
and MemRead to control if it’s reading or writing memory. The memory
operation is summarized in the following table:
Notice the input Address comes from ALUout , the output of ALU. This
is because instructions such as LDR Rt,[Rn,imm] need to use ALU to
calculate the address, i.e., ALUout = Rn + imm . However, not all ALU
outputs represent a memory address, so we branch ALUout to bypass the
memory as well.
Input WriteDataM comes from register read RegData2 . For example,
in STR X0,[X1,0] , RegData2 is R[0] , which will be written to the
90 3 Microprocessor Design
MemToReg
RegWrite RegDataW
As we see on the right of Figure 3.25, at the end of the MEM stage we have
two outputs: one from ALU ALUout , and another from data memory
ReadDataM . Since all we do at this stage is to get access to memory, we
will discuss how we handle the two outputs in the next section.
Now that the last stage has finished, and PC has been updated in stage 3,
we are ready to start the next instruction and walk through the five stages
again.
3.3 A Single-Cycle Datapath 91
PC
4
IF
A B Address
Instruction I = M[PC]
Add nextPC = PC + 4
nextPC
Memory I
0] 1] 6] 0] 1]
:1 :2 :5
] :1 0] 0] :1 :2
20 31 9 20 4: [4: [20 31
I[ I[ I[ I[ I[ I I I[
Reg2Loc
MUX opcode = I[31:21]
Control ALUop
sigCntl = Control(opcode)
n, m, t, d = I[9:5], I[20:16], I[4:0], I[4:0] ID
ReadReg1 ReadReg2 imm = I[20:10]
Register File
ReadReg1 = n
RegWrite WriteReg
ReadReg2 = m if Reg2Loc else t
RegDataW RegData1 = R[ReadReg1]
RegData2 = R[ReadReg2]
RegData1 RegData2 WriteReg = d
A B
Add ALUsrc
BrPC MUX
PBr inputA = RegData1
CBr A B action inputB = imm if ALUsrc else RegData2
ALU ALU
Z
ALUout
Control action = ALUControl(sigCntl.ALUop, opcode) EX
ALUout,Z = ALU(action, inputA, inputB)
Br = UBr | (CBr & Z)
UBr BrPC = PC + imm
if PBr: PC = ALUout
else: PC = BrPC if Br else nextPC
Br
MUX
Address
WriteDataM Address = ALUout ME
WriteDataM = RegData2
MemRead if MemWrite and not MemRead: M[Address] = WriteDataM
MemWrite Data Memory elif MemRead and not MemWrite: ReadDataM = M[Address]
elif not MemRead and not MemWrite: pass
ReadDataM else: error WB
MemToReg if LinkReg: RegDataW = nextPC
MUX else: RegDataW = ReadDataM if MemToReg else ALUout
LinkReg R[WriteReg] = RegDataW if RegWrite else R[WriteReg]
Figure 3.27: A summary of five stages each instruction goes through in the datapath, with description language on the side.
3.3.4 Summary
We use Figure 3.27 to summarize the five stages, where the datapath dia-
gram and description language are side by side.
92 3 Microprocessor Design
In Figure 3.28, the input has three bits: sba , and the component sur-
rounded by dotted lines is the combinational logic, whose output is D .
Recall from Section 3.1.3.3, the register is also controlled by a clock signal
C — the data will be written into register only when there’s rising edge
of the clock signal.
Assume it takes 300ps to get D from sba , which is the delay of the combi-
national logic, and it takes 20ps to push the data into the register. In Figure
3.29, we show a sequence of three inputs: 110 , 010 , 110 with an un-
pipelined version. A clock cycle is the period between two consecutive
rising edges. In order to process one instruction at a time in every clock
cycle, we have to make sure we leave sufficient time for the delay caused
by the combinational logic, so each clock cycle takes 300+20=320ps. To get
output from the three inputs, we need to wait 320ps×3=960ps.
s
b
Figure 3.28: An example of combina-
tional logic, where the input is three-bit, D
register
and the output D is one bit. When the
clock rises, the output D is written to the a
register. C
3.4 A Pipelined Datapath 93
Clock cycle
(320ps)
Input 1: 110
Input 2: 010
Input 3: 110
Figure 3.29: A sequence of three inputs:
110 , 010 , 110 with an unpipelined
❶ ❷ ❸ version.
idle. Since the second input will use this not gate anyway, why don’t
we send input 2 there already? To clearly show this idea, as in Figure
3.30, we separate the combinational circuits into three stages, where each
stage’s input is previous one’s output. Once input 1 passed through stage
1 and is currently in stage 2, we can send input 2 to stage 1, and so on.
Now you can see at the peak of the execution, we will have three input
signals running through the combinational circuits at the same time. This
is usually called a three-way pipeline.
The idea is simple and straightforward, but there are problems. When
input 1 moves to stage 2, we were supposed to send input 2 to stage 1.
However, notice only signal s will go through stage 1; signals a and
b will shoot the and gate right away, but currently it’s used by input
1. Because combinational logic will respond to any signal input change
immediately, now in fact input 2 is in both stage 1 and 2, while the data
from input 1 has been overwritten and lost.
b
D
register
b
register A
D
register B
register
Figure 3.31: We added two registers be-
a tween the three stages as “barriers”, to
make sure the signals in each stage will
C not be interrupted or overwritten.
94 3 Microprocessor Design
Clock cycle
(120ps)
3.4.1.2 Limitations
Even though pipeline is a great idea, it has its limitations as well. In our
previous example, when we separate the combinational logic into three
3.4 A Pipelined Datapath 95
Clock cycle
stages, we assume the delay of each stage is equal. This, however, is rarely
the case. Typically, some stages take longer time than others. Notice in
Figure 3.31, we use one clock signal to control all registers, for the purpose
of synchronization of all stages. If some stages take longer than others, the
clock has to wait for it, and all the other stages are idle as well.
In other words, the shorter a clock cycle is, the shorter the latency of each
instruction. However, to synchronize all stages, the clock cycle is deter-
mined by the slowest stage. In Figure 3.33, we assume stage 3 takes the
longest time to complete. Therefore, the clock cycle needs to cover the la-
tency of stage 3 in order to synchronize all other stages, and we see that
both stages 1 and 2 are idle for quite some time, which is a waste of re-
sources.
ą Example 3.3
(Fall 22 Midterm 2) We currently have a four-way pipeline circuit,
where the delay of each stage is 25ps, 50ps, 200ps, and 75ps, respec-
tively, and the delay includes sequential logic.
Each processing of data will pass all four stages, so it take four clock
cycles in total. Since each clock cycle is 200ps, the total latency of
one pass of data is 200ps× 4 = 800ps.
Clock cycle
Sequential
LDR X0,[X1,24] IF ID EX ME WB
ADD X0, X0,X1 IF ID EX ME WB
STR X0,[X1,24] IF ID EX ME WB
Clock cycle
Pipeline
LDR X0,[X1,24] IF ID EX ME WB
ADD X0, X0,X1 IF ID EX ME WB
STR X0,[X1,24] IF ID EX ME WB
❶
Figure 3.34: The horizontal axis is a time line. At the top we run one instruction through all five stages at a time, so it takes much longer to
complete all three instructions. At the bottom, we run multiple instructions at the same time, which resembles a pipeline.
In our course, we do not go into depth about how those limitations are
addressed. Instead, we focus on how creating a pipeline can make our
single-cycle datapath more efficient.
To apply pipeline to the datapath of five stages, we use Figure 3.34 to show
this idea. In the figure, we have a sequence of three instructions to execute.
Using the sequential model from previous section, one instruction takes
over the entire sequence (clock cycle), so the second instruction cannot
start executing until its previous instruction has finished all stages.
Now that the big picture of constructing a pipeline is clear, we’re going
into details on the datapath in the next section.
3.4 A Pipelined Datapath 97
Similar to the simply circuit example, we need to add pipeline registers be-
tween every pair of stages (except WB and IF). The pipeline registers store
different values, depending on the data that each stage needs to store. We
name them using the names of the stages they insert: IFID, IDEX, EXME,
and MEWB. In Figure 3.35, we add these four pipeline registers, where we
also show the data stored in them.
PC
4
Address
A B Instruction
Add Memory
I
MUX
Control
ReadReg1 ReadReg2 WriteReg
RegWrite RegDataW
RegData1 RegData2
ReadReg1
EX ME WB nextPC PC
ReadReg2
RegData1 RegData2 WriteReg imm opcode IDEX
MUX
ALUop
A B A B
ALU
Z ALU Control
Add action
ALUout
MUX
Address WriteDataM
MemWrite
MemRead Data
ReadDataM Memory
Control signals also need to be pipelined, but not all signals will be used
in every stage. Based on the stages they are used, we group them into EX,
ME, and WB. EX contains PBr , CBr , UBr , ALUsrc , and ALUop . ME
and WB signals are not used in the EX stage, so we pass them on to the
98 3 Microprocessor Design
1 2 3 4 5 6 7 8 9
3.5 Hazards
Pipeline is efficient, but if we just pay a little bit attention to some details
we’ll realize it can lead to wrong executions. Based on the error it can lead
to, we categorize it to two types: data hazard and control hazard, and
we’ll discuss them in detail in this section.
Data hazard refers to wrong values in the pipeline are read and written.
Let’s look at a simple example in Figure 3.37.
X0 X1 X2 X3 X11 X12
50 10 10 20 -125 180
Now this is the key point of data hazard. At cycle 4, instruction 3 ADD
which uses X2 as one of the source operands, is already in the ID stage
where the old value of X2 is read. In other words, the new value hasn’t
3.5 Hazards 99
1 2 3 4 5 6 7 8 9
EX ME
ALUout MEWB.ALUout
= R[11] + R[12] = 55 = EXME.ALUout = 55
EXME.ALUout Figure 3.37: A sequence that can lead
= ALUout = 55 to data hazard, due to dependencies be-
tween instructions. Register X2 in the
ID EX first instruction is the destination, but
also one of the source operands in the
ReadReg1 = 0 ALUout third instruction. At cycle 4, instruction
ReadReg2 = 2 = R[0] + R[2] = 60
3 has already read X2 ’s old value, but
RegData1 = R[0] = 50 EXME.ALUout
it hasn’t been updated from instruction 1
RegData2 = R[2] = 10 = ALUout = 60
yet.
even been updated back yet. This wrong value is carried to EX stage, and
surely the calculation result is wrong. It’s supposed to be 40, but now we
see it’s 60.
Had we use the sequential model, this problem would’ve been avoided,
because in sequential model an instruction starts fetching and reading reg-
isters only after previous instructions have written back registers.
Our first solution is the most straightforward one—stall the pipeline. Re-
call that the problem we have is the instruction that needs to read an up-
dated register starts executing too fast. Then why not let it wait one cycle
or two? That’s exactly how stalling works.
Following the example above, let’s see how we can manually implement
stalling. Simply put, what we want is to delay the execution of the instruc-
tion ADD , but we have two questions. First, delay to which cycle? As in
Figure 3.37, we see that in cycle 5 has SUB instruction finished updating
X2 . Because reading registers is a combinational circuits—the read port
will read the updated value as soon as the register value has changed, as
long as we let instruction 3’s ID stage align with instruction 1’s WB stage,
we’re good.
Second question is, the pipeline constantly fetches instructions from PC,
so we need to add some instruction between instructions 1 and 3. It can-
not, however, do anything other than making the stages busy, and cannot
interfere with the logic of the program.
100 3 Microprocessor Design
1 2 3 4 5 6 7 8
EX ME
ALUout MEWB.ALUout
= R[11] + R[12] = 55 = EXME.ALUout = 55
EXME.ALUout
= ALUout = 55
ID EX
ReadReg1 = 20 ALUout
RegData1 = R[20] = R[20] + imm
imm = 20 EXME.ALUout
= ALUout
ID
Figure 3.38: Because there’s no depen-
ReadReg1 = 0
dency on X2 in instruction STR , we
ReadReg2 = 2
swap it with ADD to align its ID stage RegData1 = R[0] = 50
RegData2 = R[2] = -10
with SUB ’s WB stage.
Again, in some cases, this solution cannot be used. For example, for a
sequence such as the following:
this trick doesn’t work anymore, because we see data dependency between
every pair of instructions; no other instructions can be swapped.
In Figure 3.39 we added one NOP between the two ADD instructions. We
only need one NOP inserted in this example, because as long as ADD ’s ID
stage is aligned with SUB ’s WB stage, we’re all good. More NOP s only
waste cycles unnecessarily. With this, we can see that the position of NOP
is not unique—we can certainly add one NOP right after SUB instruction,
and it’ll have the same effect on delaying the pipeline.
3.5 Hazards 101
1 2 3 4 5 6 7 8 9
ID EX ME
ID EX
ReadReg1 = 11 ALUout
ReadReg2 = 12 = R[11] + R[12] = 55
RegData1 = R[11] = -125 EXME.ALUout
RegData2 = R[12] = 180 = ALUout = 55 Figure 3.39: Inserting NOP instruction al-
lows us to delay instructions that depend
ID on the completion of earlier instructions.
In this example, to make SUB ’s WB and
ReadReg1 = 0
ReadReg2 = 2 ADD ’s ID stages align, we only need to
RegData1 = R[0] = 50 add one NOP . Certainly in some cases
RegData2 = R[2] = -10
more NOP s may be needed.
One thing we’d like to mention about NOP . In the detailed boxes in Figure
3.39, notice in cycles 4 and 5, NOP ’s ID and EX stages are exactly like
the ones in its previous ADD instruction. This is because, like we said,
NOP instruction does not do anything in the pipeline—meaning it doesn’t
change or alter what’s already there. When its previous instruction moves
along the pipeline to the next stage, the data/control signals stay there
unless its next instruction moves in and overwrites them. NOP doesn’t
do anything, so the data will not be changed.
Remarks. Note that the two solutions presented above do not need to be
exclusive. Sometimes re-arranging the instruction sequence is not enough,
and so we’d have to insert NOP s as well.
Note: explain why you rearrange the instructions the way you did.
Without explanation you’ll only get half of the points.
B See solution on page 115.
Inserting one bubble is not enough, though, because we notice there’s an-
other data hazard for X2 at cycle 5. In this case, we keep stalling instruc-
tions 3 and 4, and insert another bubble. At cycle 6, both X1 and X2 have
been updated, so instruction 3, which is currently held at ID stage, can read
both X1 and X2 successfully. Thus, from cycle 7, we move instructions
3 and 4 along the pipeline, and the data hazard has been resolved.
Now let’s analyze in what situation will there be data hazard. From previ-
ous examples and Figure 3.40, we have seen that typically, if one instruc-
tion’s trying to read a register that hasn’t been updated from previous in-
structions, there’s data hazards. In other words, the instruction at ID stage
relies on its previous instructions whose RegWrite control signal is 1, and
yet they haven’t reached to the WB stage. Such instructions are either in
EX or ME stages. Using description language, we can have the following
condition to check data hazards:
1 if IDEX.RegWrite:
2 if IDEX.WriteReg == IFID.ReadReg1 or \
3 IDEX.WriteReg == IFID.ReadReg2:
4 # Stall the pipeline
5 if EXME.RegWrite:
6 if EXME.WriteReg == IFID.ReadReg1 or \
7 EXME.WriteReg == IFID.ReadReg2:
8 # Stall the pipeline
1 2 3 4 ❶
1: MOV X1, 10 IF ID EX ME Data hazard detected: X1 is
IF ID EX needed in instruction 3, but hasn’t
2: MOV X2, 20 been updated from instruction 1
yet.
1 2 3 4 5 ❷
1: MOV X1, 10 IF ID EX ME WB Instruction 3 is stalled in ID stage
IF ID EX ME for cycle 5, and its following
2: MOV X2, 20 instruction is held in IF stage as
EX well.
1 2 3 4 5 6 7 8 9 10 ❹
1: MOV X1, 10 IF ID EX ME WB
2: MOV X2, 20 IF ID EX ME WB Figure 3.40: As long as there’s a data haz-
EX ME WB ard, we’d postpone the instruction and its
EX ME WB following instructions by stalling them
3: ADD X0, X1, X2 IF ID ID ID EX ME WB in place, and let NOP s move along the
4: STR X5, [X6,0] IF IF IF ID EX ME WB pipeline. Once all data hazards have been
resolved, stalled instructions can restart.
Once data hazard detected, this hazard detection unit does two jobs:
2. Insert bubble to EX stage. Before the EX stage starts, EX, ME, and WB
signals in IDEX represented the instruction that we want to stall.
If there’s data hazard, we cannot let them move to the next stage.
Therefore, instead of passing control signals generated by the control
unit to IDEX directly, we send them to a multiplexer, and let the
hazard detection unit decide if those signals can be passed to IDEX .
If there’s data hazard, the multiplexer will pass all zero signals to
IDEX instead, which is used as NOP instruction.
104 3 Microprocessor Design
PC
Address Instruction
I Memory
Hazard
Detection nextPC PC Rn Rm Rt Rd imm opcode IFID
Unit
MUX
00…0
RegWrite RegDataW
RegData1 RegData2
ReadReg1
EX ME WB nextPC PC
ReadReg2
RegData1 RegData2 WriteReg imm opcode IDEX
IDEX.RegWrite
Figure 3.41: Hazard detection unit receives signals from multiple stages (representing multiple instructions), and stall instructions currently
at ID and IF stages, and insert bubble to EX stage.
Stalling a pipeline can surely avoid data hazards, but it also slows down
the pipeline. In many cases, forwarding data right from EX stage back is
faster. Thus, we create another combinational circuits called forwarding
unit, as shown in Figure 3.42. The main function of this unit is to detect
if there’s a data hazard. If there is, it’ll overwrite the data read from the
registers. This way, a new value will be used as the input operand(s) of
ALU, instead of the old and not updated values from register read.
For the two operands from ALU, we use a four-way multiplexer to se-
lect one of the three possible inputs: one from the ALU result of its pre-
vious instruction, EXME.ALUout ; one from the written back value from
RegDataW ; one from register read, either RegData1 or RegData2 .
Since it’s a four-way multiplexer (see Section 3.1.2.2), we’d need two bits
of signals for selection, which are also the output of the forwarding unit.
We call them FA and FB , used for operands A and B of ALU. Thus, we
can establish a truth table for the multiplexers as follows:
3.5 Hazards 105
The description language for the multiplexer can be written this way:
1 if IDEX.MemRead:
2 if IDEX.WriteReg == IFID.ReadReg1 or \
3 IDEX.WriteReg == IFID.ReadReg2:
4 # Stall the pipeline
3.5 Hazards 107
PC
Address
Instruction
I Memory
Hazard
Detection nextPC PC Rn Rm Rt Rd imm opcode IFID
Unit
MUX
00…0
ReadReg1
EX ME WB nextPC PC
ReadReg2
RegData1 RegData2 WriteReg imm opcode IDEX
FB M
Forwarding
Unit U
X
M
U
FA X
A B
Z ALU
ALUout
Address WriteDataM
Data Memory
ReadDataM
Control hazard refers to the wrong execution of programs (not control sig-
nals), which is mainly due to branching. Consider the code sequence:
1 2 3
1: CBZ X1, .L1 IF ID EX
2: AND X12, X2, X5 IF ID
Always taken;
Always not taken;
BTFN: backward taken, forward not taken. For example, at the end
of a loop body, we always predict that it’ll branch back, because a
loop typically runs multiple iterations. If the branch is not part of a
loop but to a procedure or other part of the code, we always predict
that it’ll not take the branch. In other words:
• Predict branch taken if it’s going backwards;
• Predict branch not taken if it’s going forwards.
3.5 Hazards 109
ą Example 3.4
The following is the example from Section 2.5.4.1, and assume the
string is qwert . Please calculate the misprediction rate using dif-
ferent static prediction strategies: always taken, always not taken,
and BTFN.
Solution: It is easier to write out all the actual branching for all
iterations first, and then compare them with different branch
prediction strategies. One thing to be aware of is that only condi-
tional branching causes control hazard; unconditional branching
instructions such as B , BL , and RET do not cause control hazard.
In this example, therefore, we only need to consider B.EQ .
N N N N N T
↓ ↓ ↓ ↓ ↓ ↓
q w e r t '\0'
this happens during run time. There are two typical methods: Last-time
Predictor, and Two-bit Predictor.
Last-time predictor indicates which direction branch went last time it exe-
cuted. For example, if last time the program actually took the branch, this
time it will take the branch as well. If there’s a loop with 𝑁 iterations, it
is easy to see that only the first and last iterations will have misprediction,
2
so MR = 𝑁 .
not taken
Assume in iteration 𝑖, the predictor is at strongly taken state in Figure 3.44
Predict Predict
taken
taken taken taken and predicts the branch is taken, but it’s actually not taken. In iteration
𝑖 + 1, instead of predicting not taken immediately as in last-time predictor,
not taken
two-bit predictor will enter weakly taken state and still predict it’s taken.
taken
If it’s still a misprediction this time, i.e., actually not taken, in iteration 𝑖+2,
it will predict not taken, and end weakly not taken.
not taken
You are running the above code on a machine with a two-bit pre-
dictor, initialized to the state of Strongly Taken.
1. Assuming that 𝑁 is larger than 10, after running the loop for
10 iterations, you observe that the branch predictor mispre-
dicts 0% of the time. What is the value of 𝑋 ?
2. What is the prediction accuracy of the branch predictor if
𝑁 = 20 and 𝑋 = 2?
3.6 Performance Evaluation 111
1
clock rate = . (3.3)
clock period
This measurement denotes how many cycles a CPU can run each second,
and its unit is Hz. For example, for a computer with clock period of 250𝑝𝑠 =
2.5 × 10−10 𝑠, its clock rate is
1
clock rate = = 4 × 109 Hz (3.4)
2.5 × 10−10 𝑠
= 4 × 103 MHz (3.5)
= 4GHz. (3.6)
What does this number even mean? Consider computer A with 4GHz
clock rate and B with 2GHz. The number tells us that computer A can
have 4 × 109 cycles in each second, while B has only 2 × 109 cycles per
second. If both A and B are using the CPU we designed in this chapter,
meaning they have five stages, we see that computer A can load twice
the amount of instructions than B in each second. Thus, given the same
amount of time, A can accomplish more tasks than B.
This equation also outlines the factors that can improve a computer’s per-
formance:
For clock rate, the higher the better, because that means given one
second we can process more instructions. Again, for our non-pipelined
design, because one cycle needs to deal with all five stages, each cy-
cle needs much longer time, which greatly reduces the clock rate.
Therefore, for CPU designers, finding a balance between smaller
CPI but also larger clock rate is important to achieve better perfor-
mance. So as programmers, if we cannot improve the hardware de-
sign, what can we do? Optimize our algorithms!
ą Example 3.5
We designed a processor A with 2GHz clock rate, and when we
run a performance test program on processor A, the CPU time is
10𝑠. Our competing company designed a processor B, and when
we run the same test program on processor B, it takes only 6𝑠. Our
CEO was not happy about this, so he hired a spy to find out what
makes their processor so fast. Unfortunately, the company keeps
their secret so well, so the only thing the spy can get is that their
clock cycles per instruction 𝐶𝑃 𝐼𝐵 is 1.2 times of our 𝐶𝑃 𝐼𝐴 . So
how fast must computer B’s clock be?
𝐾
Clock cycles = ∑ 𝐼𝑘 ⋅ 𝐶𝑃 𝐼𝑘 (3.16)
𝑘=1
Clock cycles
𝐶𝑃 𝐼 = (3.17)
𝐼
𝐾
∑𝑘=1 𝐼𝑘 ⋅ 𝐶𝑃 𝐼𝑘
= (3.18)
𝐼
𝐾
𝐼
= ∑ (𝐶𝑃 𝐼𝑘 ⋅ 𝑘 ) (3.19)
𝑘=1
𝐼
𝐼𝑘
where 𝐼 is the total number of instructions in the program, and 𝐼 is called
relative frequency of instruction 𝑘.
114 3 Microprocessor Design
1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0
1 0
The highest 11 bits are 11010001000. Based on Figure 3.17, the in-
struction that matches the opcode is SUB Rd,Rn,imm11 . The next
11 bits are 00000001000, which represents the immediate number 8.
The next 5 bits 10010 is the operand register X18 , and the last 5 bits
01010 the destination register x10. Therefore, the exact instruction
is: SUB X10, X18, 8 .
True. Recall that latency is the time for one instruction to finish,
while throughput is the number of instructions processed per
unit time. Pipelining results in a higher throughput because
more instructions are run at once. At the same time, latency
is also higher as each individual instruction may take longer
from start to finish because each cycle must last as long as the
longest cycle. Additionally, hazards may be introduced.
Note: explain why you rearrange the instructions the way you did. With-
out explanation you’ll only get half of the points.
Indeed it looks like now between LDR and ADD there are enough
cycles to avoid data hazard. However, notice after SUB instruction
we branched to a procedure proc . From what’s given in the ques-
tion, it’s never clear if we need to use X22 in the procedure. If we
calculate X2 after return back, the procedure might have used the
wrong value of X2 . Therefore, remember, when you re-arrange in-
structions, anything before BL should stay before BL . Same for
everything after BL . And of course here we assume there’s no data
hazard in proc .
116 3 Microprocessor Design
Assume the following piece of code that iterates through two large arrays,
j and k , each populated with completely random positive integers. The
code has two branches (labeled B1 and B2 ). When we say that a branch
is taken, we mean that the code inside the curly brackets is executed. As-
sume the code is run to completion without any errors. For the following
questions, assume that this is the only block of code that will ever be run,
and the loop-condition branch ( B1 ) is resolved first in the iteration be-
fore the if-condition branch ( B2 ). N and X are unspecified non-zero
integers.
You are running the above code on a machine with a two-bit predictor,
initialized to the state of Strongly Taken.
1. Assuming that 𝑁 is larger than 10, after running the loop for 10 iter-
ations, you observe that the branch predictor mispredicts 0% of the
time. What is the value of 𝑋 ?
𝑋 = 1. This is because the two-bit predictor is initialized to
Strongly Taken state, so both B1 and B2 have to be taken all
the time for the first 10 iterations. The only condition that can
make 0, 1, 2, … , 9%𝑋 = 0 is 𝑋 = 1.
2. What is the prediction accuracy of the branch predictor if 𝑁 = 20
and 𝑋 = 2?
When 𝑁 = 20, in total we have to make 41 predictions: 21 for
B1 (where 20 for the actual iterations and 1 for the last check),
and 20 for B2 . Since 𝑋 = 2, let’s write out the branching
outcomes for the first few iterations and see if there’s a pattern:
i = 0 i = 1 i = 2 i = 3
B1 B2 B1 B2 B1 B2 B1 B2
...
Outcome T T T N T T T N
Prediction T𝑠 T𝑠 T𝑠 T𝑠 T𝑤 T𝑠 T𝑠 T𝑠
10 + 1
MR = = 0.269. (3.20)
41
Memory System 4
4.1 Memory Hierarchy . . . . 119
In previous chapters we discussed the processor. Now we are going to
4.2 Cache Memory . . . . . . 122
move on to the memory. Even if it’s not part of the processor, it plays a
4.3 Virtual Memory . . . . . . 139
critical role in our datapath and processor design. Thus, in this chapter,
we’re going to learn about memory technologies. 4.4 Reference . . . . . . . . . . 149
4.5 Quick Check Solutions . 150
To address this, let’s start from a programmer’s view, and then look at how
hardware designers take advantage of this observation to design faster
access devices.
Let’s start with a very simple example, where we assume there’s an integer
array a of length n:
1 int sum = 0;
2 for (int i = 0; i < n; i++) sum += a[i];
3 return sum;
This example sums over all the element in the array a, by using a for-loop
and add each element to the variable sum each time. So far we should’ve
been pretty familiar with how arrays are stored in memory. In this exam-
ple, all the elements in array a are stored next to each other. We are also
familiar with the calculation happened in the processor, so the variable
sum is an integer stored in the memory, and each time when we add a[i]
to it, we need to load sum to the processor.
Our memory is a very large device that can store lots of data, but when we
look at the example above, we notice:
The data we’re dealing with, i.e., a[i], are stored right next to each
other, instead of being all over the place;
The variables sum and i are used in each iteration, frequently.
More formally, spatial locality means the data with nearby addresses tend
to be used (referenced) close together in time, such as the array elements;
temporal locality means recently referenced data are likely to be refer-
enced again in the near future, such as sum in our example.
Compare the following two functions in C language. Which one has good
locality and why?
...
5 sum += a[i][j];
a[1][N-1] 6 return sum;
7 }
...
10 int i, j, sum = 0;
...
15 }
0th column
Since we’re comparing C functions, we need to look at the row-major ar- a[1][0]
...
the inner loop accesses elements stored together. We usually call this stride-
1 reference. a[M-1][0]
a[0][1]
1st column
The inner loop of sum_array_cols() needs to skip over 𝑁 elements in
...
each iteration, so the elements it gets access to are far away from each
other. It has a stride-𝑁 reference. a[M-1][1]
...
However, this is not to say sum_array_cols() does not have good locality
(N-1)-th column
in every language. In column-major order languages, it has better locality a[0][N-1]
than sum_array_rows(). So knowing the language behavior is the first
...
step to determine locality.
a[M-1][N-1]
Once this observation has been made, smart hardware designers thought, Figure 4.3: column-major languages
store all elements in each column
if they are used frequently and together, why don’t we just load all of them together.
into another device that’s faster than memory just once, and put them back
to memory once it’s done? That’s exactly what caching does.
The idea of caching is not new actually. Think about this: our programs
are stored on the hard drive, but why would they be sent to memory when
executing? From Figure 4.1, we see that disk seek time is almost 109 times
slower than CPU cycle time, so if the data and the code we need to use
in a program are stored in the hard drive, the delay would be intolerable.
Thus, we store the code and data we’re going to use in this program into 2: Of course the design of CPU is also
memory, to accelerate the execution. We can think memory is “caching” one of the factors. It’s entirely possible to
have thousands of registers built in a pro-
the disk.
cessor, assuming you’re a billionaire and
can afford them. However, remember if
you have those registers, you also need
The reality about storage devices is usually the faster the device is, the to build an entirely new instruction set
more expensive it is to store one byte, and therefore we tend to store less architecture to operate them, which is al-
data in that device. Registers are built inside the CPU, and so they are the ready a large project. Even if you finished
fastest storage. Due to the cost, however, we cannot have large amount of it, probably no other people can afford us-
ing this processor because it’s too expen-
registers, so in our ARM architecture there are only 32 of them. 2 sive. After all, how many billionaires do
we have in total on this earth?
122 4 Memory System
CPU
words
L0 Registers
Smaller, lines
faster, L1 cache
more expensive L1
(SRAM) lines
per byte
L2 L2 cache
(SRAM)
lines
L3 L3 cache
(SRAM)
lines
L4 Main memory
Larger, (DRAM)
slower, disk blocks
cheaper
per byte L5 Local secondary storage
(local disks, SSD, etc.)
files
All of our running programs reside in the main memory, including the
variables used in our programs. So far it should be very clear that in order
4.2 Cache Memory 123
CPU chip
Main memory
Register (DRAM)
ALU
file
Memory
bus
System bus
Bus interface I/O bridge Figure 4.5: Traditional bus structure be-
tween CPU chip and main memory.
Figure 4.5 shows a typical bus structure between CPU chip and the main
memory. The bus interface inside CPU chip allows extension of internal
bus (from register file to bus interface) to connect with I/O devices as well
as the main memory. There are also some other components in this struc-
ture:
System bus contains three major parts: control bus, address bus,
and data bus;
I/O controller connects I/O devices to the system bus to be con-
trolled and used by CPU chip;
Memory bus contains address, data, and control bus as well, which
has been covered earlier in Chapter 3. The role of control bus here is
to indicate if this is a read or write transaction.
124 4 Memory System
Read Transaction
The read transaction between CPU and the main memory is mostly done
by instructions such as LDR. For example, given an instruction LDR X10,[X9] ,
the value of X9 is read from the register file first. Then it’s put on the sys-
tem bus and memory bus through bus interface and I/O bridge. Next,
main memory retrieves the data stored at the address, and puts the data
on memory bus, then transfers back through system bus and writes to reg-
ister X10 .
Write Transaction
Similarly, STR instruction invokes write transaction. For example, STR X10,[X9] ,
the data in X9 will be put on address bus, and that in X10 on data bus
transferred to the main memory.
CPU chip
Main memory
Cache Register (DRAM)
ALU
(SRAM) file
Memory
bus
System bus
Figure 4.6: A CPU chip with one level of Bus interface I/O bridge
cache.
Since memory transactions are too slow for the processor, we add another
type of memory called cache on the CPU chip. This cache is formed by a
collection of Static RAM (SRAM), which is more expensive but faster. It’s
also smaller than the main memory, so it can only hold a tiny subset of the
5: Through the years, engineers realized
that one level of cache is not enough, so
data we use in a program. 5 In Figure 4.6, we added a cache on the CPU
they created three levels of cache, named chip between register files and the bus interface.
L1, L2, and L3 cache, as in Figure 4.4. In
fact there are more levels, but here one is Let’s start with a toy example to see what this cache actually does. As in
enough for understanding the concept. Figure 4.7, in the beginning, cache is empty, and all the data needed are
inside the main memory. Assume now the CPU executes an instruction
LDR X0,[X1] . Because the data M[X1] is not in the cache, we have to
go to the main memory to retrieve it. The thing here is, when retrieving
data from memory, we don’t just bring the requested data back; instead,
we bring a “block” of data back. In other words, in addition to copy the
requested data back, we also copy the data stored next to it back to the
cache. For example, if we want to load the double word at address 0x1000
to 0x1008, we copy all the data from 0x1000 to 0x1040 back to the cache,
which contains eight double words.
Next time, as long as the data requested by CPU are in that address range
(0x1000—0x1040), since they are already copied in the cache, there’s no
need to go to the main memory anymore; we can simply retrieve the data
from the cache. When this happens, we say it’s a hit.
4.2 Cache Memory 125
❶ In the beginning, the cache is empty, and all the data needed ❷ CPU requests a data inside main memory, but it’s not inside
are inside the main memory. cache, so the “block” that contains the requested data is copied
into cache.
❸ As long as the data requested by CPU are in the cache (called ❹ If CPU requests a data that’s not in the cache (called miss),
hit), we don’t need to go to main memory anymore, and can the “block” that contains the data in the main memory will be
simply go to cache to retrieve the data. copied into cache, and overwrite the cache.
However, if the data requested is not in the cache, we’d have to go to main
memory again, and copy the “block” back to the cache, and possibly over-
write the data in the cache. The case where the data requested is not in the
cache is called a miss.
Now with this idea, it’s no wonder why we prefer good spatial locality in
our programs. Cache stores a chunk of memory data each time, so if every
LDR in our program is requesting data that are close to each other, we can
easily and rapidly get them just from the cache, without going further to
the main memory. This is also why we prefer stride-1 reference in arrays,
because cache copies consecutive elements in an array all at once.
Cache
...
per set, and stores 𝐵 bytes for line. Note
all the lines in the cache have the same
structure as shown in line 0; due to illus-
line 0 v tag 0 1 2 3 ... B-1 line 1 ... line E-1 set S-1
tration purposes we only show structures
of line 0.
have a cache between the processor and the main memory, the first step
in this case is to see if the data is already in the cache. So given an address
issued by CPU, how can we get the corresponding data from the cache?
In fact, the address sent out by CPU can be chopped into three fields:
The first 𝑡 bits are used for the tag; next 𝑠 bits for indexing to a specific set;
and last 𝑏 bits for block offset.
Cache 𝑚 𝐶 𝐵 𝐸 𝑆 𝑡 𝑠 𝑏
1 32 1,024 4 1
2 32 1,024 8 4
3 32 1,024 32 32
B See solution on page 150.
In the following sections, we will discuss two special types of caches, and
use concrete examples to see how cache works.
4.2 Cache Memory 127
The simplest cache is called direct-mapped cache, where each set has only
one line, i.e., 𝐸 = 1. We’ll create a toy direct mapped cache to show how
it works. In this toy example, we assume:
Because each address has 4 bits, we use 𝑠 = 2 bits for set index and 𝑏 = 1
bit for block offset, which leaves us one bit for the tag, i.e., 𝑡 = 1. We’ll also
follow the convention from last chapter, and use M[x] to denote the data
in memory at address of x.
8: You can think that we are running
Now let’s start requesting data! Assume the CPU requests data at the a sequence of assembly instruc-
following addresses one at a time: 8 0b0000, 0b0001, 0b0111, 0b1000, tions such as LDRB W1,[0b0000] ,
0b0000. LDRB W2,[0b0001] , etc.
We first parse their addresses into tag, set index, and block offset as fol-
lows:
0b0010: M[0b0010]
v tag set 10
...
0b0010: M[0b0010]
v tag set 10
...
v tag set 11 0b1111: M[0b1111]
Now that we have the data in the cache, we will load one byte start-
ing from offset 0 in the block. Thus, we take M[0b0000] back to
register files.
0b0001 The second address is 0b0001. Its set index is 0b00, so we index
into set 00. At this point, we notice the valid bit is set, and its tag 0
matches the tag from our address, so we have a hit! Next, the block
offset is 1, so we’ll retrieve the data from the second byte in the line,
which is M[0b0001].
9: Note here we didn’t copy M[0b0111] 0b0111 This time the set index becomes 0b11, so we have a miss. We go
and M[0b1000] back, because remember
to the main memory, and copy both M[0b0110] and M[0b0111] back
the last bit in an address in this exam-
ple indicates the block offset. Given an to the cache. 9 After copying the data from the main memory, we set
address, we want to copy the data at ad- valid bit, and set tag to 0. Since the block offset is 1, we again load
dresses with the same tag, same set index, the second byte in the line, M[0b0111], to the register file. The cache
but from block offset of 0. This way we
at this point looks like this:
can carry an entire line into the cache.
0b0010: M[0b0010]
v tag set 10
...
0b1000 This address has a set index of 0b00, so we index into set 00, and
proceed to compare the tag. The tag in the set 00 is 0, which doesn’t
match the tag in our address, so the data currently in this set is not
the one we want, and we have a miss. We need to go to the main
memory, and copy M[0b1000] and M[0b1001] back. Since each set
has only one line, we’ll have to evict the data currently residing in
the set, and replace it with the new data we just brought back from
the main memory. Thus, M[0b0000] and M[0b0001] are overwritten
by M[0b1000] and M[0b1001], and the tag is updated to 1. After
these operations, the cache looks like this:
4.2 Cache Memory 129
0b0010: M[0b0010]
v tag set 10
...
1 0 M[0b0110] M[0b0111] set 11 0b1111: M[0b1111]
Then we can send the first byte from the line back to the processor.
0b0000 Lastly, this address again indexed to set 00, but from last address
we replaced the data in set 00 with tag 1, so we have a miss again,
unfortunately. We’ll have to repeat the procedure described above:
bring the data back from the main memory, overwrite the current
data in the set, and update the tag.
In the following, to help you understand the concept, we’ll use a real world
example. Of course, if the procedure of cache read we discussed in the
previous section is clear to you, you can skip this section. 00 01 10 11
Let’s say you’re working at a casino, and your job is to provide one of
the eight cards to a customer: ♥00, ♥01, ♥10, ♥11, ♦00, ♦01, ♦10, and 00 01 10 11
♦11.
Those cards are stored in the stockroom but you’re working at the front
Box 0 Box 1
desk. You realized that it’ll be too much hassle if you go back to the back-
room every time a customer comes and asks for a card. You’re very smart, Figure 4.9: For convenience, two boxes
are used for storing two cards each. Box
so you prepared two boxes at the front desk: box 0 and box 1. When a cus- x only stores cards starting with digit x.
tomer asks for a card, you check if it’s already in the box or not. If it is, then
you just give it to the customer; otherwise you go back to the stockroom,
bring it back, and put it in the box.
The rule of using the box is simple. The first digit on the card represents
which box. For example, if a card is 0x, you put that into box 0; if it’s
1x you put it into box 1. Because the box can fit two cards at a time, you
decided to bring two cards with the same first digit back to the box. Thus,
the second digit represents which card. For example, x0 means it’s the
first card in the box; x1 the second card.
00 01 10 11
Now let’s get started!
♥00 The first customer asked for a ♥00 card. Since you just started work- 00 01 10 11
ing, there’s nothing in the boxes—you had a miss. You went back to
the stockroom, and brought both ♥00 and ♥01 back and put them
00 01
in box 0, because both of them start with 0. ♥00 means it’s the 1st
Box 0 Box 1
card in box 0, you grabbed it and handed it to the customer. The
customer used the card and gave it back to you; Figure 4.10: After ♥00 was requested.
130 4 Memory System
♦11 The next customer asked for a ♦11 card. You checked box 1 but noth-
ing there—you had a miss again. You went back to the stockroom,
and grabbed both ♦10 and ♦11 back, because both of them start
with 1. After putting them in the box, you realize ♦11 is the second
card in box 1, so you took it to the customer;
♥01 This customer wanted a ♥01. First thing you checked is box 0, and
yes there are cards there. Then you need to make sure the cards
there are in the ♥ suit, and fortunately they are, so we have a hit!
You gladly picked the second card in box 0, and handed it to the
customer. No need to go back to the stockroom!
00 01 10 11 ♦01 This time, a customer wanted a ♦01. You went to Box 0, and there
were some cards there indeed. However, when you compared their
suits, you realized the one in the box (♥) is not what’s requested (♦),
00 01 10 11 so you had a miss again. You had to go back to the stockroom, and
grabbed both ♦00 and ♦01 back. The box 0 can only fit two cards,
so you swapped the two cards in the box out, with the new cards
00 01 10 11
you brought.
Box 0 Box 1
Figure 4.11: After ♦01 was requested. In this example, a customer’s request contain three information: suit, box
number, and card number. If we treat the requests as three-bit addresses,
and the card as the data requested, we have the perfect analogy:
suit → tag;
box number → set index;
card number → block offset.
address, MSB is the tag, and LSB the block offset, and the middle
two bits are set indices;
2-way set associative cache: We can reduce the number of set by half,
which makes 𝑆 = 2 sets. To keep 𝐵 = 2 bytes per line, we have
to make two lines per set to keep the capacity unchanged. Thus,
𝑆 = 𝐸 = 𝐵 = 2. Now that we only have two sets, we’d need one
bit for set index. LSB is still the block offset, so the most significant
two bits are used for the tag;
Fully associative cache: We have only one set in this case, so there’s
no need to use any bit in the address as set index. To keep capac-
ity unchanged, we need four lines, and since LSB is the block offset
and we don’t need set index, the most significant three bits of the
addresses are used for the tags.
There are two possible problems coming with E-way set associative caches.
Take the 2-way cache above as an example.
The first problem is if the data requested is not in the cache and there are
empty lines in the set, which line should we put the data in? A simple
solution is just to put it in the next available line.
Another problem is when there’s a conflict. Because each line has two
bits for the tag, there are four possible tags for each set in total. However,
because we only have two lines per set, there’s a chance that both lines’
tags don’t match the tag in the address. So we need to replace one line,
but which line?
There are several simple algorithms. For example, we can take the earliest
used line out, assuming it won’t be used again very soon. We can also
take a random line out and hope for the best. In our class, we’ll replace
the earliest used line.
Part 1
The box below shows the format of a physical address. Indicate (by
labeling the diagram) the fields that would be used to determine
the following:
O The block offset within the cache line
I The cache index
T The cache tag
12 11 10 9 8 7 6 5 4 3 2 1 0
Part 2
For the given physical address, indicate the cache entry accessed
and the cache byte value returned in hex. Indicate whether a cache
miss occurs. If there is a cache miss, enter “-” for “Cache Byte re-
turned”.
Physical address: 0x0E34
As mentioned before, one level of cache between the main memory and the
processor is good enough for our understanding of the concept of cache,
but it’s not practical in reality. Therefore, we see that in Figure 4.4 that we
have three levels of L1, L2, and L3 caches. L𝑘 cache holds parts of data
134 4 Memory System
Processor package
Core 0 Core 1 Core N-1
ALU ALU ALU
A B A B A B
… Main
L1 L1 L1 L1 L1 L1
d-cache i-cache d-cache i-cache d-cache i-cache memory
Internal L2 cache
External L3 cache
Figure 4.13: Simplified illustration of
ARM processor chip with multi-level
Bus
caches.
You usually see “8-core” or “quadcore” when you buy a computer. Each
core is just a CPU that contains the basic elements we have learned. In
modern architectures, L1 cache is actually split into two, called d-cache
for caching data and i-cache for instructions. This further accelerates data
processing than just using one L1 cache for both data and instructions.
To evaluate cache performance, the most commonly used one is miss rate:
The best way to understand miss rate and how to calculate it is through a
concrete example.
ą Example 4.1
A bitmap image is composed of pixels. Each pixel in the image is
represented as four values: three for the primary colors (red, green
and blue – RGB) and one for the transparency information defined
as an alpha channel.
blocks. The definition of a pixel and the matrix we’re going to use
is defined as follows:
1 typedef struct{
2 unsigned char r;
3 unsigned char g;
4 unsigned char b;
5 unsigned char a;
6 } pixel_t;
7
8 pixel_t pixel[16][16];
sizeof(unsigned char) == 1 ;
pixel begins at memory address 0;
The cache is initially empty;
Variables i,j are stored in registers and any access to these
variables does not cause a cache miss.
What’s the miss rate for writes to the pixel given the following
code?
r g b a r g b a r g b a … r g b a r g b a
0 1 2 1023
Now let’s see how to calculate the miss rate for this example. In the
inner loop, we have four writes (assigning zeros to the members
136 4 Memory System
What we can conclude from the code above is, for every eight mem-
ory access in the inner loop, we will have one miss. The outer loop
doesn’t really matter here, because it’s simply scaling the access by
16. Therefore, clearly the miss rate is 18 = 0.125 or 12.5%.
3. Calculate the cache miss rate for the line marked Line 1;
4. Calculate the cache miss rate for the line marked Line 2.
When you were learning Big-O notation, recall in the beginning a big as-
sumption is that hardware differences are ignored, and so it’s a purely
theoretical abstraction. Remember, however, that the programs are not
running in your imagination; it eventually relies on the hardware on your
laptop! Since we have learned calculating miss rate, why don’t we see an
example, where the programs have identical time complexity but different
miss rates, and see in reality what their performance is.
Our task is very simple—matrix multiplication, given two long int ma-
trices a and b of same size 𝑛 × 𝑛. Each element takes eight bytes. Based
on the math definition, we can quickly write out a segment that performs
the multiplication:
1 /* ijk */
2 for (i = 0; i < n; i ++) {
3 for (j = 0; j < n; j ++) {
4 sum = 0.0;
5 for (k = 0; k < n; k ++)
6 sum += a[i][k] * b[k][j];
7 c[i][j] = sum;
8 }
9 }
1 /* kij */
2 for (k = 0; k < n; k ++) {
3 for (i = 0; i < n; i ++) {
4 r = a[i][k];
5 for (j = 0; j < n; j ++)
6 c[i][j] += r * b[k][j];
7 }
8 }
138 4 Memory System
Or, we can fix each element in matrix b while iterate over each column in
matrices a and c:
1 /* jki */
2 for (j = 0; j < n; j ++) {
3 for (k = 0; k < n; k ++) {
4 r = b[k][j];
5 for (i = 0; i < n; i ++)
6 c[i][j] += a[i][k] * r;
7 }
8 }
The three methods above are mathematically equivalent, and have the
same complexity of 𝑂 (𝑛3 ). Let’s analyze their miss rates first. Assume
the block size of the cache is 32 bytes and is not large enough to store mul-
tiple rows. We also assume the dimension 𝑛 of the matrices is very large,
meaning each row needs to be moved into the cache multiple times.
ijk The inner loop accesses elements in a and b each time. Notice for
a[i][k], the row-number is fixed while column number k constantly
changes from 0 to n-1. Also because the block size of the cache is
32 bytes and each element takes 8 bytes, we can hold four elements
each time. For every four elements, the first one will be a miss due
to empty cache or line replacement, while the rest of the three will
be hits. Therefore, the miss rate for matrix a is 1/4 = 0.25. As to ma-
trix b, notice it gets access to each element column-wise, so the miss
rate is simply 1. The inner loop does not involve matrix c, so we can
ignore it.
kij In this code, we first notice that the inner loop does not involve matrix
a, so we skip it. Element b[k][j] is iterating over all the elements
in each row, so its miss rate is simply 1/4 = 0.25 (see analysis in the
ijk case for matrix a). Similarly, matrix c iterates over all elements
in a row, so its miss rate is also 0.25.
jki Lastly, for jki case, matrix b is not present in the inner loop, so we
ignore it. For both matrices a and c, the inner loop iterates over all
the elements in one column. You’ve probably already sensed that
0
Cycles per inner loop iteration
.0
70
jki this is a bad idea, and yes the miss rate for both is 1.
Cycles per inner loop iteration
ijk
5
.7
kij
52
If we take average miss rate of the three methods, we will have 0.625 for
ijk, 0.125 for kij, and 1 for jki. So in terms of miss rate, it’s obvious that
0
.5
35
kij is the best way and jki the worst. But is that really the case?
5
.2
18
In previous section, we see that the CPU requests a data by issuing a mem-
ory address, and the address is parsed to check the cache first. The ad-
dresses received by the cache are actually physical address—meaning those
are real address used in the real main memory.
amount of time, the assembly instructions running in the program tend 1: LDUR 1: SUBS 0x00
to cluster together (they are stored close to each other). Even if we have 2: ADD 2: ORR 0x04
3: SUB 3: LDUR 0x08
branching instruction, you don’t always jump all over the place, right? No- 4: SUB 4: STUR
0x0C
tice two key points there: “a small amount of time”, and “cluster”. Let’s 5: LDUR 5: CMP
6: STUR 6: ADDS 0x10
start with a motivating example. 7: CMP 7: ADDS 0x14
8: LDUR 0x18
0x1C
we want to run two programs at the same time. For the two programs
shown in Figure 4.15, we see that there are in total 15 instructions. If each
instruction takes 4 bytes, to load them all into the physical memory we Program 1 Program 2 Physical memory
need 15 × 4 = 60 bytes at least. So how to fit both programs into a small 1: LDR 1: SUBS 0x00 LDR
physical memory? 2: ADD 2: ORR 0x04 ADD
3: SUB 3: LDR 0x08 SUB
Because of locality, we know that the executed instructions of a program 4: SUB 4: STR
0x0C SUB
5: LDR 5: CMP
in a small amount of time are usually stored next to each other in memory. 6: STR 6: ADDS 0x10 SUBS
So why don’t we just load parts of the program into the memory first? Dur- 7: CMP 7: ADDS 0x14 ORR
8: LDR 0x18 LDR
ing that small amount of time, just that part of the program is enough to
0x1C STR
execute. As shown in Figure 4.16, we only load program 1’s and program
2’s first four instructions into the physical memory, which will fill the en- Figure 4.16: Only loading parts of the pro-
tire memory nicely. When the loaded instructions have all been executed, grams allows them to fit in to small phys-
ical memory at the same time.
140 4 Memory System
we take them out, and swap in the other parts of the programs, so we’ll
end up in a situation as in Figure 4.17.
Program 1 Program 2 Physical memory
This is such a simple thing we do almost every day. There might be lots of
1: LDR 1: SUBS 0x00 CMP
2: ADD 2: ORR 0x04 ADDS books at your home, but your bag can only take probably five books max,
3: SUB 3: LDR 0x08 ADDS so before you go to campus, you’d take only the books you need that day
4: SUB 4: STR
5: LDR 5: CMP
0x0C LDR to bring. If next day you’re going to use a different book, you’ll take out
6: STR 6: ADDS 0x10 LDR the one you’re not going to use, and put the new book into your bag. The
7: CMP 7: ADDS 0x14 STR
8: LDR
books are like parts of the program, and the bag is the physical memory.
0x18 CMP
0x1C
The example in the previous section is a very simple one, but it illustrates
the idea behind virtual memory well. Remember, however, the example is
only simplified. In fact, the areas of a process—text, data, stack, and heap—
will all be separated into “parts” and loaded into the physical memory, not
just the text segment.
We, as programmers, write our code without even thinking about mem-
ory space limit. This is because the operating system provides a memory
management mechanism called virtual memory. From our point of view,
10: Technically not infinitely-infinitely
large; the typical virtual memory size
each of our program can take basically infinitely large memory space. 10
for a program is 4GB on Linux, but can So all the statements involving “memory address” so far (except the one in
certainly be extended as needed. the cache), e.g., “pointers represent memory addresses”, “LDR loads data
at a memory address to a register”, etc, refer to virtual memory. If a vir-
tual memory address has 𝑛 bits, each program will have the same set of
𝑁 = 2𝑛 − 1 unique virtual addresses {0, 1, … , 2𝑛 − 1}, which is called a
virtual address space.
From the machine’s view, however, the physical memory is small and lim-
ited. The physical memory is also byte-addressed, meaning each byte has
an address. We call this address physical address—a real address, not vir-
tual, not imaginary. If a physical address has 𝑚 bits, the set of 𝑀 = 2𝑚 − 1
unique addresses {0, 1, … , 2𝑚 −1} is called physical address space. Based
on our analysis, we notice that 𝑁 >> 𝑀 .
Obviously, to fit all the programs into one small physical memory needs
some arrangement, as we showed in Section 4.3.1. Briefly, we chop each
program’s virtual memory space into smaller pieces, called virtual pages,
and only load some virtual pages into the physical memory. When they
reside in the physical memory, they are called physical pages.
Let’s look at Figure 4.18 for an illustration. In this example, we have two
processes share one physical memory. Each of the processes has a com-
plete virtual memory space, from 0 to 𝑁 − 1. Their virtual memory spaces
also have complete organizations including text and data segment, stack,
and heap. This is what we see for each process.
Due to space limit, it’s impossible to fit all of these two processes into one
physical memory, so we chop the virtual memory spaces into pages, and
only the data and code in a page is needed do we move the page into the
physical memory. In the figure, we see that two pages of process 1 and
Stack
4.3 Virtual Memory 141
Heap
Read/write segment
N-1 (.data, .bss) N-1
Stack Physical memory
Stack
Stack
Read-only segment
Stack
Heap
(.init, .text, .rodata) M-1
Unused
Read/write segment
(.data, .bss)
Read-only segment
Heap (.init, .text, .rodata) Heap
Stack
Heap
Unused
Heap Figure 4.18: To fit multiple processes’ vir-
Read/write segment Read/write segment
(.data, .bss) Read/write (.data, .bss) tual memory space into a small physical
Read/write segment
segment 0
(.data, .bss) memory, the system needs to split virtual
Read-only segment (.data, .bss) Read-only segment memory spaces into pages, and only load
(.init, .text, .rodata) Virtual memory
Read-only Virtual memory (.init, .text, .rodata)
Read-only segment
segment the pages that contain needed data and
0 (.init,
(Process 1)
(.init, .text,
.text, .rodata)
(Process 2)
.rodata) Unused 0
Unused code into physical memory.
Unused
Unused
Heap
Read/write segment
three pages of process 2 are currently being used. If the processor needs
(.data, .bss)
data or code in a page that’s not currently in the physical memory, it needs
Read-only segment
to find a page to swap out. If(.init,
it couldn’t find such a page, then delay will
.text, .rodata)
happen. Remember when you open too many apps on your laptop and it
Unused
Physical pages are not dedicated to a specific process, and that’s why
we see the pages from two processes are all over the place, and they
are intertwined in the physical memory;
There’s no order in physical memory. A page in a low virtual mem-
ory address doesn’t always end up in a low physical memory ad-
dress. For example, the page that contains stack of process 1 is at
the lower physical address, while the page with text segment is at
the higher physical address;
The pages are of equal size, so they are not aligned to a specific seg-
ment. For example, we see that one of the pages in process 2 has
some part of the heap, and also some part of the data segment.
Now the ultimate question is, how does the processor know which phys-
ical address is corresponding to which virtual memory address in which
process?! As to which process is running, it’s related to context switching, a
topic you’ll learn in operating systems, so we’ll skip it here and only focus
on the first question.
Cache Memory
(SRAM) bus
System bus
Figure 4.19: Adding memory manage- Bus interface I/O bridge
ment unit (MMU) to CPU chip.
Each row in called a page table entry, and has two fields: valid bit, and
physical page number (PPN). The index to each page table entry is called
virtual page number (VPN). For example, if given a virtual address we
know the VPN of it is 2, then by looking up the table above, we know that
this virtual page is located at the address starting from 0xCC00 in the phys-
ical memory. The valid-bit indicates if the page is actually in the physical
memory. If it’s 1 then yes and we’ll have a page hit; otherwise we’ll have
a page fault.
When there’s a page fault, the main memory will pick a page called victim
page, and evict it to the secondary storage, such as the hard drive. Then
the page we need will be taken back to the main memory from the hard
drive, and the address will be translated again by the MMU.
4.3 Virtual Memory 143
Given a 𝑛-bit virtual address, MMU first splits it into two parts. The high-
est 𝑛 − 𝑝 bits are used as the index to a row in the page table, and that’s the
virtual page number. If the valid bit is 1, we have a page hit, then we re-
trieve the physical page number of that row. The virtual page offset is the
lowest 𝑝 bits, and they will simply be used as physical page offset (PPO)
without change. So we attach PPO to PPN, and we have the physical ad-
dress.
The contents in physical or virtual pages are simply bytes, so the page off-
set indicates which byte in that page. For example, if PPO = 0, we want
the first byte of that page; if PPO = 1600, we want the 1601-th byte, and so
on. Since we need 𝑝 bits to represent the offset, and each byte has its own
address, it is clear that given a page, there will be 2𝑝 bytes stored. There-
fore, we call 𝑃 = 2𝑝 the page size.
ą Example 4.2
Now let’s look at an example of translating virtual address to phys-
ical address. Assume:
Page Table
VPN PPN Valid VPN PPN Valid VPN PPN Valid
00 2 0 06 B 0 0A 3 0
01 5 1 07 D 1 0B 1 1
02 7 1 08 7 1 0C 0 1
03 9 0 09 C 0 0D D 0
04 F 1 0A 3 0 0E 0 0
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 0 0 0 0 1 0 0 1
⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
PPN PPO
This is the physical address in binary, so the last step is to simply
convert this into hexadecimal, and we get physical address of
0x709.
Notice that the page tables are nothing special: it resides in the main mem-
ory just like any other data, such as our code, or other process data. There-
fore, it’ll be cached in L1 cache (and so L2, L3, etc) as well, which means
it’ll be evicted and replaced if the cache is full and we need to store other
4.3 Virtual Memory 145
data in the cache. However, remember the special role of page table: it
is used for translating addresses, which is almost needed all the time, so
we certainly don’t want it to be replaced too often. Otherwise the delay of
retrieving page table entry would be too long.
Similar to the idea of caching memory data in a SRAM cache, we also add
a small SRAM cache called translation lookaside buffer (TLB) on MMU,
as in Figure 4.20. Essentially this TLB is also a set-associative cache; each
line of this cache stores one page table entry.
We’re fairly familiar with SRAM caches by now, so the indexing rule to a
line is the same to TLB. Because page table entries can be determined by
VPN, we separate VPN into two parts: TLB tag (TLBT), and TLB index
(TLBI). These two parts will be used to index a page table entry (a line) in
TLB:
Here we use the middle 𝑡 bits to indicate TLB set index, so in total we have
𝑇 = 2𝑡 sets. Figure 4.21 shows the idea of TLB, which as you see, is just a
SRAM cache where the cached data are page table entries.
1. If a page table entry can not be found in the TLB, then a page
fault has occurred;
2. The virtual and physical page number must be the same size;
3. The virtual address space is limited by the amount of mem-
ory in the system;
4. The page table is accessed before the cache.
CPU chip
MMU
TLB
Main memory
Register (DRAM)
ALU
file
Cache Memory
(SRAM) bus
System bus
Bus interface I/O bridge Figure 4.20: TLB as a cache for page table
entries on MMU.
146 4 Memory System
TLB
line 0 v tag Page table entry line 1 ... line E-1 set 0
line 0 v tag Page table entry line 1 ... line E-1 set 1
...
Figure 4.21: A translation lookaside
buffer (TLB) is simply a SRAM cache line 0 v tag Page table entry line 1 ... line E-1 set T-1
that caches page table entries.
where ptr is a char pointer that points to a byte in the virtual memory.
During compilation, this line will be translated into assembly:
Now assume the machine we run the above code has the following config-
urations
At the time when we execute the LDRB instruction, the contents of the TLB,
the page table for the first 32 pages, and the cache are as shown in Table
4.3
Table 4.3: Contents of TLB, page table (first 32 pages), and the cache in the example. All numbers are hexadecimal.
VPN consists of TLBI and TLBT. From the table, we see that TLB has
only two sets, therefore one bit is enough to index to either. Thus,
in the virtual address, bit [9] is TLBI, while bits [15...10] is TLBT.
TLBT
⏞⏞⏞⏞⏞⏞⏞⏞⏞ ⏞
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟ ⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
VPN VPO
From this we have TLBT is 0x07, TLBI 0x00, VPN 0x0E, and VPO
0x1DE.
148 4 Memory System
Based on Table 4.3, we see that the page table entry for VPN 0x0E
has a PPN of 0x1. Fortunately, it’s valid bit is 1, so we have a page
hit, instead of a page fault. MMU will take this entry back, and trans-
late it into physical memory address.
Therefore, the tag is 0x1E, set index 0x7, and block offset 0x2.
Note that if we had a cache miss, we’d have to go to the main mem-
ory, and evict the line with tag 0x1E in set 7. This involves two opera-
tions: write the line in the cache back to the main memory (assuming
it’s write-allocate), and copy the line we want to the cache.
4.4 Reference 149
pages
N-1 N-1 N-1
Stack Stack Stack
lines
words
A
ALU
Heap Heap Heap
B
Read/write Read/write Read/write
(.data, .bss) (.data, .bss) (.data, .bss) Cache Registers
(SRAM)
Read-only segment Read-only segment Read-only segment
CPU chip
0 Unused 0 Unused 0 Unused Physical memory
(DRAM)
Virtual memory
Figure 4.22: Physical memory retrieves and stores pages from the hard drive, and sends lines to the cache. The cache then sends words to
the registers, where the data are requested by the ALU.
4.3.5 Summary
After all this mess, let’s take a look at a big picture about how data are
moved around inside our computers in Figure 4.22.
We write our code and compile them into executables. Those executables
are stored on hard drive, and have the complete image of virtual mem-
ory. From our perspective, each program has its own and identical virtual
address spaces.
As to the details on how exactly these things happened, it’s way beyond
our scope, and one course is not enough to talk about them in details. If
you are interested in this, you might want to take system courses to study
them in depth, and I encourage you to, because it’s really fun!
4.4 Reference
Cache 𝑚 𝐶 𝐵 𝐸 𝑆 𝑡 𝑠 𝑏
1 32 1,024 4 1 256 22 8 2
2 32 1,024 8 4 32 24 5 3
3 32 1,024 32 32 1 27 0 5
In the following tables, all numbers are given in hexadecimal. The con-
tents of the cache are as follows:
Part 1
The box below shows the format of a physical address. Indicate (by label-
ing the diagram) the fields that would be used to determine the following:
O The block offset within the cache line
I The cache index
T The cache tag
12 11 10 9 8 7 6 5 4 3 2 1 0
T T T T T T T T I I I O O
4.5 Quick Check Solutions 151
Part 2
For the given physical address, indicate the cache entry accessed and the
cache byte value returned in hex. Indicate whether a cache miss occurs. If
there is a cache miss, enter “-” for “Cache Byte returned”.
Given the follow chunk of code, analyze the miss rate given that we have
a byte-addressed computer with a total memory of 1 MB. It also features
a 16 KB direct-mapped cache with 1 KB blocks. Assume that the cache
begins cold (empty).
Tag = 20 − 4 − 10 = 6.
3. Calculate the cache miss rate for the line marked Line 1;
The integer accesses are 4×128 = 512 bytes apart, which means
there are 2 accesses per block. The first accesses in each block is
a compulsory cache miss, but the second is a hit because A[i]
152 4 Memory System
Assume the physical address has 32 bits, while the virtual address 64 bits.
Based on different page sizes, determine the number of bits needed to rep-
resent VPN, VPO, PPN, and PPO.
1 $ gcc hello.c -c
This command will let gcc compile our source code hello.c , and gen-
erate an object file hello.o .
Next step is to link all the object files. Since in this example we only have
one object file, we can simply use
1 $ gcc hello.o
1 $ ./a.out
1 $ ./cute
156 A C Language in Action
Note that the output file name ( cute in this example) has to follow -o ,
but -o output_name can be anywhere after the command gcc . Re-
member, the values of the flags and the flags themselves always go to-
gether as a pair.
Usually the following steps can be simplified to just one command:
1 $ gcc hello.c
which will compile (without generating an object file explicitly), link, and
generate an executable. For the rest of the course, feel free to use this com-
mand to speed up your workflow, but for this lab, you have to know how
to use the flags to do the compilation and link separately. This will help
you later when we write assembly programs.
Flags
You can have many flags attached to the command gcc . You can view
them using the --help flag. This is a flag without any values:
1 $ gcc --help
-Wall : short for warning all, which will show all the warnings (they
are not errors);
-g : use this if we want to debug our code using gdb . More on this
later;
-O : capital O, for optimizing our code. It doesn’t modify our source
code; instead, it only generates more efficient executable file.
1 .text
2 .global _start
3
4 _start:
5 MOV X0, 0 /* status <- 0 */
6 MOV X8, 93 /* exit() is system call #93 */
7 SVC 0 /* invoke system call */
Line 5 – 7 are the standard procedure to exit any assembly program. What
it does is actually to make a system call that can end the execution of the
program. We will see other examples in the future, but for now, you only
need to remember that always put these three lines at the end of your
program, to make sure it can exit successfully.
On line 4 we have a label _start , which is where all assembly programs
start executing. You can write it anywhere in your code, but it always
marks the first instruction in your program. 1 On line 2, we see we de- 1: Different machines use different la-
clared _start as a global label, using .global . The first line .text bels to mark the start of the program. For
example, some machines use main or
marks the text segment, meaning all the lines after it (until other segments)
_main . For the virtual machine we are
are assembly code.
using it’s _start . If you work on differ-
Both .global and .text are called directives. They are not part of ent environments, you need to check this
the executable—they cannot be translated into machine code and put into information first.
the CPU to execute; they are part of the syntax of the assembler, and help
the assembler to manage the organization of the program, or with other
purposes. All directives start with a dot.
B.1.2 Segments
.data :
This segment is used to store global variables, meaning they can be
accessed by any procedures in the program. Local variables used
in procedures/functions shouldn’t be declared here; instead, they
should be directly stored on stack in their frames;
.bss :
This segment is also used to store global variables, but it’s usually
used for uninitialized data. For example, if you want to reserve a
space of 100 bytes in case in your program you need to store some
data. In this case, you can declare that empty space of 100 bytes in
the .bss segment.
1 .text
2 .global _start
3
4 _start:
5 MOV X0, 0 /* status <- 0 */
6 MOV X8, 93 /* exit() is system call #93 */
7 SVC 0 /* invoke system call */
8
9 .data
10 hello_str: .ascii "Hello World!\n\0"
11 arr: .dword 13, 24, 1024
of the data can be stored after the directive, separated by commas. A brief
summary of different C data types and corresponding assembly directives
can be found in Table B.1.
1 .data
2 hello: .quad 1024
To use this data, the first step is to load its address to a register, using ADR
instruction:
Note that we cannot use a label itself as base address. For example, this is
wrong: LDR X1,[hello] . Ú Trick
B.1.3.3 Arrays
You probably have already noticed that there’s no “array” type in assem-
bly. If we want to declare an array of integers, for example, we simply put
all numbers in a row, separated by commas:
160 B ARMv8 Assembly in Action
Here we declare an “array” of five long integers. Recall from your data
structure class that arrays are simply a list of elements that are logically
stored one after another. In fact, when we put a declaration like the one
above in assembly, all the elements are exactly stored next to each other,
from low address to high.
If we want to index into the array for an element, we need to calculate the
byte offset from the base. In the example above, label arr points to the
address of its first element, so to load an element to a register, we need to
first load the base address, and then use the offset to load the element:
What if the offset is out of the array’s boundary? Again, the structure
“array” is an abstraction used by us; the internal system of computers does
not know anything about “arrays”, and therefore there’s no such a thing
called array’s boundary. So if we load an offset of a large number which is
outside of the array, the assembler does not issue warnings, and you will
probably still be able to run your program without problem. However, as
to what data you actually loaded into the register and whether the actual
result of your program is correct or not... who knows? Therefore, we as
assembly programmers need to carefully manage our own data.
ą Example B.1
Given the following .data segment, draw a memory layout to
show how these data are organized. Assume the lowest address of
.data segment is 0x1000 , and the machine is little-endian.
1 .data
2 str: .string "Hello"
3 arr: .quad 80302, 01230, 07030
4 vec: .int -1000
0x1020 FF FF 0x1027
0x1018 00 00 00 00 00 00 18 FC 0x101F arr+16
0x1010 00 00 00 00 00 00 18 0E 0x1017 arr+8
0x1008 01 00 00 00 00 00 98 02 0x100F
0x1000 48 65 6C 6C 6F 00 AE 39 0x1007
The data in .data segments are always starting from low address,
and store one right next to another. We first have str which is
a string, so each character in the string will be stored as ASCII
code. Note that in the end we have an additional null-terminator
because we used .string directive. If we used .ascii the
null-terminator will not be there.
Next, we have three double words, and they will be stored right
after the string. Notice we’re using little-endianness, meaning
the least significant byte is at the lowest address. After the three
numbers, we have an integer, which only takes four bytes.
assembly wouldn’t give you an error saying, “hey that’s the wrong
address!”. It’ll just do whatever you told it to do. In that case, X1
does not store the actual first element of the array which is 80302 ;
instead it stores the most significant six bytes of the first element,
and the least significant two bytes of the second:
1 01 00 00 00 00 00 98 02
B.1.3.4 Alignment
In Example B.1, you probably noticed that when we draw the memory
layout we put eight bytes on each row, and each row starts at an address
of multiples of eight. The byte pointed by the label arr , however, start
at 0x1006 , which is non divisible by eight, i.e., not aligned. Having data
aligned at a memory boundary in many cases will increase machine per-
formance greatly. In some machine if the data are not aligned it can even
162 B ARMv8 Assembly in Action
0x1020 18 FC FF FF 0x1027
0x1018 18 0E 00 00 00 00 00 00 0x101F
0x1010 98 02 00 00 00 00 00 00 0x1017
0x1008 AE 39 01 00 00 00 00 00 0x100F
arr
Figure B.1: Using .balign exp will 0x1000 48 65 6C 6C 6F 00 00 00 0x1007
align the next data at the address of mul-
tiple of exp , with zeros as padding. str
padding
generate errors.
1 .balign exp
which will make the data start at the next address of multiples of exp .
For example, if .balign 8 is used, the location of the next data will start
at an address of a multiple of 8.
1 .data
2 str: .string "Hello"
3 .balign 8
4 arr: .quad 80302, 01230, 07030
5 vec: .int -1000
The layout of the memory then is shown in Figure B.1. Notice after the
end of str we have two bytes of zeros, as paddings, and arr starts at
0x1008 , which is a multiple of eight.
In some cases we’d like to initialize an array with the same values. In C,
this is how we do it:
1 .data
2 arr:
3 .rept 10
4 .int 20
5 .endr
B.1 Program Structure 163
This also generates 10 integers of 20. Basically, first write the label for
the variable; then use .rept num to specify how many times you want
to repeat the following declarations, and write normal data declarations
after it as usual. Lastly, remember to use .endr to close the repetition.
where 4 stands for the size of each element, and so in this case, the size
of integers.
Empty spaces are usually declared in .bss segment, since it’s used for
uninitialized data. In fact it’s nothing special; if we want to reserve a space
of 100 bytes, it’s same as declaring 100 bytes of zeros. Two directives can
be used:
Here size is the total size, not an individual element, and fill is the
data for every byte in that space. Say we want to reserve an empty space
of 100 bytes in .bss segment:
1 .bss
2 empty_arr: .skip 100, 0
This is to fill every byte of the 100 bytes with zero. It’s the same as:
meaning fill can be ignored if we just want to fill with zeros. And of
course, it’s the same as follows:
1 .bss
2 empty_arr: .skip 100, 1024
Assume we have an assembly source code called demo.s . The first step
is to generate an object file using the aarch64 assembler:
If there’s no error message, we’re all good, and the next step is to link object
files to generate an executable, using the linker:
1 $ aarch64-linux-gnu-ld demo.o
1 $ qemu-aarch64 a.out
Listing files are very useful to examine the format of the object file or even
the final executable. It shows you the encoding of each instruction one
by one, and how they are arranged and organized in the memory when
B.2 Linking and Executing Assembly Programs 165
The following is an example of a listing file: 2 2: You can compare the memory con-
tent of the .data segment in this listing
file with the example in Section B.1.3.3.
1 AARCH64 GAS demo.s page 1 It helps you verify the memory layout in
2 1 .text that example.
3 2 .global _start
4 3
5 4 _start:
6 5 0000 000080D2 mov x0, #0
7 6 0004 A80B8052 mov x8, #93
8 7 0008 010000D4 svc #0
9 8
10 9 .data
11 10 0000 48656C6C str: .string "Hello"
12 10 6F00
13 11 0006 AE390100 arr: .quad 80302, 01230, 07030
14 11 00000000
15 11 98020000
16 11 00000000
17 11 180E0000
18 12 001e 18FCFFFF vec: .int -1000
19
28 NO UNDEFINED SYMBOLS
On page 1, we see the memory content and the assembly code are listed
side by side. For .text segments, each instruction’s four-byte encoding
is shown next to it. For .data segment, the binary representations of the
data are shown. The four digits in front of them are offsets relative to the
beginning of the segment. Also notice directives and labels do not have
corresponding memory content in the listing file. This also shows us that
they are not part of the program that can be executed by the CPU.
Page 2 summarizes defined and undefined symbols. If, say, we have BL foo
in our code but we never labeled anything foo , the symbol foo will ap-
pear under “undefined symbols”.
One thing about listing file is it will not show all the data declared repeti-
tively. For example, if we declare something like this: .fill 200, 8, 100 ,
166 B ARMv8 Assembly in Action
the listing file will only show the first few bytes.
We can certainly use some of the functions from the C library in our assem-
bly program, since those C functions eventually need to be compiled into
assembly, after all.
Simple Printing
It is not fun to print values directly in assembly, so one common situ-
ation is to print variable values using printf() . Since printf()
is still a procedure, we follow the same rules as branching to proce-
dures. Let’s look at the first example below.
1 /* print_msg.s */
2 .text
3 .global _start
4 .extern printf
5
6 _start:
7 ADR X0, hello_str
8 BL printf
9 MOV X0, 0 /* Exit */
10 MOV X8, 93
11 SVC 0
12
13 .data
14 hello_str:
15 .ascii "Hello World!\n\0"
Formatted Printing
Here’s another example. Assume we want to print out the value of
B.2 Linking and Executing Assembly Programs 167
register X20 . Written in C, the code should look like this, assuming
X20 stores the value of variable a :
1 long a;
2 printf("The value in X20 is %ld", a);
1 .data
2 check_str: .ascii "The value in X20 is %d\n\0"
1 /* print_num.s */
2 .text
3 .global _start
4 .extern printf
5
6 _start:
7 ADR X0, check_str
8 MOV X1, X20
9 BL printf
10 MOV X0, 0 /* Exit */
11 MOV X8, 93
12 SVC 0
13
14 .data
15 check_str: .ascii "The value in X20 is %d\n\0"
More Arguments
Since printf() and all C functions follow the procedure call stan-
dards, they can only use the first eight registers for passing parame-
ters. If we want to pass more than eight parameters, we would have
to use the stack. In the following example, we want to print five
characters and their ASCII codes. With the formatted string, we will
have to pass 11 parameters, so X0 – X7 apparently is not enough.
Therefore, we have to utilize the stack space:
1 /* more_args.s */
2 .text
3 .global _start
4 .extern printf
5
10 MOV X4, 60
11 MOV X5, 80
12 MOV X6, 80
13 MOV X7, 100
14 MOV X8, 100
15 MOV X9, 120
16 MOV X10, 120
17
29 MOV X0, 0
30 MOV X8, 93
31 SVC 0
32
33 .data
34 check_str:
35 .ascii "ASCII code of char %c is %d, %c is %d, "
36 .ascii "%c is %d, %c is %d, %c is %d.\n\0"
This is yet another example to show you that the machine doesn’t
care about specific data types and it all depends on our own inter-
pretation of the data. For example, we pass the same number 49
to X1 and X2 . Inside the machine, their presentations are exactly
the same, but when they are printed out, we see one is the character
1 and the other is the number 49 . This is because we used dif-
ferent specifiers: %c and %d. Thus, when they are printed out, the
printf() function takes the bytes needed for different specifiers
(1 byte for %c and 4 for %d), and presents them to the terminal.
B.2.3.2 Pitfalls
One common mistake (more common than you think!) when using C li-
brary functions, especially printf() , is forgetting about procedure call
conventions. From last section it is clear that we need to use X0 – X7 and
possibly stack space for passing arguments. One thing that we typically
forget is that upon returning from the procedure, X0 – X7 ’s values might
be changed by the procedure.
B.3 Debugging using gdb 169
Some students put some values in X7 , and called printf() . After that,
they found out that they lost the data stored in X7 . This is because X7
is a caller-saved register; it is the caller’s responsibility to save them in
case the called procedure needs to use them. Additionally, procedures
need to store return values in X0 as well. To review this topic, see Sec-
tion 2.6.2.5.
However, it’s not saying all the register values will be changed, but there’s
no guarantee that they won’t be changed, because especially for external
procedures, you never know which of them will be used. Therefore, before
branching to any procedure, remember to save the data from X0 – X7
somewhere, and restore them after return.
If the assembly code uses the standard C library, we need to use the -lc
flag to link it:
where we assume the object file is called demo.o . 4: On some machines there’s no need to
dynamically link the library, so if you can
run the executable perfectly fine without
When executing the program that’s been linked with C library, we need to linking it, you don’t have to.
dynamically link it as well: 4
B.3.1 Installation
The regular gdb we used for debugging C programs cannot be used here,
because our program can only be executed in QEMU emulator, and the ar-
chitecture is different. We will have to install a gdb that can be used for
multiple different architectures.
We’ll use gdb and qemu together, so we need to open two terminals at
the same time: one for gdb to step through, and the other for qemu to
provide an emulated environment. If we want to use gdb to debug an
assembly file, we need to add -g flag with the assembler.
1 $ aarch64-linux-gnu-ld demo.o
where the number 1234 is arbitrary—it’s a port for gdb to connect. Once
this command is in, it’ll freeze on the terminal and wait for us to start gdb.
Now leave it there (do not close it), and we can go to the other terminal
and start gdb:
Note there’s a space between remote and :1234 . The flags -ex are
the commands we want gdb to execute in the beginning. If you don’t add
these flags when invoking gdb, you’ll have to type them once you’re in the
gdb environment. The interface you see as in Figure B.2 is called TUI (Text
User Interface).
B.3.3.1 Breakpoints
When started gdb, the program is paused at somewhere in the static library.
We can use the following command to set a breakpoint at the beginning of
our program:
B.3 Debugging using gdb 171
1 b _start
Then we can use continue command (or simply c ) to reach our entry
point. Other labels can be set as breakpoint as usual.
Note: if after we set the breakpoint to _start and used continue , but
notice the break point is not exactly at label _start , see Troubleshooting
section.
B.3.3.2 Steps
Most of the commands are pretty much the same to what we’ve been using
for debugging a C program in gdb. As a reminder, when we want to go to
a procedure, we would need to use step or s . If the procedure is from
a library, such as printf() , it’s not a good idea to step into it, so we can
use next or n .
When we enter gdb with assembly code and register group laid out, the
default focus is the assembly code. Thus, if we press up/down keys on the
keyboard, or scroll up/down using a mouse, it only works on the assembly
code panel. To change focus to register group to view all registers, we
can use command focus regs . To change back to assembly code panel,
simply use focus asm .
172 B ARMv8 Assembly in Action
It is quite often that we need to examine the data stored in memory. The
syntax is as follows:
1 x/<length><format><unit> address
The length parameter specifies how much data we want to print from
address . It can be both positive and negative integers.
The format parameter tells gdb in what format do we want to see the
data. For example, if we pass x , it will print the data in hexadecimal. If
we pass d , it will print in decimal.
The unit parameter specifies how to group the data and interpret it.
For the format and unit parameters, there are many options. Please re-
fer to the gdb documentation on https://sourceware.org/gdb/onlinedocs/
gdb/Memory.html#Memory as well as https://sourceware.org/gdb/onlinedocs/
gdb/Output-Formats.html#Output-Formats.
To see condition codes, we can observe cpsr field in the register group.
CPSR stands for Current Program Status Register. CPSR is a 32 bit register,
where different flags or conditions take different bits. We only care about
the highest four bits: N, Z, C, and V:
B.3 Debugging using gdb 173
31 30 29 28 0 – 27
N Z C V Other flags
The value displayed in register group panel for cpsr is usually in hex-
adecimal and decimal, so it’s not very obvious to see individual bits. We
can use the following command to print out the value in binary format:
1 p/t $cpsr
where p stands for print, and t for binary (two’s complement), and only
look at the most significant 4 bits.
B.3.6 Troubleshooting
If in the beginning of gdb , register panel and assembly code panel are
both empty, do not worry – you just need to set a break point at _start
by typing
1 b _start
5: This typically doesn’t happen on M1- You might notice that sometimes even if you set breakpoint to _start, gdb
based macOS, so try to set a breakpoint actually set the breakpoint a few bytes later after the label start. 5
at _start at first, and if it’s not exactly
at the label, then proceed to use the actual
To set a breakpoint at the first instruction correctly, you can use the com-
address to set the breakpoint.
mand info files to find out the actual address of _start :
Notice in the output above, on line 7, we have the address of entry point
0x400204 , which is location of the label _start . Then we can set the
breakpoint there:
1 b *0x400204
If you run the qemu-aarch64 command to start debugging, but you got
an error message like this:
you would need to run the following command to install aarch64 ver-
sion of gcc :
Real numbers are a different species than integer numbers: they’re en-
coded in a different way; they are stored in a separate group of registers, in-
stead of X0 – X31 ; they are calculated in a unit called FPU (floating point
unit), not ALU (!); and they use a totally different set of instructions.
Registers:
Real numbers in ARM have 5 types of precision, but we mostly use 2
of them: single (encoded in 32 bits), and double (64 bits). Single pre-
cision numbers correspond to float type in C, and double preci-
sion numbers are, well, double type. Therefore, correspondingly,
the registers that hold them are Hn (half precision), Sn (single),
and Dn (double), where n ranging from 0 to 31 is the register num-
ber.
Usage:
Like integer registers, we use D0 – D7 to pass parameters to proce-
dures as well as store return values. D8 – D15 are preserved across
calls, and D16 – D31 can be used as temporary registers. The same
applies to single and half precision registers.
For real numbers, they have a separate set of instructions, but fortunately
they are very similar to the ones we know so far.
B.4.1.1 Arithmetic
Therefore, we recommend that you just store all real numbers in the .data
segment, and load them into registers, and use FMOV between registers.
Assume we have a number declared as such:
176 B ARMv8 Assembly in Action
1 .data
2 pi: .float 3.1415
1 ADR X0, pi
2 LDR S0, [X0]
3 FMOV S1, S0
SCVTF D0,X0
Sometimes we’d like to cast a floating point to a double precision number,
or to an integer, or vice versa. We can’t just FMOV an S register to a D
X0 D0 register. The instruction we need to use is FCVT :
See Figure B.4 for an illustration. For other instructions, the best resource
out there is ARM64’s reference sheet: https://developer.arm.com/documentation/
100076/0100/a64-instruction-set-reference/a64-floating-point-instruction
1 .data
2 fmt_str: .ascii "%lf %lf\n\0"
3 number1: .double 3.1415
4 number2: .double 10
5 ...
6 ADR X0, fmt_str // Load address of the string
7 ADR X1, number1 // Load address of number1
8 LDR D0, [X1] // Load number1 to D0
9 ADR X1, number2
10 LDR D1, [X1]
11 BL printf
If you use printf() to print both integer values ( int , char , long int )
and floating point values, move corresponding numbers into their regis-
ters in order:
B.4 Floating Point Operations 177
1 .data
2 fmt_str: .ascii "%d = %lf, %d = %lf\n\0"
3 ...
4 ADR X0, fmt_str
5 LDR X1, [...] // integer #1
6 LDR D0, [...] // floating point #1
7 LDR X2, [...] // integer #2
8 LDR D1, [...] // floating point #2
1 .data
2 fmt_str: .ascii "%f\n\0"
3 fpnum: .float 3.14
you would need to cast fpnum from .float to .double because printf()
doesn’t check S registers at all, and automatically uses double precision
as output format (even if your format is %f instead of %lf):
And if you declare your number as .float , you cannot load it directly
into D registers.
Thus, the easiest way to avoid those situations is just declare your number
as .double , and load it into D registers all the time.
B.4.3 Debugging
1 p/f $d0