Unit_1 computer archi
Unit_1 computer archi
21EC74H6
Dr. P.N.JAYANTHI
Asst.Prof, Dept. of ECE,
RVCE.
1
SYLLABUS
2
References
3
Differences between Computer Architecture
& Computer Organization
Computer Architecture is a functional description of requirements
and design implementation for the various parts of a computer.
It deals with the functional behavior of computer systems. It comes
before the computer organization while designing a computer.
Computer architecture refers to the design of the internal workings of a
computer system, including the CPU, memory, and other hardware
components.
It involves decisions about the organization of the hardware, such as the
instruction set architecture, the data path design, and the control unit
design. Computer architecture is concerned with optimizing the
performance of a computer system and ensuring that it can execute
instructions quickly and efficiently.
Architecture describes what the computer does.
4
Advantages of Computer Architecture
Performance Optimization: Proper architectural design is
known to advance the efficiency of a system by a large percentage.
Flexibility: This results in the capability to adapt and incorporate
new technologies as well as accommodate the different hardware
components.
Scalability: Plans should be in a way such that there is provision
made for future expansion of a building or accretion.
Disadvantages of Computer Architecture:
Complexity: It may be challenging and huge task which involves
time consuming on the flow of design and optimization.
Cost: High-performance architectures need deluxe equipment
and parts sometimes that is why they are more expensive.
5
Computer Organization
Computer Organization comes after the decision of Computer
Architecture first. Computer Organization is how operational
attributes are linked together and contribute to realizing the
architectural specification. Computer Organization deals with a
structural relationship.
The organization describes how it does it.
6
Advantages of Computer Organization
Practical Implementation: It offers a perfect account of the
physical layout of the computer system.
Cost Efficiency: Organization can help in avoiding wastage of
resources hence resulting in a reduction in costs.
Reliability: Organization helps in guaranteeing that similar work
produces similar favorable results.
Disadvantages of Computer Organization
Hardware Limitations: The physical components that are
available installation in the organization also defines the systems
that can be implemented hence limiting the performance.
Less Flexibility: The organization however is a lot more well
defined and less amicable or easy to change once set in its done so.
7
Computer System
8
COMPUTER ARCHITECTURE COMPUTER ORGANIZATION
Architecture describes what the computer does. The Organization describes how it does it.
Computer Architecture deals with the functional behavior of computer systems. Computer Organization deals with a structural relationship.
In the above figure, it’s clear that it deals with high-level design issues. In the above figure, it’s also clear that it deals with low-level design issues.
For designing a computer, its architecture is fixed first. For designing a computer, an organization is decided after its architecture.
Computer Architecture is also called Instruction Set Architecture (ISA). Computer Organization is frequently called microarchitecture.
Computer Architecture comprises logical functions such as instruction sets, Computer Organization consists of physical units like circuit designs, peripherals,
registers, data types, and addressing modes. and adders.
It makes the computer’s hardware visible. It offers details on how well the computer performs.
Architecture coordinates the hardware and software of the system. Computer Organization handles the segments of the network in
9
RISC vs CISC
10
Load-Store architecture
The Load-Store architecture is a type of CPU design that
differentiates it from other architectures, like the CISC.
It is commonly associated with RISC architecture, where
instructions are streamlined and optimized for efficiency.
Load instruction brings data from memory into a CPU
register.
Store instruction transfers data from a CPU register to
memory.
11
Load-Store architecture
Separation of Memory and ALU Operations:
In Load-Store architectures, the CPU separates memory access
instructions from computation (arithmetic/logic) instructions.
Only load and store instructions can access memory, and all other
instructions operate only on data within the CPU registers
Efficiency through Reduced Instructions:
• By limiting memory access to load and store instructions, this
architecture reduces the complexity of the CPU’s instruction
set, streamlining the pipeline stages, which can speed up
instruction processing.
12
Load-Store architecture
Improved Pipeline and Parallelism:
Load-Store architectures facilitate instruction pipelining by
separating data movement (load/store) from computation,
allowing multiple instructions to be processed concurrently.
Since each instruction generally takes the same amount of time,
this leads to fewer pipeline stalls, increasing the CPU's efficiency
Registers-Based Computation:
All computation (arithmetic/logic) operations are carried out on
data in registers, which are much faster to access than memory.
This design encourages using more registers and managing data
within the CPU, minimizing costly memory accesses.
13
LDR & STR examples
Generally, LDR is used to load something from memory into
a register, and STR is used to store something from a register
to a memory address.
LDR R2, [R0] @ [R0] - origin address is the value found in R0.
STR R2, [R1] @ [R1] - destination address is the value found in R1.
LDR operation: loads the value at the address found in R0 to the register R2.
STR operation: stores the value found in R2 to the memory address found in
14 R1.
Architecture vs. Micro architecture
15
Architecture (Instruction Set Architecture (ISA))
Definition: Architecture, or Instruction Set Architecture (ISA), is the abstract
model of a computer. It defines what the processor can do and how it does it.
The ISA serves as a blueprint for software developers and hardware engineers,
establishing the instructions that a processor can execute, such as arithmetic
operations, data manipulation, control flow, and memory access.
Components:
Instruction Set: The set of operations the processor can perform.
Registers: The number, type, and purpose of registers accessible to the
programmer.
Memory Model: How memory is accessed and organized.
Data Types: Defines data sizes and structures.
Examples: x86, ARM, MIPS, and RISC-V are some well-known ISAs.
Architecture is primarily concerned with functionality and acts as an interface
between hardware and software. It is independent of the physical implementation.
16
Microarchitecture
Microarchitecture is the detailed, lower-level design that defines how a
particular processor (or implementation) of an ISA is constructed.
It deals with the actual organization and operational details of the
CPU and covers the physical arrangement of components like pipelines, ALUs,
caches, and execution units.
Components:
Pipelines: Defines how instructions are processed in stages.
Caches: Levels of memory caches (L1, L2, L3) to optimize data access.
Execution Units: ALUs, FPUs and other functional units.
Branch Prediction and Speculative Execution: Techniques to improve
performance.
Examples: Intel’s Haswell, AMD’s Zen, and ARM’s Cortex microarchitectures
are examples of how different manufacturers implement the same or similar
ISAs differently.
Microarchitecture is implementation-specific and focuses on optimizing the
execution of instructions defined by the ISA, aiming to balance
speed, power efficiency, and physical constraints.
17
Machine models
Machine models in computing refer to abstract frameworks
or theoretical models used to understand and design
computing systems, helping to simulate, analyse, and
implement algorithms. These models play a crucial role in
both theoretical computer science and practical computing,
as they influence how processors, memory, and computation
workflows are organized.
18
1. Von Neumann Model
This is the classic model for most modern computers, named after
John von Neumann. It consists of a single memory space for
storing both instructions and data, and a central processing
unit (CPU) that sequentially processes instructions.
Components:
Memory: Holds both instructions and data.
Control Unit: Interprets instructions from memory.
ALU :Executes operations.
Input/Output: Interfaces for user interaction and data transfer.
Characteristics: Sequential instruction processing, a shared
memory space for instructions and data (leading to the "von
Neumann bottleneck"), which can slow down performance
due to limited data throughput.
19
2. Harvard Architecture
This model separates the memory for instructions and data,
allowing the CPU to access both simultaneously.
Components:
Separate Memory Banks: Separate instruction and data
memory.
Control Unit and ALU: As in the von Neumann model.
Characteristics: The separation allows parallelism in
fetching instructions and data, reducing bottlenecks. It’s
common in embedded systems and certain digital signal
processors.
20
3 .Parallel Models
These models consider multiple processors or cores to achieve
concurrent processing.
SIMD (Single Instruction, Multiple Data): A single control
unit issues one instruction that operates on multiple data points
simultaneously. Ideal for tasks like graphics processing and
scientific computing.
MIMD (Multiple Instruction, Multiple Data): Each
processor executes its own set of instructions independently.
Common in multi-core CPUs and distributed computing systems.
SISD (Single Instruction, Single Data): The traditional,
sequential model used in single-core processors.
MISD (Multiple Instruction, Single Data): Rarely used,
where multiple instructions operate on a single data stream.
21
Machine(Data Path) Models
● Data in registers, memory, or stack can form operands for ALU, which decides
the type of machine model to follow.
Stack machine model:
• The stack machine performs operations by pushing operands (data or
variables) onto a stack and then applying operations on the values at
the top of the stack.
• Simple model with no explicit operands specified in ALU instructions
unless it is a multilevel stack(more than two locations).
Ex: 8087 Floating point coprocessor to be used in conjunction
with 8086 processor.
Ex: Java virtual machine & Python interpreted languages work on
stack machine model.
22
Machine(Data
Path)
Models
Register-Memory Register-Register
Accumulator
• Accumulator • This model
• This model
machines were represents the
represents the
load-store
among the earliest architecture of
computer architecture of
many real-world
architectures like modern systems
computer
the EDSAC and the likeARM.
systems like x86.
• Instructions are
IBM 701. • Instructions are
• Instructions are with 2 or 3
with 2 or 3
simple, operating named
named operands.
directly with the operands.
accumulator and
memory.
• Instructions are
with a named
operand.
23
Machine(Data Path) Models: Typical Program sequence
Let us take an operation: C=A+B
24
Instruction Set Architecture (ISA)
ISA is the interface between software and hardware, defining the supported instructions,
data types, registers, memory addressing modes, and I/O mechanisms.
Each ISA has a unique set of characteristics that influences how effectively it can handle
different types of computation and how it balances performance, energy efficiency, and
simplicity.
Key Characteristics of an ISA
Instruction Set:
Types of Instructions: Defines arithmetic, logical, data transfer, control, and floating-
point instructions.
Instruction Length: Fixed-length or variable-length instructions. Fixed-length
instructions (like in RISC) simplify decoding, while variable-length instructions (like in
CISC) provide flexibility.
Format: Specifies how operands and operation codes are structured in instructions.
Registers:
General-Purpose Registers (GPRs): Registers that can be used for various purposes,
providing faster access than memory.
Special-Purpose Registers: These may include program counters, status registers, and
others, optimized for specific operations.
Register Count: Impacts performance and complexity; more registers generally mean
more data can be processed without needing memory access.
25
Instruction Set Architecture (ISA)
Memory Addressing Modes:
Specifies how instructions reference memory. Common modes include immediate, direct,
indirect, register, and displacement addressing.
Complex addressing modes, as in CISC, allow versatile data manipulation, while RISC
designs like RISC-V often use simpler modes to speed up instruction decoding and
execution.
Data Types:
Defines supported data types, such as integer, floating-point, and sometimes more
specialized data types like packed or SIMD data types.
The types supported impact the ISA's utility for various applications, e.g., floating-point
support for scientific computation.
Instruction Decoding:
Complexity of Decoding: Affects CPU design; simpler decoding is a characteristic of
RISC ISAs, which can speed up the pipeline.
Control Flow Instructions: Defines branches, loops, and jumps. Branch prediction
optimizations depend on control instructions' predictability.
Power Efficiency:
Simpler ISAs (RISC) tend to consume less power due to fewer instructions, fixed-length
encoding, and streamlined decoding.
26
Types of ISA
27
RISC-V
• RISC-V (pronounced "risk-five”) is a ISA standard
– An open source implementation of a reduced instruction set computing (RISC)
based instruction set architecture (ISA)
– There was RISC-I, II, III, IV before
• Most ISAs: X86, ARM, Power, MIPS, SPARC
– Commercially protected by patents
– Preventing practical efforts to reproduce the computer systems.
• RISC-V is open
– Permitting any person or group to construct compatible computers
– Use associated software
• Originated in 2010 by researchers at UC Berkeley
– Krste Asanović, David Patterson and students
• 2017 version 2 of the userspace ISA is fixed
– User-Level ISA Specification v2.2
– Draft Compressed ISA Specification v1.79 https://riscv.org/
– Draft Privileged ISA Specification v1.10 https://en.wikipedia.org/wiki/RISC-V
28
29
Key Characteristics of RISC-V
Modular Design:
RISC-V follows a modular approach, allowing implementers to add or omit
specific features as needed.
The base ISA (RV32I for 32-bit, RV64I for 64-bit, RV128I for 128-bit)
provides basic integer instructions.
Additional modules (extensions) support features like floating-point
arithmetic (F, D), atomic operations (A), and vector processing (V).
Simplified Instruction Set:
RISC-V follows RISC principles with a small, fixed set of instructions that
are easy to decode and execute.
Fixed-Length Instructions: RISC-V uses 32-bit fixed-length instructions
in the base ISA, simplifying instruction fetch and decode stages.
32 General-Purpose Registers:
RISC-V uses 32 GPRs for each base ISA (RV32, RV64, and RV128). These
registers allow efficient data handling without frequent memory access,
which is especially beneficial for high-performance computing.
30
Key Characteristics of RISC-V
Support for Extensions:
The base RISC-V ISA is minimal, but a wide array of extensions allows it
to be tailored for different applications.
Common Extensions:
M (Multiply/Divide): Adds support for integer multiplication and
division.
F and D (Single- and Double-Precision Floating Point): Adds
support for floating-point calculations, critical for scientific and engineering
applications.
A (Atomic Instructions): Adds atomic read-modify-write operations,
essential for multithreading.
C (Compressed Instructions): Reduces code size, improving efficiency,
especially in memory-constrained environments.
V (Vector Extension): Enables SIMD (Single Instruction, Multiple Data)
operations, suitable for data-parallel tasks like machine learning and image
processing.
31
Key Characteristics of RISC-V
Scalability and Interoperability:
RISC-V supports 32-bit, 64-bit, and 128-bit address spaces, providing scalability across
devices.
Modular extensions ensure compatibility, enabling devices of different capabilities to run
the same software while tailoring hardware to specific application needs.
Simplified Load/Store Architecture:
RISC-V follows a load/store architecture where only load and store instructions access
memory, while arithmetic instructions operate solely on registers. This approach
simplifies pipelining and optimizes performance.
Efficient and Power-Aware:
RISC-V’s minimalist design, fixed-length instructions, and load/store architecture
contribute to high power efficiency, making it suitable for embedded and low-power
applications.
Open-Source and Customizable:
Unlike proprietary ISAs like ARM or x86, RISC-V is an open standard, fostering
innovation and custom hardware designs without licensing fees. This openness has led to
rapid adoption in academia and industry, as it encourages experimentation and
customization.
32
RISC-V
RISC-V’s design aligns with traditional RISC principles,
prioritizing simplicity, modularity, and flexibility.
Its open-source nature, coupled with an extensible design,
makes it versatile for a wide range of applications, from small
embedded devices to large-scale data centers.
This flexibility allows for targeted optimizations, balancing
performance, power consumption, and cost for various
computing environments.
33
Applications
The application options are endless for the RISC-V ISA:
•Wearable's, Industrial, IoT, and Home Appliances. RISC-V processors are ideal for
meeting the power requirements of space-constrained and battery-operated designs.
•Smartphones. RISC-V cores can be customized to handle the performance needed to
power smartphones, or can be used as part of a larger SoC to handle specific tasks for
phone operation.
•Automotive, High-Performance Computing (HPC), and Data Centres. RISC-V
cores can handle complex computational tasks with customized ISAs, while RISC-V
extensions enable development of simple, secure, and flexible cores for greater energy
efficiency.
•Aerospace and Government. RISC-V offers high reliability and security for these use
applications.
34
35
36
37
Pipelining
40
Simple RISC
Instructions encoding
41
Simple R I S C Processor Design
● The approach to designing the processor is to divide the processing into stages.
42
Simple RISC Processor Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
43
SimpleRISC Processor Datapath
The EX Stage
Contains an Arithmetic-Logical Unit (ALU)This unit can
perform all arithmetic operations ( add, sub, mul, div, cmp,
mod), and logical operations (and, or, not)
Contains the branch unit for computing the branch condition
(beq, bgt).
Contains the flags register (updated by the cmp instruction)
44
Simple RISC Processor Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
45
SimpleRISC Processor Pipelined Datapath
IF Stage
46
Simple RISC Processor Pipelined Data path
47
Unpipelined Datapath for MIPS
PCSrc
br RegWrite MemWrite WBSrc
rind
jabs
pc+4
0x4
Add
Add
clk
we
clk
rs1
rs2
PC addr 31 rd1 we
inst ws addr
wd rd2 ALU
clk Inst. GPRs z rdata
Memory Data
Imm Memory
Ext wdata
ALU
Control
4
9
Pipelined Datapath
0x4
Add
we
rs1
rs2
addr rd1 we
PC
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
5
0
Pipelined Control
0x4
Add
we
rs1
rs2
addr rd1 we
PC
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
5
1
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
addr rd1 we
PC
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
52
Hardwired control in pipelining
Speed: Hardwired control is faster than micro programmed control because it uses
combinational logic to produce control signals directly based on the current state and
input. This speed is crucial in pipelined processors, where each instruction stage needs
quick control signal generation to avoid slowing down the pipeline.
Predictability: Hardwired control is deterministic, meaning that it’s more
straightforward to design and predict behavior for each pipeline stage. This predictability
helps in minimizing stalls or hazards (such as data hazards, control hazards, or structural
hazards) within the pipeline.
Lower Latency: The primary goal in pipelining is to improve the throughput of the
instruction execution process. Hardwired control contributes to this by ensuring minimal
latency in signal generation, helping each pipeline stage transition smoothly from one to
the next without delays.
Simplicity in Execution: Hardwired control is typically simpler to implement in terms
of the circuitry needed for straightforward, well-defined control tasks in each pipeline
stage, making it an efficient choice for processors with simpler instruction sets, like
RISC.
Hardwired control is fast and efficient, it can become complex and difficult to modify if the
instruction set is extensive or if additional functionalities are needed, as adding control paths
requires physical changes in the wiring.
For more complex processors or those that need flexibility (like supporting multiple
instruction sets), microprogrammed control may be a better fit.
53
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
addr rd1 we
PC
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
54
However, CPI will increase unless instructions are pipelined
Pipelined Control
Hardwired
0x4 Controller
Add
we
rs1
rs2
addr rd1 we
PC
rdata IR ws addr
wd rd2 ALU
GPRs rdata
Inst. Data
Memory Imm Memory
Ext wdata
write
fetch decode & register- execute memory
-back
phase fetch phase phase phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
5
However, CPI will increase unless instructions are pipelined 5
CPI Examples
Microcoded machine Time
7 cycles 5 cycles 10 cycles
Inst 1 Inst 2 Inst 3
61
62
63
64
65
66
67
68
69
70
71
Design of a Pipeline
Splitting the Data Path
We divide the data path into 5 parts : IF, OF,
EX, MA, and RW
Timing
We insert latches (registers) between
consecutive stages
4 Latches → IF-OF, OF-EX, EX-MA, and MA-RW
At the negative edge of a clock, an instruction
moves from one stage to the next
72
Simple R I S C Processor Design
● The approach to designing the processor is to divide the processing into stages.
73
The Instruction Packet
What travels between stages ?
ANSWER : the instruction packet
Instruction Packet
Instruction contents
Program counter
All intermediate results
Control signals
Every instruction moves with its entire state, no
interference between instructions
74
Pipelined Data Path with
Latches
Latches
75
Simple RISC Processor Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
76
SimpleRISC Processor Datapath
The EX Stage
Contains an Arithmetic-Logical Unit (ALU)This unit can
perform all arithmetic operations ( add, sub, mul, div, cmp,
mod), and logical operations (and, or, not)
Contains the branch unit for computing the branch condition
(beq, bgt).
Contains the flags register (updated by the cmp instruction)
77
Simple RISC Processor Design
MA (Memory Access) Stage
• Interfaces with the memory system
• Executes a load or a store
RW (Register Write) Stage
• Writes to the register file
• In the case of a call instruction, it writes the
return address to register, ra
78
Instruction Fetch
79
Simple RISC
Basic Instruction Format
5 Bits opcode
imm: Immediate
80
OF Stage
instruction
Control
Immediate and unit
branch target
81
OF Stage
82
EX Stage
op2 OF-EX
aluSignals
isBeq
Branch
ALU unit isBgt
?a
gs isUBranch
flags
0 1
isRet
branchPC
pc isBranchTakenaluResult control
83
MA Stage
pc aluResult op2 instruction control EX-MA
mar mdr
isLd
Data memory Memory
unit
isSt
84
RW Stage
4 isLd
isWb
10 01 00 isCall
E
rd
0
Register
E enable A file
1
data ra(15) D
A address
D data
85
1
pc + 4 0
pc in stru cti on
1 0 1 0 isSt
isR et Control
reg
Immediate and Re gister unit
file data
branch target
op2 op1
isWb
immx isImmediate
1 0
aluSignals
flags
0 1 isBeq
Branch
isRet ALU unit isBgt
isUB ranch
isBranchTaken
pc aluResult op2 in stru cti on control
ma r md r
isLd
Data
Memory
memory unit
isSt
DRAFT
10 01 00
C Smruti
data
R. Sarangi
isLd
ra(15) <[email protected]>
1
rd
isCall
0
isWb
86
Abridged Diagram
Data
ALU
memory
op2 Unit
Instruction Register
memory file op1
87
Instructions Interact With Each Other
in Pipeline
• Data Hazard: An instruction depends on a
data value produced by an earlier instruction
• Control Hazard: Whether or not an
instruction should be executed depends on a
control decision made by an earlier instruction
• Structural Hazard: An instruction in the
pipeline needs a resource being used by
another instruction in the pipeline
88
Pipeline Diagram
Clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2 3
[1]: add r1, r2, r3 OF 1 2 3
[2]: sub r4, r5, r6 EX 1 2 3
MA 1 2 3
[3]: mul r8, r9, r10
RW 1 2 3
89
Example
Clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2 3
[1]: add r1, r2, r3 1 2
OF 3
[2]: sub r4, r2, r5 EX 1 2 3
MA 1 2 3
[3]: mul r5, r8, r9
RW 1 2 3
90
Data Hazards
clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2
[1]: add r1, r2, r3 2
OF 1
91
Data Hazard
Definition: A hazard is defined as the possibility of erroneous execution of an
instruction in a pipeline. A data hazard represents the possibility of erroneous
execution because of the unavailability of data, or the availability of incorrect
data.
92
Other Types of Data Hazards
Our pipeline is in-order
93
WAW Hazards & WAR Hazards
[1]: add r1, r2, r3 [1]: add r1, r2, r3
[2]: sub r1, r4, r3
[2]: add r2, r5, r6
Instruction [2] cannot Instruction [2] cannot write the
write the value of r1, value of r2, before instruction
before instruction [1] [1] reads it → will lead to a
writes to it, will lead to a WAR hazard
WAW hazard
94
Control Hazards
95
Control Hazard – Pipeline
Diagram Clock cycles
1 2 3 4 5 6 7 8 9
IF 1 2 3
[1]: beq .foo OF 1 2 3
[2]: mov r1, 4 EX 1 2 3
MA 1 2 3
[3]: add r2, r4, r3
RW 1 2 3
96
Control Hazards
The two instructions fetched immediately after a branch
instruction might have been fetched incorrectly.
These instructions are said to be on the wrong path
A control hazard represents the possibility of erroneous
execution in a pipeline because instructions in the wrong
path of a branch can possibly get executed and save their
results in memory, or in the register file
97
Structural Hazards
A structural hazard may occur when two instructions have a
conflict on the same set of resources in a cycle
Example :
Assume that we have an add instruction that can read
one operand from memory
add r1, r2, 10[r3]
99
Solutions in Software
Data hazards
1.Insert nop instructions, reorder code
[1]: add r1, r2,
r3
[2]: sub r3, r1,
r4
100
2. Code Reordering
add r1, r2, r3
add r1, r2, r3
add r8, r5, r6
add r4, r1, 3
add r10, r11, r12
add r8, r5, r6
nop
add r9, r8, r5
add r4, r1, 3
add r10, r11, r12
add r9, r8, r5
add r13, r10, 2
add r13, r10, 2
101
Control Hazards—Delayed
branch
Trivial Solution : Add two nop
instructions after every branch
Better solution :
Assume that the two instructions fetched
after a branch are valid instructions
These instructions are said to be in the delay
slots
Such a branch is known as a delayed branch
102
3. Delay Slots
add r1, r2, r3 b .foo
add r4, r5, r6 add r1, r2, r3
b .foo add r4, r5, r6
add r8, r9, r10 add r8, r9, r10
103
Hardware solution for
pipeline hazards
104
Why interlocks ?
We cannot always trust the compiler to do a good
job, or even introduce nop instructions correctly.
Compilers now need to be tailored to specific
hardware.
We should ideally not expose the details of the
pipeline to the compiler (might be confidential
also)
Hardware mechanism to enforce correctness →
interlock
105
Two kinds of Interlocks
Data-Lock
Do not allow a consumer instruction to move
beyond the OF stage till it has read the correct
values. Implication : Stall the IF and OF stages.
Branch-Lock
We never execute instructions in the
wrong path.
The hardware needs to ensure
both these conditions.
106
Comparison between
Software and Hardware
Attribute Software Hardware(withinterlocks)
Portability Limited to a specific Programs can be run on any
processor processor irrespective of the nature
of the pipeline
Branches Possible to have no Need to stall the pipeline for 2 cycles
performance penalty, by in our design
using delay slots
RAW hazards Possible to eliminate Need to stall the pipeline
them through code
scheduling
Performance Highly dependent on the The basic version of a pipeline with
nature of the program interlocks is expected to be slower
than the version that relies on
software
107
Conceptual Look at Pipeline
with Interlocks
[1]: add r1, r2, r3
[2]: sub r4, r1, r2
108
Example
1 2 3 4 5 6 7 8 9
IF 1 2
[1]: add r1, r2, r3
OF 1 2 2 2 2
MA 1 2
RW 1 2
109
A Pipeline Bubble
A pipeline bubble is inserted into
a stage, when the previous stage
needs to be stalled
It is a nop instruction
To insert a bubble
Create a nop instruction packet
OR, Mark a designated bubble bit to 1
110
Bubbles in the Case of a
Branch Instruction
Clock cycles
bubble
1 2 3 4 5 6 7 8 9
[1]: beq. foo
[2]: add r1, r2, r3 IF 1 2 3 4
[3]: sub r4, r5, r6
OF 1 2 4
....
.... EX 1 4
.foo:
MA 1 4
[4]: add r8, r9, r10
RW 1 4
111
Control Hazards and Bubbles
We know that an instruction is a branch
in the OF stage
When it reaches the EX stage and the
branch is taken, let us convert the
instructions in the IF, and OF stages to
bubbles
Ensures the branch-lock condition
112
Ensuring the Data-Lock Condition
When an instruction reaches the
OF stage, check if it has a conflict
with any of the instructions in
the EX, MA, and RW stages
If there is no conflict, nothing
needs to be done
Otherwise, stall the pipeline (IF
and OF stages only)
113
How to Stall a Pipeline ?
Disable the write functionality of :
The IF-OF register
and the Program Counter (PC)
To insert a bubble
Write a bubble (nop instruction) into the OF-
EX register
114
Data Path with Interlocks
(Data-Lock)
bubble
stall stall
Data-lock Unit
Control
unit Branch
unit Memory
unit
MA-RW
EX-MA
Register
OF-EX
Fetch Immediate
IF-OF
Data
ALU
op2
memory
unit
Instruction Register
memory file op1
115
Ensuring the Branch-Lock Condition
Option 1 :
Use delay slots (interlocks not required)
Option 2 :
Convert the instructions in the IF, and OF
stages, to bubbles once a branch instruction
reaches the EX stage.
Start fetching from the next PC (not taken) or
the branch target (taken)
116
Ensuring the Branch-Lock
Condition - II
Option 3
If the branch instruction in the EX stage is taken,
then invalidate the instructions in the IF and OF
stages. Start fetching from the branch target.
Otherwise, do not take any special action
This method is also called predict not-taken (we
shall use this method because it is more efficient
that option 2)
117
Data Path with Interlocks
isBranchTaken
Control
unit Branch
unit Memory
unit
Fetch Immediate Register
MA-RW
IF-OF
OF-EX
flags
EX-MA
and branch write unit
unit unit
Data
ALU
unit
memory
op2
Instruction Register
memory file op1
118
Ensuring the Branch-Lock
Condition
Option 1 :
Use delay slots (interlocks not required)
Option 2 :
Convert the instructions in the IF, and OF
stages, to bubbles once a branch instruction
reaches the EX stage.
Start fetching from the next PC (not taken) or
the branch target (taken)
119
Ensuring the Branch-Lock
Condition - II
Option 3
If the branch instruction in the EX stage is taken,
then invalidate the instructions in the IF and OF
stages. Start fetching from the branch target.
Otherwise, do not take any special action
This method is also called predict not-taken (we
shall use this method because it is more efficient
that option 2)
120
Data Path with Interlocks
isBranchTaken
Control
unit Branch
unit Memory
unit
Fetch Immediate Register
MA-RW
IF-OF
OF-EX
flags
EX-MA
and branch write unit
unit unit
Data
ALU
unit
memory
op2
Instruction Register
memory file op1
121
Measuring Performance
What do we mean by the
performance of a processor ?
ANSWER : Almost nothing
What should we ask instead ?
What is the performance with respect to
a given program or a set of programs ?
Performance is inversely proportional to
the time it takes to execute a program
122
Computing the Time a Program Takes
𝜏 = #𝑠𝑒𝑐𝑜𝑛𝑑𝑠
#𝑠𝑒𝑐𝑜𝑛𝑑𝑠 #𝑐𝑦𝑐𝑙𝑒𝑠
= ∗ ∗ #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
#𝑐𝑦𝑐𝑙𝑒𝑠 #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
#𝑠𝑒𝑐𝑜𝑛𝑑𝑠 #𝑐𝑦𝑐𝑙𝑒𝑠
= + ∗ #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
#𝑐𝑦𝑐𝑙𝑒𝑠 #𝑖𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠
1/𝑓 𝐶𝑃𝐼
𝐶𝑃𝐼 ∗ #𝑖𝑛𝑠𝑡𝑠
=
𝑓
123
The Performance Equation
𝐼𝑃𝐶 ∗ 𝑓
𝑃 ∝
#𝑖𝑛𝑠𝑡𝑠
124
Number of Instructions (#insts)
Static Instruction: The binary or executable of a program, contains a list of
static instructions.
Dynamic Instruction: A dynamic instruction is a running instance of a static
instruction, which is created by the processor when an instruction
enters the pipeline.
125
Number of Instructions(#insts) – 2
Dead code removal
Often programmers write code that does not
determine the final output
This code is redundant
It can be identified and removed by the compiler
Function inlining
Very small functions have a lot of overhead → call,
ret instructions, register spilling, and restoring
Paste the code of the callee in the code of the caller
(known as inlining)
126
Computing the CPI
CPI for a single cycle processor = 1
𝑛+𝑘 −1
𝐶𝑃𝐼 =
𝑛
127
Computing the Maximum Frequency
Let the maximum amount of time that it takes to execute any instruction be :
tmax (also known as algorithmic work)
In the case of a pipeline, let us assume that all the pipeline stages are balanced
We thus have :
𝑡𝑚𝑎𝑥
𝑡𝑠𝑡𝑎𝑔𝑒 = +𝑙
𝑘
1 𝑡𝑚𝑎𝑥
= +𝑙
𝑓 𝑘
The minimum cycle time (1/f) is equal to tstage . Let us thus, assume that our cycle time
is as low as possible.
128
Performance of an Ideal Pipeline
Let us assume that the number of instructions are a
constant
𝑓
𝑃=
𝐶𝑃𝐼
1
𝑡𝑚𝑎𝑥
+𝑙
= 𝑘
𝑛+𝑘−1
𝑛
𝑛
= 𝑡
𝑚𝑎𝑥
+ 𝑙 ∗ (𝑛 + 𝑘 − 1)
𝑘
𝑛
=
𝑛 − 1 𝑡𝑚𝑎𝑥
+ (𝑡𝑚𝑎𝑥 + ln − 𝑙) + 𝑙𝑘
𝑘
129
Optimal Number of Pipeline Stages
𝑛 − 1 𝑡𝑚𝑎𝑥
𝜕 + (𝑡𝑚𝑎𝑥 + ln − 𝑙) + 𝑙𝑘
𝑘
=0
𝜕𝑘
𝑛 − 1 𝑡𝑚𝑎𝑥
⇒− +𝑙 =0
𝑘2
(𝑛 − 1)𝑡𝑚𝑎𝑥
⇒𝑘=
𝑙
k is inversely proportional to 𝑙
k is proportional to 𝑡𝑚𝑎𝑥
As we increase the latch delay, we should have less pipeline stages
We need to minimise the time wasted in accessing latches
As the number of instructions tends to ∞, the number of ideal pipeline stages also
tends to ∞
130
A Non-Ideal Pipeline
Our ideal CPI (CPIideal = 1) is 1
However, in reality, we have stalls
𝐶𝑃𝐼 = 𝐶𝑃𝐼𝑖𝑑𝑒𝑎𝑙 + 𝑠𝑡𝑎𝑙𝑙_𝑟𝑎𝑡𝑒 ∗ 𝑠𝑡𝑎𝑙𝑙_𝑝𝑒𝑛𝑎𝑙𝑡𝑦
Let us assume that the stall rate is a function of the program, and its nature of
dependences
Both these assumptions are strictly not correct. They are being used
to make a coarse grained mathematical model.
CPI = (n+k-1)/n + rck
r → stall rate, c → constant of proportionality
131
Mathematical Model
𝑓
𝑃=
𝐶𝑃𝐼
1
𝑡𝑚𝑎𝑥
+𝑙
= 𝑘
𝑛+𝑘−1
+ 𝑟𝑐𝑘
𝑛
𝑛
=
𝑛 − 1 𝑡𝑚𝑎𝑥
+ (𝑟𝑐𝑛𝑡𝑚𝑎𝑥 + 𝑡𝑚𝑎𝑥 + ln − 𝑙) + 𝑙𝑘(1 + 𝑟𝑐𝑛)
𝑘
𝑛 − 1 𝑡𝑚𝑎𝑥
𝜕 + (𝑟𝑐𝑛𝑡𝑚𝑎𝑥 + 𝑡𝑚𝑎𝑥 + ln − 𝑙) + 𝑙𝑘(1 + 𝑟𝑐𝑛)
𝑘
=0
𝜕𝑘
𝑛 − 1 𝑡𝑚𝑎𝑥
⟹− + 𝑙 1 + 𝑟𝑐𝑛 = 0
𝑘2
𝑛−1 𝑡𝑚𝑎𝑥 𝑡𝑚𝑎𝑥
⟹𝑘= ≈ (𝑎𝑠 𝑛 ⟶ ∞)
𝑙 1+𝑟𝑐𝑛 𝑙𝑟𝑐
132
Implications
For programs with a lot of dependences (high value of r) →
Use less pipeline stages
For a pipeline with forwarding → c is smaller (than a pipeline
that just has interlocks)
It requires a larger number of pipeline stages for optimal performance
The optimal number of pipeline stages is directly proportional
to √(tmax / l)
This explains why the number of pipline stages has remained more or less
constant for the last 5-10 this ratio is not significantly changing across
years
133
134
Example
Example Consider two programs that have the following characteristics.
Program 1 Program 2
Instruction Type Fraction Instruction Type Fraction
135
Example
CPI=CPI ideal + stall rate *stall penalty
136
137
138
139
Performance, Architecture, Compiler
P f IPC
Technology Compiler
Architecture Architecture
• Manufacturing technology affects the speed of transistors, and in turn
the speed of combinational logic blocks, and latches.
• Transistors are steadily getting smaller and faster.
• Consequently, the total algorithmic work (tmax) and the latch delay (l),
are also steadily reducing.
• Hence, it is possible to run processors at higher frequencies leading to
improvements in performance.
• Manufacturing technology exclusively affects the frequency at which we
can run a processor.
• It does not have any effect on the IPC, or the number of instructions.
140
Contraints for pipelining
• Note that the overall picture is not as simple as we described
• We need to consider power and complexity issues also.
• Typically, implementing a pipeline beyond 20 stages is very difficult
because of the increase in complexity.
• Secondly, most modern processors have severe power and temperature
constraints.
• This problem is also known as the power wall.
• It is often not possible to ramp up the frequency, because we cannot
afford the increase in power consumption.
• As a thumb rule, power increases as the cube of frequency. Hence,
increasing the frequency by 10% increases the power consumption by
more than 30%, which is prohibitively large.
• Designers are thus increasingly avoiding deeply pipelined designs that
run at very high frequencies.
141
142
Consider a pipelined processor with the following four stages, Instruction
Fetch: IF, Instruction decode and Operand Fetch: ID, EX: Execute, WB: Write
Back
The IF,ID and WB stages take one clock cycle each to complete the operation.
The number of clock cycles of the EX stage depends on the instruction. The
ADD and SUB instructions need 1 clock cycle and the MUL instruction need 3
clock cycles in the EX stage. Operand forwarding is used in the pipelined
processor. What is the number of clock cycles taken to complete the following
sequence of operations?
ADD R2,R1,R0
MUL R4,R3,R2
SUB R6,R5,R4
143
144
145
146
147
148
149
150
151
152
153
154
Consider a pipelined processor with the following four stages, Instruction
Fetch: IF, Instruction decode and Operand Fetch: ID, EX: Execute, WB: Write
Back
The IF,ID and WB stages take one clock cycle each to complete the operation.
The number of clock cycles of the EX stage depends on the instruction. The
ADD and SUB instructions need 1 clock cycle and the MUL instruction need 3
clock cycles in the EX stage. Operand forwarding is used in the pipelined
processor. What is the number of clock cycles taken to complete the following
sequence of operations?
ADD R2,R1,R0
MUL R4,R3,R2
SUB R6,R5,R4
155
A processor executes instructions without pipelining in 10
cycles per instruction. A pipelined version of the processor
splits execution into 5 stages, each taking 2 cycles. Calculate
the speedup for executing 100 instructions on the pipelined
processor compared to the non-pipelined processor. Assume
no stalls or hazards.
156
157
A pipelined processor has 6 stages, each taking 2 ns. Due to
hazards and stalls, only 80% of the pipeline is utilized.
a) What is the effective throughput in instructions per
second?
b) What would the throughput be if the pipeline were fully
utilized?
158
159
160
161
Thank you
162