0% found this document useful (0 votes)

23 views

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

Phan Tuấn Khôi

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

Phan Tuấn Khôi

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

DIGITAL LOGIC AND

COMPUTER ORGANIZATION
Lecture 23: Multicore
ELEC3010
ACKNOWLEGEMENT

I would like to express my special thanks to Professor Zhiru Zhang

School of Electrical and Computer Engineering, Cornell University
and Prof. Rudy Lauwereins, KU Leuven for sharing their teaching
materials.

2
COVERED IN THIS COURSE
❑ Binary numbers and logic gates
❑ Boolean algebra and combinational logic
❑ Sequential logic and state machines
❑ Binary arithmetic
Digital logic
❑ Memories

❑ Instruction set architecture

❑ Processor organization Computer
❑ Caches and virtual memory
❑ Input/output Organization
❑ Advanced topics
3
MOTIVATION EXAMPLE 1

4
MOTIVATION EXAMPLE 2

Wi-Fi/GPS/ 6 ARM-based 64 4 Apple

Bluetooth bit microprocessor designed Audio
modules cores GPU cores
Broadcom codec

Qualcomm
RF front
Snapdragon X55M
end Super Retina
5G modem 16-core
AMX blocks XDR OLED
Neural Engine
LG
Sensor Fabricated by TSMC
modules
4GB/6GB 64GB/28GB/256GB
12M Cameras LPDDR4X RAM NAND flash Battery and Power
LG Innotek Micron Samsung module

5
MOTIVATION EXAMPLE 3

6
INCREASING CLOCK FREQUENCIES

Darling of performance improvement for decades

• Technology Scaling
Why is this no longer the strategy?

Hitting Frequency Limits:

• Heat
• Power

7
IMPROVING IPC VIA ILP

You’ve seen:
❑Exploiting Intra-instruction parallelism:
Pipelining (decode A while fetching B)
You haven’t seen:
❑Exploiting Instruction Level Parallelism (ILP):
• Multiple issue (2-wide, 4-wide, etc.)
• Statically detected by compiler (VLIW)
• Dynamically detected by HW
➢ Dynamic Scheduling (OoO)

8
STATIC MULTIPLE ISSUE
a.k.a. Very Long Instruction Word (VLIW)
Compiler groups instructions to be issued together
▪ Packages them into “issue slots”

How does HW detect and resolve hazards?

It doesn’t. ☺ Compiler must avoid hazards

Example: Static Dual-Issue 32-bit MIPS

▪ Instructions come in pairs (64-bit aligned)
• One ALU/branch instruction (or nop)
• One load/store instruction (or nop)
9
STATIC DUAL ISSUE
Two-issue packets
• One ALU/branch instruction
• One load/store instruction
• 64-bit aligned
• ALU/branch, then load/store
• Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n ALU/branch IF ID EX MEM WB
n+4 Load/store IF ID EX MEM WB
n+8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
10
STATIC DUAL ISSUE

11
SCHEDULING EXAMPLE
Schedule this for dual-issue
Loop: lw t0, 0(s1) # t0=array element
add t0, t0, s2 # add with s2
sw t0, 0(s1) # store result
addi s1, s1,–4 # decrement pointer
bne s1, zero, Loop # branch s1!=0
ALU/branch Load/store cycle
Loop: nop lw t0, 0(s1) 1
addi s1, s1,–4 nop 2
add t0, t0, s2 nop 3
bne s1, zero, Loop sw t0, 4(s1) 4
What is the IPC of this machine?
(A) 0.8 (B) 1.0 (C) 1.25 (D) 1.5 (E) I don’t know
12
DYNAMIC MULTIPLE ISSUE
aka SuperScalar Processor (c.f. Intel)
• CPU chooses multiple instructions to issue each cycle
• Compiler can help, by reordering instructions….
• … but CPU resolves hazards

13
DYNAMIC SCHEDULING

❑Scheduling is done at execution time

❑Out-of-order Execution:
• Execute instructions as early as possible
• Guess results of branches, loads, etc.
• Roll back if guesses were wrong
• Don’t commit results until all previous instructions committed

14
IMPROVING IPC VIA TLP
Exploiting Thread-Level parallelism
Hardware multithreading to improve utilization:
• Multiplexing multiple threads on single CPU
• Three types:
• Course-grain (has preferred thread)
• Fine-grain (round robin between threads)
• Simultaneous (hyperthreading)

15
WHAT IS A THREAD?

❑Process: multiple threads, code, data and OS state

❑Threads: concurrent computations that share the same address
space
• Share: code, data, files
• Do not share: registers or stack

16
THREAD MEMORY LAYOUT

Thread 1 Stack 1
SP
PC Stack 2

Thread 2 Stack 3
SP
PC
Data
Thread 3
SP
Insns
PC
17
THREAD EXAMPLES
int e;

main () {
int x[10], j, k, m; j = f(x, k); m = g(x, k);
}

int f(int *x, int k) Thread 0

{
int a; a = e * x[k] * x[k]; return a;
}

int g(int *x, int k) Thread 1

{
int a; k = k-1; a = e / x[k]; return a;
}
18
STANDARD MULTITHREADING PICTURE
Color = thread, white = no instruction
time

4-wide CGMT FGMT SMT

Superscalar Switch to thread B Switch threads Insns from multiple
on thread A L2 miss every cycle threads coexist
19
MULTITHREADING PERFORMANCE
1 2 3 4
time

4-wide CGMT FGMT SMT

Superscalar Which one of these has the best single-thread performance?
Which one of these has the best instruction throughput?
(A) 1 (B) 2 (C) 3 (D) 4 (E) I don't know
20
POWER EFFICIENCY

CPU Year Clock Pipeline Issue Out-of-order/ Cores Power

Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W

21
WHY MULTICORE?
Performance 1.2x Single-Core
1.7x Overclocked +20%
Power

Performance 1.0x
Single-Core
Power 1.0x

Performance 0.8x Single-Core

Power 0.51x Underclocked -20%

Performance 1 2 1.6x Dual-Core

Power 1 2 1.02x Underclocked -20%

22
POWER EFFICIENCY
CPU Year Clock Pipeline Issue Out-of-order/ Cores Power
Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
Core i5 Nehal 2010 3300MHz 14 4 Yes 1 87W
Core i5 Ivy Br 2012 3400MHz 14 4 Yes 8 77W

23
PARALLEL PROGRAMMING

So lets just all use multicore from now on!

… but software must be written as parallel program

Multicore difficulties
• Partitioning work
• Coordination & synchronization
• Communications overhead
• How do you write parallel programs?

... without knowing exact underlying architecture?

24
WORK PARTITIONING
Partition work so all cores have something to do

25
LOAD BALANCING

Need to partition so all cores are actually working

26
AMDAHL’S LAW
❑ Amdahl’s Law was named after Gene Amdahl, who presented it in
1967.
❑ Amdahl’s Law states that in parallelization, if P is the proportion
of a system or program that can be made parallel, and 1-P is the
proportion that remains serial, then the maximum speedup S(N)
that can be achieved using N processors is:
S(N)=1/((1-P)+(P/N))
❑ As number of cores increases …
▪ time to execute parallel part? goes to zero
▪ time to execute serial part? Remains the same
▪ Serial part eventually dominates

27
AMDAHL’S LAW

28
CAN YOU DO IT?

for (i = 0; i < N; i++) B.

for (i = 1; i < N; i++)
A.
a[i] = b[i] / 2.0; a[i] = a[i-1] * b[i];

int i;
float *a, *b, *c, tmp;
...
for (i = 0; i < N; i++) { Which code is parallelizable?
C.
tmp = a[i] / b[i];
c[i] = tmp * tmp;

}
29
CAN YOU DO IT?

for (j = 1; j < n; j++)

A. for (i = 0; i < m; i++)
a[i][j] = 2 * a[i][j-1];

for (i = 0; i < m; i++)

B. for (j = 1; j < n; j++)
a[i][j] = 2 * a[i][j-1];

Which code is better?

30
HAVE YOU EVER SEEN THIS?

31
BEFORE NEXT CLASS

• Textbook: 8.4
• Next time:
Virtual Memory

Pharmacy Management System
33% (3)
Pharmacy Management System
17 pages
Tunerpro RT Datalogging & OSEPlugin Guide
No ratings yet
Tunerpro RT Datalogging & OSEPlugin Guide
8 pages
Arista Networks Macro-Segmentation Service™ (MSS™) Design & Deployment Guide
100% (1)
Arista Networks Macro-Segmentation Service™ (MSS™) Design & Deployment Guide
45 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
UNIT-5 (1)
No ratings yet
UNIT-5 (1)
86 pages
Lecture19 ILP SMT
No ratings yet
Lecture19 ILP SMT
31 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
23.L20 Multiprocessing Multithreading Vectorization
No ratings yet
23.L20 Multiprocessing Multithreading Vectorization
38 pages
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
No ratings yet
ECE 4100/6100 Advanced Computer Architecture: Lecture 13 Multithreading and Multicore Processors
56 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages
Lec 4 Superscalarprocessor PDF
No ratings yet
Lec 4 Superscalarprocessor PDF
23 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
No ratings yet
Computer Architecture Chapter 4: The Processor Part 3: Dr. Phạm Quốc Cường
23 pages
Me FIRST
No ratings yet
Me FIRST
4 pages
Module 2
No ratings yet
Module 2
127 pages
005-SimultaneousMultithreading
No ratings yet
005-SimultaneousMultithreading
50 pages
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
No ratings yet
Parallel Processing: 6.004x Computation Structures Part 3 - Computer Organization
41 pages
SOC
No ratings yet
SOC
71 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Parallelism (2) & Heterogeneous Computing & Future Perspetives
No ratings yet
Parallelism (2) & Heterogeneous Computing & Future Perspetives
50 pages
Lec 4 Superscalarprocessor Updated PDF
No ratings yet
Lec 4 Superscalarprocessor Updated PDF
40 pages
Chapter 2 - Computer Organization
No ratings yet
Chapter 2 - Computer Organization
30 pages
HPC-Unit-1
No ratings yet
HPC-Unit-1
65 pages
C++ Lecture Organized
No ratings yet
C++ Lecture Organized
796 pages
Pipeline History
No ratings yet
Pipeline History
30 pages
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
No ratings yet
Modern Computer Architecture (Processor Design) : Prof. Dan Connors Dconnors@colostate - Edu
32 pages
HPC-Unit-2
No ratings yet
HPC-Unit-2
72 pages
EE457Unit9c_CMT
No ratings yet
EE457Unit9c_CMT
60 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Memory Coherent
No ratings yet
Memory Coherent
62 pages
L 0 ILP Optional Extra Topic
No ratings yet
L 0 ILP Optional Extra Topic
44 pages
Arch3 Pipelining Afterlecture
No ratings yet
Arch3 Pipelining Afterlecture
180 pages
2017edan85l4 1
No ratings yet
2017edan85l4 1
33 pages
Aca Notes
No ratings yet
Aca Notes
23 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
EE6304 Lecture12 TLP
No ratings yet
EE6304 Lecture12 TLP
70 pages
Multi Thread2
No ratings yet
Multi Thread2
37 pages
Lecture 16
No ratings yet
Lecture 16
30 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
lecture1
No ratings yet
lecture1
37 pages
2.2 DD2356 Threads
No ratings yet
2.2 DD2356 Threads
22 pages
Unit-5 Part1
No ratings yet
Unit-5 Part1
85 pages
Ca - Unit 4
No ratings yet
Ca - Unit 4
77 pages
Limits of Instruction-Level Parallelism
No ratings yet
Limits of Instruction-Level Parallelism
51 pages
Superscalar and VLIW Architectures
No ratings yet
Superscalar and VLIW Architectures
35 pages
L 5 Multicore
No ratings yet
L 5 Multicore
30 pages
Credits: WWW - Cse.Scu - Edu/ Rdaniels/Html/Courses/Co En1/Cpuarch
No ratings yet
Credits: WWW - Cse.Scu - Edu/ Rdaniels/Html/Courses/Co En1/Cpuarch
35 pages
Architecture PDF
No ratings yet
Architecture PDF
19 pages
CSO Computer Programming
No ratings yet
CSO Computer Programming
73 pages
Instruction Level Parallelism
No ratings yet
Instruction Level Parallelism
17 pages
Lecture1 Introduction to Parallel Computing_2025
No ratings yet
Lecture1 Introduction to Parallel Computing_2025
38 pages
Distributed Systems
No ratings yet
Distributed Systems
238 pages
Computer Architecture
No ratings yet
Computer Architecture
29 pages
Chapter1 - Basic Structure of Computers
No ratings yet
Chapter1 - Basic Structure of Computers
119 pages
Real Time System Lect10 A
No ratings yet
Real Time System Lect10 A
25 pages
Chapter1 - Basic Structure of Computers
No ratings yet
Chapter1 - Basic Structure of Computers
119 pages
Comp422 534 2020 Lecture1 Introduction
No ratings yet
Comp422 534 2020 Lecture1 Introduction
49 pages
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
From Everand
GameCube Architecture: Architecture of Consoles: A Practical Analysis, #10
Rodrigo Copetti
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
Comptia Server+ Primer
From Everand
Comptia Server+ Primer
John Greene
5/5 (1)
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
DigitalLogic ComputerOrganization L19 PipelinedProcessorP3 Handout
No ratings yet
DigitalLogic ComputerOrganization L19 PipelinedProcessorP3 Handout
24 pages
DigitalLogic ComputerOrganization L11 12 Timing Handout
No ratings yet
DigitalLogic ComputerOrganization L11 12 Timing Handout
39 pages
DigitalLogic ComputerOrganization L16 SingleCycleProcessorP2 Handout
No ratings yet
DigitalLogic ComputerOrganization L16 SingleCycleProcessorP2 Handout
35 pages
DigitalLogic ComputerOrganization L7 FSM Handout
No ratings yet
DigitalLogic ComputerOrganization L7 FSM Handout
18 pages
DigitalLogic ComputerOrganization L3 Handout
No ratings yet
DigitalLogic ComputerOrganization L3 Handout
30 pages
Python Basics Sample Chapters
No ratings yet
Python Basics Sample Chapters
91 pages
Sow - 3D Modelling Nov-Apr2021
No ratings yet
Sow - 3D Modelling Nov-Apr2021
4 pages
Report Package User Guide 5.2 b0400bd - G
No ratings yet
Report Package User Guide 5.2 b0400bd - G
106 pages
Lab Arduino
No ratings yet
Lab Arduino
16 pages
Supported Operating System For CIMPLICITY - GE Digital Customer Center
No ratings yet
Supported Operating System For CIMPLICITY - GE Digital Customer Center
4 pages
Troubleshooting Low TBF Establishment Success Rates
No ratings yet
Troubleshooting Low TBF Establishment Success Rates
8 pages
16.on Road Vehicle Breakdown Assistance
No ratings yet
16.on Road Vehicle Breakdown Assistance
5 pages
Failure Mode, Effects, and Criticality Analysis
No ratings yet
Failure Mode, Effects, and Criticality Analysis
8 pages
Mis Medha
No ratings yet
Mis Medha
12 pages
Teori Computational Thinking I PDF
No ratings yet
Teori Computational Thinking I PDF
41 pages
Testing For Reflected XSS
No ratings yet
Testing For Reflected XSS
25 pages
History of Scratch
0% (1)
History of Scratch
4 pages
Compiler Design Practical File
No ratings yet
Compiler Design Practical File
24 pages
Split Phase LV5048
No ratings yet
Split Phase LV5048
1 page
Keys On The Keyboard
No ratings yet
Keys On The Keyboard
2 pages
Itil4 Devops and Agile Brave New World Stephane Joret
No ratings yet
Itil4 Devops and Agile Brave New World Stephane Joret
24 pages
Lecture Notes For ARM Architecture - Module I
No ratings yet
Lecture Notes For ARM Architecture - Module I
45 pages
TM-100-Datasheet-Working-1
No ratings yet
TM-100-Datasheet-Working-1
2 pages
Algebra Concepts
No ratings yet
Algebra Concepts
111 pages
Syllabyus
No ratings yet
Syllabyus
6 pages
Acc 155
No ratings yet
Acc 155
5 pages
An Overview of Social Engineering Malware Trends, Tactics, and Implications
No ratings yet
An Overview of Social Engineering Malware Trends, Tactics, and Implications
14 pages
The Ten Commandments For MENDELEY
No ratings yet
The Ten Commandments For MENDELEY
1 page
INFORMATION RETRIEVAL
No ratings yet
INFORMATION RETRIEVAL
5 pages
Land Acqition Manual
No ratings yet
Land Acqition Manual
33 pages
BSBWRT411 Project Portfolio (Student Resources)
No ratings yet
BSBWRT411 Project Portfolio (Student Resources)
14 pages
tl16c752b Ep
No ratings yet
tl16c752b Ep
42 pages

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

DigitalLogic ComputerOrganization L23 Multicore Handout

Uploaded by

DIGITAL LOGIC AND

I would like to express my special thanks to Professor Zhiru Zhang

❑ Instruction set architecture

Wi-Fi/GPS/ 6 ARM-based 64 4 Apple

Darling of performance improvement for decades

Hitting Frequency Limits:

How does HW detect and resolve hazards?

Example: Static Dual-Issue 32-bit MIPS

❑Scheduling is done at execution time

❑Process: multiple threads, code, data and OS state

int f(int *x, int k) Thread 0

int g(int *x, int k) Thread 1

4-wide CGMT FGMT SMT

4-wide CGMT FGMT SMT

CPU Year Clock Pipeline Issue Out-of-order/ Cores Power

Performance 0.8x Single-Core

Performance 1 2 1.6x Dual-Core

So lets just all use multicore from now on!

... without knowing exact underlying architecture?

Need to partition so all cores are actually working

for (i = 0; i < N; i++) B.

for (j = 1; j < n; j++)

for (i = 0; i < m; i++)

Which code is better?

You might also like