DigitalLogic ComputerOrganization L23 Multicore Handout
DigitalLogic ComputerOrganization L23 Multicore Handout
COMPUTER ORGANIZATION
Lecture 23: Multicore
ELEC3010
ACKNOWLEGEMENT
2
COVERED IN THIS COURSE
❑ Binary numbers and logic gates
❑ Boolean algebra and combinational logic
❑ Sequential logic and state machines
❑ Binary arithmetic
Digital logic
❑ Memories
4
MOTIVATION EXAMPLE 2
Qualcomm
RF front
Snapdragon X55M
end Super Retina
5G modem 16-core
AMX blocks XDR OLED
Neural Engine
LG
Sensor Fabricated by TSMC
modules
4GB/6GB 64GB/28GB/256GB
12M Cameras LPDDR4X RAM NAND flash Battery and Power
LG Innotek Micron Samsung module
5
MOTIVATION EXAMPLE 3
6
INCREASING CLOCK FREQUENCIES
7
IMPROVING IPC VIA ILP
You’ve seen:
❑Exploiting Intra-instruction parallelism:
Pipelining (decode A while fetching B)
You haven’t seen:
❑Exploiting Instruction Level Parallelism (ILP):
• Multiple issue (2-wide, 4-wide, etc.)
• Statically detected by compiler (VLIW)
• Dynamically detected by HW
➢ Dynamic Scheduling (OoO)
8
STATIC MULTIPLE ISSUE
a.k.a. Very Long Instruction Word (VLIW)
Compiler groups instructions to be issued together
▪ Packages them into “issue slots”
11
SCHEDULING EXAMPLE
Schedule this for dual-issue
Loop: lw t0, 0(s1) # t0=array element
add t0, t0, s2 # add with s2
sw t0, 0(s1) # store result
addi s1, s1,–4 # decrement pointer
bne s1, zero, Loop # branch s1!=0
ALU/branch Load/store cycle
Loop: nop lw t0, 0(s1) 1
addi s1, s1,–4 nop 2
add t0, t0, s2 nop 3
bne s1, zero, Loop sw t0, 4(s1) 4
What is the IPC of this machine?
(A) 0.8 (B) 1.0 (C) 1.25 (D) 1.5 (E) I don’t know
12
DYNAMIC MULTIPLE ISSUE
aka SuperScalar Processor (c.f. Intel)
• CPU chooses multiple instructions to issue each cycle
• Compiler can help, by reordering instructions….
• … but CPU resolves hazards
13
DYNAMIC SCHEDULING
14
IMPROVING IPC VIA TLP
Exploiting Thread-Level parallelism
Hardware multithreading to improve utilization:
• Multiplexing multiple threads on single CPU
• Three types:
• Course-grain (has preferred thread)
• Fine-grain (round robin between threads)
• Simultaneous (hyperthreading)
15
WHAT IS A THREAD?
16
THREAD MEMORY LAYOUT
Thread 1 Stack 1
SP
PC Stack 2
Thread 2 Stack 3
SP
PC
Data
Thread 3
SP
Insns
PC
17
THREAD EXAMPLES
int e;
main () {
int x[10], j, k, m; j = f(x, k); m = g(x, k);
}
21
WHY MULTICORE?
Performance 1.2x Single-Core
1.7x Overclocked +20%
Power
Performance 1.0x
Single-Core
Power 1.0x
22
POWER EFFICIENCY
CPU Year Clock Pipeline Issue Out-of-order/ Cores Power
Rate Stages width Speculation
i486 1989 25MHz 5 1 No 1 5W
Pentium 1993 66MHz 5 2 No 1 10W
Pentium Pro 1997 200MHz 10 3 Yes 1 29W
P4 Willamette 2001 2000MHz 22 3 Yes 1 75W
UltraSparc III 2003 1950MHz 14 4 No 1 90W
P4 Prescott 2004 3600MHz 31 3 Yes 1 103W
Core 2006 2930MHz 14 4 Yes 2 75W
Core i5 Nehal 2010 3300MHz 14 4 Yes 1 87W
Core i5 Ivy Br 2012 3400MHz 14 4 Yes 8 77W
23
PARALLEL PROGRAMMING
Multicore difficulties
• Partitioning work
• Coordination & synchronization
• Communications overhead
• How do you write parallel programs?
25
LOAD BALANCING
26
AMDAHL’S LAW
❑ Amdahl’s Law was named after Gene Amdahl, who presented it in
1967.
❑ Amdahl’s Law states that in parallelization, if P is the proportion
of a system or program that can be made parallel, and 1-P is the
proportion that remains serial, then the maximum speedup S(N)
that can be achieved using N processors is:
S(N)=1/((1-P)+(P/N))
❑ As number of cores increases …
▪ time to execute parallel part? goes to zero
▪ time to execute serial part? Remains the same
▪ Serial part eventually dominates
27
AMDAHL’S LAW
28
CAN YOU DO IT?
int i;
float *a, *b, *c, tmp;
...
for (i = 0; i < N; i++) { Which code is parallelizable?
C.
tmp = a[i] / b[i];
c[i] = tmp * tmp;
}
29
CAN YOU DO IT?
31
BEFORE NEXT CLASS
• Textbook: 8.4
• Next time:
Virtual Memory
32