EE108b&Lecture&8& & Pipelined&Processor: Christos (Kozyrakis ( (H.p://ee108b.stanford - Edu ( (
EE108b&Lecture&8& & Pipelined&Processor: Christos (Kozyrakis ( (H.p://ee108b.stanford - Edu ( (
&
Pipelined&Processor&
Christos(Kozyrakis(
(
h.p://ee108b.stanford.edu(
(
EE108b&–&Winter&2014&–&Lecture&08&
Announcements&
! Upcoming(deadlines(
! HW2,(PA2,(Lab2(
! Midterm(exam:(Monday(2/20,(6pmK9pm(
! Included:(lectures(1K9(
! Closed(books,(1(page(of(notes,(green(page,(calculator(
! CatchKup(with(reading(material;(uPlize(office(hours(
! Review(session(on(Friday((
! 2.15K3.15pm(Gates(B01(
( 2
Review:&
Single&Cycle&Processor&
PC [3 1– 28 ] Instru ction [2 5– 0 ] 0 0 0 0
M M
u u
x x
AL U
Ad d 1 1
re su lt
A dd S hift
le ft 2 Ju mp
Re gDst
4 Bra nc h
M emRe ad
Instru ction [3 1– 26 ] M emto Reg
Con trol AL UO p
M emWr ite
AL US rc
Reg Write
Instru ction [2 5– 21 ] R e ad
Read r eg ister 1
PC Read
a d dres s
Instru ction [2 0– 16 ] d a ta 1
R e ad
r eg ister 2 Ze ro
Ins truc tio n 0 Re g is ters Read ALU AL U
[31– 0 ] 0 R ea d
M W rite d a ta 2 r es ult Ad d re ss 1
Instruction u r eg ister M data
u M
me mo ry x u
Instru ction [1 5– 11 ] W rite x
1 Da ta x
d at a 1 m em o ry 0
Write
data
16 32
Instru ction [1 5– 0] Si g n
extend A LU
con tr ol
Instru ctio n [5 – 0]
3
Review:&
Single&Cycle&Processor&
! Pros(
! Simple(
! CPI(=(1((
! Cons(
! Cycle(Pme(is(the(worst(case(path(→((long(cycle(Pmes(
! Worst(case(=(?(
! Hardware(is(underuPlized(
! ALU(and(memory(used(only(for(a(fracPon(of(clock(cycle(
! Not(well(amorPzed!(
! Best(possible(CPI(is(1(
4
Key&Tools&for&System&Architects&
1. Pipelining&
2. Parallelism(
3. OutKofKorder(execuPon(
4. PredicPon(
5. Caching(
6. IndirecPon(
7. AmorEzaEon&
8. Redundancy(
9. SpecializaPon(
10. Focus(on(the(common(case(
5
Pipelining&
! Overlapping(execuPon(
! Helps(throughput,(not(latency(
! PotenPal(speedup(=(number(
pipe(stages(
! Pipeline(rate(limited(by(
slowest(stage(
! Unbalanced(pipe(stages(
reduces(speedup(
! Fill/drain(Pme(reduce(
speedup(
6
Pipelining&the&Processor&
! 5(stages,(one(clock(cycle(per(stage(
! IF:(instrucPon(fetch(from(memory(
! ID:(instrucPon(decode(&(register(read(
! EX:(execute(operaPon(or(calculate(address(
! MEM:(access(memory(operand(
! WB:(write(result(back(to(register(
7
Pipelining&the&Processor&
! Overlap(instrucPons(in(different(stages(
! All(hardware(used(all(the(Pme(
! Clock(cycle(is(fast(
! CPI(is(sPll(1(
8
Pipeline&Datapath&
0
M
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
9
Load:&Stage&1&(IF)&
lw
Instruction Fetch
0
M
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
10
Load:&Stage&2&(ID)&
lw
Register Fetch
0
M
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
11
Load:&Stage&3&(EX)&
lw
Execute
0
M
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
12
Load:&Stage&4&(MEM)&
lw
Memory
0
M
u
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
13
Load:&Stage&5&(WB)&
lw
0
M
u
Write Back
x
1
Add
4 Add Add
result
Shift
left 2
Read
Instruction
14
Pipeline&Control&
! Need(to(control(funcPonal(units(
! But(they(are(working(on(different(instrucPons!(
! Not(a(problem(
! Just(pipeline(the(control(signals(along(with(data(
! Make(sure(they(line(up(
! Using(labeling(convenPons(ogen(helps(
! InstrucPon_rf(–(means(this(instrucPon(is(in(RF(
! Every(Pme(it(gets(flopped,(changes(pipestage(
! Make(sure(right(signals(go(to(the(right(places(
15
Control&Signals&
! Same(control(unit(generates(signals(in(ID(stage(
! Control(signals(for(EX((
! (ExtOp,(ALUSrc,(…)(used(1(cycle(later(
! Control(signals(for(Mem((
! (MemWr,(Branch)(used(2(cycles(later(
! Control(signals(for(WB((
! (MemtoReg,(MemWr)(used(3(cycles(later(
16
Pipelined&Control&
ExtOp( ExtOp(
ALUSrc( ALUSrc(
Ex/MEM(Register(
MEM/WB(Register(
ALUOp( ALUOp(
ID/Ex(Register(
IF/ID(Register(
17
PuUng&it&All&Together:&
Pipelined&Processor&
PCSrc
ID/EX
0
M
u WB
x EX/MEM
1
Control M WB
MEM/WB
EX M WB
IF/ID
Add
Add
4 Add result
RegWr ite
Branch
Shift
left 2
Mem Wr ite
ALUSrc
Read
Mem toReg
Ins truc tion
Instruction 16 32 6
[15– 0] Sign ALU MemRead
extend control
Instruction
[20– 16]
0 ALUOp
M
Instruction u
[15– 11] x
1
RegDst
18
MIPS&ISA&designed&for&pipelining&
! All(instrucPons(are(32Kbits(
! Easier(to(fetch(and(decode(in(one(cycle(
! c.f.(x86:(1K(to(17Kbyte(instrucPons(
! Few(and(regular(instrucPon(formats(
! Can(decode(and(read(registers(in(one(step(
! Load/store(addressing(
! Can(calculate(address(in(3rd(stage,(access(memory(in(4th(stage(
! Alignment(of(memory(operands(
! Memory(access(takes(only(one(cycle(
19
Pipeline&Performance&
! Assume(Pme(for(stages(is(
! 100ps(for(register(read(or(write(
! 200ps(for(other(stages(
! Compare(pipelined(with(singleKcycle(processor(
20
Pipeline&Performance&
Single-cycle (Tc= 800ps)
21
Pipeline&Speedup&
! If(all(stages(are(balanced(
! i.e.,(all(take(the(same(Pme(
! Time(between(instrucPonspipelined(
=(Time(between(instrucPonsnonpipelined(
( (Number(of(stages(
! If(not(balanced,(speedup(is(less(
! Speedup(due(to(increased(throughput(
! Latency((Pme(for(each(instrucPon)(does(not(decrease(
22
But&Something&Feels&Wrong&
! Why(stop(at(5(pipeline(stages(
! If(pipelining(improves(Tclock(&(CP=1(
! We(should(keep(subdividing(the(cycle(
(
! Three(issues(
! Some(things(have(to(complete(in(a(cycle(
! CPI(is(not(really(one(
! Cost((area(and(power)(
23
Quiz&
! Ignoring(all(other(issues,(what(is(the(highest(clock(
frequency(you(can(achieve(with(pipelining?(
! Lowest(clock(cycle(Pme?(
! What(are(the(limiPng(factors?((
(
24
Pipeline&Hazards&
! SituaPons(that(prevent(starPng(the(next(instrucPon(
in(the(next(cycle(
! Lead(to(CPI(>(1(
! Structure(hazards(
! A(required(resource(is(busy(
! Data(hazard(
! Must(wait(previous(instrucPons(to(produce/consume(data(
! Control(hazard(
! Next(PC(depends(on(previous(instrucPon(
25
Structural&Hazards&
! Resource(conflict(
! Two(instrucPons(use(same(hardware(in(the(same(cycle(
! Example:(pipeline(with(a(single(unified(memory(
! No(separate(instrucPon(&(data(memories(
! Load/store(requires(data(access(
! One(instrucPon(would(have(to(stall(for(that(cycle(
! Which(one?(
! Would(cause(a(pipeline(“bubble”(((
! Other(examples(
! FuncPonal(units(that(are(not(fully(pipelined((mult,(div)(
26
Avoiding&Structural&Hazards&
1. Do(nothing((performance(hit)(
2. Replicate(resources(
! Separate(instrucPon/data(memories,(mulPported(memories,(…(
3. Design(away(the(structural(stall(
! Use(resource(once(per(instrucPon,(always(in(the(same(stage(
! Example(of(bad(pipeline(arrangement(
! Load(uses(Register(File’s(Write(port(during(its(5 (stage(
th
1 2 3 4 5
Load IF RF/ID EX MEM WB
! RKtype(uses(Register(File’s(Write(port(during(the(4th(stage(
1 2 3 4
R-type IF RF/ID EX WB
27
Structural&Hazard&Example&
! Consider(a(load(followed(immediately(by(an(ALU(operaPon(
! Register(file(only(has(a(single(write(port(
! But(need(to(write(the(results(of(the(ALU(and(the(memory(back(
28
Delayed&WriteYback&in&&
5Ystage&Pipeline&
! Delay(RKtype(register(write(by(one(cycle(
! Does(this(increase(the(CPI(of(instrucPon?(
! What(is(the(cost?(
1( 2( 3( 4( 5(
RKtype( IF( RF/ID( EX( MEM( WB(
! Dependencies(are(a(property(of(your(program((always(there)(
! Dependencies(may(lead(to(hazards(on(a(specific(pipeline(
30
Dependency&Examples&
! True(dependency(=>(RAW(hazard(
addu $t0, $t1, $t2
subu $t3, $t4, $t0
! Output(dependency(=>(WAW(hazard(
addu $t0, $t1, $t2
subu $t0, $t4, $t5
! AnP(dependency(=>(WAR(hazard(
addu $t0, $t1, $t2
subu $t1, $t4, $t5
31
Analyzing&the&Problem&
! Can(an(output(dependency(cause(a(WAW(hazard(in(5Kstage(pipeline?(
! Can(an(anPKdependency(cause(a(WAR(hazard(in(5Kstage(pipeline?(
! Are(these(answers(universally(true?(
32
Dealing&with&RAW&Hazards&&
! Must(keep(our(“promise”(in(the(instrucPon(set(
! Each(instrucPon(fully(completes(before(next(on(starts(
! All(RAW(dependencies(are(respected(
! Pipelining(may(break(this(promise(
! Overlapping(i(and(j(
! i(writes(late(in(the(pipeline((WB);(j(reads(early((ID)(
! Must(ensure(that(programmers(cannot(observe(this(behavior(
! Without(necessarily(reverPng(to(singleKcycle(design…((
(
(
33
RAW&Hazard&Example&
! Dependencies(backwards(in(Pme(are(hazards(
Time&(clock&cycles)& 0 1 2 3 4 5 6 7
IF( ID/RF( EX( MEM( WB(
ALU(
add(r1,r2,r3( Im( Reg( Dm( Reg(
I&
n&
ALU(
s& sub(r4,(r1,(r3( Im( Reg( Dm( Reg(
t&
r.&
ALU(
& and(r6,(r1,(r7( Im( Reg( Dm( Reg(
O&
r&
ALU(
d& or(r8,(r1,(r9( Im( Reg( Dm( Reg(
e&
r&
ALU(
xor(r10,(r1,(r11( Im( Reg( Dm( Reg(
34
SoluEons&for&RAW&Hazards&
! Delay(the(reading(instrucPon(unPl(data(is(available(
! Also(called(stalling(or(inserPng(pipeline(bubbles(
! How(can(we(delay(the(younger(instrucPon?((
! Compiler(insert(independent(work(or(NOPS(ahead(of(it(
! NOP(example:(or($0,($0,($0(
! Disadvantage:(pipelineKspecific(binary(program(
! Hardware(inserts(NOPs(as(needed((interlocks)(
! Advantage:(correct(operaPon(for(all(programs/pipelines(
! Disadvantage:(may(miss(some(opPmizaPon(opportuniPes(
! Most(modern(machines(
! Hardware(inserts(NOPs(but(compiler(may(try(to(minimize(need(
(
35
Data&Hazard&Y&Stalls&
! Eliminate(reverse(Pme(dependency(by(stalling(
Time&(clock&cycles)& 0 1 2 3 4 5 6 7
IF( ID/RF( EX( MEM( WB(
ALU(
add(r1,(r2,(r3( Im( Reg( Dm( Reg(
I&
n&
ALU(
s& sub(r4,(r1,(r3( Im( bubble&bubble&bubble&Reg( Dm( Reg(
t&
r.&
& and(r6,(r1,(r7(
ALU(
O& Im( Reg( Dm(
r&
d& or(r8,(r1,(r9(
ALU(
e& Im( Reg(
r&
36
How&to&Stall&the&Pipeline&&
! Discover(need(to(stall(when(2nd(instrucPon(is(in(ID(stage(
! Repeat(its(ID(stage(unPl(hazard(resolved(
! Let(all(instrucPons(ahead(of(it(move(forward(
! Stall(all(instrucPons(behind(it(
1. Force(control(values(in(ID/EX(register(a(NOP(instrucPon(
! As(if(you(fetched(or($0,($0,($0(
! When(it(propagates(to(EX,(MEM(and(WB,(nothing(will(happen(
2. Prevent(update(of(PC(and(IF/ID(register(
! Using(instrucPon(is(decoded(again(
! Following(instrucPon(is(fetched(again(
37
Performance&Effect&
! Stalls(can(have(a(significant(effect(on(performance(
! Consider(the(following(case(
! The(ideal(CPI(of(the(machine(is(1((
! A(RAW(hazard(causes(a(3(cycle(stall(
! If(40%(of(the(instrucPons(cause(a(stall?(
! The(new(effecPve(CPI(is(1(+(3(×(0.4(=(2.2(
! And(the(real(%(is(probably(higher(than(40%(
! You(get(less(than(½(the(desired(performance!(
38
Reducing&Stalls&
! Key:(when(you(say(new(data(is(actually(available?(
! In(the(5Kstage(pipeline(
! Ager(WB(stage?(
! During(WB(stage?(
! Register(file(is(typically(fast(
! Write(in(the(first(half,(read(in(the(second(half(
! Ager(EX(stage?(
39
Decreasing&Stalls:&Fast&RF&
! Register(file(writes(on(first(half(and(reads(on(second(half(
Time&(clock&cycles)& 0 1 2 3 4 5 6 7
IF( ID/RF( EX( MEM( WB(
ALU(
add(r1,(r2,(r3( Im( Reg( Dm( Reg(
I&
n&
ALU(
s& sub(r4,(r1,(r3( Im( bubble&bubble& Reg( Dm( Reg(
t&
r.&
& and(r6,(r1,(r7(
ALU(
O& Im( Reg( Dm(
r&
d& or(r8,(r1,(r9(
ALU(
e& Im( Reg(
r&
40
Performance&Effect&
! Stalls(can(have(a(significant(effect(on(performance(
! Consider(the(following(case(
! The(ideal(CPI(of(the(machine(is(1((
! A(RAW(hazard(causes(a(2(cycle(stall(
! If(40%(of(the(instrucPons(cause(a(stall?(
! The(new(effecPve(CPI(is(1(+(2(×(0.4(=(1.8(
! And(the(real(%(is(probably(higher(than(40%(
! You(get(a(li.le(more(than(½(the(desired(performance!(
(
41
Reducing&Stalls&–&one&step&beyond&
! Key(is(to(be(careful(about(when(((
! Data(is(actually(available(as(output(
! Data(is(actually(required(as(an(input(
! In(our(example:(
! Data(becomes(available(when(add(finishes(EX(stage(
! Cycle((2(
! Data(needed(by(sub(at(the(beginning(of(its(EX(stage(
! Cycle(3((the(soonest(possible)(
! If(you(can(use(this(value,(the(stall(for(ALU(is(zero!(
! Fastest,(but(requires(more(hardware(–(called(forwarding(
! Aka(bypassing,(shortKcircuiPng(
42
Decreasing&Stalls:&Forwarding&
! “Forward”(the(data(to(the(appropriate(unit(
Time&(clock&cycles)& 0 1 2 3 4 5 6 7
IF( ID/RF( EX( MEM( WB(
ALU(
add(r1,(r2,(r3( Im( Reg( Dm( Reg(
I&
n&
s&
ALU(
t& sub(r4,(r1,(r3( Im( Reg( Dm( Reg(
r.&
&
ALU(
O& and(r6,(r1,(r7( Im( Reg( Dm( Reg(
r&
d&
ALU(
e&
or(r8,(r1,(r9( Im( Reg( Dm( Reg(
r&
ALU(
xor(r10,(r1,(r11( Im( Reg( Dm( Reg(
43
Forwarding&LimitaEon:&
LoadYUse&Case&
! Data(is(not(available(yet(to(be(forwarded(
0 (1((((((((2((((((((3(((((((((4(((((((((5((((((((((6((((((((7(
Time&(clock&cycles)&
IF( ID/RF( EX( MEM( WB(
ALU(
lw(r1,(0(r2)( Im( Reg( Dm( Reg(
I&
n&
s&
ALU(
t& sub(r4,(r1,(r6( Im( Reg( Dm( Reg(
r.&
&
ALU(
O& and(r6,(r1,(r7( Im( Reg( Dm( Reg(
r&
d&
ALU(
e&
or(r8,(r1,(r9( Im( Reg( Dm( Reg(
r&
44
LoadYUse&Case:&Hardware&Stall&
! A(pipeline&interlock(checks(and(stops(the(instrucFon&issue&
Time&(clock&cycles)&
IF( ID/RF( EX( MEM( WB(
ALU(
lw(r1,(0(r2)( Im( Reg( Dm( Reg(
I&
n&
s&
ALU(
t& sub(r4,(r1,(r3( Im( Reg( bubble& Dm( Reg(
r.&
&
Im(
ALU(
O&
and(r6,(r1,(r7( bubble& Reg( Dm( Reg(
r&
d&
e&
ALU(
r& or(r8,(r1,(r9( Im( Reg( Dm( Reg(
45
IdenEfying&the&&
Forwarding&Datapaths&
! IdenPfy(all(stages(that(produce(new(values(
! EX(and(MEM(
! All(stages(ager(first(producer(are(sources(of(forwarding(data(
! MEM,(WB(
! IdenPfy(all(stages(that(really(consume(values(
! EX(and(MEM(
! These(stages(are(the(desPnaPons(of(a(forwarding(data(
! Add(mulPplexor(for(each(pair(of(source/desPnaPon(stages(
! Consider(both(possible(instrucPon(operands(
46
Forwarding&Paths:&ParEal&
47
Forwarding&Control&
! Pass(register(numbers(along(pipeline(
! e.g.,(ID/EX.RegisterRs(=(register(number(for(Rs(in(ID/EX(pipeline(register(
! ALU(operand(register(numbers(in(EX(stage(are(given(by(
! ID/EX.RegisterRs,(ID/EX.RegisterRt(
! Data(hazards(possible(when(
! 1a.(EX/MEM.RegisterRd(==(ID/EX.RegisterRs(
Fwd(from(
! 1b.(EX/MEM.RegisterRd(==(ID/EX.RegisterRt( EX/MEM(
! 2a.(MEM/WB.RegisterRd(==(ID/EX.RegisterRs( pipeline(reg(
! 2b.(MEM/WB.RegisterRd(==(ID/EX.RegisterRt( Fwd(from(
MEM/WB(
pipeline(reg(
48
Forwarding&Control&
! But(only(if(forwarding(instrucPon(will(write(to(a(register!(
! EX/MEM.RegWrite,(MEM/WB.RegWrite(
! And(if(Rd(for(that(instrucPon(is(not($zero(
! EX/MEM.RegisterRd(≠(0,(
MEM/WB.RegisterRd(≠(0(
! And(if(forwarding(instrucPon(is(not(a(load(in(MEM(stage(
! EX/MEM.MemToReg==0(
! This(is(a(case(we(have(to(stall…((
49
Forwarding&Control&
(Stall&Case¬&Shown)&
! EX(hazard(
! if((EX/MEM.RegWrite(and((EX/MEM.RegisterRd(≠(0)(
((((and((EX/MEM.RegisterRd(==(ID/EX.RegisterRs))(
((ForwardA(=(10(
! if((EX/MEM.RegWrite(and((EX/MEM.RegisterRd(≠(0)(
((((and((EX/MEM.RegisterRd(==(ID/EX.RegisterRt))(
((ForwardB(=(10(
! MEM(hazard(
! if((MEM/WB.RegWrite(and((MEM/WB.RegisterRd(≠(0)(
((((and((MEM/WB.RegisterRd(==(ID/EX.RegisterRs))(
((ForwardA(=(01(
! if((MEM/WB.RegWrite(and((MEM/WB.RegisterRd(≠(0)(
((((and((MEM/WB.RegisterRd(==(ID/EX.RegisterRt))(
((ForwardB(=(01(
50
Double&Data&Hazard&
! Consider(the(sequence:(
add $1,$1,$2
sub $1,$1,$3
or $1,$1,$4
! Both(hazards(occur(
! Want(to(use(the(most(recent(result(from(the(sub(
! Revise(MEM(hazard(condiPon(
! Only(fwd(if(EX(hazard(condiPon(isn’t(true(
51
Forwarding&Control&(Revised)&
! MEM(hazard(
! if((MEM/WB.RegWrite(and((MEM/WB.RegisterRd(≠(0)(
((((and(not((EX/MEM.RegWrite(and((EX/MEM.RegisterRd(≠(0)(
(((((((((((((((((and((EX/MEM.RegisterRd(==(ID/EX.RegisterRs))(
((((and((MEM/WB.RegisterRd(=(ID/EX.RegisterRs))(
((ForwardA(=(01(
! if((MEM/WB.RegWrite(and((MEM/WB.RegisterRd(≠(0)(
((((and(not((EX/MEM.RegWrite(and((EX/MEM.RegisterRd(≠(0)(
(((((((((((((((((and((EX/MEM.RegisterRd(==(ID/EX.RegisterRt))(
((((and((MEM/WB.RegisterRd(=(ID/EX.RegisterRt))(
((ForwardB(=(01(
52
Datapath&with&Forwarding&
53
LoadYUse&Data&Hazard&
Need to stall
for one cycle
54
LoadYUse&Hazard&DetecEon&
! Check(when(use(instrucPon(is(decoded(in(ID(stage(
! ALU(register(numbers(in(ID(stage(are(given(by(
! IF/ID.RegisterRs,(IF/ID.RegisterRt(
! LoadKuse(hazard(when(
! ID/EX.MemRead(and(
((((ID/EX.RegisterRt(=(IF/ID.RegisterRs)(or(
((((ID/EX.RegisterRt(=(IF/ID.RegisterRt))(
! If(detected,(stall(and(insert(bubble(
55
Datapath&with&&
Hazard&DetecEon&
56
Example:&LoadYUse&Stall&
sub r4, r1, r3 lw r1, 0(r2)
57
Example:&LoadYUse&Stall&
1&cycle&later&
sub r4, r1, r3 nop lw r1, 0(r2)
58
Looking&Ahead&
! Compilers(and(data(hazards(
! Control(hazards(
! ExcepPons(and(interrupts(
! Advanced(pipelining(–((CPI(<(1.0)(
59