Computer Architecture
EE-371/CS-330
                                                             Spring 2019
                                                         Hasan Baig
                                                [email protected]
                                                           Habib University
The contents in these lecture slides are prepared with a help of the official lecture slides of the book “Computer Organization and Design –
RISC V edition” by Patterson and Hennessy .
        2
Recap
                                                              3
Performance Issues
      • Longest delay determines clock period
          – Critical path: load instruction
          – Instruction memory  register file  ALU  data
            memory  register file
      • Not feasible to vary period for different
        instructions
      • Violates design principle
          – Making the common case fast
      • We will improve performance by pipelining
                                                                  4
Pipelining   Restaurant Analogy
 Buffet/Delivery Restaurant            Delivery Address /
                                       Goodies collection
 1.  Dine-in, Delivery, worker                               10
         Process execute one at a
 time
 Tasks:                                  Cash Counter        2
 1. Customer – Grab food and dine in
 2. Delivery Guy – Deliver food
 3. Worker – Purchase Groceries             Kitchen
                                            (Grab food)      2
                                           (Dining Hall)
                                         (Check groceries)
                                        Take/Give Order 1
                                         Token counter       1
                                                                           5
Pipelining   Laundry Analogy
    An implementation technique in which multiple instructions are
    overlapped in execution
                                                  Four loads:
                                                      Speedup
                                                       = 8/3.5 = 2.3
                                                  Non-stop:
                                                      Speedup
                                                       = 2n/(0.5n + 1.5) ≈ 4
                                                       = number of stages
                                                       6
Pipelining
      Five stages, one step per stage
       1. IF: Instruction fetch from memory
       2. ID: Instruction decode & register read
       3. EX: Execute operation or calculate address
       4. MEM: Access memory operand
       5. WB: Write result back to register
                                                                                 7
Pipelining    Example
    • Assume time for stages is
           – 100ps for register read or write
           – 200ps for other stages
    • Compare pipelined datapath with single-cycle
      datapath
   Instr        Instr fetch Register   ALU op   Memory   Register   Total time
                            read                access   write
   ld           200ps      100 ps      200ps    200ps    100 ps     800ps
   sd           200ps      100 ps      200ps    200ps               700ps
   R-format     200ps      100 ps      200ps             100 ps     600ps
   beq          200ps      100 ps      200ps                        500ps
                                                                                  8
Pipelining    Example
   Instr      Instr fetch Register   ALU op     Memory    Register   Total time
                          read                  access    write
   ld         200ps      100 ps      200ps      200ps     100 ps     800ps
   sd         200ps      100 ps      200ps      200ps                700ps
   R-format   200ps      100 ps      200ps                100 ps     600ps
   beq        200ps      100 ps      200ps                           500ps
                               Single-cycle (Tc= 800ps)
                                                                                9
Pipelining    Example
   Instr      Instr fetch Register   ALU op    Memory   Register   Total time
                          read                 access   write
   ld         200ps      100 ps      200ps     200ps    100 ps     800ps
   sd         200ps      100 ps      200ps     200ps               700ps
   R-format   200ps      100 ps      200ps              100 ps     600ps
   beq        200ps      100 ps      200ps                         500ps
                              Pipelined (Tc= 200ps)
                                                              10
Pipelining     Speedup
      • If all stages are balanced
             – i.e., all take the same time
             – Time between instructionspipelined
               = Time between instructionsnonpipelined
                          Number of stages
      • If not balanced, speedup is less
      • Speedup due to increased throughput
             – Latency (time for each instruction) does not
               decrease
                                  11
Pipelining   Speedup
                       T = 2400
                       T = 1400
                                                                                                12
Pipelining       Speedup
   Instructions = 1,000,000
                  Pipelined                                       Non-Pipelined
 Each instruction will add 200 ps                  Each instruction will add 800 ps
 Total time = 1,000,000 x 200 = 200000000 + 1400   Total time = 1,000,000 x 800 = 800000000 + 2400
            = 200001400                                       = 800002400
                     13
Latency   Exercise
                                                            14
Latency      Solultion
1. R-type: 30 + 250 + 150 + 25 + 200 + 25 + 20 = 700ps
2. ld: 30 + 250 + 150 + 25 + 200 + 250 + 25 + 20 = 950 ps
3. sd: 30 + 250 + 150 + 200 + 25 + 250 = 905
4. beq: 30 + 250 + 150 + 25 + 200 + 5 + 25 + 20 = 705
5. I-type: 30 + 250 + 150 + 25 + 200 + 25 + 20 = 700ps
                                                            15
Recap   Problems in single-cycle processor
    • Longest delay determines clock period
        – Critical path: load instruction
        – Instruction memory  register file  ALU  data
          memory  register file
    • Not feasible to vary period for different
      instructions
    • Violates design principle
        – Making the common case fast
    • We will improve performance by pipelining
        16
Recap
                                                      17
Quick Review   Stages in Processor
     Five stages, one step per stage
      1. IF: Instruction fetch from memory
      2. ID: Instruction decode & register read
      3. EX: Execute operation or calculate address
      4. MEM: Access memory operand
      5. WB: Write result back to register
                                                    18
Recap
    • Data Read in the second half of clock cycle
    • Data Write in the first half of clock cycle
                                                            19
Pipelining and ISA Design
     • RISC-V ISA designed for pipelining
          – All instructions are 32-bits
              • Easier to fetch and decode in one cycle
              • c.f. x86: 1- to 17-byte instructions
          – Few and regular instruction formats
              • Can decode and read registers in one step
                                                                20
Pipelining and ISA Design
     • Situations that prevent starting the next
       instruction in the next cycle  Hazards
     • Structure hazards
          – A required resource is busy
     • Data hazard
          – Need to wait for previous instruction to complete
            its data read/write
     • Control hazard
          – Deciding on control action depends on previous
            instruction
                                                                   21
Pipelining   Structural Hazards
 When a planned instruction cannot execute in the proper clock cycle
 because the hardware does not support the combination of
 instructions that are set to execute.
    • Conflict for use of a resource
    • In RISC-V pipeline with a single memory
        – Load/store requires data access
        – Instruction fetch would have to stall for that cycle
             • Would cause a pipeline “bubble”
    • Hence, pipelined datapaths require separate
      instruction/data memories
                                                                      22
Pipelining   Data Hazards
   Data hazards occur when the pipeline must be stalled because one
   step must wait for another to complete.
                            add   x19, x0, x1
                            sub   x2, x19, x3
                                                                                             23
Pipelining     Data Hazards
      • Use result when it is computed
             – Don’t wait for it to be stored in a register
             – Requires extra connections in the datapath
 Forwarding - Also called bypassing. A method of resolving a data hazard by retrieving the
 missing data element from internal buffers rather than waiting for it to arrive from
 programmer- visible registers or memory.
                                                                      25
Pipelining   Data Hazards
      • Forwarding paths are valid only if the destination stage is
        later in time than the source stage
      • Source – Output of MEM in first instruction
      • Destination - Input to EX stage
                                                        26
Pipelining     Data Hazards
      • Load-use Data Hazards
      • Can’t always avoid stalls by forwarding
             – If value not computed when needed
             – Can’t use forwarding backward in time!
                                                                                          27
Pipelining    Data Hazards
      Code Scheduling to avoid stalls
      • Reorder code to avoid use of load result in the
        next instruction
      • C code for a = b + e; c = b + f;
             Assume all variables are in memory and accessible from the offset x31
              ld                 x1,    0(x31)        ld                  x1,   0(x31)
              ld                 x2,    8(x31)        ld                  x2,   8(x31)
   stall
              add                x3,    x1, x2        ld                  x4,   16(x31)
              sd                 x3,    24(x31)       add                 x3,   x1, x2
              ld                 x4,    16(x31)       sd                  x3,   24(x31)
              add                x5,    x1, x4        add                 x5,   x1, x4
   stall
              sd                 x5,    32(x31)       sd                  x5,   32(x31)
                          13 cycles                               11 cycles
                                                                 28
Pipelining     Control Hazards
     • Also called Branch Hazard
     • Branch determines flow of control
             – Fetching next instruction depends on branch
               outcome
             – Pipeline can’t always fetch correct instruction
                • Still working on ID stage of branch
     • In RISC-V pipeline
             – Need to compare registers and compute target
               early in the pipeline
             – Add hardware to do it in ID stage
                                                      29
Pipelining   Control Hazards
      Stall on branch
      • Wait until branch outcome determined before
        fetching next instruction
                                                               30
Pipelining     Control Hazards
      Branch Predictions
      • Longer pipelines can’t readily determine
        branch outcome early
             – Stall penalty becomes unacceptable
      • Predict outcome of branch
             – Only stall if prediction is wrong
      • In RISC-V pipeline
             – Can predict branches not taken
             – Fetch instruction after branch, with no delay
                               31
Pipelining   Control Hazards
      Branch Predictions
                                                                            32
Pipelining     Control Hazards
      More-Realistic Branch Predictions
      • Static branch prediction
             – Based on typical branch behavior
             – Example: loop and if-statement branches
                • Predict backward branches taken
                • Predict forward branches not taken
      • Dynamic branch prediction
             – Hardware measures actual branch behavior
                • e.g., record recent history of each branch
             – Assume future behavior will continue the trend
                • When wrong, stall while re-fetching, and update history
                                                      33
Pipelining   Summary of Overview
   • Pipelining improves performance by increasing
     instruction throughput
       – Executes multiple instructions in parallel
       – Each instruction has the same latency
   • Subject to hazards
       – Structure, data, control
   • Instruction set design affects complexity of
     pipeline implementation
                                                                                    34
Pipelining   Activity
For each code sequence below, state whether it must stall, can avoid stalls using
only forwarding, or can execute without stalling or forwarding.