0% found this document useful (0 votes)
3 views

Lecture # Pipelining

The document discusses the key operations of a processor, including R-type, load, store, branch, and jump instructions, along with their implementation in the datapath. It explains the inefficiencies of single-cycle implementations and introduces pipelining as a more efficient alternative that increases instruction throughput. Additionally, it covers pipeline hazards, including structural, data, and control hazards, and their solutions such as forwarding and branch prediction.

Uploaded by

adeenhassan7575
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture # Pipelining

The document discusses the key operations of a processor, including R-type, load, store, branch, and jump instructions, along with their implementation in the datapath. It explains the inefficiencies of single-cycle implementations and introduces pipelining as a more efficient alternative that increases instruction throughput. Additionally, it covers pipeline hazards, including structural, data, and control hazards, and their solutions such as forwarding and branch prediction.

Uploaded by

adeenhassan7575
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

The Processor

Computer Organization and Assembly Language


Key Operations:
•R-type (add, sub):
•Fetch, Decode, Execute (ALU), Write back.
•Load (lw):
•Fetch, Decode, ALU address calculation, Memory Read, Write back.
•Store (sw):
•Fetch, Decode, ALU address calculation, Memory Write.
•Branch (beq):
•Fetch, Decode, Compare (ALU Zero Check), Update PC if Zero.
•Jump (j):
•Fetch, Decode, Compute Jump Address, Update PC.
Building the Datapath
Implementing Jump Instruction
• Jump Instruction

• Jump vs Branch Instruction


• The jump instruction looks like a branch instruction but computes the target PC differently and is not
conditional
• How is the jump instruction implemented?
• Jump instruction has 26 bit of address. However, it is required to compute the PC to reach at Target
Address.
• Computation of Target Address
• The lower 28 bits of comes from address contained in jump instruction which has 26 bits.
• Those 26 bits are shifted left to make 28 bits.
• The upper 04 bits of PC remains same as of current value of PC+4
• Thus, jump instruction can be implemented by concatenation of upper bits of PC+4 and
Target
26:0 bits of address field after left shifting PC will result in 28 bits.
which
Upper 4-bits of Lower 26-bits of Jump Address 0
PC+4 0
Building the Datapath
Implementing Jump Instruction
• Let’s assume that given initial PC
value is actually 0x0008 0000
• That’s is to make 32-bits
• Now executing Jump instruction
• Current Value of PC = 0x0008 0024
• Address field of Jump = 0x0020000
• Shifted Left = 0x0020000*22 =
0x0080000
• Target Address = Upper 4-bits of PC + 28-bits after Left Shift
= 0b0000 + 0b0000 0000 1000 0000 0000 0000
0000
= 0b 0000 0000 0000 1000 0000 0000 0000
0000
= 0x 0008 0000
which is required target address
Building the Datapath
Implementing Jump Instruction (Fig.
Expl.)
• An additional multiplexor (at the upper right) is used to
choose between the jump target and either the branch
target or the sequential instruction following this one
• This multiplexor is controlled by the jump control signal
• The jump target address is obtained by shifting the
lower 26 bits of the jump instruction left 2 bits,
effectively adding 00 as the low-order bits, and then
concatenating the upper 4 bits of PC + 4 as the high-
order bits, thus yielding a 32-bit address.
Why a Single-Cycle Implementation Is Not
Used Today
• Although the single-cycle design will work correctly, it would not be
used in modern designs because it is inefficient
• Why single cycle is not used today?
• The clock cycle must have the same length for every instruction in this
single-cycle design
• And of course, the longest possible path in the processor determines the
clock cycle
• Which instruction has longest path or involve all function elements
of processor?
• Load Word instruction uses five functional units in series
• the instruction memory, the register file, the ALU, the data memory, and the register file
• Although the CPI is 1, the overall performance of a single-cycle
implementation is likely to be poor, since the clock cycle is too long
Why a Single-Cycle Implementation Is Not
Used Today
• The penalty for using the single-cycle design with a
fixed clock cycle is significant, but might be considered
acceptable for this small instruction set
• However, for floating-point unit or an instruction set with more
complex instructions, this single-cycle design wouldn’t work
well at all.
• Reason:
• Because we must assume that the clock cycle is equal to the worst-
case delay for all instructions
• It’s useless to try implementation techniques that reduce the delay of
the common case but do not improve the worst-case cycle time
Why a Single-Cycle Implementation Is Not
Used Today
• Solution:
• Pipelining uses the Datapath very similar to the single-cycle
Datapath but is much more efficient by having a much higher
throughput
• Pipelining improves efficiency by executing multiple
instructions simultaneously
• Pipelining is an implementation technique in which
multiple instructions are overlapped in execution
Pipelining
(Figure Explanation)
• Pipelining Example
• A lot of laundry has intuitively used pipelining
• However, the nonpipelined approach to laundry would be as
follows:
1. Place one dirty load of clothes in the washer.
2. When the washer is finished, place the wet load in the dryer.
3. When the dryer is finished, place the dry load on a table and fold.
4. When folding is finished, ask your roommate to put the clothes away.
Pipelining
Figure Explanation
• Ann, Brian, Cathy, and Don each have dirty clothes to
be washed, dried, folded, and put away
• The washer, dryer, “folder,” and “storer” each take 30
minutes for their task
• Sequential laundry takes 8 hours for 4 loads of wash, while
pipelined laundry takes just 3.5 hours
• We show the pipeline stage of different loads over time
by showing copies of the four resources on this two-
dimensional timeline, but we really have just one of
each resource.
Pipelining
• Why is pipelining faster?
• It is faster for many loads as everything is working in parallel
thus more loads are finished per hour
• What does pipelining not do?
• It would not decrease the time to complete one load of
laundry
• Pipelining improves throughput of our laundry system.
• However, when we have many loads of laundry to do, the
improvement in throughput means decrease in the total time
to complete the all laundry
Pipelining
• MIPS instructions classically take five steps:
1. Fetch instruction from memory.
2. Decode the instruction and read registers
• The regular format of MIPS instructions allows reading and decoding to
occur simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.
Pipelining
Example
• Compare the average time between instructions of a single-
cycle implementation to a pipelined implementation
• Assumptions:
• The operation times for the major functional units are given below

• In the single-cycle model, every instruction takes exactly one


clock cycle, so the clock cycle must be stretched to accommodate
the slowest instruction.
Pipelining
Example
• Single Cycle Design
• The single-cycle design must allow for the slowest instruction,
and it is lw—so the time required for every instruction is 800 ps

• Just as the single-cycle design must take the worst-case clock


cycle of 800 ps, even though some instructions can be as fast as
500 ps
Pipelining
Example
• Pipelined Execution
• Pipelined execution clock cycle must have the worst-case clock
cycle of 200 ps, even though some stages take only 100 ps.

• Thus, Pipelining offers a fourfold performance improvement


• The time between the first and fourth instructions is 3 × 200 ps or 600 ps
Pipelining
• What would happen if we increased the number of
instructions?
• Extend the previous figures to 1,000,003 instructions
• We would add 1,000,000 instructions in the pipelined example
• each instruction adds 200 ps to the total execution time
• Total execution = 1,000,000 × 200 ps + 1400 ps = 200,001,400 ps
• In the nonpipelined example
• we would add 1,000,000 instructions, each taking 800 ps
• Total execution time = 1,000,000 × 800 ps + 2400 ps =
800,002,400 ps
• Under these conditions
• Ratio of total execution times of pipelines and nonpipelined
Pipelining
• Pipelining improves performance by increasing
instruction throughput, as opposed to decreasing the
execution time of an individual instruction, but
instruction throughput is the important metric because
real programs execute billions of instructions.
Pipeline Hazards
• There are situations in pipelining when the next
instruction cannot be executed in the following clock
cycle
• These events are called hazards, and there are three
different types.
1. Structural Hazard
2. Data Hazards
3. Control Hazards
Pipeline Hazards
Structural Hazard
• Structural hazard means that the hardware cannot
support the combination of instructions that we want to
execute in the same clock cycle
• For example,
• A structural hazard in the laundry room would occur if we used
a combined washer-dryer unit instead of a separate washer
and dryer machine
• or if our roommate was busy doing something else and
wouldn’t put clothes away
Pipeline Hazards
Structural Hazard
• Suppose, we had a single memory unit (combined instruction and
data memory)
• If there is a fourth instruction, in the same clock cycle the
• first instruction is accessing data from memory
• fourth instruction is fetching an instruction from that same memory
• Without two memories, our pipeline could have a structural hazard.
Pipeline Hazards
Data Hazard
• Data hazards occur when the pipeline must be stalled
because one step must wait for another to complete
• For example,
• You found a sock at the folding station for which no match
existed. One possible strategy is to run down to your room
and search through your clothes bureau to see if you can find
the match
• Obviously, while you are doing the search, loads must wait
that have completed drying and are ready to fold as well as
those that have finished washing and are ready to dry
Pipeline Hazards
Data Hazard
• In a computer pipeline, data hazards arise from the
dependence of one instruction on an earlier one that is still in
the pipeline
• Problem
• An Add instruction followed immediately by a subtract instruction
that uses the sum ($s0):
add $s0, $t0, $t1
sub $t2, $s0, $t3
• Without intervention, a data hazard could severely stall the pipeline
• The add instruction doesn’t write its result until the fifth stage,
meaning that we would have to waste three clock cycles in the
pipeline.
Pipeline Hazards
Data Hazard
• Solution
• There is no need to wait for the complete instruction execution
• Adding extra hardware to retrieve the missing item early from
the internal resources is called forwarding or bypassing
• For the code sequence
add $s0, $t0, $t1
sub $t2, $s0, $t3
• As the ALU creates the sum for the add, we can supply it as an input for
the subtract
Pipeline Hazards
Data Hazard
Pipeline Hazards
Data Hazard
• Valid Case:
• Forwarding paths are valid only if the destination stage is later in time than
the source stage

• Invalid Case:
• There cannot be a valid forwarding path from the output of the memory
access stage in the first instruction to the input of the execution stage of
the following, since that would mean going backward in time
Pipeline Hazards
Data Hazard For the code sequence
lw $s0, 20($t1)
sub $t2, $s0, $t3
• For example,
• The first instruction was a load of $s0 instead of an add.
• Data value will only be available only after the fourth stage which is too late for forwarding
• Hence, even with forwarding, we would have to stall one stage for a load-use data
hazard officially called a pipeline stall, but often given the nickname bubble.
Pipeline Hazards
Control Hazards
• Control hazard comes arises when there is need to make a
decision based on the results of one instruction while
others are executing
• For example, Branch Instruction
• The instruction next to branch instruction must be fetched on next
clock cycle
• Because the pipeline cannot possibly know what the next instruction should
be. Since it only just received the branch instruction from memory
• The decision is still to be made based on the operands of branch instruction
• There will be pipeline stall immediately after we fetch a branch
• Because waiting until the pipeline determines the outcome of the branch
and knows what instruction address to fetch from
Pipeline Hazards
Control Hazards
• If we cannot resolve the branch in the second stage
• As is often the case for longer pipelines, then we’d see an
even larger slowdown if we stall on branches
• Consequences:
• The cost of this option is too high for most computers to use
• Second Solution
• Predict: Computers do indeed use prediction to handle
branches
• One simple approach is to predict always that branches will be untaken
• When you’re right, the pipeline proceeds at full speed
• Only when branches are taken does the pipeline stall
Pipeline Hazards
Control Hazards
• A more sophisticated version of branch prediction would
have some branches predicted as taken and some as
untaken
• In the case of programming, at the bottom of loops are
branches that jump back to the top of the loop
• Since they are likely to be taken and they branch backward,
we could always predict taken for branches that jump to an
earlier address
Pipeline Hazards
Control Hazards
• Third solution: Delayed branch in computers are actually used by the MIPS
architecture to deal branches.
• The delayed branch always executes the next sequential instruction, with
the branch taking place after that one instruction delay.
• It is hidden from the MIPS assembly language programmer because the assembler
can automatically arrange the instructions to get the branch behavior desired by
the programmer.
• MIPS software will place an instruction immediately after the delayed branch
instruction that is not affected by the branch, and a taken branch changes the
address of the instruction that follows this safe instruction.
• In our example, the add instruction before the branch in Figure 4.31 does
not affect the branch and can be moved after the branch to fully hide the
branch delay. Since delayed branches are useful when the branches are
short, no processor uses a delayed branch of more than one cycle. For
longer branch delays, hardware-based branch prediction is usually used.
Pipeline Hazards
Control Hazards
• In Delayed branch example
• add instruction before the branch does not affect the branch and can be
moved after the branch to fully hide the branch delay
• Since delayed branches are useful when the branches are short, no processor
uses a delayed branch of more than one cycle
• For longer branch delays, hardware-based branch prediction is usually used

You might also like