Pipelining - Part1
Pipelining - Part1
An Overview of Pipelining
• Pipelining is an implementation technique in which multiple
instructions are overlapped in execution.
• Anyone who has done a lot of laundry has intuitively used pipelining.
The non-pipelined approach to laundry would be as follows:
1. Place one dirty load of clothes in the washer.
2. When the washer is finished, place the wet load in the dryer.
3. When the dryer is finished, place the dry load on a table and fold.
4. When folding is finished, ask your roommate to put the clothes
away.
• The pipelined approach takes much less time, as Figure 4.25 shows.
An Overview of Pipelining
• All steps—called stages in pipelining—are operating concurrently.
• As long as we have separate resources for each stage, we can pipeline
the tasks.
An Overview of Pipelining
• If all the stages take about the same amount of time and there is
enough work to do,
• then the speed-up due to pipelining is equal to
• the number of stages in the pipeline,
in this case four: washing, drying, folding, and putting away.
• Therefore, pipelined laundry is potentially four times faster than
nonpipelined:
• 20 loads would take about 5 times as long as 1 load (with pipeline), while 20
loads of sequential laundry takes 20 times as long as 1 load.
An Overview of Pipelining
• Notice that at the beginning and end of the workload in the pipelined
version in Figure 4.25,
• the pipeline is not completely full;
• this start-up and windowing affects performance when the number of tasks is
not large compared to the number of stages in the pipeline.
• If the number of loads is much larger than 4, then the stages will be full most
of the time and the increase in throughput will be very close to 4.
Additional Stages (other processors)
• Fetch instruction (FI)
• Read the next expected • Fetch operands (FO)
instruction into a buffer • Fetch each operand
• Decode instruction (DI) from memory
• Determine the opcode • Operands in registers
and the operand need not be fetched
specifiers
• Calculate operands (CO) • Execute instruction
• Calculate the effective (EI)
address of each source • Perform the indicated
operand
• This may involve
operation
displacement, register
indirect, indirect, or
• Write operand (WO)
other forms of address • Store the result in
calculation memory
An Overview of Pipelining
• MIPS instructions classically take five steps:
1. Fetch instruction from memory.
2. Read registers while decoding the instruction.
• The regular format of MIPS instructions allows reading and decoding to occur
simultaneously.
3. Execute the operation or calculate an address.
4. Access an operand in data memory.
5. Write the result into a register.
• Hence, the MIPS pipeline we explore has five stages.
Single-Cycle versus Pipelined Performance
• The operation times for the major functional units in MIPS are 200 ps for
memory access, 200 ps for ALU operation, and 100 ps for register file read or
write.
• In the single-cycle model, every instruction takes exactly one clock cycle,
• so the clock cycle must be stretched to accommodate the slowest instruction.
• Figure 4.26 shows the time required for each of the eight instructions.
• The single-cycle design must allow for the slowest instruction—in Figure 4.26
it is lw
• So the time required for every instruction is 800 ps.
Single-Cycle versus Pipelined Performance
• Figure 4.27 compares non-pipelined and pipelined execution of three
load word instructions.
• All the pipeline stages take a single clock cycle,
• so the clock cycle must be long enough to accommodate the slowest
operation.
• Just as the single-cycle design must take the worst-case clock cycle of 800
ps,
• Even though some instructions can be as fast as 500 ps,
• the pipelined execution clock cycle must have the worst-case clock cycle of 200 ps,
• even though some stages take only 100 ps.
Single-Cycle versus Pipelined Performance
• Thus, the time between the first and fourth instructions in the nonpipelined
design is 3 × 800 ns or 2400 ps.
• Pipelining still offers a four fold performance improvement:
• the time between the first and fourth instructions is 3 × 200 ps or 600 ps.
• We can turn the pipelining speed-up discussion above into a formula. If the
stages are perfectly balanced,
• then the time between instructions on the pipelined processor—assuming ideal
conditions—is equal to
Single-Cycle versus Pipelined Performance
• Under ideal conditions and with a large number of instructions, the speed-up
from pipelining is approximately equal to the number of pipe stages;
• a five-stage pipeline is nearly five times faster.
Single-Cycle versus Pipelined Performance
• The example shows,
• however, that the stages may be imperfectly balanced.
• Moreover, pipelining involves some overhead,
• Thus,
• the time per instruction in the pipelined processor will exceed the minimum possible, and
• speed-up will be less than the number of pipeline stages.
Single-Cycle versus Pipelined Performance
• Pipelining improves performance by
• increasing instruction throughput, as opposed to decreasing the execution
time of an individual instruction,
• But instruction throughput is the important metric because real
programs execute billions of instructions.
Pipelining
• Pipelining increases the number of simultaneously executing
instructions and the rate at which instructions are started and
completed.
• Pipelining does not reduce the time it takes to complete an individual
instruction, also called the latency.
• For example,
• the five-stage pipeline still takes 5 clock cycles for the instruction to
complete.
• Pipelining improves instruction throughput rather than individual
instruction execution time or latency.
Speed up
Consider the execution of m tasks (instructions) using n-stages
(units) pipeline.
n+m-1 time units are required to complete m tasks.
Throughput
Efficiency
Designing Instruction Sets for Pipelining
• Even with this simple explanation of pipelining, we can get insight
into the design of the MIPS instruction set, which was designed for
pipelined execution.
Designing Instruction Sets for Pipelining
• First
• All MIPS instructions are of the same length.
• This restriction makes it much easier to fetch instructions in the first pipeline stage and
to decode them in the second stage.
• In an instruction set like the x86, where instructions vary from 1 byte to 15 bytes,
pipelining is considerably more challenging.
• Recent implementations of the x86 architecture actually translate x86 instructions into simple
operations that look like MIPS instructions and then pipeline the simple operations rather
than the native x86 instructions!
Designing Instruction Sets for Pipelining
• Second
• MIPS has only a few instruction formats, with the source register fields being
located in the same place in each instruction.
• This symmetry means that the second stage can begin reading the register
file at the same time that the hardware is determining what type of
instruction was fetched.
• If MIPS instruction formats were not symmetric, we would need to split stage
2, resulting in six pipeline stages.
Designing Instruction Sets for Pipelining
• Third
• memory operands only appear in loads or stores in MIPS.
• This restriction means we can use the execute stage to calculate the memory
address and then access memory in the following stage.
• If we could operate on the operands in memory, as in the x86, stages 3 and 4
would expand to an address stage, memory stage, and then execute stage.
Designing Instruction Sets for Pipelining
• Fourth
• operands must be aligned in memory.
• Hence, we need not worry about a single data transfer instruction requiring two data
memory accesses;
• the requested data can be transferred between processor and memory in a single
pipeline stage.