We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
Single-Cycle versus Pipelined Performance
To make this discussion concrete, let's create a pipeline. In this example, and in
the rest of this chapter, we limit our attention to seven instructions: load word
(Iw), store word (sw), add (add), subtract (sub), AND (and), OR (or), and
branch if equal (beq).
Contrast the average time between instructions of a single-cycle
implementation, in which all instructions take one clock cycle, toa pipelined
implementation. Assume that the operation times for the major functional
units in this example are 200 ps for memory access for instructions or data,
200 ps for ALU operation, and 100 ps for register file read or write. In the
single-cycle model, every instruction takes exactly one clock cycle, so the
clock cycle must be stretched to accommodate the slowest instruction.
Figure 4.28 shows the time required for each of the seven instructions. The
single-cycle design must allow for the slowest instruction—in Figure 4.28
it is ]w—so the time required for every instruction is 800 ps. Similarly to
Figure 4.27, Figure 4.29 compares nonpipelined and pipelined execution of
three load register instructions. Thus, the time between the first and fourth
instructions in the nonpipelined design is 3 x 800 ps or 2400 ps.
All the pipeline stages take a single clock cycle, so the clock cycle must be
long enough to accommodate the slowest operation. Just as the single-cycle
design must take the worst-case clock cycle of 800ps, even though some
instructions can be as fast as 500ps, the pipelined execution clock cycle must
have the worst-case clock cycle of 200ps, even though some stages take only
100ps. Pipelining still offers a fourfold performance improvement: the time
between the first and fourth instructions is 3 x 200 ps or 600 ps.
ere
ees
Load word (Iw) 200 ps 100ps__|200ps__|200ps_|100ps _| 800 ps
‘Store word (sw) 200 ps 100ps_|200ps | 200 ps 700 ps
Reformat (add, sub, 200 ps 100 ps | 200 ps 100 ps —_| 600 ps
and, or)
Branch (beq) 200 ps 100 ps__ [200 ps 500 ps.
FIGURE 4.28 Total time for each instruction calculated from the time for each component.
‘This calculation assumes that the multiplexors, control unit, PC accesses, and sign extension unit have no
delay.