0% found this document useful (0 votes)
304 views5 pages

MIPS Superscalar Simulator

The document describes a software simulator that was developed to model the behavior of a superscalar processor pipeline capable of fetching and committing two instructions per clock cycle. The simulator consists of an assembler, memory components, and a superscalar processor model. It simulates hazard detection, forwarding, stalling, reordering, cancellation, and branch prediction. Benchmarks are run on the simulator to verify functionality and measure performance in clock cycles per instruction between 0.7-1.1, showing accurate modeling.

Uploaded by

Matthew Zhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
304 views5 pages

MIPS Superscalar Simulator

The document describes a software simulator that was developed to model the behavior of a superscalar processor pipeline capable of fetching and committing two instructions per clock cycle. The simulator consists of an assembler, memory components, and a superscalar processor model. It simulates hazard detection, forwarding, stalling, reordering, cancellation, and branch prediction. Benchmarks are run on the simulator to verify functionality and measure performance in clock cycles per instruction between 0.7-1.1, showing accurate modeling.

Uploaded by

Matthew Zhu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Software Simulation and Modeling of a MIPS Superscalar

Computer Architecture
Doo H. Kim, Matthew Zhu
Group # 02, CSCI 5593, 05/03/2016

A simulator was developed which models the behavior of a processor with a superscalar
pipeline that is capable of fetching and committing two instructions in a single clock cycle.
The software simulator consists of an assembler, memory components, and a superscalar
processor. Hazard detection, data forwarding, pipeline stalling, instruction reordering,
instruction cancellation, and a branch untaken prediction scheme are simulated.
Benchmarks are assembled and executed using the simulator for verification of functional
correctness and for timing assessments. The simulator is found to be capable of accurately
processing instructions with a cycles per instruction ratio between 0.7 and 1.1.

Nomenclature
CPI
DE
EX
funct
IF
MEM
MIPS
NOP
opcode
RAW
rd
rs
rt
shamt
VLIW
WAR
WAW
WB

=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=
=

Cycles Per Instruction


instruction DEcode
Execute
function code
Instruction Fetch
MEMory access
Microprocessor without Interlocked Pipeline Stages
No OPeration
operation code
Read After Write
destination register
source register
second source register / target register
shift amount
Very Long Instruction Word
Write After Write
Write After Write
register WriteBack

I. Introduction

OMPUTER architects have been constantly striving to improve the performance of computers since their
advent. Modern computer architectures use methods such as superpipelining, superscalar pipelining, and VLIW
processing in order to enhance computational performance [1]. Difficulty arises in the exploration of the design
space for computer architectures since it is very time consuming and expensive to develop hardware prototypes and
analytical models tend to be inaccurate. Because of this, favorable designs are modeled in software and analysis is
done by simulation. Well designed architectural simulators are flexible, parameterizable, accurate, and have short
evaluation and development times. In simulators, various components of a computer architecture are evaluated such
as branch predictors, cache, and instruction pipelines.

CSCI 5593

Spring 2016

Figure 2. Various Simulator Organizations

Figure 1. MIPS Instruction Format

Simulators can be divided into two main categories: functional simulators and cycle-accurate simulators. The
purpose of functional simulators is to validate the correctness of a design as well as to produce sequences of
instructions or memory addresses, or traces, for usage in a cycle-accurate simulator component. Cycle-accurate
simulators are used to determine the practical instruction throughput of a processor. Simulators can also be
subdivided into user-level simulators and full-system simulators; full system simulators account for the effects of
operating systems, whereas user-level simulators only account for application and system library code.
Execution-driven simulators consist of both functional and timing components [2]. How both components
interact varies between simulators; various simulator organizations are given in Fig. 1. In order to achieve the goal
of flexibility, it is common for simulators to decouple their functional and timing components. In an integrated
simulator, changing something in one component may require changing something in another component.
Simulator implementations often consist of assemblers and instruction parsers. Implementation of these
components requires knowledge of instruction set architecture characteristics. For this project, a MIPS instruction
set architecture is used, for which instructions are generally categorized into the three types depicted by Fig. 2.

II. Related Work


Many simulators have been developed in the past for not only the variety of computer architectures in use, but
also theoretical and exploratory designs. Among these are Simplescalar, the PTLSim family, and Simple-VLIW.
SimpleScalar is a well known simulator used in academia that uses the Alpha instruction set and custom PISA
instruction set. One of its limitations is that the instruction sets it uses are unorthodox, uncommon, and require
outdated compilers.
The PTLSim family includes simulators that
are comprehensive models for x86 instruction set
architectures. The original version of PTLSim
introduced in 2005 was single threaded and only
accounted for userspace processor features.
Eventually, PTLSim/X was developed; it utilizes
the Xen hypervisor to aid in full system
simulation involving operating system features. In
this version, it was determined that full system
simulation is essential as 15% of CPU cycles for
the execution of a particular benchmark were
Figure 3. Assembler Class Diagram
spent in kernel mode [3]. Simulators emerging at
a later date from the PTLSim family include
MPTLSim and MARSS.
One of the few available simulators for VLIW architectures is Simple-VLIW. This simulator consists of an
assembler, locator, simulation engine, and its own benchmark suite tailored for VLIW architectures. Its overall
software organization is largely based on SimpleScalar [4]. Since VLIW architectures are effective in the field of
digital signal processing, benchmarks representing fast Fourier transforms and binary pattern recognition are
included in the suite.

III. Implementation
There are three main components for the simulator built in the project: an assembler, memory model, and
superscalar processor. The class diagram for the assembler is shown in Fig. 3. At the start of the simulation, the path
CSCI 5593

Spring 2016

for an assembly file is passed to a class designed


to read a source file line by line. This class
maintains the list of instructions and a list of a
specific subset of instructions that are labeled. It
ultimately enables the simulator to construct a list
of simulated instruction objects from the
instruction list that it returns. The instruction class
delegates line parsing tasks to an instruction
Figure 4. Memory Model
parser in order to initialize its fields. The
instruction parser task maintains a list of
instructions with labels in order to obtain branch
target addresses for branch instructions. There is
also a class which defines the three types of MIPS
instructions as well as the valid operation and
function codes.
In the MIPS architecture, an explicit NOP
instruction is not included, but it is necessary for
initialization and cancelling instructions that have
been mispredicted. NOP instructions can be
represented by a shift left logical from register
zero to register zero with a shift amount of zero.
This instruction has an operation code, function
Figure 5. Pipeline Stages
code, and fields consisting all of zeroes.
The memory model for the simulator, shown in Fig. 4, consists of main memory and a register file. Data can be
transferred between the two by load word and store word instructions. Load word instructions transfer data from
main memory to registers, whereas store word instructions transfer data from registers to main memory.
For each clock cycle in the processor, instructions advance to the next pipeline stage and pipeline stages are
processed from WB to IF. This concept is demonstrated by Fig. 5, which highlights the pertinent stages for a single
cycle. Figure 6 illustrates the forwarding model for the simulator pipeline. Instructions that are fetched three cycles
apart do not require forwarding since registers are written to in the first half of a cycle in WB and read from in the
second half of a cycle in ID. This behavior is intrinsic to the simulator as it processes WB before it processes DE for
any given cycle. If an instruction has true dependence on another instruction, then the two instructions must not
travel through the pipeline simultaneously, since it is impossible to forward between two instructions that traverse
the same stages in the same cycle. If there is a true dependency and instructions are one or two cycles apart, then
forwarding can be done from EX to the EX or from MEM to EX.
Instructions that load a word from memory must be accounted for in data hazard detection and forwarding
implementation. If an instruction has a data dependence on another instruction in the previous cycle which loads
data from memory, then it is cancelled when it reaches the memory stage and the pipeline is stalled for a cycle. This
is because it is impossible to forward from an instruction that reads from memory, as it does not have the data to
forward until after it has passed the memory stage.
Figure 7 is the class diagram for the superscalar processor of the simulator. Simulated instructions are wrappers
constructed with instruction objects from the assembler. In addition to the fields of the original instruction object,
simulated instructions also keep track of data dependences with a reference to an instance of a forward class as well
as calculated effective addresses, branch conditions,
whether or not they have been reordered, and buffered
register values.
All of the pipeline stages share a common superclass
which specifies that each pipeline stage maintains
references to instructions that are currently in the stage.
Each stage has a function called process, which performs
the functionality of the stage. In addition, the fetch stage
uses private functions to reorder instructions in its
instruction window and ensure that there are no hazards
before reordering instructions.
The simulator also maintains a hazard list which adds
Figure 6. Pipeline Forwarding Model
the most recently fetched instructions to the end of the list
CSCI 5593

Spring 2016

at the end of every clock cycle as shown in Fig. 8. This list is used in the decode stage to determine whether or not
there is a RAW hazard between instructions in the stage and instructions that passed the stage within two cycles ago.
If there is, a forwarding flag maintained by the simulated instruction is set for it to obtain forwarded data at the
beginning of the execution stage.

Figure 7. Processor Class Diagram

Figure 8. Hazard Detection

IV. Results and Analysis


To validate the functional correctness and obtain timing characteristics of the simulator, specific sequences of
instructions are added to benchmarks that are executed by the simulator. The simulator is only considered to be
functionally correct if it arrives at an expected outcome once a benchmark execution is completed. The data
dependence graph for a benchmark on the left of Fig. 9 is shown on the right. For this particular benchmark, a
sequential execution of the instructions leads to a final value of 2 in the r1 register. If any of the dependences
leading to the final value of r1 are not observed, then the result is potentially incorrect. Two instructions are fetched
at each cycle until data dependences are encountered.
The first data dependence encountered during execution is between the store word instruction and the add
immediate instruction. The two instructions cannot enter the pipeline simultaneously, since the the value in r2 must
be updated before the store word instruction determines its effective address. As indicated by the graph, instruction
5 must write after instruction 4 reads, but
it is still possible to fetch instruction 5 in
the cycle before instruction 4 since
instruction 4 will perform its decode
while instruction 5 is in its execute stage
and has not yet written back to a register.
Once the add immediate instruction is
moved before the two store word
instructions, the processor must use the
original value of 10 for r3 for the first
store word instruction and use the new
value of 30 for r3 for the second store
word
instruction.
An
addition,
forwarding needs to occur between the
reordered add immediate instruction and
the second store word instruction.
After the store word instructions, a
branch
is
encountered.
Another
instruction may not enter the pipeline
simultaneously as a branch instruction.
The first branch instruction is not taken,
and as the simulator implements a
branch not-taken policy, it proceeds
regularly. There is a data dependence
Figure 8. Assembly Benchmark and Dependence Graph
CSCI 5593

Spring 2016

between instructions 10 and 11 so instruction 10 enters the pipeline alone. The second branch condition evaluates to
true, so the branch is mispredicted and instructions in the preceding two stages must be cancelled into NOPs. If the
instructions between the branch and the target are not skipped over, the final result will not be the expected value
from executing the instructions sequentially. The load word instruction on line 16 needs to forward from MEM to
EX for the add instruction on line 19. The final instruction depends on a r9 which in turn depends on the load
instruction having correctly forwarded its value for r7. Upon execution of the benchmark, the simulation completed
with the expected final value of 2 in r1 with a CPI of 0.9. This verifies that the processor model of the simulator is
capable of correctly forwarding and reordering instructions in a functionally correct manner.
Other benchmarks are also executed for the purpose of a timing assessment for the simulated processor. Multiple
benchmarks were executed with a resultant CPI between 0.7 and 1.1. This is a reasonable result for a superscalar
processor that ideally fetches and commits two instructions per cycle. Because of branch mispredictions and data
dependences, it is not always possible to achieve a CPI of 0.5. In general, the simulated superscalar processor
exceeds the ideal performance of a scalar pipeline processor.

V. Conclusion
Superscalar processor simulators are complex to design and implement. Each pipeline stage has a distinct role,
but must coordinate with other pipeline stages in order to correctly simulate behaviors such as data forwarding,
branch prediction, instruction cancelling, instruction reordering, hazard detection, and pipeline stalling. These
behaviors must be accurately simulated in order for the execution of the benchmarks to produce expected results.
The simulator developed in this project was functionally validated and timing assessments were made by executing
various benchmarks. Although the simulated superscalar processor does not achieve an ideal value of 0.5 for the
CPI, it generally outperforms an ideal scalar pipelined processor.

References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]

J. L. Gaudiot et al., Techniques to Improve Performance Beyond Pipelining: Superpipelining, Superscalar, and VLIW,
Advances in Computers, vol. 63, 2005, pp. 1-34.
L. Eeckhout, Simulation, in Computer Architecture Performance Evaluation Methods, San Rafael, Morgan & Claypool,
2010, ch. 5, pp. 49-62.
M. T. Yourst, PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural Simulator, IEEE Int. Symp.
Performance Analysis of Systems & Software (ISPASS), San Jose, CA, 2007, pp. 23-34.
P. Wang et al., Simple-VLIW: A Fundamental VLIW Architectural Simulation Platform, IEEE Asia Simulation
Conference. System Simulation and Scientific Computing (ICSC), Beijing, 2008, pp. 1258-1266.
J. J. Yi et al., Simulation of Computer Architectures: Simulators, Benchmarks, Methodologies, and Recommendations,
IEEE Transactions on Computers, vol. 55, 2006, pp. 268-280.
D. A. Penry, A Single-Specification Principle for Functional-to-Timing Simulator Interface Design, IEEE Int. Symp.
Performance Analysis of Systems & Software (ISPASS), Austin, TX, 2011, pp. 186-196.
H. Zeng et al., MPTLsim: A Cycle-Accurate, Full-System Simulator for x86-64 Multicore Architectures with Coherent
Caches, Newsletter ACM SIGARCH Computer Architecture News, vol. 37, 2009, pp. 2-9.
A. Patel et al., MARSSx86: A Full System Simulator for x86 CPUs, Dept. of Computer Science, State University of
New York at Binghamton, 2011.
Milo Bev and Stanislav Kahnek., VLIW-DLX Simulator for Educational Purposes, WCAE '07 Proceedings of the
2007 workshop on Computer architecture education, 2007, pp. 8-13.
J. M. Colmenar et al., An Overview of Computer Architecture and System Simulation, SCS M&S Magazine, 2011.
"MIPS Reference Sheet", www-inst.eecs.berkeley.edu, 2016. [Online]. Available: http://wwwinst.eecs.berkeley.edu/~cs61c/resources/MIPS_help.html.
"MIPS Assembly - Wikibooks, open books for an open world", En.wikibooks.org, 2016. [Online]. Available:
https://en.wikibooks.org/wiki/MIPS_Assembly.
"CS161: MIPS Instruction Reference", Alumni.cs.ucr.edu, 2016. [Online]. Available:
http://alumni.cs.ucr.edu/~vladimir/cs161/mips.html.
"NOP", Wikipedia, 2016. [Online]. Available: https://en.wikipedia.org/wiki/NOP.

CSCI 5593

Spring 2016

You might also like