Design and Implementation of a 32-bit ISA RISC-V
Design and Implementation of a 32-bit ISA RISC-V
Abstract—In this paper, a 32-bit Datapath with RISC-V II. DESIGN OF THE CORE
instruction set architecture based on RV32I CPU instruction
set has been designed. Furthermore, through analysis of A. Execution Pipelining
function and theory of RISC-V CPU instruction set, the The processor comprises a six staged folded pipeline
processor has been optimized by designing a six staged folded which is different from the basic pipelining of RISC. A
pipeline core. Finally, the design has been examined on the classic RISC pipeline has five stages – Fetch, Decode,
current industrial standards FPGA boards. Synthesis and Execute, Memory Access and Write back. [2] This core
implementation on two Boards of Virtex family has been design has six stages with an additional “Pre-Decode” stage
compared on the basis of Utilization reports and Power (Fig.1) and is conveniently modified where the instruction
Reports. The comparison shows the more heavily configurable
fetch takes two cycles to encode 16-bit compressed
board i.e. the Virtex Ultrascale delivers better power –
performance tradeoff at higher costs.
instructions again and jumps or branches are predicted.
I. INTRODUCTION
RISC-V is an open source ISA enabling a novel world of
processor design through open collaborative domain,
creating a pool of technology, with 32- or 64-bit address
space, consisting of a small core Instruction set and a
compressed ISA designed for special and standard purpose Fig. 1. Core Execution Pipelining
extensions. It is effectively a base ISA, which is present in
any implementation, and optional extensions can be added
for more functionality. The base integer is the same as the
previous generation RISC ISAs but there is an absence of
branch delay slots and support of variable length encoded
instructions. Base is a reasonable target to the compilers,
linkers, assemblers etc. or operating systems with supervisor
level operations, because of a restricted minimal instruction Fig. 2. Execution stages overlapping
set. The processor design is based on industry standard
instruction set by RISC-V organization. It can be The major modification in the execution pipelining is that
parameterized as 32- or 64-bit data. The processor designed the Execute and Write–Back stages are folded into the
in this project is based on the open source RV12 processor Memory Access Stage. The instruction bits are optimized by
by RoaLogic. The processor is single core and based on the Decode Stage to allow stalls and CPI of the processor. [3]
RISC-V ISA version 2.2. [1] Here the design is focused on
the core of the processor which is heavily configurable, has This modified pipeline overlaps the execution stages and
precise interrupts, single cycle execution, parameterized executes one instruction per clock cycle, i.e. at the same time
cache memory, size, architecture. The most important feature five operations can be performed. (Fig. 2)
of the processor is: The “classic RISC processor” consist of Instruction fetch,
x Optimizing folded 6 staged pipeline Instruction decode, Execute, Memory access and Writeback.
The target of 5 stages was to perform one instruction per
x Optional Branch Prediction Unit
cycle. Since increasing the pipeline increases the instruction
The Implementation of this processor is supposed to have throughput of the Central processing unit, work has been
a small silicon footprint. Gated clock design reduces power. done on the six stages pipelined processor.
Since the design is extensively configurable, it gives power, During decoding, the instruction determines whether the
utilization and performance trade-offs making the core register from which the instruction needs to be decoded is
highly optimizable. This is the focus of this project where being written to by a different operation in the execution
power and utilization of the design is being compared. stage, if this happens, the instruction is stalled by the Control
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 12:13:43 UTC from IEEE Xplore. Restrictions apply.
unit by 1 clock cycle as well as in the FETCH stage so that division. The Load / Store Unit enables the data memory to
no data is overwritten This is known as CPU stall. load or store data. The Branch Unit figures hop and branch
addresses and approves the anticipated branches. [6]
Here, decode helps to hide the CPU stalls and increase
the CPU cycle instruction by letting CPU stalls, mem access x Memory-Access
and execution to overlap.
At this unit the data memory is accessed. The memory
The instruction stage also is succeeded by the pre-decode stage provided a gap for fetching and load/store to finish.
stage which is heavily based on the compression component Data, and control signals Memory accessing and address,
of the RISC-V RV32I ISA. [4] being working during the execution pipeline stage are
simultaneous. The memory stores signals and so forth
completes the true access after memory read access is
complete. The read data will not be up until a sole clock
cycle later. This would happen after the Write-Back stage,
and henceforth late for the results to be written. Therefore,
the MEM stage is essential. [7]
B. External Bus Interfaces
1) AHB – Lite Memory using AMBA 3
The process core design uses an AHB Lite Memory. It is
a memory access bus interface designed which has a very
wide range of parameter support. All ports on the chip are
supported by AMBA 3 AHB - Lite. The generic use of this
allows physical implementation on a board with a controlled
synthesis for hardware and simulation support according to
the behavioural HDL.
Fig. 3. Execution Pipeline The AHB-Lite Memory is a parameterized and
configurable IP that allows a designer to connect internal
The pipelining (Fig. 3) in this core is as follows: device memory to AHB-Lite based hosts. The dimensions of
x Instruction fetch the memory with a registered output stage, are set via
The Fetch instruction piles up a package from the parameters.
program memory. This package is a field that involves more 2) Interrupts
than one instruction. The location of these (addresses) are The processor here supports multiple various external
known to the program counter. The Program counter can get interrupts and operates in concomitance with an external
refreshed at any point unless there is a CPU stall. It is either a PLIC (platform Level Interrupt Controller).
32-bit or 64-bit. After flushing the pipeline, the Program
counter starts again fresh with the given default location Specific pins on the processor core provide the interrupt
to the processor which then makes the identifier source
x Pre-Decode interrupt by the PLIC at the efficient interrupt vector upon a
The given instructions need to be first decoded into the call.
RISC V instruction set from its 16-bit compressed state.
III. IMPLEMENTATION
After that, the program counter is processed, changing
different instructions such as link, branch and jump hence the The design has been examined on two different FPGA
dependence on execution stage for waiting for the change is boards,
avoided and also reduces the need for flushing the pipeline. Virtex-7 (Product Part: xq7vx690trf1930-2I) and Virtex
The address of the destination is predicted by either using an Ultrascale. (Product Part: xcvu440-flga2892-2-e)
extension of branch predictor or on the basis of the offset
value. [5] The schematic diagram generated of the design on
Virtex-7 (Fig. 5) and Ultrascale (Fig. 6) along with the RTL
x Instruction Decode design of the core (Fig. 4) has been displayed.
The instruction decode accesses registers, checks values
in it, checks whether opcodes are correct, checks the
immediate values and confirms that the operands are
available for the executions.
x Execute
The Execution stage carries out necessary operations for
the information given by former Decode unit stages and here
is where the actual computation occurs. This stage has the
Arithmetic and Logical unit and a bit shifter. This has
different execution units and each has a novel capacity. The
ALU is responsible for Boolean operations and also for
integer arithmetic operations bit Shifter is for shifts and
rotations. The Multiplier unit computes marked/unsigned
Fig. 4. RTL design of the core
increase. The Divider unit manages marked or unmarked
127
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 12:13:43 UTC from IEEE Xplore. Restrictions apply.
Power reports can be broadly classified into Physical
Domain and Functional Domain. Physical Domain of power
deals with the board shape, size, power deliverance, Thermal
power dissipation system. Functional Domain is more based
on the design i.e. the utilization of area, I/O signal
interference etc. This bit theory is highly essential in
understanding the core difference between the two boards
used in this project.
The utilization reports have a great impact on the power
usage or on-chip power of both the processors. The Power
report taken out during synthesis explains the Power
estimated to be used or dissipated whilst using a specific
device or vendor.
Power reports have been monitored and refined in
Synthesis of the project without a placement procedure. This
is also known as Post Synthesis in Vivado.
The power report received after Implementation is called
‘Post Placement’ and is done when netlist components are
actually placed into the FPGA board resources. This is more
Fig. 5. Implemented design of the core on Virtex-7
precise due to a detailed utilization report and configuration.
This stage also verifies the routing and best- and worst-case
gate and delays. Basically, power analysis is most accurate at
this stage before testing on an actual board. [8]
The utilization reports have a great impact on the power
usage or on-chip power of both the processors. The Virtex-
Ultrascale processor is a huge board with a very powerful set
of specifications. This highlights the immense I/O resource
provided on the board, with 1456 I/O pins available but the
design of the core takes up 941 pins only, giving a 64%
utilization. Compared with this Virtex-7, the 7 has 1000
available I/O pins which gives an efficient 94% utilization.
[9].
128
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 12:13:43 UTC from IEEE Xplore. Restrictions apply.
Fig. 7. Utilization report
129
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 12:13:43 UTC from IEEE Xplore. Restrictions apply.
REFERENCES Symposium on Microarchitecture (MICRO), pp. 1-5, 2015S. Islam,
D. Chattopadhyay, M. K. Das, V. Neelima and R. Sarkar, "Design of
[1] A. Waterman, K. Asanovic, "The RISC-V Instruction Set Manual" in High-Speed-Pipelined Execution Unit of 32-bit RISC Processor,"
User-Level ISA, Berkeley:SiFive Inc. and CS Division, EECS 2006 Annual IEEE India Conference, New Delhi, 2006, pp. 1-5.
Department, University of California, vol. I, 2017.
[6] S. Islam, D. Chattopadhyay, M. K. Das, V. Neelima and R. Sarkar,
[2] P. Maillard, J. Arver, C. Smith, O. Ballan, M. J. Hart and Y. P. Chen, "Design of High-Speed-Pipelined Execution Unit of 32-bit RISC
"Test Methodology & Neutron Characterization of Xilinx 16nm Processor," 2006 Annual IEEE India Conference, New Delhi, 2006,
Zynq® UltraScale+™ Multi-Processor System-on-Chip (MPSoC)," pp. 1-5.
2018 IEEE Radiation Effects Data Workshop (REDW), Waikoloa
Village, HI, 2018, pp. 1-4. [7] A. Oleksiak, S. Cieślak, K. Marcinek and W. A. Pleskacz, "Design
and Verification Environment for RISC-V Processor Cores," 2019
[3] I. Kuroda, E. Murata, K. Nadehara, K. Suzuki, T. Arai and A. MIXDES - 26th International Conference "Mixed Design of
Okamura, "A 16-bit parallel MAC architecture for a multimedia Integrated Circuits and Systems", Rzeszów, Poland, 2019, pp. 206-
RISC processor," 1998 IEEE Workshop on Signal Processing 209.
Systems. SIPS 98. Design and Implementation (Cat. No.98TH8374),
Cambridge, MA, USA, 1998, pp. 103-112. [8] D. K. Dennis et al., "Single cycle RISC-V micro architecture
processor and its FPGA prototype," 2017 7th International
[4] K. Patsidis, D. Konstantinou, C. Nicopoulos, G. Dimitrakopoulos, "A Symposium on Embedded Computing and System Design (ISED),
low-cost synthesizable RISC-V dual-issue processor core leveraging Durgapur, 2017, pp. 1-5.
the compressed Instruction Set Extension", Microprocessors and
Microsystems, vol. 61, pp. 1-10, Sept. 2018. [9] Y. Jing, C. Y. Meng, M. T. Reyes, J. O. Yang, P. F. Salinas and G.
Tan, "Electrical diagnosis of temperature-dependent global clock
[5] Mukherjee et al., A Systematic Methodology to Compute the failures using probeless isolation and pattern commonality analysis,"
Architectural Vulnerability Factors for a High-Performance 2012 19th IEEE International Symposium on the Physical and Failure
Microprocessor Proceedings of the 36th Annual International Analysis of Integrated Circuits, Singapore, 2012, pp. 1-6.
130
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 12:13:43 UTC from IEEE Xplore. Restrictions apply.