Processor Design From Leon3 Extension Final Report
Processor Design From Leon3 Extension Final Report
General-Purpose Processor
Islamabad, Pakistan
This project explores the design of a Reduced Instruction-Set Computer based 32-bit General-
Purpose Processor. A simple 32-bit RISC processor has been evaluated and tested on an FPGA. For
the initial proof of concept, an open-source 8-bit processor was selected, and synthesized to run on
an FPGA running custom code and instructions written in C. Thereafter, there specifications were
upgraded to implement a 32-bit processor. Several open-source and commercial processors were
studied. Once a decent level of understanding of the architecture was achieved, an open-source
processor was selected, synthesized, and tested. Finally, a new peripheral was integrated into the
processor to enhance the processor's capabilities, and to adapt it for better performance as needed
for different applications.
To help in exploring the foundations of processor design and to obtain an architecture on which
further components could be added for improved and increased functionality.
Brief
The purpose of the project is to explore the design of a 32-bit general purpose processor which is
synthesizable on an FPGA.
Deliverables
The report starts by discussing the basic structure of a processor in Chapter 2. Chapter 3 discussed
the bare minimum design of a processor and builds an 8-bit processor as a proof of concept. It is
synthesized, mapped, and tested on an FPGA. The processor is then modified and tested again.
Based on the architecture of Chapter 3, Chapter 4 moves to a 32-bit RISC architecture with a more
complex structure. Again the 32-bit processor is synthesized, mapped and tested on an FPGA.
Finally, ways to incorporate custom peripherals to the processor are explored in Chapter 5, leading
to the conclusion of the report in Chapter 6.
Introduction:
A processor is a device capable of manipulating information in a way specified by a set of
instructions. This instruction set, in-turn, defines the capabilities of a processor. A sequence of these
instructions forms a machine controlled program. Each family of processors has a different
instruction set, thus the functionality varies. This sequence of instructions may be altered as needed
to alter the application.
Deviation from von Neumann is Harvard Architecture in instructions and data have different
memory spaces with separate address, data and control buses for each memory space. This has
number of advantages in that instruction and data fetches can occur concurrently. In our current
study, we will stick with RISC Processor conforming Harvard Architecture.
1. Decoder
2. Memory
3. Bus
4. Peripherals
5. Arithmetic Logic Unit (ALU)
Other add-ons that can be found on high performance and commercial processors may include
advanced pipelining, Floating Point Unit, co-processor, high performance buses etc.
Instruction Decoder:
The Instruction Decoder reads the next instruction (incremented by program counter) in from
memory, and sends the component pieces of that instruction to the Arithmetic Logic Unit (ALU) for
execution. For each machine-language instruction, the control unit produces the sequence of pulses
on each control signal line required to implement that instruction (and to fetch the next instruction).
Many processors are designed with single cycle execution.
The RISC instruction decoder is typically a very simple device. Because RISC instruction words are a
fixed length, the positions of the fields are fixed, and processor reads in the entire instruction into
the instruction register. We can decode an instruction, therefore, by simply separating the machine
word in the instruction register into small parts.
Bus:
Bus is a communication system physically links the components to each other, thus allowing the
transit of control information, or data between these components. Communication between the
functional blocks of a system mono-chips were first provided by bus-based architectures. There are
many open source and commercial Bus Architectures which include OpenCores’ Wishbone, Altera’s
Avalon, ARM’s AMBA and IBM’s CoreConnect. We will draw a comparison between them later.
The ALU performs operations on the one (or two) operands decoded by instruction decoder and
read by memory. The inclusion of inverters on the inputs enables the same ALU hardware to
perform the subtraction operation (adding an inverted operand), and the operations NAND and
NOR.
Introduction:
For understanding of how a processor works and how it can be synthesized into FPGA, we chose
open source that was compatible to Intel 8051 architecture. There are many open source and
commercial IP Core available. Open source 8051 IP Cores include Oregano Systems mc8051,
OpenCores’ T51 and 8051 while commercial IP Cores include Evatronix R8051XC2, e8051 and Digital
Core Design DP8051CPU.
Of all the above mentioned 8051 cores, R8051XC2 is claimed to be fastest and fully-configurable
8051 achieving speed of 350 MHz. However, its code was not open source and meant for
commercial purposes. For education, cores from Open Cores and Oregano Systems were to be used.
Cores from Open Cores had one disadvantage that they were not easy to synthesize and
documentation provided was not helpful. Thus, core for 8051 Microcontroller written in VHDL from
Oregano Systems was chosen.
2. For compilation of C Program for 8051, Keil c51 was installed which has built in target specification
for Oregano 8051 Core. Here, after building C file (for example, BLINKY.c or Fibonacci.c),
corresponding .hex file was created.
Top module for 8051 was written in VHDL, which used components of 8051 Core as well as
memories such as 128 x 8 RAM, 64k x 8 ROM and 64k x 8 External RAM. Memories were created
from Core Generator in Xilinx ISE. Configuration for 128 x 8 bit RAM is as follows:
Also, Phase Locked Loop (PLL) from Xilinx Core Generator was used to downgrade the speed from
FPGA system clock of 100 MHz to desired frequency (11.675, 25 or 40 MHz). Its component was also
called in top module.
Architecture of Top Module generated from Plan Ahead (Pre-Synthesis) is shown below:
To convert HEX to COE file, there are some open source tools available but most are not compatible
with 64 bit Windows. For this purpose, an alternative set to tools (number 3 and 4 in tools section)
were introduced which convert HEX to bin file and then, bin to COE. These tools used Command
Prompt in Windows as shown:
To deal with this problem, a PLL Core was introduced in between FPGA Clock and 8051 Core Clock.
The resultant clock was matched at: 11.675 MHz. Core Schematic is as follows:
Simulation of mc8051:
A local testbench was created for Fibonacci.c file which was loaded into a ROM similar to synthesis
process. It was then simulated using Xilinx ISim. The output integer values were used in Port 0
(p0_o).
As an example, 25 MHz synthesizable core was chosen. In this core, file named mc8051_p.vhd there
is parameter named 'C_IMPL_N_TMR'. It can take values from 1 to 256. Its default value to set to 1.
We changed its value to 2 which generated 2 extra timer units, 1 additional serial port and 1
additional external interrupt sources. Initial peripheral diagram (pre-synthesized) is shown below:
Introduction:
The complexity of designing processors has increased overtime. Designing each and every hardware
component of the system from scratch soon became impractical and expensive for most designers.
Therefore, the idea of using pre-designed and pre-tested IP Cores in designs became an attractive
alternative. Softcore processors are processors whose architecture and behavior are fully described
using synthezable Hardware Desciption Languages (HDL) like Verilog or VHDL. They can be easily
synthesized to FPGA or ASIC.
Evaluation of Processors:
There are many 32-bit processors available such as Altera Nios II, Xilinx MicroBlaze, Tensilica Xtensa,
OpenCores OpenRISC 1200 and Gaisler Leon 3. Overall comparison has been drawn between them:
Max Frequency 200 (FPGA) 200 (FPGA) 350 (ASIC) 300 (ASIC) 400 (ASIC)
(MHz)
125 (FPGA)
Pipeline Stages 6 3 5 5 7
From above table, we can easily access that each processor has its advantages and disadvantages.
Xtensa offers unlimited ISA customization but it is also not open source and expensive. Similarly,
OpenRISC has open source code but difficult to use to use with given technology. Leon 3, despite its
ISA customization it excels all other departments. However, there are other problems to be explored
also like bus architecture, software tools and compliant ISA.
From here also, we can see that AMBA from ARM has quite a lot of advantages. However, WishBone
has an edge of being adopted as primary bus for most open source designs. AMBA is the bus
architecture used by Leon 3. We will check more details about it afterwards.
Taking in view the above comparisons, we can safely use Leon 3 as 32-bit processor and our baseline
in designing our own custom processor.
SPARC is an instruction set architecture (ISA), derived from a RISC lineage. As an architecture, SPARC
allows for a spectrum of chip and system implemenEtations at a variety of price/performance points
for a range of applications, including scientific/engineering, programming, real-time, and
commercial. SPARC was designed as a target for optimizing compilers and easily pipelined hardware
implementations. SPARC implementations provide exceptionally high execution rates and short
time-to-market development schedules. Its advantages are:
• Open architecture without patent or license fees unlike Intel, MIPS and ARM
• Well designed and documented
• Easy to implement
• Established software standard
Leon 3 Introduction:
The LEON3 is a synthesisable VHDL model of a 32-bit processor compliant with the SPARC v8
architecture. The model is highly configurable, and particularly suitable for system-on-a-chip (SOC)
designs. The full source code is available, allowing free and unlimited use for research and
education. The LEON3 processor has the following features:
• Compliant with SPARC V8 ISA
• Advanced 7-stage Pipeline
• Hardware Multiply, Divide and MAC units
• High Performance Pipelined Floating Point Unit (FPU)
• Harvard Architecture (Separate Instruction and Data Cache)
• AMBA 2.0 AHB Bus Interface
• On-Chip Debug Support
• Multiprocessor Support
• Power Down and Clock Gating
• Fault tolerant version available for High Performance space applications
• Extensively configurable
• Tools available like simulators, compilers, debuggers and kernels
Leon 3 consists of following subsystems:
➢ Integer Unit (based on 7-Stage Pipeline Harvard Architecture)
➢ Cache (Data and Instruction)
➢ Floating Point Unit & Co-processor
➢ Hardware Multiplier and Divider
➢ Memory Management Unit
➢ Debug Support Unit
➢ Interrupt Controller
Integer Unit:
It implements the full SPARC V8 standard, including hardware multiply and divide instructions. The
implementation is focused on high performance and low complexity. The number of register
windows are configurable within the limit of the SPARC standard (2 - 32), with a default setting of 8.
The pipeline consists of 7 stages with a separate instruction and data cache interface (Harvard
architecture). Its 7-stage pipeline is shown below:
Memory Controller:
The memory controller handles a memory bus hosting PROM, memory mapped I/O devices,
asynchronous static ram (SRAM) and synchronous dynamic ram (SDRAM). The controller acts as a
slave on the AHB bus. The function of the memory controller will be programmed through memory
configuration registers 1, 2 & 3 (MCR1, MCR2 & MCR3) through the APB (Advanced Peripheral Bus)
bus. The memory bus will support four types of devices: PROM, SRAM, SDRAM and local I/O. The
memory bus can also be configured in 8- or 16-bit mode for applications with low memory and
performance demands. The controller decodes three address spaces (PROM, I/O and RAM) whose
mapping is determined through VHDL-generics (parameters). Following diagram shows different
connections with:
The AMBA AHB is the high-performance system backbone bus. It is for the high performance, high
clock frequency system modules. It supports the efficient connection of processors, on-chip
memories and off-chip external memory interfaces with low-power peripheral macro-cell functions.
AHB is also specified to ensure ease of use in an efficient design flow by using synthesis and
automated test techniques.
AMBA APB is optimized for minimal power consumption and reduced interface complexity to
support peripheral functions. The APB is for the low power peripherals. APB can be used in
conjunction with either version of the system bus.
After installation of library, toolchain is required to use Leon 3. It is compatible in both Windows and
Linux. However, Windows is preferred due to ease in installation. It includes:
➢ Bare-C Compiler (BCC)
➢ Boot-Prom Builder (mkprom2)
➢ RTEMS Leon Cross Compiler (RTEMS)
➢ GRMON Debug Tool (GRMON2 Evaluation version)
➢ TSIM Simulator (Evaluation Version)
In windows environment, these tools are installed through a single installer file known as GRTOOLS
where in Linux every file has to be installed separately. Also, during installation, environment
variables in windows are set automatically. For Bare-C Compiler, Eclipse Kepler version 1.6 is
installed during installation.
Besides ease at installation, I preferred Windows because tools for Synthesis and Simulation (Xilinx
and ModelSim) were already installed and their environment variables were set. For Linux, all these
tools had to be installed from scratch.
In windows, we install Cygwin to replicate the Linux environment in Windows. During installation,
make sure to install Tcl/Tk which is important for GUI launch. With cygwin installed, it is time to
configure Leon using xconfig tool. Cygwin can be launched from Desktop and also XWIN server is
required also for display. After XWin is successfully launched, following command is written in
Cygwin to export Display to XWin server:
Here we can see that we are in target design ML50x. Here by writing xconfig in cygwin shell calls for
xconfig GUI as shown below:
Synthesis: Target technology for FPGA and other technology related configurations. In this
case, it is Xilinx.
Board Selection: FPGA Board (Xilinx ML507) or ASIC Technology
Clock Generation: PLL Generated for FPGA Board. Default is 60 MHz for 100 MHz Board.
Processor: Main Processor configuration like number of processors, Integer Unit, FPU, MMU
Configuration
L2 Cache
AMBA Bus Configuration
Debug Link
Peripherals: Memory Controller, On-Chip RAM/ROM, Ethernet, UART, Timer, VGA and
Keyboard Interface, PCI Express
VHDL Debugging
This default configuration is known as Minimal Processor. First we will try to simulate and synthesize
the Minimal Processor and then, go for more high performance configurations.
Now, we call on ModelSim for simulation. ModelSim is not launch as separate application which is
problem is Cygwin because it is crashed during launch. It is called as vsim.exe which works in the
background.
After successful Bit file generation, the selected file is loaded into FPGA (Vertex 5 ML507). Now,
software is loaded and debugged into Leon 3 synthesized in FPGA.
bp 0 1 1 Branch Prediction
A brief table for area utilized and timing analysis for each processor is shown in the table below:
BCC is a cross-compiler for LEON3 processors. It is based one the GNU compiler tools and the Newlib
standalone C-library. The cross-compiler system allows compilation of both tasking and non-tasking
C and C++ applications. It supports hard and soft floating-point operations, as well as SPARC V8
multiply and divide instructions.
#include <stdio.h>
main()
{
printf("Hello World\n");
}
It takes hello.c file and compiles it to output hello.exe. This executable file can be loaded into FPGA
program using two methods:
1. GRMON Debugger
2. MKPROM2 PROM Programmer
We will use JTAG link which is also used for bit file programming of FPGA. However, for GRMON,
compatible driver must be installed to use it. After successful link is established in JTAG, GRMON
shell is launched in Command Prompt:
hello.exe compiled with Bare-C Compiler can be loaded into Leon 3 FPGA using GRMON debug link
and its output is wrote back in the GRMON shell:
hello.exe compiled using Bare-C Compiler again can be loaded and checked here:
Introduction:
Using the knowledge of Leon 3 processor, we need to extend our work in customizing this processor.
We will to study the factors and variables essential in the designing of this processor. There is
different form of understanding required to achieve each form customization. To add a peripheral,
we need:
Library Structure
Understanding and working of AMBA APB bus
VHDL Generics and link with Leon3mp.vhd (Top Module)
xconfig GUI Customization
Library Structure:
The automatic generation of compile scripts searches for VHDL libraries in the file lib/libs.txt, and in
lib/*/libs.txt. The libs.txt files contains paths to directories containing IP cores to be compiled into
the same VHDL library. The name of the VHDL library is the same as the directory. The main libs.txt
(lib/libs.txt) provides mappings to libraries that are always present in existing library, or which
depend on a specific compile order (the libraries are compiled in the order they appear in libs.txt):
Each directory specified in the libs.txt contains the file dirs.txt, which contains paths to sub-
directories containing the actual VHDL code. In each of the sub-directories appearing in dirs.txt
should contain the files vhdlsyn.txt and vhdlsim.txt. The file vhdlsyn.txt contains the names of the
files which should be compiled for synthesis (and simulation), while vhdlsim.txt contains the name of
the files which only should be used for simulation. The files are compiled in the order they appear,
with the files in vhdlsyn.txt compiled before the files in vhdlsim.txt.
Why is this important? When scripts are generated during synthesis or simulation, the library is
loaded with each file required for the processor and assembly system. When we create or add new
peripheral, we update these scripts accordingly. It is done by updating target vhdlsyn.txt file with
new peripheral file. The resulting script is shown below:
The access to the AHB slave input (AHBI) is decoded and an access is made on APB bus. The APB
master drives a set of signals grouped into a VHDL record called APBI which is sent to all APB slaves.
The combined address decoder and bus multiplexer controls which slave is currently selected. The
output record (APBO) of the active APB slave is selected by the bus multiplexer and forwarded to
AHB slave output (AHBO).
Example IP Core is written with APB interface. The IP core has one memory mapped 32-bit register
that will be reset to zero. The register can be read or written from register address offset 0. The
core’s base address, mask and bus index settings are configurable via VHDL generics (pindex, paddr,
pmask). The paddr and pmask VHDL generics are propagated to the APB bridge via the apbo.pconfig
signal and the index is propagated via the apbo.pindex signal. These values are then used by the APB
There are also many open source or commercial IP Cores available that can be added. The problem is
not every IP Core uses AMBA interface. For example, cores available on Open Cores follow Wishbone
Bus Architecture. For that purpose we sometimes need to create Wishbone to AMBA wrapper.
xconfig extension:
This module is the last but it uses information of all previous work which leads to customization of
GUI shown. Each core has a set of files that are used to generate the core’s xconfig menu entries. As
an example we will look at the apb_example. The xconfig files are typically located in the same
directory as the core’s HDL files (but this is not a requirement).
The first line defines a boolean option that will be saved in the variable CONFIG_I2CAHB. This will be
rendered as a yes/no question in the menu. If this constant is set to yes (‘y’) then the user will be
able to select two more configuration options. First the width, which is defined as an integer (int),
and the interrupt mask which is defined as a hexadecimal value (hex). The GUI has a help option for
Now, these variables can be used to generate cores for apb_example in the same way as shown in
Fig: 46. The modified xconfig is shown below:
A basic processor architecture was explored based on the 8- bit 8051 instruction-set architecture. It
was simulated, synthesized, and mapped on an FPGA. Then a custom C code was executed on the
processor and its performance was measured.
Then a 32-bit Sparc v8 instruction set architecture was explored which included more complex
components like caches, memory controller, and AMBA busses. This was again synthesized and
mapped on an FPGA for testing.
Finally, the capabilities of the processor were enhanced by designing and interfacing a custom
memory mapped peripheral with the AMBA bus of the system, thus incorporating the desired
functionality into the existing system.
The architecture was found to be suitable for implementation in ASICs and FPGAs, and is flexible
enough to incorporate custom peripherals.