SOC Verification Debugging Material 1739857112
SOC Verification Debugging Material 1739857112
DESIGN
VERIFICATION
Using an ASIC chip, the major advantage is that, overall design can be made
into one integrated circuit and the number of additional circuits can be
reduced. Modern ASIC chip designs generally include 32-bit processors,
memory blocks and other large building blocks. Such a modern ASIC chip is
known as SoC (System-on-a-Chip). Usually, ASIC designs will be carried out
only for those products which have large production run. Also, since the cost
of ASIC chip is high, it is recommended only for high volume products. Today,
many digital ASIC designers use Hardware description language (HDLs) like
Verilog and VHDL, to describe the function and design of ASIC chip.
Three types of Gate Array Based ASICs are: Channeled Gate Arrays,
Channeless Gate Arrays and Structured Gate Array. In Channeled Gate Array,
the space for interconnect between the rows of cells are fixed in height.
While in Channeless Gate Array, there is no predefining space between the
rows of cells. Structured Gate Array or Embedded Gate Array combines both
the features of standard cell based and gate array based ASICs.
Programmable ASIC
Programmable ASICs are classified into Programmable Logic Devices and
Field Programmable Gate Arrays.
ASIC Design
There are various steps involved in the ASIC chip design. A brief description
is given below.
1. Design Entry: In this step, the designer starts the design with a text
description or system specific language like HDL, C language etc.
2. Logic Synthesis: Logic synthesis generally helps to produce the netlist
consisting the description and interconnection of logic cells.
3. System Partitioning: Here partitioning of a large design into a small ASIC
design takes place.
4. Prelayout Simulation: Prelayout Simulation allows checking whether the
design functions correctly.
5. Floorplanning: Using this step we can plan the arrangement of the
blocks present in the netlist on the chip.
6. Placement: Allows the placement of cells present in the block.
7. Routing: This step is carried out to provide the necessary
interconnections between the cells.
8. Circuit Extraction: Here the translation of the integrated circuit to
electrical circuit takes place.
9. Postlayout Simulation: It allows checking the final layout of the design.
Applications of ASIC Technology
Application Specific Integrated Circuits finds many applications in the field
of medical, industrial sectors, automotive and sensors. Today ASIC chip can
be used in satellites, modems, computer PCs etc. Electronic Odometer,
Engine Monitor etc. are some ASIC products suitable for automobile
applications. Electronic Odometer helps to record the mileage of a vehicle.
Engine Monitor and Warning Light Controller is another ASIC product that
monitors different parameters like temperature, voltage etc. of a vehicle.
ASICs can be widely used for industrial applications also. Some ASIC based
industrial products are Micro-Power 555 Programmable Timer, Thermal
Controller, 8 Bit Microcontroller etc. In medical applications, biometric
monitors, hearing aids etc. are some products. Today, for security
applications many ASIC products are coming out. One of them is RFID tags.
Last but not the least, ASIC can be used for many applications and in the
near future we can expect a low cost ASIC technology.
Device programmer blows fuses on the PLD to control each gate operation.
Inexpensive software tools are used for quick development, simulation and
testing, therefore design cost is comparatively low. Another important
advantage is that customer can modify their design, based on changes in
requirement.
The most commonly used Programmable Logic Devices are listed here
The above figure shows the block diagram of a PROM. It consists of a fixed
AND gate array followed by a Programmable OR gate array. AND gate array
is used as the address decoder which selects the corresponding address
location based on the input address provided to it. Data is stored in the OR
gate array. It contains programmable fuses, which can be burned off
depending on the data values that are to be stored.
Internal structure of each block in the PROM is shown in the above figure.
Here, A and B are the address inputs and Y is the data output. AND arrays
are fixed to select each row (address location) for corresponding inputs. As
shown in the figure, data in each memory location is determined by the
fuses in the OR array. If the fuse is not burned off, charge send through the
row is received at the output, indicating a logic one and when the fuse is
burned off, the signal cannot reach the data output and is detected as logic
zero. Thus, in PROM binary data is stored using fuses.
For example in this PROM, single bit data can be stored in each memory
location (fuses). Since there are four address locations to store data, two bit
address inputs are required to select these locations. Working of a PROM can
be better understood if it is explained with an example. Consider address
input to be “00”. As shown in the above figure, the inverted values of both
address inputs are given to the first AND gate. That is, both inputs of first
AND is now at logic one. This means that, first address location is selected.
Now, data in the first address location is determined by the presence or
absence of fuses in the OR array. Like in above figure, if fuse at this location
is burned off, then, output data is ‘0’. Similarly in this example, logic one is
stored in second and third location and logic zero in the fourth location.
For this, address pins of the PROM are used as the inputs of the logic device
and data out is same as PROM output data. Outputs of respective inputs are
stored in the corresponding memory locations. For example an XOR gate can
be implemented by storing its output values in the respective address
location
We know that, output of a two input XOR gate is logic one, if exactly one of its
input is at logic one state. The table shows a two bit PROM with four address
locations (‘00’, ‘01’, ‘10’, ‘11’) in which data ‘0’, ‘1’, ‘1’, and ‘0’ are stored.
Therefore, if input is ‘00’ (address input for PROM) the data (‘0’) stored in the
memory location ‘00’ is fetched and outputted. Similarly, we get logic one at
the output for the input combinations ‘01’ and ‘10’ and logic zero for ‘11’.
Thus device works like a two input XOR gate. Any other two input logic device
can be implemented using this PROM by changing the data stored in the
memory. Now refer the figure of the example that was discussed earlier.
If there are m address bits, then 2^m locations can be addressed with that.
This 2^m address locations can store 2^(2^m) different values, therefore
2^(2^m) logic function can be implemented using an m bit PROM. In our
example, we used a simple PROM with two address bits. That means, 2 ^ 2
data can be stored in it. Each data combination is a particular logic function.
To implement this, we only need inputs A and B’ in the first (top) AND gate
and A’ and B inputs in the second (bottom) AND gate. Rest two inputs in each
AND gate is not needed, therefore, respective fuses are burned off. Two
input XOR gates using Programmable Array Logic is shown below.
(c). Programmable Logic Array (PLA)
Programmable Logic Array or PLA is used to implement logic functions in
digital circuits. The structure has programmable AND-matrix, programmable
OR-matrix, input and output buffers. Block diagram of a PLA device is as
shown below.
Both inverted and original values of each PLA inputs are provided by input
buffers. Input to both AND and OR gates are through fuse and therefore they
can be burned off depending on our requirements. Structure of a PLA with all
fuses (before programming) is shown below.
Two input PLA structure can be used to realize any two input logic gates. For
that, fuses which are not required to realize that particular logic function are
burned off. For example, XOR gate realized using a programmable Logic
device is shown below.
Characteristic equation of a XOR gate contains of two min terms. Given by
Each AND is used to generate a particular min term and required min terms
can be selected using the fuses in the input of OR gate.
(b). D-flip-flop,
(c). Multiplexers.
Like in PAL, OR gates are used to sum off min terms from the output of the
AND gates. An OLMC cell consists of a D-flip-flop, which is used to implement
sequential circuits. Multiplexers in the OLMC cells are used to select the
routing of the input signals to the external output or to the feedback output.
It is also used to select from the sequential and non-sequential output taken
from the input and output of the D-flip-flop depending on the requirement.
Complex designs are first divided into small functions. Logic blocks are used
to implement these sub functions and connections are made using
programmable interconnects.
The figure shows the architecture of programmed FPGA. Required sub
functions are implemented using each logic blocks, which is then
programmed and interconnected using switch boxes.
Logic Block
Logic Blocks in the FPGA are used to implement sub functions. Any type of
logic function (both combinational and sequential) circuits, can be
implemented using a logic block. Therefore, logic blocks are commonly
referred to as configurable logic blocks (CLBs). A basic Logic block contains
A simple block diagram of a logic block consists of a lookup table, register and
a multiplexer as shown in the figure. SRAM is used to implement lookup table.
Therefore, desired logic function can be implemented by varying the data
stored in the SRAM. Output from the lookup table is given as inputs to both
multiplexer and D flip flop. D flip flop is used to delay the output. And
depending on the application, the multiplexer selects LUT output or delayed
output. Therefore, by using select input of the multiplexer we can implement
both combinational and sequential circuits using logic blocks. Many such
logic blocks are configured and finally interconnected using the switch box to
build the desired complex circuits.
Compared to other logic devices FPGA has very high logic density. Which
means, a single FPGA chip contains ten thousand to eight million gates.
Therefore, more complex logic circuits can be implemented using FPGA.
FPGA undergo concurrent processing which is faster and more efficient than
other pipeline architectures.
Design flow of SoC aims in the development of hardware and software of SoC
designs. In general, the design flow of SoCs consists of:
Advantages of SoC
Low power.
Low cost.
High reliability.
Small form factor.
High integration levels.
Fast operation.
Greater design.
Small size.
Disadvantages of SoC
Fabrication cost.
Increased complexity.
Time to market demands.
More verification.
Architecture Strategy
Design for Test Strategy
Validation Strategy
Synthesis Backend Strategy
Integration Strategy
On chip Isolation
Architecture Strategy
The kind of processor that we use to design the SoC is really an important
factor to be considered. Also, the kind of bus that has to be implemented is
another matter of choice.
Design for Test Strategy
Most of the common physical defects are modeled as faults here. While the
necessary circuits included in the SoC design help in checking the faults.
Validation Strategy
Validation Strategy of SoC designs involves two major issues. First issue is
that we have to verify the IP cores. While the second issue is that we need to
verify the integration of the system.
Synthesis and Backend Strategy
There are many physical effects that have to be considered while designing
the SoC synthesis and strategy. Effects like IR drop, cross talk, 3D noise,
antenna effects and EMI effects. Inorder to tackle these issues, chip
planning, power planning, DFT planning, clock planning, timing and area
budgeting is required in the early stage of the design.
Integration Strategy
In the integration strategy, all the above listed facts have to be considered
and assembled to bring out a smooth strategy.
On chip Isolation
In on chip isolation, many effects like impact of process technology,
grounding effects, guard rings, shielding and on- chip decoupling is to be
considered.
What is a chiplet?
A chiplet is a sub processing unit, usually controlled by a I/O controller chip
on the same package. Chiplet design is a modular approach to building
processors. Both AMD and Intel, current major CPU manufacturers, are
adopting chiplet designs for their current product line ups. Chiplets help
increase production by way of better silicon yields. Higher yields and
modular building mean a producing high core count parts mean less waste.
Here is an IBM POWER5 MCM from 2004. Note that there are two types of
chips on the substrate.
The 4 rectangular chips in the corners contain two POWER5 CPU cores each,
for a total of 8 cores. The 4 larger square chips in the center contain 9MB of
last-level cache each, for a total of 36MB. The tiny squares carpeting the
design are capacitors, probably for smoothing out power delivery.
Here is an AMD Epyc MCM from 2017. Each of the 4 rectangular chips contain
8 cores each, for a total of 32 cores. There are no dedicated cache chips, the
last-level cache is split among the 4 chips. There is only one type of chip, and
half of the number of chips compared to the IBM POWER5
Here’s the upcoming AMD Epyc 2 “chiplet”-based MCM that will be released
in 2019. AMD has significantly increased the number of chips on the package
from 4 to 9. The 8 tiny chips contain 8 cores each, for a total of 64 cores. The
large chip in the center contains I/O and the last-level cache. Interestingly,
the 8 tiny chips are built on TSMC 7nm technology, whereas the large I/O die
is built on TSMC 14nm. Since 14nm is significantly more mature and cheaper
to manufacture than 7nm, this allows AMD to further save on costs. With this
type of assembling strategy, AMD can produce Epyc 2 server SKUs with
anywhere between 8 and 64 cores by removing or disabling the tiny chips.
The downside is that there is an additional latency and power penalty when
communicating off-chip, but this might not matter depending on the
application.
However, at a high level, there is no real difference between the packaging
used for the Epyc 2 and the packaging used for the IBM POWER5. Both have
lots of chips (8 vs 9) on the same package, and both have two different types
of chips (cores vs cache/IO). It’s therefore likely that “chiplet” is just a
marketing term for an updated version of something done for decades.
Chiplet designs simply put multiple die into the same package.
AMD has introduced CCX configuration, the die consists of seperate CCD,
IOD blocks
The 1st Gen Ryzen architecture was relatively simple: An SoC design with
everything from cores to I/O and controllers on the same die. The CCX concept
was introduced, wherein CPU cores were grouped into four-core units and
combined using the Infinity Cache. Two quad-core CCXs formed a die.
It’s important to note that even though the CCXs were introduced, the
consumer Ryzen chips were still monolithic single-die designs. Furthermore,
although the L3 cache was shared across all the cores in a CCX, each had its
own slice. Accessing the last level cache (LLC) of another was relatively
slower, even more so if it was on the other CCX. This caused poor
performance in latency-sensitive applications like gaming.
Things largely remained the same with Zen+ (-node shrink), but Zen 2 was a
major upgrade. It was the first chiplet-based design for consumer CPUs
featuring two compute dies or CCDs and one I/O die. AMD added a second
CCD on the Ryzen 9 parts for core counts never seen before on the consumer
front.
The 16MB L3 cache was more accessible (read: faster) for all the cores on the
CCX, greatly improving gaming performance. The I/O die was separated, and
the Infinity Fabric was upgraded. At this point, AMD was slightly slower in
gaming but offered superior content creation performance than rival Intel
Core chips.
Zen 3 further refined the chiplet design, eliminating the CCX and merging the
eight cores and 32MB cache into one unified CCD. This drastically reduced
cache latency and simplified the memory sub-system. For the first time,
AMD’s Ryzen processors offered better gaming performance than archrival
Intel’s. Zen 4 makes no notable changes to the CCD design other than making
them smaller.
At low core counts, a monolithic approach works fine. This largely explains
why Intel’s mainstream consumer CPU line has, until Ryzen, topped out at 4
cores. Increasing the core count on a monolithic chip dramatically increases
costs.
Mathematically, that ten percent defect rate stacks for every additional core
on a monolithic die to the point that with, say, a 28-core Xeon, Intel has to
throw away one or two defective chips for every usable one since all 28 cores
have to be functional. Costs don’t just scale linearly with core count–they
scale exponentially because of wastage.
The economics are much more forgivable with the chiplet approach, as costs
scale linearly with core counts. Because AMD’s wastage rate is relative to its
ability to create a functional 4-core block at most (a single CCX), they don’t
have to throw out massive stocks of defective CPUs.
The second advantage comes from their ability to leverage those defective
CPUs themselves. Whereas Intel mostly throws them out, AMD disables
functional cores on a per-CCX basis to achieve different core counts.
Intel released its first major chipped or tiled CPUs as Xeon Sapphire Rapids a
while back. The 14th Gen Core processors codenamed “Meteor Lake” will be
the first chiplet lineup from Team Blue. They are expected to hit the market
in the final quarter of 2023.
Chapter 3
SOC Architecture & Functionality
SOC Architecture
SoC stands for System On Chip. It is a small integrated chip that contains all
the required components and circuits of a particular system. The
components of SoC include CPU, GPU, Memory, I/O devices, etc.
I/O die is manufactured using the GF 12nm process while the CCD is
manufactured on the TSMC 7nm. This is done because I/O doesn't scale (or
at least doesn't scale as well as the cores) and therefore manufacturing it on
a cheaper and more mature process allows AMD to get higher yields by only
using the TSMC 7nm process for the CPU cores which benefit the most from
it.
AMD’s Ryzen CPUs are made up of core complexes called CCDs and/or
CCXs. But what is a CCX and how is it different from a CCD in an AMD
processor? Let’s have a look. There are many factors responsible for AMD’s
recent success in the consumer market. But, the chiplet or MCM design
(Multi-chip Module) is at the heart of it.
It allowed AMD to increase the core counts to never-before-seen figures in
the consumer market and set the groundwork for a revolution of sorts.
The Ryzen 9 5950X features 16 cores while the Threadripper flagship, the
3990X boasts an insane core count of 64, the same as the Epyc Rome parts.
This means that, at any given price point, AMD can deliver more cores, more
threads, and therefore, better multi-threaded performance, than Intel can,
even after a number of price cuts.
There are pros and cons to having the CCX be Ryzen’s basic functional unit.
A negative is that the baseline cost of manufacturing can be somewhat on
the high end since AMD needs to pay up for a minimum of four cores.
However, this is offset by the fact that Team Red salvages partially
functional CCXs with, say, two or three functional cores, to create different
SKUs. For example, the Ryzen 5 5600X features two CCXs/CCDs, each of
which has one core disabled, for a total of 6 functional cores.
However, while CCXs are the basic unit of silicon dabbed, at an architectural
level, a CCD or Core Chiplet Die is your lowest level of abstraction. A CCD
consists of two CCXs paired together using the Infinity Fabric Interconnect.
All Ryzen parts, even quad-core parts, ship with at least one CCD. They just
have a differing number of cores disabled per CCX.
System Hub
The SysHub IP is responsible for routing downstream requests from the
host to different system clients – PCIe devices, like internal graphics and
various clients attached to System Hub. It also routes upstream DMA
requests from internal devices to the IOHUB or PCIe. Key components
include nBIF Core, which contains the PCI capability registers for the root
complex and internal endpoints and exposes these capabilities to the
operating system. Another component is the GDC (global data converter),
which converts between AXI and SDP protocols, and merges ATHUB
transactions with the scalable IO network.
The final component is System Hub Core, which consists of multiple
NIC400 blocks which combine the client requests according to VC
assignment.
For AMD APU models from 2011 until 2016. AMD is marketing their chipsets
as Fusion Controller Hubs (FCH), implementing it across their product range
in 2017 alongside the release of the Zen architecture. Before then, only
APUs used FCHs, while their other CPUs still used a northbridge and
southbridge.
Graphics (GFX)
The focus for STX graphics engine is to substantially increase performance
over the previous generation mainstream APU Phoenix by bringing in all the
PPA features from our mobile product line up, to target leading graphics
performance in its product segments.
Performance is primarily measured by 3DMark benchmark tests, which are
important for PC products. Performance is also gauged by a suite of recent
GFX Benchmarks tests such as Aztec and 3D games which includes DOTA,
Ashes of the Singularity, Fallout4 etc.
DF Components
Coherent Master (CCM, GCM)
Interfaces to any client which may cache data, for example CPU and GPU.
IO Master/Slave (IOM/IOS)
Interface block to the IOHUB. IOM contains master functionality – for
example DMA accesses from an IO device. IOS contains slave functionality,
for accesses from the CPU to a device.
Multi-Socket Bridge (CAKE) Extends the transport layer off chip, for
example for a chiplet-based design. Not used by STX.
Power Management, Interrupts, Etc (PIE) The PIE’s primary function is fabric
power management. It sequences fabric frequency changes and controls up
to 4 power zones within the fabric. Also handles interrupts, register access,
debug, RAS functionality and the loopback controller.
The major IPs within NBIO sub-system include IOHUB, PCIe Controllers,
nBIF, and System Hub.
Chapter 4
SOC Verification
Any chip, a simple embedded microcontroller, or a complex system-on-a-
chip [SoC] will have one or more processors. Figure1 shows a complex
electronic system composed of both hardware and software needed for
electronic devices like smartphones.
The hardware is made up of a complex SoC that incorporates almost all the
components needed for the device. In the case of the smartphone, we
integrate all the hardware components called IPs [Intellectual Properties]
like CPUs, GPUs, DSP, Application Processors, Interface IPs like USB, UART,
SPI, I2C, GPIO, and subsystems like System Controllers, Memories with
controllers, Bluetooth, and WiFi, etc. and create the SoC. Using SoC helps us
to reduce the size and power consumption of the device while improving its
performance.
Black-box method
Black-box method treats the DUT as a black-box (no knowledge about its
internal structure). Result – We are not bothered about how the internal
structure of the application is maintained/changed until the outside
functionality is working as expected (as per requirements). Knowing what
the application does is more important than knowledge of how it does it.
This is the most widely used test method for System & Acceptance tests as
it doesn’t require professionals with coding knowledge and also it provides
an external perspective of the DUT (as an end-user who have no knowledge
of the actual code).
An SOC will have several design blocks (IPs) all integrated together and a
testbench for an SOC will need to be more complicated and will need
several such VIPs to stimulate and verify different interfaces.
In SOC Verification all the IP's are interconnected together and that's why in
SOC Verification you mostly deals with the connectivity tests,you need to
check whether the data and control path are clear or not and you need to
check it through the entire hierarchy.Interconnetion across all the IP's
needs to be verified.
After all VIPs are integrated into the testbench, the actual verification of the
DUT(IP/SOC) will still need to follow a Verification plan that captures Design
features, simulating the DUT and all interactions of various features,
debugging failures, collecting coverage and making sure the DUT is
functionally correct
The only difference is in the intent of the specification. With IP, the intent is
to provide a specification to be manufactured into a physical design. With
VIP, the intent is to provide a specification for verification of the
functionality of a design block or full system.