0% found this document useful (1 vote)
24 views43 pages

SOC Verification Debugging Material 1739857112

The document provides an overview of Application Specific Integrated Circuits (ASICs), detailing their types, design methodologies, and applications. It discusses Full Custom, Semi-Custom, and Programmable ASIC designs, along with their advantages and disadvantages. Additionally, it covers the design process of ASIC chips and highlights their use in various industries such as automotive, medical, and industrial sectors.

Uploaded by

venkatmusala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
24 views43 pages

SOC Verification Debugging Material 1739857112

The document provides an overview of Application Specific Integrated Circuits (ASICs), detailing their types, design methodologies, and applications. It discusses Full Custom, Semi-Custom, and Programmable ASIC designs, along with their advantages and disadvantages. Additionally, it covers the design process of ASIC chips and highlights their use in various industries such as automotive, medical, and industrial sectors.

Uploaded by

venkatmusala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

SOC

DESIGN
VERIFICATION

Different types of ASIC Chips


Chiplet vs Monolithic
SOC Components
CCD vs IOD
SOC | IP Subsystem level Verification
Chapter 1
Introduction
1.1 What is ASIC ?
Application specific integrated circuits are highly specialized devices. Unlike
other devices, ASICs are non- standard integrated circuits constructed for
one specific purpose application only. The work related to ASIC began in the
early 1980s itself. Today, we can see that Application Specific Integrated
Circuit (ASIC) are changing the way electronic device systems are designed,
manufactured, and marketed. So, it is important in this challenging design
world to understand the nature, options, design methodologies, and costs of
ASIC technology. Some examples of ASIC chip include chips for satellites,
chips designed to run a cell phone, Bitcoin Miner, chip used in a voice
recorder etc. Some ICs that are not ASICs include memory chips like ROM,
DRAM; ICs at LSI, SSI and MSI levels etc.

Using an ASIC chip, the major advantage is that, overall design can be made
into one integrated circuit and the number of additional circuits can be
reduced. Modern ASIC chip designs generally include 32-bit processors,
memory blocks and other large building blocks. Such a modern ASIC chip is
known as SoC (System-on-a-Chip). Usually, ASIC designs will be carried out
only for those products which have large production run. Also, since the cost
of ASIC chip is high, it is recommended only for high volume products. Today,
many digital ASIC designers use Hardware description language (HDLs) like
Verilog and VHDL, to describe the function and design of ASIC chip.

1.2 Types of ASIC


There are basically three types of ASIC chip designs: Full Custom Design,
Semi-Custom Design and Programmable ASIC.
Full Custom Design
Full custom Design is a design methodology useful for integrated circuits. In
this design, the resistors, transistors, digital logic, capacitors and analog
circuits are all positioned in the circuit layout. Generally full custom designs
are referred as “handcrafted” designs. Microprocessor is a simple example of
full custom IC. Usually; the manufacturing and design of it is very expensive.
8 weeks is the normal manufacturing lead time.

Maximum performance, minimized area and highest degree of flexibility are


the major features of this design. While; the risks of such an IC is higher
because the whole design is untested and not designed from the library
elements that have been used before. Also, it requires designers that are
highly skilled and may take many years to finish the design. Full Custom ICs
also helps to concentrate an intended specific design application mainly.

Semi- Custom Design


Semi-Custom design is an alternative to full-custom designs. Here, we can
use components from a standard library for design purposes. Thus, in
semicustom ASIC designs, all logic cells are predesigned and some mask
layers are only customized. The advantage of using predesigned logic cells
from the library is that, it can make semi-custom ASIC chip designs easier.
Standard cell libraries are usually designed using full custom designs. There
are basically two types of semicustom ASICs: Standard Cell Based ASIC and
Gate Array Based ASIC.

Standard Cell Based ASICs


A standard cell based ASIC commonly uses predesigned logic cells like logic
gates, flip flops, multiplexers, demultiplexers etc. These predesigned logic
cells are known as Standard Cells. Flexible blocks also known as standard cell
areas in the ASIC design consist of many rows of standard cells. The standard
cell areas can be usually used in combination with larger standard cells like
microcontrollers. The larger standard cells are also known by the names
Megacells, Megafunctions, System Level Macros, Full custom blocks, Fixed
Blocks, cores or System Level Macros.

Placement and providing Interconnection of cells is the major role of the


ASIC designer here. Some advantages of using standard cell based designs
include save money, reduced time and less risk compared to full custom
designs.
The disadvantage is the time and expense to make a standard library and to
fabricate these designs.

Table 1: Examples of Standard Cell Products

Gate Array Based ASIC


Gate Array Based ASIC chip is a prefabricated silicon chip in which
transistors, logic gates, and other active devices are placed at predefined
positions and manufactured on a wafer. Predefined pattern of gate array
based ASIC is known as Base Array. The element or logic cell present in the
base array is often called as Base Cell. Usually gate array based ASIC designs
are known popularly by the name Masked Gate Array. Here the ASIC designer
can choose predesigned logic cells from the gate array library for better and
easy design. The cells present in the gate array library are often called as
Macros. Reduced time, low cost are some of its advantages compared to
standard cell and full custom designs.
Table 2: Examples of Gate Array Products

Three types of Gate Array Based ASICs are: Channeled Gate Arrays,
Channeless Gate Arrays and Structured Gate Array. In Channeled Gate Array,
the space for interconnect between the rows of cells are fixed in height.
While in Channeless Gate Array, there is no predefining space between the
rows of cells. Structured Gate Array or Embedded Gate Array combines both
the features of standard cell based and gate array based ASICs.

Figure: Channeless Gate Array and Structured Gate Array

Programmable ASIC
Programmable ASICs are classified into Programmable Logic Devices and
Field Programmable Gate Arrays.

Programmable Logic Devices (PLDs)


PLDs are electronic devices used to build reconfigurable circuits. Unique
features of PLDs are fast turn around, large programmable interconnect, no
customized logic cells and mask layers. PAL, PLA, GAL ROM, PROM, EPROM,
EEPROM, UVPROM etc. are some examples of programmable ICs.
Table 3: Examples of some PLD Products

Field Programmable Gate Arrays (FPGAs)


FPGAs are complex and larger reconfigurable devices. Unique features of
Field Programmable Gate Arrays include programming logic cells and
interconnect and here no mask layer is customized. Xilinx, Altera, QuikLogic,
Actel etc. are some of the important FPGA companies.

Table 4: Examples of FPGA

ASIC Design
There are various steps involved in the ASIC chip design. A brief description
is given below.

1. Design Entry: In this step, the designer starts the design with a text
description or system specific language like HDL, C language etc.
2. Logic Synthesis: Logic synthesis generally helps to produce the netlist
consisting the description and interconnection of logic cells.
3. System Partitioning: Here partitioning of a large design into a small ASIC
design takes place.
4. Prelayout Simulation: Prelayout Simulation allows checking whether the
design functions correctly.
5. Floorplanning: Using this step we can plan the arrangement of the
blocks present in the netlist on the chip.
6. Placement: Allows the placement of cells present in the block.
7. Routing: This step is carried out to provide the necessary
interconnections between the cells.
8. Circuit Extraction: Here the translation of the integrated circuit to
electrical circuit takes place.
9. Postlayout Simulation: It allows checking the final layout of the design.
Applications of ASIC Technology
Application Specific Integrated Circuits finds many applications in the field
of medical, industrial sectors, automotive and sensors. Today ASIC chip can
be used in satellites, modems, computer PCs etc. Electronic Odometer,
Engine Monitor etc. are some ASIC products suitable for automobile
applications. Electronic Odometer helps to record the mileage of a vehicle.
Engine Monitor and Warning Light Controller is another ASIC product that
monitors different parameters like temperature, voltage etc. of a vehicle.
ASICs can be widely used for industrial applications also. Some ASIC based
industrial products are Micro-Power 555 Programmable Timer, Thermal
Controller, 8 Bit Microcontroller etc. In medical applications, biometric
monitors, hearing aids etc. are some products. Today, for security
applications many ASIC products are coming out. One of them is RFID tags.
Last but not the least, ASIC can be used for many applications and in the
near future we can expect a low cost ASIC technology.

Programmable Logic Device


PLDs are semiconductor devices that can be programmed to obtain
required logic device. Because of the advantage of re-programmability, they
have replaced special purpose logic devices like Logic gates, flip-flops,
counters and multiplexers in many semicustom applications. It reduces
design time and thus reduces time for the product to reach the market. It
consists of arrays of AND and OR gates, which can be programmed to
realize required logic function.

Device programmer blows fuses on the PLD to control each gate operation.
Inexpensive software tools are used for quick development, simulation and
testing, therefore design cost is comparatively low. Another important
advantage is that customer can modify their design, based on changes in
requirement.

The most commonly used Programmable Logic Devices are listed here

(a). Programmable Read Only Memory


Programmable read only memory is a memory chip, on which data can be
written only once. Once a data is written onto a PROM, it remains there
forever. Therefore, it is also called as a one-time programmable memory.
Memory chip is delivered blank and the programmer transfers the data on
to it. A blank PROM consists of many fuses which can be selectively
burned out during the first programming. They are used in applications
like, computer bios where reprogramming is not required.

Block Diagram of Programmable Read only memory

The above figure shows the block diagram of a PROM. It consists of a fixed
AND gate array followed by a Programmable OR gate array. AND gate array
is used as the address decoder which selects the corresponding address
location based on the input address provided to it. Data is stored in the OR
gate array. It contains programmable fuses, which can be burned off
depending on the data values that are to be stored.

Internal structure of each block in the PROM is shown in the above figure.
Here, A and B are the address inputs and Y is the data output. AND arrays
are fixed to select each row (address location) for corresponding inputs. As
shown in the figure, data in each memory location is determined by the
fuses in the OR array. If the fuse is not burned off, charge send through the
row is received at the output, indicating a logic one and when the fuse is
burned off, the signal cannot reach the data output and is detected as logic
zero. Thus, in PROM binary data is stored using fuses.
For example in this PROM, single bit data can be stored in each memory
location (fuses). Since there are four address locations to store data, two bit
address inputs are required to select these locations. Working of a PROM can
be better understood if it is explained with an example. Consider address
input to be “00”. As shown in the above figure, the inverted values of both
address inputs are given to the first AND gate. That is, both inputs of first
AND is now at logic one. This means that, first address location is selected.
Now, data in the first address location is determined by the presence or
absence of fuses in the OR array. Like in above figure, if fuse at this location
is burned off, then, output data is ‘0’. Similarly in this example, logic one is
stored in second and third location and logic zero in the fourth location.

Programmable Read Only Memory as Logic Device


Programmable Read Only Memory can also be used as a logic device. A logic
device performs an operation on one or more inputs, and produces a single
output. Output is constant for respective input combinations. Therefore, if
the device behavior is stored, the PROM can work like a logic device.

For this, address pins of the PROM are used as the inputs of the logic device
and data out is same as PROM output data. Outputs of respective inputs are
stored in the corresponding memory locations. For example an XOR gate can
be implemented by storing its output values in the respective address
location
We know that, output of a two input XOR gate is logic one, if exactly one of its
input is at logic one state. The table shows a two bit PROM with four address
locations (‘00’, ‘01’, ‘10’, ‘11’) in which data ‘0’, ‘1’, ‘1’, and ‘0’ are stored.
Therefore, if input is ‘00’ (address input for PROM) the data (‘0’) stored in the
memory location ‘00’ is fetched and outputted. Similarly, we get logic one at
the output for the input combinations ‘01’ and ‘10’ and logic zero for ‘11’.
Thus device works like a two input XOR gate. Any other two input logic device
can be implemented using this PROM by changing the data stored in the
memory. Now refer the figure of the example that was discussed earlier.

If there are m address bits, then 2^m locations can be addressed with that.
This 2^m address locations can store 2^(2^m) different values, therefore
2^(2^m) logic function can be implemented using an m bit PROM. In our
example, we used a simple PROM with two address bits. That means, 2 ^ 2
data can be stored in it. Each data combination is a particular logic function.

(b). Programmable Array Logic (PAL)


PAL or Programmable Array Logic is used to implement logic functions in
digital circuits. Structure of a PAL can be divided into two parts, AND and OR
array. In this, AND array is programmable that means connections to the
inputs of each AND gate is through fuses. Therefore, when a particular input
is not required to implement, using a specific logic function, it can be burned
off. AND array is followed by a fixed OR array. It is used to sum off outputs
from all AND gates. Input connections are fixed in the OR gate array,
therefore, no changes can be made in this section of PAL device.

Block Diagram of Programmable Array Logic


Internal structure of a two input PAL logic.

Fig. Programmable Array Logic


For example consider an XOR gate. If the inputs are labeled as ‘A’ and ‘B’
then the output can be given by

To implement this, we only need inputs A and B’ in the first (top) AND gate
and A’ and B inputs in the second (bottom) AND gate. Rest two inputs in each
AND gate is not needed, therefore, respective fuses are burned off. Two
input XOR gates using Programmable Array Logic is shown below.
(c). Programmable Logic Array (PLA)
Programmable Logic Array or PLA is used to implement logic functions in
digital circuits. The structure has programmable AND-matrix, programmable
OR-matrix, input and output buffers. Block diagram of a PLA device is as
shown below.

Both inverted and original values of each PLA inputs are provided by input
buffers. Input to both AND and OR gates are through fuse and therefore they
can be burned off depending on our requirements. Structure of a PLA with all
fuses (before programming) is shown below.

Two input PLA structure can be used to realize any two input logic gates. For
that, fuses which are not required to realize that particular logic function are
burned off. For example, XOR gate realized using a programmable Logic
device is shown below.
Characteristic equation of a XOR gate contains of two min terms. Given by

Each AND is used to generate a particular min term and required min terms
can be selected using the fuses in the input of OR gate.

(d). Generic Array Logic (GAL)


A GAL or Generic Array Logic device consists of a re-programmable PAL
matrix and a programmable output-cell. GAL is an improved form of PAL
which uses electrically erasable CMOS cells instead of fuses. Therefore, AND
matrix of GAL can be re-programmed several times unlike one time
programmable PAL devices. AND matrix is followed by fixed OR matrix (inside
output cell), used to sum off all min terms from the AND output. Block
diagram of a GAL device is shown here.
Output Logic Macrocell
Another added feature of GAL is that it also has reprogrammable output logic
called OLMC (Output Logic Macrocell). Internal sturcture of an output cell is
shown below.

As shown in the figure three main components of an output cell are:

(a). N-input OR,

(b). D-flip-flop,

(c). Multiplexers.

Like in PAL, OR gates are used to sum off min terms from the output of the
AND gates. An OLMC cell consists of a D-flip-flop, which is used to implement
sequential circuits. Multiplexers in the OLMC cells are used to select the
routing of the input signals to the external output or to the feedback output.
It is also used to select from the sequential and non-sequential output taken
from the input and output of the D-flip-flop depending on the requirement.

(e). Complex Programmable Logic Device (CPLD)


CPLD is defined as the network of PLDs that are connected together through
a switching matrix. General block diagram of a CPLD is shown here. The global
interconnection matrix, as shown in the figure, is reconfigurable and so we
can change the connections between the Functional Blocks depending on our
requirement.
Blocks of Complex Programmable Logic Device

Each Functional Block (FB) in the CPLD contains a re-programmable AND/OR


array along with a bank of macro-cells. Therefore, multiple types of logic
functions, both combinational and sequential circuits, can be implemented
using CPLD. As shown in the figure, it is connected to the external world
through the I/O blocks. The entire device contains thousands to tens of
thousands of logic gates. Therefore, more complex designs, other than PLD
devices, can be implemented using CPLD.

(f). Field-Programmable Gate Array (FPGA)


FPGA or Field Programmable Gate Array is now used in the mainstream of
modern IC verification. Circuit design is done in Hardware Description
Language, which is then synthesized to bit streams before burning into FPGA
core. It can be used to realize simple digital logic gates to complex
mathematical equations.

Logic blocks in FPGA architecture are arranged in two dimensional arrays.


Hierarchy of reconfigurable interconnects is then programmed to implement
complex circuits. Desired logic function is implemented using this logic block,
which is then connected together using programmable switch boxes. Figure
shows, architecture of FPGA containing array of Logic blocks, interconnects,
Switch blocks and I/O blocks.

Complex designs are first divided into small functions. Logic blocks are used
to implement these sub functions and connections are made using
programmable interconnects.
The figure shows the architecture of programmed FPGA. Required sub
functions are implemented using each logic blocks, which is then
programmed and interconnected using switch boxes.

Logic Block
Logic Blocks in the FPGA are used to implement sub functions. Any type of
logic function (both combinational and sequential) circuits, can be
implemented using a logic block. Therefore, logic blocks are commonly
referred to as configurable logic blocks (CLBs). A basic Logic block contains

Lookup table (LUT): to implement the combinational logic functions.


Register (D flipflop): to store the output from the Lookup Table.
Multiplexer: to select the output from the LUT.

A simple block diagram of a logic block consists of a lookup table, register and
a multiplexer as shown in the figure. SRAM is used to implement lookup table.
Therefore, desired logic function can be implemented by varying the data
stored in the SRAM. Output from the lookup table is given as inputs to both
multiplexer and D flip flop. D flip flop is used to delay the output. And
depending on the application, the multiplexer selects LUT output or delayed
output. Therefore, by using select input of the multiplexer we can implement
both combinational and sequential circuits using logic blocks. Many such
logic blocks are configured and finally interconnected using the switch box to
build the desired complex circuits.
Compared to other logic devices FPGA has very high logic density. Which
means, a single FPGA chip contains ten thousand to eight million gates.
Therefore, more complex logic circuits can be implemented using FPGA.
FPGA undergo concurrent processing which is faster and more efficient than
other pipeline architectures.

After manufacturing, customer configures desired circuit in to FPGA. The


main advantage of FPGA is its ability to reprogram. Therefore, it is mostly
preferred during the design phase where continuous changes in the
requirements can occur. Whereas, Custom ICs are expensive, not
programmable and takes long time to design. Disadvantages of FPGA are that,
they are slow and they draw more power. The configuration of FPGA is stored
in the RAM (volatile), so once they lose power its configuration is lost.
Therefore in practical applications, configurations are externally stored in
non-volatile flash memories and from there data is automatically restored
after retaining the power.
Chapter 2
System On Chip

What is System on Chip?


SoC acronym for system on chip is an IC which integrates all the components
into a single chip. It may contain analog, digital, mixed signal and other radio
frequency functions all lying on a single chip substrate. Today, SoCs are very
common in electronics industry due to its low power consumption. Also,
embedded system applications make great use of SoCs.

SoCs consists of:

Control Unit: In SoCs, the major control units are microprocessors,


microcontrollers, digital signal processors etc.
Memory Blocks: ROM, RAM. Flash memory and EEPROM are the basic memory
units inside a SoC chip.
Timing Units: Oscillators and PLLs are the timing units of the System on chip.
Other peripherals of the SoCs are counter timers, real-time timers and power
on reset generators.
Analog interfaces, external interfaces, voltage regulators and power
management units form the basic interfaces of the SoCs.

Design flow of SoC aims in the development of hardware and software of SoC
designs. In general, the design flow of SoCs consists of:

Hardware and Software Modules: Hardware blocks of SoCs are developed


from pre-qualified hardware elements and software modules integrated using
software development environment. The hardware description languages like
Verilog, VHDL and SystemC are being used for the development of the
modules.
Functional Verification: The SoCs are verified for the logic correctness before
it is being given to the foundry.
Verify hardware and software designs: For the verification and debug of
hardware and software of SoC designs, engineers have employed FPGA,
simulation acceleration, emulation and other technologies.
Place and Route: After the debugging of the SoC, the next step is to place and
route the entire design to the integrated circuit before it is being given to the
fabrication. In the fabrication process, full custom, standard cell and FPGA
technologies are commonly used.
Fig. Design Flow of SOC

Advantages of SoC
Low power.
Low cost.
High reliability.
Small form factor.
High integration levels.
Fast operation.
Greater design.
Small size.
Disadvantages of SoC
Fabrication cost.
Increased complexity.
Time to market demands.
More verification.

SoC Design Challenges


The different SoC design challenges are given below:

Architecture Strategy
Design for Test Strategy
Validation Strategy
Synthesis Backend Strategy
Integration Strategy
On chip Isolation

Architecture Strategy
The kind of processor that we use to design the SoC is really an important
factor to be considered. Also, the kind of bus that has to be implemented is
another matter of choice.
Design for Test Strategy
Most of the common physical defects are modeled as faults here. While the
necessary circuits included in the SoC design help in checking the faults.
Validation Strategy
Validation Strategy of SoC designs involves two major issues. First issue is
that we have to verify the IP cores. While the second issue is that we need to
verify the integration of the system.
Synthesis and Backend Strategy
There are many physical effects that have to be considered while designing
the SoC synthesis and strategy. Effects like IR drop, cross talk, 3D noise,
antenna effects and EMI effects. Inorder to tackle these issues, chip
planning, power planning, DFT planning, clock planning, timing and area
budgeting is required in the early stage of the design.
Integration Strategy
In the integration strategy, all the above listed facts have to be considered
and assembled to bring out a smooth strategy.
On chip Isolation
In on chip isolation, many effects like impact of process technology,
grounding effects, guard rings, shielding and on- chip decoupling is to be
considered.

What is a chiplet?
A chiplet is a sub processing unit, usually controlled by a I/O controller chip
on the same package. Chiplet design is a modular approach to building
processors. Both AMD and Intel, current major CPU manufacturers, are
adopting chiplet designs for their current product line ups. Chiplets help
increase production by way of better silicon yields. Higher yields and
modular building mean a producing high core count parts mean less waste.

Chiplets allow manufacturers to increase yields of chips over monolithic CPU


designs where all pieces of a processing unit are built into a single piece of
silicon. The increase in yields is because a single monolithic chip with a
defect on a core will either have to be sold as a lower model with fewer cores
or thrown out entirely. With the chiplet approach, a single defective chiplet
is discarded and CPU can be sold as the desired model by adding more
cores.

Chiplet design is an updated version of the old idea of putting multiple


silicon chips (dies) on the same package which communicate with each other
through organic or silicon interposer. Recent advances to substrate design
has allowed for much higher bandwidth between chiplets, which means that
the performance penalty of going from a monolithic chip to multiple chiplets
is lower than before. However, there is still a major latency penalty for inter-
chiplet communication. This used to be called multi-chip module (MCM)
design. I’m not sure about what the distinction between MCM and chiplet is, I
suspect it’s just a marketing term for a lot of chips on the same die.

The goal of chiplet design is to reduce manufacturing costs by reducing the


size of the chip and reducing the number of types of chips that need to be
made to satisfy the entire range of the market. Given an apples-to-apples
comparison, single-die will always be superior in terms of performance and
power compared to a chiplet design, but the chiplet design will likely be
much cheaper to manufacture, thus leading to much lower prices.

Here is an IBM POWER5 MCM from 2004. Note that there are two types of
chips on the substrate.
The 4 rectangular chips in the corners contain two POWER5 CPU cores each,
for a total of 8 cores. The 4 larger square chips in the center contain 9MB of
last-level cache each, for a total of 36MB. The tiny squares carpeting the
design are capacitors, probably for smoothing out power delivery.

Here is an AMD Epyc MCM from 2017. Each of the 4 rectangular chips contain
8 cores each, for a total of 32 cores. There are no dedicated cache chips, the
last-level cache is split among the 4 chips. There is only one type of chip, and
half of the number of chips compared to the IBM POWER5

Here’s the upcoming AMD Epyc 2 “chiplet”-based MCM that will be released
in 2019. AMD has significantly increased the number of chips on the package
from 4 to 9. The 8 tiny chips contain 8 cores each, for a total of 64 cores. The
large chip in the center contains I/O and the last-level cache. Interestingly,
the 8 tiny chips are built on TSMC 7nm technology, whereas the large I/O die
is built on TSMC 14nm. Since 14nm is significantly more mature and cheaper
to manufacture than 7nm, this allows AMD to further save on costs. With this
type of assembling strategy, AMD can produce Epyc 2 server SKUs with
anywhere between 8 and 64 cores by removing or disabling the tiny chips.
The downside is that there is an additional latency and power penalty when
communicating off-chip, but this might not matter depending on the
application.
However, at a high level, there is no real difference between the packaging
used for the Epyc 2 and the packaging used for the IBM POWER5. Both have
lots of chips (8 vs 9) on the same package, and both have two different types
of chips (cores vs cache/IO). It’s therefore likely that “chiplet” is just a
marketing term for an updated version of something done for decades.

Difference between Monolithic & Multichiplet


Standard integrated circuit chips are monolithic, meaning that all
components are placed on a single semiconducting (silicon) die.

Core, processor, data fabric, control fabric everything on single die

Low power rating. ...


Poorer isolation between components.
No possibility of fabrication of inductors.
Small range of values of passive components used in the ICs.
Lack of flexibility in circuit design as for making any variation in the
circuit, a new set of masks is required

Chiplet designs simply put multiple die into the same package.

The goal of chiplet design is to reduce manufacturing costs by reducing


the size of the chip and reducing the number of types of chips that need
to be made to satisfy the entire range of the market.
Divided into no.of dies, each die consists of no.of cores

AMD has introduced CCX configuration, the die consists of seperate CCD,
IOD blocks

Fig. AMD Ryzen 1000 Zen 1 CCD

The 1st Gen Ryzen architecture was relatively simple: An SoC design with
everything from cores to I/O and controllers on the same die. The CCX concept
was introduced, wherein CPU cores were grouped into four-core units and
combined using the Infinity Cache. Two quad-core CCXs formed a die.
It’s important to note that even though the CCXs were introduced, the
consumer Ryzen chips were still monolithic single-die designs. Furthermore,
although the L3 cache was shared across all the cores in a CCX, each had its
own slice. Accessing the last level cache (LLC) of another was relatively
slower, even more so if it was on the other CCX. This caused poor
performance in latency-sensitive applications like gaming.

Things largely remained the same with Zen+ (-node shrink), but Zen 2 was a
major upgrade. It was the first chiplet-based design for consumer CPUs
featuring two compute dies or CCDs and one I/O die. AMD added a second
CCD on the Ryzen 9 parts for core counts never seen before on the consumer
front.
The 16MB L3 cache was more accessible (read: faster) for all the cores on the
CCX, greatly improving gaming performance. The I/O die was separated, and
the Infinity Fabric was upgraded. At this point, AMD was slightly slower in
gaming but offered superior content creation performance than rival Intel
Core chips.

Zen 3 further refined the chiplet design, eliminating the CCX and merging the
eight cores and 32MB cache into one unified CCD. This drastically reduced
cache latency and simplified the memory sub-system. For the first time,
AMD’s Ryzen processors offered better gaming performance than archrival
Intel’s. Zen 4 makes no notable changes to the CCD design other than making
them smaller.

Intel has been following a monolithic approach to processor design.


Essentially, this means that all cores, cache, and I/O resources for a given
processor are physically on the same monolithic chip. There are some clear
advantages to this approach.

The most notable is reduced latency. Since everything is on the same


physical substrate, different cores take much less time to communicate,
access the cache, and access system memory. Latency is reduced. This leads
to optimal performance.
If everything else is the same, the monolithic approach will always net you
the best performance. There’s a big drawback, though. This is in terms of cost
and scaling. We need to take a quick look now at the economics of silicon
yields. Strap in: things are going to get a little complicated.
Defects and Yields
When foundries manufacture CPUs (or any piece of silicon), they rarely
manage 100 percent yields. Yields refer to the proportion of usable parts
made. If you’re on a mature process node like TSMC’s N7 (7nm), your silicon
yields will be more than 80%. You get a lot of usable CPUs with minimal
wastage. The inverse is that for every 10 CPUs you manufacture, you have to
discard at least 2-3 defective units. The discarded unit costs money to make,
so that cost has to factor into the final selling price.

At low core counts, a monolithic approach works fine. This largely explains
why Intel’s mainstream consumer CPU line has, until Ryzen, topped out at 4
cores. Increasing the core count on a monolithic chip dramatically increases
costs.

On a high-end monolithic die, every (or nearly every) core has to be


functional. If you’re fabbing an eight-core chip and 7 out of 8 cores work, you
still can’t use it. Remember what we said about yields being more than 80
percent?

Mathematically, that ten percent defect rate stacks for every additional core
on a monolithic die to the point that with, say, a 28-core Xeon, Intel has to
throw away one or two defective chips for every usable one since all 28 cores
have to be functional. Costs don’t just scale linearly with core count–they
scale exponentially because of wastage.
The economics are much more forgivable with the chiplet approach, as costs
scale linearly with core counts. Because AMD’s wastage rate is relative to its
ability to create a functional 4-core block at most (a single CCX), they don’t
have to throw out massive stocks of defective CPUs.

The second advantage comes from their ability to leverage those defective
CPUs themselves. Whereas Intel mostly throws them out, AMD disables
functional cores on a per-CCX basis to achieve different core counts.

Chiplet or Monolithic: Which is Better?


The chiplet approach is going to see widespread adoption in the coming
years, from both AMD as well as Intel, for CPUs as well as GPUs. Moore’s law–
which mandated a doubling in processing power, mainly due to die shrinks
(56nm to 28nm> 28nm to 14nm>14nm to 7nm) every couple of years–has
comprehensively slowed down.

Intel released its first major chipped or tiled CPUs as Xeon Sapphire Rapids a
while back. The 14th Gen Core processors codenamed “Meteor Lake” will be
the first chiplet lineup from Team Blue. They are expected to hit the market
in the final quarter of 2023.
Chapter 3
SOC Architecture & Functionality
SOC Architecture
SoC stands for System On Chip. It is a small integrated chip that contains all
the required components and circuits of a particular system. The
components of SoC include CPU, GPU, Memory, I/O devices, etc.

Fig. SOC Architecture

Difference between CCD & IOD


The CCD contains the CPU cores. I/O stands for Input/Output and as the
name suggest it contains components that connect with other components
in the system such as the memory controller and the PCIe controller.

I/O die is manufactured using the GF 12nm process while the CCD is
manufactured on the TSMC 7nm. This is done because I/O doesn't scale (or
at least doesn't scale as well as the cores) and therefore manufacturing it on
a cheaper and more mature process allows AMD to get higher yields by only
using the TSMC 7nm process for the CPU cores which benefit the most from
it.

AMD’s Ryzen CPUs are made up of core complexes called CCDs and/or
CCXs. But what is a CCX and how is it different from a CCD in an AMD
processor? Let’s have a look. There are many factors responsible for AMD’s
recent success in the consumer market. But, the chiplet or MCM design
(Multi-chip Module) is at the heart of it.
It allowed AMD to increase the core counts to never-before-seen figures in
the consumer market and set the groundwork for a revolution of sorts.

The Ryzen 9 5950X features 16 cores while the Threadripper flagship, the
3990X boasts an insane core count of 64, the same as the Epyc Rome parts.
This means that, at any given price point, AMD can deliver more cores, more
threads, and therefore, better multi-threaded performance, than Intel can,
even after a number of price cuts.

What is an AMD CCD and CCX


These two functional units lie at the heart of AMD’s modular approach to
Ryzen. The basic unit of a Ryzen processor is a CCX or Core Complex, a
quad-core/octa-core CPU chiplet with a shared L3 cache. In newer Ryzen
3000 and 5000 parts, the amount of L3 is higher and it’s referred to as
“Gamecache.”

There are pros and cons to having the CCX be Ryzen’s basic functional unit.
A negative is that the baseline cost of manufacturing can be somewhat on
the high end since AMD needs to pay up for a minimum of four cores.
However, this is offset by the fact that Team Red salvages partially
functional CCXs with, say, two or three functional cores, to create different
SKUs. For example, the Ryzen 5 5600X features two CCXs/CCDs, each of
which has one core disabled, for a total of 6 functional cores.

However, while CCXs are the basic unit of silicon dabbed, at an architectural
level, a CCD or Core Chiplet Die is your lowest level of abstraction. A CCD
consists of two CCXs paired together using the Infinity Fabric Interconnect.
All Ryzen parts, even quad-core parts, ship with at least one CCD. They just
have a differing number of cores disabled per CCX.

SMN(System Management Network )


SMN is a packet-switched on-chip network intended for system
management in SoCs. It can be thought of as a packetized AXI-4 based
network. Its purpose is to provide a scalable, flexible, and highly
configurable interconnect, which interfaces with on-chip IPs through the
standard AXI-4 protocol, exposing none of the details of its internal packet-
switched protocol. SMN uses source routing. The routing information is
embedded in the packet header, and the routers direct the packets solely
based on this information.
The only state that must be maintained by the router is the destination
output port and VC of each ongoing packet. This output port is specified by
the routing information in the first header flit and must be the maintained
to direct the remaining flits of the packet. This information is discarded
once the packet’s tail flit is forwarded to the next hop.

Fig. SMN Network

Multimedia Hub (MMHUB)


MMHUB provides data fabric connectivity for IPs using an AXI4 based client
interface network. The AXI4 interface at each IP connects to a physically
adjacent DAGB bridge block in MMHUB, which isolates the IP from MMHUB
internal details. Within MMHUB, this client interface can span large
distances (versus AXI4 which has a valid/ready protocol that is not
convenient for spanning distances). This allows MMHUB client IPs to be
physically distant from the MMHUB SDP DF interface port(s).

The primary functions of the MMHUB are:


o AXI4 client interface network
o GPUVM address translation
o x86/IOMMU address translation
o SDP Interface to Data Fabric
o Scheduling/Arbitration for Quality-of-Service & SDP efficiency.
o AXI4 Interface to System Management Network
Input/Output Hub (IOHUB)
At a high level, the IOHC is a crossbar which connects system devices to the
CPU, and its internals service a wide variety of system IO functions. These
functions include:
• Host (CPU) Requests: Decode and route host requests to downstream
clients (PCIe, IOMMU, nBIF and clients below System Hub). While routing, if
no client claims the request, then the request will be sent to FCH (if
present), or will be Ured internally inside IOHC
• Client DMA Requests: Decode and route DMA requests received from the
downstream clients (PCIe
and nBIF/System Hub). IOHC will forward these requests to the CPU, or in
the case of P2P requests, forward them to another PCIe device or internal
graphics. Unclaimed requests are discarded. DMA request types are
memory rd/wr.
• IO Traffic: Decode and route IO messages for power management,
interrupts remapping and forwarding, legacy ATI vendor-defined NB/SB
messages, and external MCTP Master/PCIe vendor defined messages.
• P2P Requests: Routing DMA requests from client to client. Request types
are memory rd/wr.
• Trap Feature: Trap NP requests for SMU FW processing

IOMMU is a sub-IP under IOHUB which is a system function that translates


addresses used in DMA transactions, protects memory from illegal access
by I/O devices, and remaps peripheral interrupts. It consists of a distributed
topology containing multiple IOMMU L1s (L1IMU) and single IOMMU L2
(L2IMU).L1IMU contains level 1 remote TLBs closer to the client. It supports
processing multiple translation requests from clients concurrently. L1
misses as well as few other transaction types (interrupts, PRI, ATS
etc.) are forwarded to L2IMU by L1. L2IMU contains Level 2 TLBs to service
misses from L1 TLBs. L2IMU also hosts central translation table walker to
fetch configuration structures (DTE, GCR3) as well translation
tables (v1/v2), command processor to service TLB invalidations and PRI
response handling, and
Event/Fault/PRI recording. IOMMU may feature multiple instances of
remote IOMMUs, consisting of dedicated TLB and host only page table
walker. The new module provides dedicated host translations for
real time clients such as Display and ISP. Main IOMMU and remote IOMMU
are viewed as single IOMMU entity from a SW programming model point of
view. OS SW is not required to perform any extra programming to enable
IOMMU in the SOC.

System Hub
The SysHub IP is responsible for routing downstream requests from the
host to different system clients – PCIe devices, like internal graphics and
various clients attached to System Hub. It also routes upstream DMA
requests from internal devices to the IOHUB or PCIe. Key components
include nBIF Core, which contains the PCI capability registers for the root
complex and internal endpoints and exposes these capabilities to the
operating system. Another component is the GDC (global data converter),
which converts between AXI and SDP protocols, and merges ATHUB
transactions with the scalable IO network.
The final component is System Hub Core, which consists of multiple
NIC400 blocks which combine the client requests according to VC
assignment.

Fusion Controller Hub (FCH)


FCH is the Southbridge IP integrated in the SoC. It contains low bandwidth
platform I/O controllers, ACPI and system reset control logic, GPIOs, and
clock generation blocks.

For AMD APU models from 2011 until 2016. AMD is marketing their chipsets
as Fusion Controller Hubs (FCH), implementing it across their product range
in 2017 alongside the release of the Zen architecture. Before then, only
APUs used FCHs, while their other CPUs still used a northbridge and
southbridge.

Graphics (GFX)
The focus for STX graphics engine is to substantially increase performance
over the previous generation mainstream APU Phoenix by bringing in all the
PPA features from our mobile product line up, to target leading graphics
performance in its product segments.
Performance is primarily measured by 3DMark benchmark tests, which are
important for PC products. Performance is also gauged by a suite of recent
GFX Benchmarks tests such as Aztec and 3D games which includes DOTA,
Ashes of the Singularity, Fallout4 etc.

A graphics processing unit (GPU) is a specialized electronic circuit initially


designed to accelerate computer graphics and image processing (either on
a video card or embedded on motherboards, mobile phones, personal
computers, workstations, and game consoles). After their initial design,
GPUs were found to be useful for non-graphic calculations involving
embarrassingly parallel problems due to their parallel structure. Other non-
graphical uses include the training of neural networks and cryptocurrency
mining

Dedicated graphics processing unit


Dedicated graphics processing units are not necessarily removable, nor
does it necessarily interface with the motherboard in a standard fashion.
The term "dedicated" refers to the fact that graphics cards have RAM that is
dedicated to the card's use, not to the fact that most dedicated GPUs are
removable. This RAM is usually specially selected for the expected serial
workload of the graphics card (see GDDR). Sometimes, systems with
dedicated, discrete GPUs were called "DIS" systems,[70] as opposed to
"UMA" systems (see next section). Dedicated GPUs for portable computers
are most commonly interfaced through a non-standard and often
proprietary slot due to size and weight constraints. Such ports may still be
considered PCIe or AGP in terms of their logical host interface, even if they
are not physically interchangeable with their counterparts.

Integrated graphics processing unit


Integrated graphics processing unit (IGPU), integrated graphics, shared
graphics solutions, integrated graphics processors (IGP), or unified memory
architecture (UMA) use a portion of a computer's system RAM rather than
dedicated graphics memory. IGPs can be integrated onto the motherboard
as part of the (Northbridge) chipset,[80] or on the same die (integrated
circuit) with the CPU (like AMD APU or Intel HD Graphics).
Data Fabric (DF)
The function of the data fabric is to provide CPU access to DRAM, MMIO,
and PCI configuration space; to provide a full bandwidth data path between
graphics and memory; and to provide a data path for internal PCIe devices
to/from memory and host x86 processor. It provides I/O coherent access to
DMA devices in the SoC. DF also provides services such as cache coherent
communication between CPU and Graphics; global ordering point for the
SoC; and Quality of Service (QoS) for both hard real time and soft real time
multimedia devices.
Fig. DF Architecture

DF Components
Coherent Master (CCM, GCM)
Interfaces to any client which may cache data, for example CPU and GPU.

Non-Coherent Master (NCM)


Interfaces to clients that do not cache data, for example MMHUB, DCN, etc.

IO Master/Slave (IOM/IOS)
Interface block to the IOHUB. IOM contains master functionality – for
example DMA accesses from an IO device. IOS contains slave functionality,
for accesses from the CPU to a device.

Transport Switch (TCDX)


Connects all masters and slaves in the system. Route’s command/address
and data packets from any source to any destination using the Coherent
HyperTransport+ protocol. Configurable for up to 6 ports.

Transport Switch (TCDX)


Connects all masters and slaves in the system. Route’s command/address
and data packets from any source to any destination using the Coherent
HyperTransport+ protocol. Configurable for up to 6 ports.
Coherent Slave (CS)
Manages coherence for all the physical memory behind the CS. Connects to
the UMC to access memory. It issues probes for coherent transactions

Non-Coherent Slave (NCS) Handles transfer requests which are serviced by


blocks other than the UMC, for example requests targeting IO from GFX or
MMHUB.

Multi-Socket Bridge (CAKE) Extends the transport layer off chip, for
example for a chiplet-based design. Not used by STX.

Power Management, Interrupts, Etc (PIE) The PIE’s primary function is fabric
power management. It sequences fabric frequency changes and controls up
to 4 power zones within the fabric. Also handles interrupts, register access,
debug, RAS functionality and the loopback controller.

Northbridge I/O (NBIO)


NBIO is the sub-system design hierarchy that assembles a collection of
NBIO functional blocks to build a high performance PCIe-compatible I/O
interconnect for SOCs. In general, NBIO provides PCIe connectivity
to external devices as well as AXI-based IPs. SDP interfaces are
implemented on Data Fabric and FCH data paths while SMN interfaces are
used for register access through Remote-SMU module.

The major IPs within NBIO sub-system include IOHUB, PCIe Controllers,
nBIF, and System Hub.
Chapter 4
SOC Verification
Any chip, a simple embedded microcontroller, or a complex system-on-a-
chip [SoC] will have one or more processors. Figure1 shows a complex
electronic system composed of both hardware and software needed for
electronic devices like smartphones.

Fig. Simple Electronic System

The hardware is made up of a complex SoC that incorporates almost all the
components needed for the device. In the case of the smartphone, we
integrate all the hardware components called IPs [Intellectual Properties]
like CPUs, GPUs, DSP, Application Processors, Interface IPs like USB, UART,
SPI, I2C, GPIO, and subsystems like System Controllers, Memories with
controllers, Bluetooth, and WiFi, etc. and create the SoC. Using SoC helps us
to reduce the size and power consumption of the device while improving its
performance.

The software is composed of application software and system software. The


application software provides the user interface, and the system software
provides the interface to application software to deal with the hardware. In
the smartphone case, the application software could be mobile apps like
YouTube, Netflix, GoogleMap, etc, and the system software could be the
operating system [OS] like ios or android. The system software provides
everything like firmware and protocol stack along with the OS needed for
the application software to interface with the hardware. The OS manages
multiple application threads in parallel, memory allocation, and I/O
operations as a central component of the system software.
Difference between SOC-IP-Sub system level verification
SOC Level - Black box verification
IP Level - White box verification
Sub System Level - Gray box verification

Black-box method
Black-box method treats the DUT as a black-box (no knowledge about its
internal structure). Result – We are not bothered about how the internal
structure of the application is maintained/changed until the outside
functionality is working as expected (as per requirements). Knowing what
the application does is more important than knowledge of how it does it.
This is the most widely used test method for System & Acceptance tests as
it doesn’t require professionals with coding knowledge and also it provides
an external perspective of the DUT (as an end-user who have no knowledge
of the actual code).

White-box Method (structural)


White-box i.e. the internal structure, and use that knowledge to expand the
coverage to test every possible flow at code-level. For example – Statement
coverage, Branch coverage or Path coverage. It requires programming skills
and is usually preferred only for Unit & Integration test levels. You can call it
by different names – Clear-box, Glass-box or Transparent-box as far as you
can see the internal contents of the box.

Gray Box Method


As you might have guessed – a hybrid approach, i.e. leveraging the strength
of both, it’s the combination of White-box and Black-box Testing. Not
complete but still the tester has little knowledge about the internal
structure of the AUT.
Simple System On Chip

Difference SOC-IP-Sub system Level Verification

Verification of an IP/SOC design is a process to ensure functional


correctness of the IP or the SOC .

Majority of this verification process involves simulation based techniques


which uses a framework (also known as test bench) that consists of various
components like stimulus generators, scoreboards/checkers, coverage
model etc

A Verification IP (VIP) is a pre-defined functional blocks that can be


inserted into the testbench which can then be used to actually simulate
the design (either an IP or an SOC) and verify the functional correctness of
it

Following is a simple example of a VIP that is connected to a simple DUT


(which can be a simple IP design block) inside a Testbench.

The development of this VIP involves developing transactions/sequences,


drivers, configuration components, actual Testplan for this interface and
test suites. This will also involve verifying the VIP itself in a stand alone
manner.
IP Verification mostly deals with the verification of features associated with
a particular IP or Protocol and it generally deals with Functional
Testing.Mostly in IP you needs to verify about the working of a IP and all its
related features like clocks,reset,Statemachine,data packet processing or
data traffic,transaction initiating and other features which is specific to that
IP.

An SOC will have several design blocks (IPs) all integrated together and a
testbench for an SOC will need to be more complicated and will need
several such VIPs to stimulate and verify different interfaces.

See following diagram which is an SOC Verification Testbench for a complex


processor system which uses several such independently developed VIPs
integrated to build a SOC level testbench

In SOC Verification all the IP's are interconnected together and that's why in
SOC Verification you mostly deals with the connectivity tests,you need to
check whether the data and control path are clear or not and you need to
check it through the entire hierarchy.Interconnetion across all the IP's
needs to be verified.
After all VIPs are integrated into the testbench, the actual verification of the
DUT(IP/SOC) will still need to follow a Verification plan that captures Design
features, simulating the DUT and all interactions of various features,
debugging failures, collecting coverage and making sure the DUT is
functionally correct

The only difference is in the intent of the specification. With IP, the intent is
to provide a specification to be manufactured into a physical design. With
VIP, the intent is to provide a specification for verification of the
functionality of a design block or full system.

You might also like