DFT Architecture in Multimedia Design: Gil Bouganim
DFT Architecture in Multimedia Design: Gil Bouganim
Gil Bouganim
DSPG
Herzlia, Israel
www.dspg.com
Table of Contents
1. Introduction ............................................................................................................................... 3
2. The IC ....................................................................................................................................... 4
3. OCC concept ............................................................................................................................. 6
4. Integration of @ speed test in multi frequency, timing & power efficient design ................... 7
5. Integration of 2 "scan inserted" IPs ....................................................................................... 141
6. Targeting for high coverage while enforced to work with many 3rd party cores .................... 14
7. Lessons learned …………………………………………………………………………….16
8. References ...............................................................................................................................17
Table of Figures
Figure 1 – block diagram …………………………………………………………………………4
Figure 2 - IC floorplan ................................................................................................................... 5
Figure 3 - OCC shift & capture waveforms .................................................................................... 6
Figure 4 - clock scheme ................................................................................................................. 8
Figure 5 - Flip flop with uncontrollable synchronous reset ……………………………………..14
Figure 6 - Flip flop with synchronous reset – fixed with scan_enable ………………………….15
Table of Tables
Table 1 - 1 OCC Vs multi OCC ...................................................................................................... 9
DSPG provides a variety of wireless chipset solutions for converged communications at home.
At the end of 2010 DSPG taped out a 65 nm lp multimedia IC - DMW96. The dft structures
integrated into DMW96 include OCC structures for @ spped testing, and structures for scan
compression. In this paper, I will cover main aspects of our methodology, highlight the DFT
challenges we faced, and lessons we have learned from this project.
The DFT challenges of this project were:
1. Integration of @ speed test in multi frequency, timing & power efficient design.
At such circomstances, the DFT structure must align with the complexed clock scheme
structure of the design. This means that while planning the clock scheme architecture, @
speed considerations must be addressed in the same high priority as other functional
considerations, and should not be addressed only in implementation stages. On the other
hand, the DFT structures that are used, must align with the design, not vise versa.
2. Integration of 2 "scan inserted" IPs.
There are 2 common approaches for integartion of scan inserted Ips: the conservative –
the "isolation approcah", and the progressive – "full integration approach". We have
decided to use both approaches at the same time.
3. Targeting for high coverage while enforced to work with many 3rd party cores.
3rd party cores, are imported "as is" to the design, and as such, do not allow any rtl
changes for DFT purposes. In these cases, we used the "autofix" feature of dft-compiler
to account for poor testability.
DMAAHB1
PortB PortA
Generic
DMA DMAAHB2 Bus Matrix (R.R)
Engine
CPUAHB
USB
OTG
USB
OTG
DMAAHB1 Data
RAM
WiFi-MAC WiFi-BB
Bus Matrix (R.R.)
Decoder
AHBOSD
AHBCOM
250MHz
AHBDISP
LCDC
AHB
IRQ/FIQ
7280
Video ISOLATOR
Encoder
250MHz DBM
Conf CPUAHB
External Memory Interface
32KB
AXI2AHB
D-$$
AXI_VE (64)
AXI Fabric
CPU core
256K L2 $$
AXI2AXI
Cortex-A8
P0
L2 CTL
AXI_VD (64)
BIU
P1
AXI_GR (64)
(LPDDRII)
Coresight
32KB
P2
I-$$
P5
AHB_MAS (32)
P4
AHBCOM (32)
P6
AHBDISP (32)
P7
AHBVID (32)
P3
AXI_CPU (64)
JTAG
Figure 2 – IC floorplan
On chip clock (OCC) controller basically multiplexes 2 free running clocks – ATE clock &
PLL clock.
The purpose of using an OCC is for @ speed testing.
The OCC guarantees that shift is done by the ATE clock, and capture (which is composed of
launch and capture cycles) is done by the PLL clock.
The Implementation process was initiated after few decisions were made.
The primary decision we had to take was which clocks should be controlled by OCC and
which not. Those that would not be controlled by OCC, could be checked "@ speed", but not
at their specific working frequency. The goal than, was to use reasonable number of OCC
structures and cover as much design as possible with OCCs. We decided that any clock with
frequency higher than 125 MHz should be controlled by OCC. This assured that OCC struc-
tures will fanout to more than 80% of flip flops in the design, and that we limit OCC structures
generation to a reasonable number of 11 OCCs. We distributed them as follows:
- PLL1 – system pll: 6 OCCs (500 MHz, 250 MHz, 125 MHz clocks).
- PLL2 – cpu pll: 2 OCCs (850 MHz, 425 MHz clocks).
- PLL3 – dram pll: 1 OCC (266 MHz clocks).
- PLL4 – comm. pll: 2 OCCs (300MHz, 150MHz clocks).
12
SLOWDIV SWSLOWSRC
VIDDECDIVRATE(
CLKDIVCNTRL2[15:8] clk_viddecdiv_out_2occ clk_viddecdiv_out_occ
pad
CLKSWCNTRL[16]
(STRAP) OCC
xout clk_12m VIDDECDIV VIDDEC_CLK_EN(SWCLKENR1[25] & ~fuse_viddec_disable)
50-250MHz
VIDENCDIVRATE
clk div (4 bit)
11
CLKDIVCNTRL2[3:0] clk_videncdiv_out_2occ clk_videncdiv_out_occ
XTALON_32K clk_cortex_out_2occ OCC
CLKOSCCR[0]
(Default 1) VIDENCDIV VIDENC_CLK_EN(SWCLKENR1[26] & ~fuse_videnc_disable) 50-250MHz
clk_gpudiv_out_occ
cmu_hf_pll2 clk div (4 bit)
GPUDIVRATE
OCC
5
PLL2SWCTL CLKDIVCNTRL2[19:16] 500MHz clk_gpudiv_out_2occ
BCLK CPUCLKCNTRL[9:8]
cmu_cortex_src
/2 + axi_en clk_gpudiv_out_div2_occ
pad
(default=0)
12-850MHz GPUDIV GPU_CLK_EN(SWCLKENR1[24] & ~fuse_gpu_disable) | cmu_en_clk_in_reset
/2 30 OCC
CPUSRC clk_cortex_out_2occ clk_sysbusdiv_out_2occ
clk_sysbusdiv_out_occ
8
clk_20m 0 clk_pll2sw_out /2+hclk_en and clk_hclk_ug
clk_bp_div4 1 /4 +hclk_en_div4 1-125MHz OCC
1
0
clk_12m 0 PLL2CONTROL
[16:0] 1 clk_cortex_out SYSBUSDIVRATE SYSBUSDIV CG clk_hclk_xxx
clk_bp PLL2LD
PLL2CONTROL[18] clk_pll2_out CPUCLKCNTRL[7:4]
(def. 1 = div/2)
cmu_hclk_div2_en to IMC_ROM
CPUSRC 2 cmu_hclk_en
CPUCLKCNTRL[17:16]
(default=0)
clk div (8 bit) SIMDIVRATE
CLKDIVCNTRL3[15:8] 13 1-20MHz clk_simdiv_out
xin 1 clk_12m R PLL2 SIMDIV
SIM_CLK_EN
pad Main 0
SWCLKENR1[27]
24
10
MSDIVRATE
6
clk div (8 bit) 2-40MHz clk_msdiv_out
13.824M 13M clk_13m_osc
PGN65LP25SMF1000A_140A PLL2_PLL4PREDIV CLKDIVCNTRL1[23:16]
clk_ms
pad Osc. PLL4PREDIV clk_out_straps | 31 MSDIV MS_CLK_EN(SWCLKENR1[12] & ~fuse_ms_disable)
clk_bp_div4
clk_altsys
XTALON
clk_test_mux
SDDIVRATE clk_out_straps
clk div (8 bit) 2-50MHz clk_sddiv_out
7
clk_32k {SYSSRC_NOR,PLL1PD_NOR} CPUCLKCNTRL[13:11] (default=001) CLKDIVCNTRL1[31:24]
clk_sd
3
pad PLL5
/4 /4 clk_13m
2
CORTEX_STANDBY_WFI
SDDIV
SD_CLK_EN
SWCLKENR1[13]
auto clock
WIFI clk_20m 1 SYS_AUTO CPUCLKCNTRL[18] (default=0)
clk_dpdiv_out (to PAD)
WIFI clk div (8 bit) DPDIVRATE
1-20MHz
9
FSM CLKDIVCNTRL3[7:0]
RADIO clk_80m_pll5 clk_12m 0 clk_500m
pad 25 DP_CLK_EN
21
clk_bp
1 ALTSYSSRC
CPUCLKCNTRL[15:14] 1 clk_500m_sc
clk_500m_sc
clk div (8 bit) CLKDIVCNTRL1[7:0] 1-62.5MHz
CIU_CLK_EN
(default=0)
12-20MHz clk_altsys 2 CIUDIV SWCLKENR1[19]
tdm_clk[n]
clk_80m SW500MSRC From PAD
PLL1PD TDM1/2/3 0.25-40MHz
1
PLL1INMUX SYNC2 clk_tdm_sclk
WIFI
CG
PLL1CONTROL[20:19]
(default=0)
PLL1LD
PLL1CONTROL[18] 0 clk
clkdiv(12
clk div (Nbit)
div (N bit)
bit)
clk_tdm3div_out
2
PHY (bb_top) clk_pll1_out 1
PLL1CONTROL clk_pll3_to_lcdpixdiv/TDM1/2/3DIV
clk_tdm2div_out to PAD CG tdm_tx_pol clk_tdm_tx
3
WIFI
[16:0] clk_pll4_to_lcdpixdiv/TDM1/2/3DIV 2
TDM[n]DIV
TDM[n]DIV
TDM[n]DIV clk_tdm1div_out tdm_rx_pol clk_tdm_rx X3
FSM clk_13m PLL4PREDIV LCDPIXDIV_CLK_SRC
clk_pll1_src 62-500MHz
CLKDIVCNTRL2[31:24] PLL4PREDIV_EN CLKSWCNTRL[18:17] tdm_fsync_pol clk_tdm_fsync
clk_20m 2 PLL1 (SWCLKENR2[19])
& 5-83.3MHz clk_lcdpixdiv_out
clk div (8 bit)
4
clk_ethmac_tx_from_phy R 0
clk_12m 1 PLL4ALTSRC==1
clk_lcdc
pad
0 System (CLKDIVCNTRL2[5:4]) clk_pll3_to_lcdpixdiv/TDM1/2/3DIV 1
clk_ethmac_rx_from_phy PGN65LP25SMF1000A_140A
clk div (8 bit)
PLL1_PLL4PREDIV
clk_pll1_pll4prediv_out
clk_pll4_to_lcdpixdiv/TDM1/2/3DIV
ETHDIV_CLK_SRC
2
LCDPIXDIV (to PAD too)
pad PLL1SRC CLKSWCNTRL[25]
LCDPIXDIVRATE LCD_CLK_EN
SWCLKENR1[21]
clk_12m_bp_div4 clk_pll4_to_lcdpixdiv/TDM1/2/3DIV
CLKDIVCNTRL1[15:8]
clk_ethmac_rmii SYNC2 1
pad clk_bp_div4 1 0 clk div (4 bit) 50MHz
See Ethernet_scheme
clk_12m cmu_dram_ctl_clk_en SWDRAMSRCSEL PADS
tdm_clk[n]
0 CLKSWCNTRL[0]
(def. 0)
ETHRDIVRATE
CLKDIVCNTRL2[23:20]
ETHRDIV ETHERDIV_EN
sheet
pad PLL3CONTROL (def. 9 = div/10)
[16:0]
PLL3LD
CG SWDRAMSRC
clk_12m PLL3CONTROL[18] 0
jtag_tck clk_pll3_to_lcdpixdiv/TDM1/2/3DIV
COMALTDIV
pad
clk_12m
EFUSE
PLL3 40-266MHz clk_pll3_out 1
12-266MHz COMCLKSEL[3:0]
(def. 3 = div/4) clk div (4 bit)
CTRL 250MHz
sjtag_tck
pad DRAM to PLL4ALTSRC
COM_CLK_EN
SWCLKENR1[23] COMALTDIV
clk div (8 bit) clk_pll3_pll4prediv_out
COM_CLK_EN(SWCLKENR1[23]) | cmu_en_clk_in_reset
PLL4ALTSRC
PGN65LP25SMF1000A_140A
PLL3_PLL4PREDIV PLL4PREDIV_EN
clk_dram_2occ clk_dram_occ clk_comaltdiv_out COM_ETM_CLK_EN clk_com_arm_etm_occ
CLKDIVCNTRL2[5:4]
clk_bp 1 (default =1)
(SWCLKENR2[19])
OCC SYNC2 SWCLKENR1[31]
SYNC2
&
0 300MHz CG
16
clk_com_arm_occ
xin
0
clk_pll3_pll4prediv_out 2
PLL4PREDIV
CLKDIVCNTRL2[31:24]
clk_pll4_out
PLL4ALTSRC==2
(CLKDIVCNTRL2[5:4]) 40-300MHz
1 clk_swcomsrc_out
CG OCC 17 CG
pad clk_12m_osc clk_pll1_pll4prediv_out 1
PLL4CONTROL
PLL4LD 12-50MHz clk_com_occ=clk_com 300MHz
[16:0]
2
12M 26
clk_pll2_pll4prediv_out 0 clk_pll4prediv PLL4CONTROL[18] SYNC2
com_hclk_en
12M clk_pll4_src_bp COMLPSRC SWCOMSRC
COM_ARM_CLK_EN
(SWCLKENR1[1] &
Osc. PLL4SRC COMCLKSEL[13:12] ~fuse_arm926_disable) cmu_en_clk_in_reset clk_com_ahbdiv_out
C
pad clk_13m 3 PLL4 (default=2)
1 clk_com_hclk_ug 150MHz
xout 2
clk_pll4_to_lcdpixdiv/TDM1/2/3DIV
0 OCC S
clk_20m 1
Comm SWCOMSRC
1
0
/2 + hclk_en
CG
PGN65LP25SMF1000A_140A 40-300MHz COMCLKSEL[11:10] clk_com_hclk_xxx
USBOSCEN
CLKOSCCR[2] clk_12m 0 clk_pll4_src ARM_STANDBY_WFI (default=2) S
(default=1) COMLPEN
19
COM_ADPCM_EN
PLL4INMUX
0 COMCLKSEL[14] SWCLKENR[29] clk div(5bit) + CG clk_com_adpcm
USB1/2_PHY_CLK_EN 1 arm_dbgtck_en
SWCLKENR1/2[17]
PLL4CONTROL[20:19]
clk_bp_div4 arm_nirq rst_sys_com_n cmu_armclk_en COMADPCMDIV
COMADPCMDIV
(default=0)
USB1_MAC_CLK_EN
COMCLKSEL[8:4]
(def. 5 = div/6)
25MHz
SWCLKENR1[16] COM_BMP_EN
SYNC2 cmu_bp_en & cmu_pll4_bp_div4 BMPREFDIV SWCLKENR[30]
clk_12m_usb1otg_phy clk_12m_usb2otg_phy cmu_armclk_en COMCLKSEL[9]
arm_nfiq CMU_TEST[12]
clk_usb1otg_phy CG
Default 0 /12 | /1 clk_com_bmp
27 clk_usb1otg_mac clk_13m COMBMPDIV
USB1 13.824/1.152MHz
USB
20
USB PLL
PLL clk
clk div
div 28
USB2_MAC_CLK_EN
SWCLKENR2[16] MAC
Osc.
Osc. 66 SYNC2
USB2 OCC xxx xxx Control 1 0
USB2_PHY
1
USB2_PHY MAC HardMacro
From register
0 clk_test
CG on chip clock (SCAN) block in top level block in digital core clock mux sync. point divider
clk_usb2otg_mac bypass mux
clk_usb2otg_phy
After 11 OCCs were placed as the output segments of the relevant clocks in the clock scheme,
we had to decide how to generate and connect the fast functional clock to the OCC structures
during @ speed test. First option was to bypass the complexity of the clock generation unit (its
dividers, gaters and muxes), and the second option was to force the functional architecture to
work in a manner that will assure that the desired functional clocks will be free running at the
input pins of OCCs. We chose to implement the second option. The benefit we gain from such
an approach is that our system clocks work in test mode exactly as they work in functional
mode. In order to force the clocks to be free running in test mode, we added combinational
logic in the RTL code of the clock generation module along the combinational path of the de-
sired clocks. The RTL code used scan_mode signal to assure that during test mode:
- The selected clock gaters are open.
- The selected clock dividers are continuously dividing in the desired rate.
- Glitchless clock muxes function properly and select the desired clocks.
- PLLs:
i. The reference pin of PLL is connected to its primary input. The path is
unblocked, and allows free running clock at the PLL's reference pin.
Before moving on to the implementation stage with dft-compiler, we had to decide how many
ATE clocks are required, and whether we want to implement 1 big OCC unit, or several OCC
units. The following table compares the pros and cons of each method:
Implementation in dft-compiler:
We followed the documented commands for OCC insertion:
set_dft_configuration -clock_controller enable
set_dft_signal -view existing -type refclock –port osc_port -test_mode all
set_dft_signal -view existing -type Oscillator –port gpio0 -test_mode all ; ATE clock
set_dft_signal -view existing -type Oscillator -hookup_pin [get_pins "buff/Z"] -test_mode all ;
PLL clock
set_dft_signal -view spec -type pll_reset -active_state 0 -test_mode all
set_dft_signal -view spec -type pll_bypass -active_state 0 -test_mode all
set_dft_clock_controller -cell_name XXX_occ \
-design_name snps_clk_mux \
-pllclocks [get_pins "buff/Z"] \
-ateclocks gpio0 \
-test_mode_port test_mode_port \
-cycles_per_clock 2 -chain_count 1
#test_mode_port is the same port that is defined as type TestMode for scan compression.
Scan inserted IPs are digital cores with complete layout, that already include routed scan
chains. The dft characteristics of the IP are determined by the IP vendor.
In DMW96, we integrated 2 IPs that were "scan inserted" by their vendors - wifi_afe &
usb_phy.
The common approaches for scan integration of "scan inserted" IPs are:
The conservative – "full isolation" approach:
There are several reasons why a backend team would consider it risky to fully integrate a
"scan inserted" IP with the IC's scan chains as if it were a common hard macro design:
o DFT aspect - Since the backend team did not implement the dft structure of the IP,
it must rely on the correctness of the atpg netlist that is delivered by the IP vendor.
o Timing aspect – the backend team cannot run STA on the internal parts of the IP
in "test mode", and verify it meets "test mode" timing requirements.
o Usually these IPs are mixed signal designs. There is always a risk that the analog
design interferes with the proper function of the digital design during test mode.
Another reason for following this approach is that usually these IP's digital cores are very
small compared to IC's digital core. Thus, the integration effort, and the fact that such a
small digital core can interfere with the functionality of the entire IC scan architecture,
make the "full integration" approach (see below) not advisable.
In order to account for the above risk, dedicated scan ports (scan In/Out, scan clock) are
assigned to the IP, and its scan chains are not mixed with any of the IC's scan chains.
In case the IP's scan chains don’t work on the tester, it will not have any effect on the rest
of the IC.
The progressive – "full integration" approach:
This approach gives credit to the IP provider, and its validation of the IP's scan structure.
It is safe to take this approach especially when the IP is silicon proven.
Unlike the conservative approach, in this case, the IP is integrated as if it were a common
hard macro in the IC. It is best to use an ILM view for its integration, but a liberty model
(for timing) and CTL test model should be sufficient enough. Using a test model will al-
low connecting the scan chains of the IP with the IC's scan chains, and making better us-
age of the IC's scan resources (primary inputs & outputs).
One more advantage this approach has over the conservative one, is that it allows better
controllability and observability of the IP's input/output pins, and thus better test cover-
age.
After implementation, it is required to verify that the integrated IP meets the IC's test
mode STA requirements. To do so, a Liberty model that represents the IP's characteristics
in test mode should be read into primetime.
Our decision was to mix the above 2 approaches. This mixture gave us the following benefits:
- We made the best usage of scan signal resources we had.
- In case the IP's scan chains don't work, we can remove their scan chains from the
overall set of scan chains, and still check the rest of the IC on the tester.
Implementation in dft-compiler:
#Internal_scan
set_scan_path chain65 -view spec -scan_data_in [get_ports gpioXX] -scan_data_out
[get_ports gpioYY] -test_mode Internal_scan \
-ordered_elements [list u_ip0/PHY_SC0 u_ip0/PHY_SC1] -complete true
set_scan_path chain66 -view spec -scan_data_in [get_ports gpioKK] -scan_data_out
[get_ports gpioLL] -test_mode Internal_scan \
-ordered_elements [list u_ip0/PHY_SC2 u_ip0/PHY_SC3] -complete true
set_scan_path chain94 -view spec -scan_data_in [get_ports gpioZZ] -scan_data_out
[get_ports gpioTT] -test_mode Internal_scan \
-ordered_elements [list u_ip1_shell/inst_2_chain8 u_ip1_shell/inst_1_chain8 \
u_ip1_shell/inst_2_chain0 u_ip1_shell/inst_2_chain1 \
u_ip1_shell/inst_2_chain2 u_ip1_shell/inst_2_chain3 \
u_ip1_shell/inst_2_chain4 u_ip1_shell/inst_2_chain5 \
u_ip1_shell/inst_2_chain6 u_ip1_shell/inst_2_chain7 \
u_ip1_shell/inst_1_chain0 u_ip1_shell/inst_1_chain1 \
u_ip1_shell/inst_1_chain2 u_ip1_shell/inst_1_chain3 \
u_ip1_shell/inst_1_chain4 u_ip1_shell/inst_1_chain5 \
u_ip1_shell/inst_1_chain6 u_ip1_shell/inst_1_chain7] -complete true
#ScanCompression_mode
set_scan_path 10 -view spec -ordered_elements [list u_ip0/PHY_SC0] -complete true -
test_mode ScanCompression_mode
It is most common and advised to modify the RTL code in order to make it dft friendly. Most
design modifications are made to allow controllability of clocks and reset signals.
RTL code that is imported from 3rd party vendor is integrated "as is" in the IC, and cannot be
modified. This actually restricts the dft engineer from improving the design's testability.
Even though dft-compiler supports "autofix" commands for several years, many engineers
choose not to use it, because it modifies the RTL and introduces a risk of loosing control over
the design's architecture.
Since we had many 3rd party cores in our design, and some did not meet our testability crite-
ria, we had to find a way to improve testability without making any direct RTL modifications.
The autofix commands proved to be very efficient for our needs, which were mainly focused
on controlling clock signals and allowing as many flip flops to be scannable.
One of the cases we had to solve was of minority group of flip flops with synchronous reset in
an asynchronous reset design. This group of hundreds of flip flops were disturbed during
shift, because their synchronous reset was uncontrollable (was generated by a scan element,
and passed through logic), thus were not able to function as scan elements (See figure 5).
Implementation in dft-compiler:
#Fix sync reset
set_dft_configuration -fix_reset enable
set_dft_signal -view existing -type ScanEnable -port scan_en
set_dft_signal -view spec -type ScanEnable -port scan_en
#fix sync reset
set_autofix_configuration -type reset -control_signal scan_en -method gate