LLM in Chip Design
LLM in Chip Design
Mingjie Liu§ , Teo Ene§ , Robert Kirby§ , Chris Cheng§ , Nathaniel Pinckney§ , Rongjian Liang§
Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, Bonita Bhaskaran
Bryan Catanzaro, Arjun Chaudhuri, Sharon Clay, Bill Dally, Laura Dang, Parikshit Deshpande
Siddhanth Dhodhi, Sameer Halepete, Eric Hill, Jiashang Hu, Sumit Jain, Brucek Khailany
Kishor Kunal, Xiaowei Li, Hao Liu, Stuart Oberman, Sujeet Omar, Sreedhar Pratty, Ambar Sarkar
Zhengjiang Shao, Hanfei Sun, Pratik P Suthar, Varun Tej, Kaizhe Xu, Haoxing Ren
NVIDIA
Domain-Adaptive
Abstract—ChipNeMo aims to explore the applications of large Pretraining Pretraining
language models (LLMs) for industrial chip design. Instead of Foundation Models
Trillions tokens of 24B tokens of chip
directly deploying off-the-shelf commercial or open-source LLMs, internet data
LLaMA2 (7B, 13B)
design docs/code
we instead adopt the following domain adaptation techniques:
105 – 106 GPU hrs Thousands GPU hrs
custom tokenizers, domain-adaptive continued pretraining, su-
pervised fine-tuning (SFT) with domain-specific instructions, and
domain-adapted retrieval models. We evaluate these methods on Supervised
Fine-Tuning
three selected LLM applications for chip design: an engineering ChipNeMo ChipNeMo
Chat Models 128K chat insts Foundation Models
assistant chatbot, EDA script generation, and bug summarization (7B, 13B) + 1.1K task insts (7B, 13B)
and analysis. Our results show that these domain adaptation
100+ GPU hrs
techniques enable significant LLM performance improvements
over general-purpose base models across the three evaluated
applications, enabling up to 5x model size reduction with similar Fig. 1: ChipNeMo Training Flow
or better performance on a range of design tasks. Our findings
also indicate that there’s still room for improvement between 7
our current results and ideal outcomes. We believe that further these three specific LLM applications: an engineering assistant
investigation of domain-adapted LLM approaches will help close chatbot for GPU ASIC and Architecture design engineers,
this gap in the future.
which understands internal HW designs and is capable of
I. I NTRODUCTION explaining complex design topics; EDA scripts generation for
two domain specific tools based on Python and Tcl for VLSI
Over the last few decades, Electronic Design Automation timing analysis tasks specified in English; bug summarization
(EDA) algorithms and tools have provided huge gains in chip and analysis as part of an internal bug and issue tracking
design productivity. Coupled with the exponential increases in system.
transistor densities provided by Moore’s law, EDA has enabled
the development of feature-rich complex SoC designs with Although general-purpose LLMs trained on vast amounts
billions of transistors. More recently, researchers have been of internet data exhibit remarkable capabilities in generative
exploring ways to apply AI to EDA algorithms and the chip AI tasks across diverse domains (as demonstrated by Bubeck
design process to further improve chip design productivity [1] et al. in [9]), recent work such as BloombergGPT [10] and
[2] [3]. However, many time-consuming chip design tasks that BioMedLLM [11] demonstrate that domain-specific LLM mod-
involve interfacing with natural languages or programming lan- els can outperform a general purpose model on domain-specific
guages still have not been automated. The latest advancements tasks. In the hardware design domain, [6] [12] showed that open-
in commercial (ChatGPT, Bard, etc.) and open-source (Vicuna source LLMs (CodeGen [13]) fine-tuned on additional Verilog
[4], LLaMA2 [5], etc.) large language models (LLM) provide data can outperform state-of-art OpenAI models. Customizing
an unprecedented opportunity to help automate these language- LLMs in this manner also avoids security risks associated with
related chip design tasks. Indeed, early academic research [6] sending proprietary chip design data to third party LLMs via
[7] [8] has explored applications of LLMs for generating RTL APIs. However, it would be prohibitively expensive to train
that can perform simple tasks in small design modules as well domain-specific models for every domain from scratch, since
as generating scripts for EDA tools. this often requires millions of GPU training hours. To cost-
effectively train domain-specific models, we instead propose
We believe that LLMs have the potential to help chip design
to combine the following techniques: Domain-Adaptive Pre-
productivity by using generative AI to automate many language-
Training (DAPT) [14] of foundation models with domain-
related chip design tasks such as code generation, responses to
adapted tokenizers, model alignment using general and domain-
engineering questions via a natural language interface, analysis
specific instructions, and retrieval-augmented generation (RAG)
and report generation, and bug triage. In this study, we focus on
[15] with a trained domain-adapted retrieval model.
§ Equal contribution As shown in Figure 1, our approach is to start with a base
foundational model and apply DAPT followed by Supervised We achieve a score of 7.4 out of 10 point scale for
Fine-Tuning (SFT). DAPT, also known as continued pretraining engineering assistant chatbot responses based on expert
with in-domain data, has been shown to be effective in areas evaluations, achieve more than 50% correctness in EDA
such as biomedical and computer science publications, news, script generation, and expert evaluation rating of 4 to 5
and reviews. In our case, we construct our domain-specific out of 7 point scale for summarizations and assignment
pre-training dataset from a collection of proprietary hardware- identification tasks.
related code (e.g. software, RTL, verification testbenches, etc.) • Domain-adapted ChipNeMo models dramatically out-
and natural language datasets (e.g. hardware specifications, performs all vanilla LLMs evaluated on both multiple-
documentation, etc.). We clean up and preprocess the raw choice domain-specific AutoEval benchmarks and human
dataset, then continued-pretrain a foundation model with the evaluations for applications.
domain-specific data. We call the resulting model a ChipNeMo • For tasks where it is possible for the model to generate text
Foundation Model. DAPT is done on a fraction of the tokens from the prompt context (e.g. chat with RAG hits, summa-
used in pre-training, and is much cheaper, only requiring a rization, code generation with provided documentation),
few thousand GPU hours. As described in Section V, we find domain-adaptation closes the gap between a state-of-the-
this approach to be more effective than Parameter Efficient art LLaMA2 70B model and a much smaller 13B model (a
Training (PEFT) techniques such as LoRA [16] for our use small incremental training cost enables up to 5x parameter
cases. reduction for reduced inference cost).
LLM tokenizers convert text into sequences of tokens • Customized tokenizers reduce DAPT token count by up
for LLM training. A domain-specific tokenizer improves the to 3.3% without hurting effectiveness on applications.
tokenization efficiency by tailoring rules and patterns for • SFT on an additional 1.1K domain-specific instructions
domain-specific terms such as keywords commonly found in significantly improves applications proficiency by up to
RTL. For DAPT, we cannot retrain a new domain-specific 0.33 out of 10-point scale, 18% correctness and 0.79 out
tokenizer from scratch, since it would make the foundation of 7-point scale in engineering assistant chatbot, EDA
model invalid. Instead of restricting ChipNeMo to the pre- scripts generation, and bug summarization and analysis,
trained general-purpose tokenizer used by the foundation model, respectively.
we instead adapt the pre-trained tokenizer to our chip design • Fine-tuning our ChipNeMo retrieval model with domain-
dataset, only adding new tokens for domain-specific terms. specific data improves the retriever hit rate by 30% over
ChipNeMo foundation models are completion models which a pre-trained state-of-the-art retriever, in turn improving
require supervised-fine-tuning (SFT) to adapt to tasks such overall quality of RAG responses.
as chat. We use largely publicly available general-purpose The paper is organized as follows. Section II describes our
chat instruction datasets for multi-turn chat together with a dataset and auto evaluation benchmarks for domain knowledge
small amount of domain-specific instruction datasets to perform verification. Section III outlines domain adaptation and training
SFT on the ChipNeMo foundation model, which produces methods used including the adapted tokenizer, DAPT, SFT, and
the ChipNeMo Chat model. We observe that SFT with a RAG. Section IV provides details of each application and the
general purpose chat instruction dataset is adequate to align the experimental setup. Section V describes the experimental results
ChipNeMo foundation models with queries in the chip design including human evaluations for each application. Section VI
domain. We also added a small amount of task-specific SFT discusses ChipNeMo limitations and future work. Section VII
instruction data, which further improves the alignment. We describes relevant LLM methods and other work targeting
trained multiple ChipNeMo Foundation and Chat models based LLMs for chip design. Finally, complete results along with
on variants of LLaMA2 models used as the base foundation additional model training details and examples of text generated
model. by the application use-cases are illustrated in the Appendix.
To improve performance on the engineering assistant chatbot
application, we also leverage Retrieval Augmented Generation
(RAG). RAG is an open-book approach for giving LLMs precise II. DATASET
context for user queries. It retrieves relevant in-domain knowl-
A. DAPT Dataset
edge from its data store to augment the response generation
given a user query. This method shows significant improvement During Domain-Adaptive Pre-Training (DAPT), we assemble
in grounding the model to the context of a particular question. a dataset from a combination of NVIDIA-proprietary chip
Crucially we observed significant improvements in retrieval hit design specific data sources and publicly available datasets.
rate when finetuning a pretrained retrieval model with domain Chip Design Datasets: Our internal dataset consists of a
data. This led to even further improvements in model quality. diverse range of text sources pertinent to chip design, spanning
We highlight the following contributions and findings related design, verification, infrastructure, and internal documentation.
to adapting LLMs to the chip design domain: Table I provides a breakdown of the data collected after
• We demonstrate domain-adapted LLM effectiveness on filtering, and the corresponding number of tokens using the
three use-cases: an engineering assistant chatbot, EDA LLaMA2 tokenizer. We construct the dataset by gathering
tool script generation, and bug summarization and analysis. all relevant internal data, then filtering by file type, based
on filename extensions and distinguishing between machine- cases. These examples have been meticulously crafted by
generated and human-written content. Although we evaluated subject matter experts and are formatted as single-turn questions
on three specific use cases, we did not specifically limit the and answers. Table II depicts the quantity of our domain-
dataset to sources known to be relevant to these use cases since specific instruction dataset. It’s worth noting that the total
we believed that incorporating additional domain knowledge number of training samples in the domain-specific instruction
would improve performance. After collection, cleaning, and dataset is quite small when compared to the extensive amount
filtering, the internal data training corpus has 23.1 billion tokens. of generative chat instruction data.
Further details of the data collection process are covered in
Appendix A. C. AutoEval
Public Datasets: We augment the chip design specific data In order to quickly and quantitatively assess the accuracy of
with a sample of publicly available data from various sources, various models, we established evaluation criteria structured as
a common practice in the development of foundational large multiple-choice question-and-answer formats for each use case,
language models. Our approach was to reuse public training designed to closely align with established benchmarks, such
data from other language models, with the stipulation that it as MMLU [22]. In the process of formulating these multiple-
must be publicly accessible and compatible with open sourcing. choice questions, collaboration with domain experts was pivotal.
These datasets exhibit a high degree of correlation with the The goal was to ensure that each question included at least
pretraining data used in LLaMA2 [5], with the intention of one complex answer choice, thereby posing a challenge to
preserving general knowledge and natural language capabilities individuals with limited domain expertise. Careful attention
during DAPT. The public datasets used by ChipNeMo can be was also given to prevent any inadvertent contamination of the
categorized into two groups, natural language and code. For the questions with data from our domain-specific SFT. In addition
natural language component, we draw from Wikipedia data [17], to the per-use-case benchmarks, an additional benchmark was
as it is widely regarded for its high data quality. For code, we created for general circuit design knowledge, covering both
leverage GitHub data [18], focusing on programming languages analog and digital design topics. The number of multiple-choice
also present in our internal data chip design dataset such as questions for evaluation benchmark are shown in Table III.
C++, Python, and Verilog. To ensure that the overall dataset is When we report results on the above benchmarks, we take
representative of pre-training distributions, we perform a sub- average results obtained from five distinct runs to mitigate
sampling operation that results in approximately 9.2% of the the effects of variance and noise in the testing process. Each
total training tokens being sampled from these public datasets, iteration employs a set of 5-shot examples, with variations
with a balanced representation of natural language and code. introduced across each individual runs.
Data Blend: A significant proportion of the domain data In addition to these domain-specific evaluation benchmarks,
we gathered is comprised of unannotated code from diverse we also include commonly-used publicly available LLM
origins. In an effort to enhance the model’s comprehension of academic benchmarks. Furthermore, we measure the model’s
domain-specific knowledge, we conducted downsampling of code generation capabilities, by evaluating HumanEval [23]
code data while concurrently upsampling natural language data, for Python and VerilogEval [12] for Verilog.
specifically design documentation, over a span of 2 to 4 training
epochs. We also increased the representation of data that we
III. C HIP N E M O D OMAIN A DAPTATION M ETHODS
deemed more pertinent to downstream applications, such as
human-written EDA tool scripts. Furthermore, we incorporated ChipNeMo implements multiple domain adaptation tech-
publicly available domain data for 1 epoch. Details of the token niques to adapt LLMs to the chip design domain. These
distribution for training are shown in Table I. techniques include custom tokenizers for chip design data,
domain adaptive pretraining with large corpus of domain data,
B. SFT Instruction Data supervised-fine-tuning with domain specific tasks, and retrieval-
During Supervised Fine-Tuning (SFT), we employ a general augmented generation with a fine-tuned retrieval model. We
chat SFT instruction dataset that is accessible for commercial will illustrate the details of each technique in this section.
use. The dataset is comprised largely of publicly available in-
struction following datasets including OASST [19], FLAN [20], A. Tokenizer
P3 [21] and a small amount of a broad domain proprietary When adapting a pre-trained tokenizer, the main goals are
dataset comprising various topics such as brainstorming, open- to improve tokenization efficiency on domain-specific data,
ended question answering, rewriting, summarization etc. It’s maintain efficiency and language model performance on general
important to note that the SFT instruction data we discuss datasets, and minimize the effort for retraining/fine-tuning. To
here is focused on general natural language tasks and does not achieve this, we’ve developed a four-step approach:
contain any information or tasks related to the downstream use • Step 1: Training a tokenizer from scratch using domain-
cases in chip design. In total, this dataset comprises 128,000 specific data.
training samples. • Step 2: From the vocabulary of the new tokenizer,
Additionally, we meticulously assembled a domain-specific identifying tokens that are absent in the general-purpose
instruction dataset for aligning the model to downstream use tokenizer and are rarely found in general-purpose datasets.
Data Data Training Training
Data Source Type
Percentage (%) Tokens (B) Percentage (%) Tokens (B)
Bug Summary 9.5% 2.4 10.0% 2.4
Design Source 47.0% 11.9 24.5% 5.9
Documentation 17.8% 4.5 34.0% 8.2
Verification 9.1% 2.3 10.4% 2.5
Other 7.9% 2.0 12.0% 2.9
Wikipedia 5.9% 1.5 6.2% 1.5
Github 2.8% 0.7 3.0% 0.7
Total 100.0% 25.3 100.0% 24.1
TABLE I: Breakdown of Data by Source. Token count measured with original LLaMA2 tokenizer.
Fig. 7: Retrieval Model Accuracy Comparison Fig. 8: Human Evaluation of Different Models. Model Only represents
results without RAG. RAG (Hit)/(Miss) only include questions whose
retrieved passages hit/miss their ideal context, RAG (Total) includes
The queries in the Specs category are derived directly from all questions.
passages in the documents, so their answers are often nicely
contained in a concise passage and clearly address the query. (7.4) as the 5X larger model LLaMA2-70B-Chat with
On the other hand, the queries of the Testbench and Build RAG, where LLaMA2-70B-Chat does better in extracting
categories are not directly derived from passages, so their answers on hits; however, domain adaptation makes up
answers were often not as apparent in the fetched passages and for it on the misses.
required more context (see Appendix C for detailed examples). • Domain SFT helps improve the performance of
This significantly contributes to the difference in retrieval ChipNeMo-13B-Chat by 0.28 (with RAG) and 0.33
quality between the categories. (without RAG).
We conducted evaluation of multiple ChipNeMo models and The complete evaluation results on all models are shown in
LLaMA2 models with and without RAG. The results were then Appendix D.
scored by human evaluators on a 10 point scale and shown in
Figure 8. F. EDA Script Generation
We made the following observations: In order to evaluate our model on the EDA script generation
• RAG significantly boosts human scores. RAG improves the task, we created two different types of benchmarks. The first
scores of LLaMA2-13B-Chat*, ChipNeMo-13B-Chat, and is a set of “Easy” and “Medium” difficulty tasks (1-4 line
LLaMA2-70B-Chat by 3.82, 2.19, and 5.05, respectively. solutions) that can be evaluated without human intervention by
Note that, scores are generally higher even with RAG comparing with a golden response. Due to the work required
miss, particularly on LLaMA2 models. We hypothesize to build and evaluate these benchmarks we only have this
that the additional in-domain context helps to boost the evaluation set for our Python task. The second set of tasks
performance. (“Hard”) come from real use case scenarios that our engineers
• ChipNeMo-13B-Chat outperform similar sized LLaMA2- chose. These tasks are much harder requiring 10’s of lines to
13B-Chat* in model only and RAG evaluations by 2.88 solve. Because these are hard to evaluate in an automatic way,
and 1.25, respectively. we had human engineers judge the correctness between 0% and
• ChipNeMo-13B-Chat with RAG achieves the same score 100%. The size of these benchmarks are described in Table V.
Work is ongoing to both increase the size and scope for these summarization. This includes having a long comment history
benchmarks to allow us to further improve these models. or other data which makes the bugs hard for a human to
We discovered that our models were unable to answer some quickly summarize. We then ask humans to rate both modes
of our harder tasks. The tasks required knowledge of many of summarization as well as the bug assignment the LLM
tool APIs and the model seemed to be unable to decide on the suggests. The evaluation metric is based on a 7 point Likert
proper ones while keeping the control flow properly organized. scale. Our results are included in Figure 10.
To mitigate this, we appended a human curated context to
the prompt, specific to each question. This context contained
explanations of different functions or attributes needed to
properly write the desired script. We only provided this for the
“Hard with Context” benchmark category. This also allows us
to study the possible effect of a retrieval based solution, which
we leave to future work.
As can be seen in the ablation results in Figure 9, both DAPT
and domain SFT for our problem was important. Without DAPT,
the model had little to no understanding of the underlying APIs
and performed poorly on automatically evaluated benchmarks.
Domain SFT further improved the results. We believe this is Fig. 10: Bug Summarization and Analysis Evaluation Results
because our domain SFT data helps guide the model to present ChipNeMo-13B-Chat models outperform the base LLaMA2-
the final script in the most directly applicable fashion. 13B-Chat* model for all three tasks, improving the 7 point
One interesting result is the LLaMA2-70B pass rate on Likert score by 0.82, 1.09 and 0.61 for technical summary, man-
“Hard with Context” benchmarks. It performs better than most agerial summary and assignment recommendation, respectively.
models on the Python tool but poorly on the Tcl tool. This Domain SFT also significantly improves the performances over
is likely because when provided with the correct context, without domain SFT on managerial summarization and task
LLaMA2-70B’s superior general Python coding ability is able assignment.
to solve novel problems it has not been trained on. However, the We hypothesize that contrary to the technical summarization
LLaMA2-70B model is unable to generalize its coding ability task whose quality and technical content are more dependent
to the Tcl tool, likely because it has not been exposed to a large on the model’s understanding of natural language semantics,
volume of Tcl code. This highlights the benefit of DAPT when managerial summary requires the model to understand how to
it comes to low-volume or proprietary programming languages. summarize the input data while retaining key personnel/engi-
neer names. This needs a more careful instruction-based fine-
tuning of the LLM.
LLaMA2-70B-Chat model also performs very well on all
three tasks, beating ChipNeMo-13B model over all tasks. Note
that LLaMA2-70B-Chat model also suffers from long-context
challenges with 4096 context size, we believe effective chunk-
and-combine schemes (hierarchical and incremental), choice of
instructional prompts at various stages of summarization, choice
of prompt during task assignment, and raw data formatting/pre-
processing help in circumventing the long-context challenge
and enable LLaMA2-70B-Chat to achieve high scores even
without DAPT and domain SFT.
VI. D ISCUSSION
Fig. 9: EDA Script Generation Evaluation Results A. Considerations for Domain Adaptation
G. Bug Summarization and Analysis Although domain-adapted ChipNeMo models achieve signifi-
To evaluate our models on bug summarization and analysis cant improvements over their corresponding foundation models,
we have a hold out set of 40 bugs which are ideal candidates for we also observe that the larger LLaMA2 70B can sometimes
achieve similar accuracy as ChipNeMo, as seen in Figures 8,
Evaluation Benchmark Name Size
9, and 10. Recent work has leveraged these powerful models
Tool1 (Python) - Automatic (Easy) 150 to perform chip design tasks.
Tool1 (Python) - Automatic (Medium) 30 However, it is important to consider the cost-efficiency
Tool1 (Python) - Human (Hard with Context) 10
Tool2 (Tcl) - Human (Hard with Context) 10
benefits gained from the use of a smaller model. Pope et
al. demonstrate that inference costs on an 8B model are 8-
TABLE V: EDA Script Generation Evaluation Benchmarks 12x lower than on a 62B model for equal latency targets
[35]. Furthermore, model size reduction can lead to dramatic reward models trained over general purpose datasets. We
increases in inference speed by allowing a model to fit within also plan to conduct long-context training [38] to overcome
a single GPU or node where it otherwise could not [36]. Our the challenge where long context is needed, e.g. in the bug
ChipNeMo 13B model can be loaded within the memory summarization application. In general, longer context support
of a single A100 GPU without any quantization, unlike the would help improve retrieval based methods for chat assistance
LLaMA2 70B model. This leads to significant inference speed as well as code generation.
increases under normal GPU operation, which can be traded 4) Retrieval: We will further investigate better RAG meth-
off for significant inference cost reduction should the GPU be ods for both the engineering assistant chatbot and EDA script
underclocked. generation. For the engineering assistant chatbot, we can create
Thus, when deciding between the use of a larger general- different data stores for different application areas. We can also
purpose model versus a smaller specialized model in a produc- integrate enterprise search engines with RAG to find relevant
tion environment the following criteria must be considered: context for a diverse set of problems. For code generation, we
• Training and inference trade-off: Smaller domain can investigate automated retrieval of context from existing
adapted models can match the accuracy of larger gen- code and documentation.
eral purpose models. While domain adaptation incurs
additional up-front costs, the use of smaller models leads C. Agent-Based Design Methodologies
to significantly reduced operating costs. The use cases we experimented in this work are straight-
• Uniqueness of use case: As can be seen from Figures forward applications of the prompt and response capability of
6, 9, and 10, domain adapted models show the most LLMs. Agents refer to the use of an LLM to choose a sequence
improvement on tasks that are rarely present in the public of actions to take, where an LLM is acting as a reasoning
domain, such as writing code in proprietary languages or engine to drive outside tools. Chip design processes involve
libraries. Indeed, our data shows that even when they are many existing EDA tools and methodologies. We believe some
provided with hand-picked contexts, large general purpose of these methodologies can be driven by agents powered by
models have difficulty matching the accuracy of domain domain-adapted LLMs such as ChipNeMo models. We plan
adapted models in such scenarios. to work on agent-based design methodologies for verification
• Availability of domain data: Domain adaption works and optimization in the future.
best when there is large amount of training data, i.e.
billions of training tokens. This is often the case for large
VII. R ELATED W ORKS
corporations and projects which have accumulated a large
amount of internal documents and code, but not necessarily Many domains have a significant amount of proprietary
true for smaller businesses or projects. data which can be used to train a domain-specific LLM. One
• End use case diversity: It is possible to fine-tune a approach is to train a domain specific foundation model from
general purpose model for a particular task, but domain- scratch, e.g., BloombergGPT [10] for finance, BioMedLLM
adapted models are suited for a diverse set of tasks in a [11] for biomed, and Galactica [39] for science. These
domain. Although we only demonstrate three use cases for models were usually trained on more than 100B tokens of
ChipNeMo models in this work, it can be readily re-used raw domain data. The second approach is domain-adaptive
for other use cases with sufficient SFT data. pretraining (DAPT) [14] which continues to train a pretrained
foundation model on additional raw domain data. It shows
B. Performance Gap slight performance boost on domain-specific tasks in domains
Although ChipNeMo achieves impressive results in our such as biomedical, computer science publications, news, and
selected applications as shown in Appendix E, the evaluation reviews. In one example, [40] continued-pretrained a foundation
results for all applications still show a considerate gap with model on technical content datasets and achieved state-of-the-
human expert performance. We are considering the following art performance on many quantitative reasoning tasks.
approaches to bridge this performance gap: Retrieval Augmented Generation (RAG) helps ground the
1) Data Collection: We can expand the DAPT dataset LLM to generate accurate information and to extract up-to-date
to include more internal proprietary data. In addition, we information to improve knowledge-intensive NLP tasks [41].
plan to add more task specific instruction sets for SFT as It is observed that smaller models with RAG can outperform
evidence shown task specific SFT improves the evaluation larger models without RAG [42]. Retrieval methods include
results meaningfully. sparse retrieval methods such as TF-IDF or BM25 [43], which
2) Base Model: We expect better and larger base models analyze word statistic information and find matching documents
can improve performance, such as LLaMA2 70B. We can also with a high dimensional sparse vector. Dense retrieval methods
explore applying DAPT to code-specific base models such as such as [44] [45] find matching documents on an embedding
Code LLaMA [32] for code generation tasks. space generated by a retrieval model pretrained on a large
3) Training: We also plan to conduct reinforcement learning corpus with or without fine-tuning on a retrieval dataset. The
from human feedback (RLHF) [37] over the ChipNeMo chat retrieval model can be trained standalone [44] [45] [46] or
model to make it more versatile. We plan to leverage pretrained jointly with language models [47] [42]. In addition, it has been
shown that off-the-shelf general purpose retrievers can improve inference of ChipNeMo models; NVIDIA Infrastructure teams
a baseline language model significantly without further fine- for supporting the GPU training and inference resources for
tuning [48]. RAG is also proposed to perform code generation the project; NVIDIA Hardware design teams for their support
tasks [49] by retrieving from coding documents. and insight.
Foundation models are completion models, which have
X. C ONTRIBUTIONS
limited chat and instruction following capabilities. Therefore,
a model alignment process is applied to the foundation models Mingjie Liu conducted DAPT and SFT model training.
to train a corresponding chat model. Instruction fine-tuning Teo Ene, Robert Kirby developed inference and application
[20] and reinforcement learning from human feedback (RLHF) evaluation infrastructure.
[37] are two common model alignment techniques. Instruction Chris Cheng developed RAG framework.
fine-tuning further trains a foundation model using instructions Nathaniel Pinckney collected and prepared data sets for
datasets. RLHF leverages human feedback to label a dataset training.
to train a reward model and applies reinforcement learning to Rongjian Liang developed custom tokenizers.
further improve models given the trained reward model. RLHF Siddhanth Dhodhi, Ismet Bayraktaroglu, Himyanshu
is usually more complex and resource hungry than instruction Anand, Eric Hill designed engineering assistant chatbot,
fine-tuning. Therefore, recent studies also propose to reduce provided domain instruction datasets, evaluation benchmarks,
this overhead with simpler methods such as DPO [50] and and conducted evaluation.
SteerLM [51]. Parikshit Deshpande, Zhengjiang Shao, Kaizhe Xu,
Researchers have started to apply LLM to chip design Jiashang Hu, Laura Dang, Xiaowei Li, Hao Liu, Ambar
problems. Early works such as Dave [52] first explored Sarkar developed engineering assistant chatbot application.
the possibility of generating Verilog from English with a Sreedhar Pratty, Kishor Kunal, Varun Tej, Sumit Jain,
language model (GPT-2). Following that work, [6] showed that Sujeet Omar, Pratik P Suthar, Hanfei Sun developed EDA
fine-tuned open-source LLMs (CodeGen) on Verilog datasets scripts generation application, provided domain instruction
collected from GitHub and Verilog textbooks outperformed datasets and evaluation benchmarks.
state-of-the-art OpenAI models such as code-davinci-002 on Bonita Bhaskaran, Arjun Chaudhuri, Sanmitra Banerjee
17 Verilog questions. [12] proposed a benchmark with more developed bug summarization and analysis application, pro-
than 150 problems and demonstrated that the Verilog code vided domain instruction datasets and evaluation benchmarks.
generation capability of pretrained language models could be Brucek Khailany, Stuart Oberman, Sharon Clay, Sameer
improved with supervised fine-tuning by bootstrapping with Halepete, Bryan Catanzaro, Jonah Alben, Bill Dally advised
LLM generated synthetic problem-code pairs. Chip-Chat [7] from AI research and hardware engineering perspectives.
experimented with conversational flows to design and verify Haoxing Ren designed and led the research.
a 8-bit accumulator-based microprocessor with GPT-4 and R EFERENCES
GPT-3.5. Their findings showed that although GPT-4 produced
[1] B. Khailany et al., “Accelerating chip design with machine learning,”
relatively high-quality codes, it still does not perform well IEEE Micro, vol. 40, no. 6, pp. 23–32, 2020.
enough at understanding and fixing the errors. ChipEDA [8] [2] H. Ren and M. Fojtik, “Invited- nvcell: Standard cell layout in advanced
proposed to use LLMs to generate EDA tools scripts. It also technology nodes with reinforcement learning,” in 2021 58th ACM/IEEE
Design Automation Conference (DAC), 2021.
demonstrated that fine-tuned LLaMA2 70B model outperforms [3] R. Roy et al., “PrefixRL: Optimization of parallel prefix circuits
GPT-4 model on this task. using deep reinforcement learning,” in 2021 58th ACM/IEEE Design
Automation Conference (DAC), 2021.
[4] W.-L. Chiang et al., “Vicuna: An open-source chatbot impressing
VIII. C ONCLUSIONS gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available:
https://lmsys.org/blog/2023-03-30-vicuna/
We explored domain-adapted approaches to improve LLM [5] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,”
performance for industrial chip design tasks. Our results show 2023.
that domain-adaptive pretrained models, such as ChipNeMo- [6] S. Thakur et al., “Benchmarking large language models for automated
verilog rtl code generation,” in 2023 Design, Automation & Test in
13B-Chat, achieve similar or better results than their base Europe Conference & Exhibition (DATE), 2023, pp. 1–6.
models. Closing the gap with much more powerful LLaMA2 [7] J. Blocklove et al., “Chip-chat: Challenges and opportunities in conver-
70B model on all three use cases: engineering assistant chatbot, sational hardware design,” 2023.
[8] Z. He et al., “Chateda: A large language model powered autonomous
EDA scripts generation, and bug summarization and analysis. agent for eda,” 2023.
Our future work will focus on further improving ChipNeMo [9] S. Bubeck et al., “Sparks of artificial general intelligence: Early
models and methods to make them ready for production use. experiments with gpt-4,” 2023.
[10] S. Wu et al., “Bloomberggpt: A large language model for finance,” 2023.
[11] M. LLC. (2022) Biomedlm: a domain-specific large
IX. ACKNOWLEDGEMENTS language model for biomedical text. [Online]. Available:
https://www.mosaicml.com/blog/introducing-pubmed-gpt
The authors would like to thank: NVIDIA IT teams for [12] M. Liu et al., “VerilogEval: evaluating large language models for verilog
their support on NVBugs integration; NVIDIA Hardware code generation,” in 2023 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), 2023.
Security team for their support on security issues; NVIDIA [13] E. Nijkamp et al., “Codegen: An open large language model for code
NeMo teams for their support and guidance on training and with multi-turn program synthesis,” ICLR, 2023.
[14] S. Gururangan et al., “Don’t stop pretraining: Adapt language models to [47] G. Izacard et al., “Few-shot Learning with Retrieval Augmented Language
domains and tasks,” 2020. Models,” 2022. [Online]. Available: http://arxiv.org/abs/2208.03299
[15] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive [48] O. Ram et al., “In-context retrieval-augmented language models,” 2023.
nlp tasks,” 2021. [49] S. Zhou et al., “Docprompting: Generating code by retrieving the docs,”
[16] E. J. Hu et al., “Lora: Low-rank adaptation of large language 2023.
models,” CoRR, vol. abs/2106.09685, 2021. [Online]. Available: [50] R. Rafailov et al., “Direct preference optimization: Your language model
https://arxiv.org/abs/2106.09685 is secretly a reward model,” 2023.
[17] L. Gao et al., “The pile: An 800gb dataset of diverse text for language [51] Y. Dong et al., “Steerlm: Attribute conditioned sft as an (user-steerable)
modeling.” alternative to rlhf,” 2023.
[18] D. Kocetkov et al., “The stack: 3 tb of permissively licensed source
[52] H. Pearce, B. Tan, and R. Karri, “Dave: Deriving automatically verilog
code,” 2022.
from english,” in Proceedings of the 2020 ACM/IEEE Workshop on
[19] A. Köpf et al., “Openassistant conversations – democratizing large
Machine Learning for CAD, ser. MLCAD ’20. New York, NY,
language model alignment,” 2023.
USA: Association for Computing Machinery, 2020, p. 27–32. [Online].
[20] J. Wei et al., “Finetuned language models are zero-shot learners,” 2022.
Available: https://doi.org/10.1145/3380446.3430634
[21] V. Sanh et al., “Multitask prompted training enables zero-shot task
generalization,” 2022. [53] “Beautiful Soup,” https://www.crummy.com/software/BeautifulSoup/,
[22] D. Hendrycks et al., “Measuring massive multitask language understand- accessed: 10 Oct 2023.
ing,” 2021. [54] K. Sakaguchi et al., “Winogrande: An adversarial winograd schema
[23] M. Chen et al., “Evaluating large language models trained on code,” challenge at scale,” arXiv preprint arXiv:1907.10641, 2019.
2021. [55] R. Zellers et al., “Hellaswag: Can a machine really finish your sentence?”
[24] F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A pretrained in Proceedings of the 57th Annual Meeting of the Association for
language model for Indonesian Twitter with effective domain-specific Computational Linguistics, 2019.
vocabulary initialization,” in Proceedings of the 2021 Conference on [56] P. Clark et al., “Think you have solved question answering? try arc, the
Empirical Methods in Natural Language Processing, Nov. 2021, pp. ai2 reasoning challenge,” 2018.
10 660–10 668. [57] G. Lai et al., “Race: Large-scale reading comprehension dataset from
[25] O. Kuchaiev et al., “Nemo: a toolkit for building ai applications using examinations,” 2017.
neural modules,” 2019.
[26] M. Shoeybi et al., “Megatron-lm: Training multi-billion parameter lan-
guage models using model parallelism,” arXiv preprint arXiv:1909.08053,
2019. A PPENDIX
[27] T. Dao et al., “FlashAttention: Fast and memory-efficient exact attention
with IO-awareness,” in Advances in Neural Information Processing
Systems, 2022. A. Data Collection Process
[28] A. Chowdhery et al., “Palm: Scaling language modeling with pathways,”
2022. Collection was implemented with a set of shell and Python
[29] Z. Ji et al., “Survey of hallucination in natural language generation,” scripts, designed to identify relevant design data and docu-
ACM Comput. Surv., vol. 55, no. 12, mar 2023. [Online]. Available:
https://doi.org/10.1145/3571730 mentation, convert them to plain text if applicable, filter them
[30] L. Wang et al., “Text embeddings by weakly-supervised contrastive using basic quality metrics, compute a checksum for precise file
pre-training,” arXiv preprint arXiv:2212.03533, 2022. deduplication, and compress them for storage. The collection
[31] L. Gao et al., “Tevatron: An efficient and flexible toolkit for dense
retrieval,” 2022. flow did not use off-the-shelf LLM-specific scraping and
[32] B. Rozière et al., “Code llama: Open foundation models for code,” 2023. collection scripts, as we aimed to minimize space requirements
[33] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural through in-situ data collection of internal data sources (both
networks,” Proceedings of the national academy of sciences, vol. 114,
no. 13, pp. 3521–3526, 2017. networked file systems and internal web applications). For file
[34] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings system-based collection, data was kept in-place while being
using siamese bert-networks,” in Proceedings of the 2019 Conference filtered for quality, instead of storing additional sets of raw
on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, 11 2019. [Online]. Available: data locally.
http://arxiv.org/abs/1908.10084 The design and verification data collection encompassed
[35] R. Pope et al., “Efficiently scaling transformer inference,” 2022.
[36] R. Y. Aminabadi et al., “Deepspeed inference: Enabling efficient inference a variety of source files, including Verilog and VHDL (RTL
of transformer models at unprecedented scale,” 2022. and netlists), C++, Spice, Tcl, various scripting languages,
[37] L. Ouyang et al., “Training language models to follow instructions with and build-related configuration files. Data from internal web
human feedback,” 2022.
[38] W. Xiong et al., “Effective long-context scaling of foundation models,”
services were gathered through both REST API calls and
2023. conventional crawling, with HTML formatting being removed
[39] R. Taylor et al., “Galactica: A large language model for science,” 2022. using the open-source BeautifulSoup [53] Python library in both
[40] A. Lewkowycz et al., “Solving quantitative reasoning problems with
language models,” 2022.
instances to minimize inadvertent removal of coding examples,
[41] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive at the cost of introducing more boiler plate navigation bars and
nlp tasks,” 2021. other HTML page elements. Our data collection flow supported
[42] S. Borgeaud et al., “Improving language models by retrieving from
trillions of tokens,” 2022.
conventional documentation formats, including .docx, .pptx,
[43] S. Robertson and H. Zaragoza, “The probabilistic relevance framework: and .pdf, using readily available Python conversion libraries
Bm25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, p. 333–389, and open-source tools.
apr 2009. [Online]. Available: https://doi.org/10.1561/1500000019
[44] V. Karpukhin et al., “Dense passage retrieval for open-domain question As most internal data is believe to be of high quality, minimal
answering,” 2020. filtering was applied: line count filtering was used to ensure
[45] G. Izacard et al., “Unsupervised dense information retrieval with that exceedingly large or small files were excluded, and files
contrastive learning,” 2022.
[46] W. Shi et al., “Replug: Retrieval-augmented black-box language models,” were sorted into broad categories of manually written versus
2023. tool-generated.
B. Domain Adaptive Pretraining (DAPT)
In this section we present detailed results on our domain
adaptive pretrained models. We also detail our ablation experi-
ments on domain adaptive pretraining.
DAPT Hyperparameters: Details presented in Table VI.
Hyperparameters Value
Context Window 4096
Global Batch Size 256 (128)
Optimizer distributed fused adam
Weight Decay 0.01
Betas 0.9, 0.95 (0.9, 0.98)
Learning Rate 5 · 10−6
Scheduler None
TABLE VII: Auto Evaluation Results. We report academic benchmark results for LLaMA2 using proprietary evaluation methods. ChipNeMo
models trained with tokenizer augmentation.
•Step 4: It is possible that some of the fetched passages Fig. 14: Sample Generation For Retrieval Model Training
are actually positive, so use the same language model to
filter out the positive passages
Query:
• Step 5: If there are not enough negative passages after What is the support DL for XXX build issues?
this filtering process, supplement with random passages
from the corpus Hit Passage:
For our initial research we used Vicuna [4] and Sentence Tree Setup
Working in the XXX Mainline explains initial tree setup and build
Transformer [34]; however, they can easily be replaced with steps
LLaMA2 [5] and BM25 [43] respectively to produce a retrieval Build
model that is commercially viable. Arch-Build-Hotseat-XXX - Hotseat support for XXX build issues
YYY build failures
2) Hit Quality Comparison: Not all hits are created equal.
The passage in the Spec example below clearly and completely
answers its query. The passage in the Build example contains D. Additional Evaluation Data
the answer; however, more context is required to answer the
query. Table XIV shows the evaluation data for all models on the
Spec Example: Hit passage clearly answers the query. engineering assistant chatbot application.
Table XV shows our evaluation results for all models on the
EDA script generation task.
Query:
What is the XXX buffer in the YYY used for? Table XVI shows our evaluation results for all models on
the bug summarization and analysis task.
Hit Passage:
Since the memory subsystem is partitioned, YYY may receive read Model Domain Hit Miss ALL
responses whose order is different from their requests. ZZZ requires SFT
strict in-order per-thread processing, thus YYY reorders the read
LLaMA2-13B-Chat* No 2.13 2.80 2.33
responses to the request order before sending them back to ZZZ.
ChipNemo-13B-Chat No 4.64 5.40 4.88
YYY re-orders the read responses in a read return reorder buffer
ChipNemo-13B-Chat Yes 4.66 6.44 5.21
(XXX) to guarantee the ordering.
LLaMA2-13B-Chat* + RAG No 7.68 3.72 6.46
ChipNemo-13B-Chat + RAG No 7.86 5.48 7.12
Answer: re-orders the read responses. ChipNemo-13B-Chat + RAG Yes 7.86 6.36 7.40
LLaMA2-70B-Chat No 2.36 2.32 2.35
LLaMA2-70B-Chat + RAG No 8.38 5.20 7.40
Build Example: Additional information is required to fully
answer the query. Such as: What is a DL? How do we know TABLE XIV: Engineering Assistant Chatbot Human Evaluation
Arch-Build-Hotseat-XXX is a DL?
Tool1 (Python) Tool2 (Tcl)
Model Automatic (Easy) Automatic (Medium) Human (Hard with Context) Human (Hard with Context)
LLaMA2-13B-Chat* 7% 4% 15% 6%
ChipNeMo-13B-Chat (noDSFT) 26% 35% 39% 33%
ChipNeMo-13B-Chat 44% 39% 41% 36%
LLaMA2-70B-Chat 3% 4% 49% 18%
TABLE XVI: Bug Summarization and Analysis Evaluation. Likert scale 1-7.
E. Examples
1) Engineering Assistant Chatbot: 2) EDA Script Generation: Some function names and
Question: commands are obfuscated.
Write a System Verilog assert to check if signal A is true within 5
cycles of signal B being true Question:
write a code in T OOL1 to print the number of flop cells in the give
Answer: rectangle (0,0,100,100)
Sure, here’s an example System Verilog assert to check if signal A is
true within 5 cycles of signal B being true: Answer:
// Define the signals # Sure, here is a TOOL1 code to print the number of
logic a, b; ,→ flop cells in the given rectangle
// Define the assert ,→ (0,0,100,100):
assert property (@(posedge clk) a |-> ##[1:5] b);
In this example, the assert property is used to check if signal a is # This code gets the top_ref of the design
true, then signal b must be true within 5 cycles. The assert will fail if top_ref = env().get_top_reference()
b is not true within 5 cycles of a being true. # This code gets the cells of the rectangle in the
,→ top_ref
cells = env().layout().cells_in_rectangle( top_ref,
Question: ,→ rectangle(0, 0, 100, 100), hierarchical)
why do we add retiming stages? # This code gets the cells that are flop cells
flop_cells = cells.type("is_flop_cell")
# This code prints the number of flop cells
Answer: print(flop_cells.size())
Retiming stages are added to improve the timing of the design and to
account for the physical distance between different parts of the chip.
By adding retiming stages, we can reduce the amount of time it takes Question:
for signals to travel between different parts of the chip, which can Write a T OOL2 tcl proc to find intersecting logic from pins ’$pinA’
help to improve the overall performance of the design. Additionally, to pin ’$pinB’.
retiming stages can also help to reduce the amount of noise and
interference that can occur on the signals, which can further improve
the performance of the design. Answer:
#Sure, here is the code:
proc myfunc {pinA pinB} {
# Get all the pins in the fanout of the given pin
Question:
set fanout_pins [fanout -from $pinA]
How do I run multiple seeds of random tests using testgen
# Get all the pins in the fanout of the given pin
framework?
set fanin_pins [fanin -to $pinB]
# Get the intersection of the two collections
Answer: set intersection [common_logic $fanout_pins
You can run multiple seeds of random tests using the Testgen ,→ $fanin_pins]
framework by using the rand tgen target. This target allows you # Return the intersection
to specify a variety of options for the stopping criteria of random runs return $intersection
using the same Testgen testlist(s). }
nvmk -c . rand_tgen TESTLIST=my_testlist MON=1
,→ TGEN.LOOP=50
Prompt:
BugId : 002
Synopsis : Request power/current measurement on fabrics using fabric test module
Module : Silicon - DFT
Description :
Hi all,
We want to evaluate silicon current (power) consumed by fabric macros in a chip. This will help us correlate this power to pre-Si power estimates.
Here are the conditions we want to look at for this data:
Fabric test module activity: 0/25/50/75/100%
Corners:
typical low voltage
typical high voltage
We want to isolate fabric current (power) as much as possible, so we want to enable only the logic needed for fabric test modules and the fabrics
themselves.
If this information can be broken down per module (there are 6 in the chip), that will help.
Severity : 5-Performance
Priority : 5-Normal
Comments :
#1 : Author : User A
Is this a requirement only for current or for future chips as well ?
#2 : Author : User B
There will likely be interest in future chips to perform this exercise as we push the boundaries of fabric power density for better area.
#3 : Author : User A
What is needed in this bug ?
@User B It is not clear.
Can you list the specific needs with timelines and priority
Technical Summary:
In NVIDIA’s internal bug management system, a bug has been filed regarding the power and current measurements of the fabrics in the latest chip.
The request is to evaluate the silicon current (power) consumed by the fabric macros in the latest chip and to correlate this power to pre-Si power
estimates. The conditions for the measurements are the fabric test module activity at 0/25/50/75/100% and the corners are typical low voltage and
typical high voltage. The data needs to be broken down per module (there are 6 in the latest chip). Only the logic needed for fabric test modules
and the fabrics themselves need to be enabled.
Managerial Summary:
User A asks User B if the need for evaluating the silicon current of fabric macros in a chip is only for that specific chip or for future ones as
well. User B replies that there will likely be interest in future chips to perform this exercise as they push the boundaries of fabric power density for
better area. User A then asks what is needed in this bug and asks for specific needs with timelines and priority.
Task Assignment:
User B