0% found this document useful (0 votes)
48 views

LLM in Chip Design

The document proposes domain-adapting large language models (LLMs) for chip design tasks through techniques like continued pretraining on domain data, customized tokenizers, and supervised fine-tuning. It explores applying these adapted models to three LLM applications: an engineering assistant chatbot, EDA script generation, and bug summarization/analysis. Results show domain adaptation enables significant performance gains over general models, allowing up to 5x model size reduction with similar or better accuracy. Further investigation is needed to close gaps between adapted models and ideal outcomes.

Uploaded by

Tathagato Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

LLM in Chip Design

The document proposes domain-adapting large language models (LLMs) for chip design tasks through techniques like continued pretraining on domain data, customized tokenizers, and supervised fine-tuning. It explores applying these adapted models to three LLM applications: an engineering assistant chatbot, EDA script generation, and bug summarization/analysis. Results show domain adaptation enables significant performance gains over general models, allowing up to 5x model size reduction with similar or better accuracy. Further investigation is needed to close gaps between adapted models and ideal outcomes.

Uploaded by

Tathagato Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

ChipNeMo: Domain-Adapted LLMs for Chip Design

Mingjie Liu§ , Teo Ene§ , Robert Kirby§ , Chris Cheng§ , Nathaniel Pinckney§ , Rongjian Liang§
Jonah Alben, Himyanshu Anand, Sanmitra Banerjee, Ismet Bayraktaroglu, Bonita Bhaskaran
Bryan Catanzaro, Arjun Chaudhuri, Sharon Clay, Bill Dally, Laura Dang, Parikshit Deshpande
Siddhanth Dhodhi, Sameer Halepete, Eric Hill, Jiashang Hu, Sumit Jain, Brucek Khailany
Kishor Kunal, Xiaowei Li, Hao Liu, Stuart Oberman, Sujeet Omar, Sreedhar Pratty, Ambar Sarkar
Zhengjiang Shao, Hanfei Sun, Pratik P Suthar, Varun Tej, Kaizhe Xu, Haoxing Ren
NVIDIA

Domain-Adaptive
Abstract—ChipNeMo aims to explore the applications of large Pretraining Pretraining
language models (LLMs) for industrial chip design. Instead of Foundation Models
Trillions tokens of 24B tokens of chip
directly deploying off-the-shelf commercial or open-source LLMs, internet data
LLaMA2 (7B, 13B)
design docs/code
we instead adopt the following domain adaptation techniques:
105 – 106 GPU hrs Thousands GPU hrs
custom tokenizers, domain-adaptive continued pretraining, su-
pervised fine-tuning (SFT) with domain-specific instructions, and
domain-adapted retrieval models. We evaluate these methods on Supervised
Fine-Tuning
three selected LLM applications for chip design: an engineering ChipNeMo ChipNeMo
Chat Models 128K chat insts Foundation Models
assistant chatbot, EDA script generation, and bug summarization (7B, 13B) + 1.1K task insts (7B, 13B)
and analysis. Our results show that these domain adaptation
100+ GPU hrs
techniques enable significant LLM performance improvements
over general-purpose base models across the three evaluated
applications, enabling up to 5x model size reduction with similar Fig. 1: ChipNeMo Training Flow
or better performance on a range of design tasks. Our findings
also indicate that there’s still room for improvement between 7

our current results and ideal outcomes. We believe that further these three specific LLM applications: an engineering assistant
investigation of domain-adapted LLM approaches will help close chatbot for GPU ASIC and Architecture design engineers,
this gap in the future.
which understands internal HW designs and is capable of
I. I NTRODUCTION explaining complex design topics; EDA scripts generation for
two domain specific tools based on Python and Tcl for VLSI
Over the last few decades, Electronic Design Automation timing analysis tasks specified in English; bug summarization
(EDA) algorithms and tools have provided huge gains in chip and analysis as part of an internal bug and issue tracking
design productivity. Coupled with the exponential increases in system.
transistor densities provided by Moore’s law, EDA has enabled
the development of feature-rich complex SoC designs with Although general-purpose LLMs trained on vast amounts
billions of transistors. More recently, researchers have been of internet data exhibit remarkable capabilities in generative
exploring ways to apply AI to EDA algorithms and the chip AI tasks across diverse domains (as demonstrated by Bubeck
design process to further improve chip design productivity [1] et al. in [9]), recent work such as BloombergGPT [10] and
[2] [3]. However, many time-consuming chip design tasks that BioMedLLM [11] demonstrate that domain-specific LLM mod-
involve interfacing with natural languages or programming lan- els can outperform a general purpose model on domain-specific
guages still have not been automated. The latest advancements tasks. In the hardware design domain, [6] [12] showed that open-
in commercial (ChatGPT, Bard, etc.) and open-source (Vicuna source LLMs (CodeGen [13]) fine-tuned on additional Verilog
[4], LLaMA2 [5], etc.) large language models (LLM) provide data can outperform state-of-art OpenAI models. Customizing
an unprecedented opportunity to help automate these language- LLMs in this manner also avoids security risks associated with
related chip design tasks. Indeed, early academic research [6] sending proprietary chip design data to third party LLMs via
[7] [8] has explored applications of LLMs for generating RTL APIs. However, it would be prohibitively expensive to train
that can perform simple tasks in small design modules as well domain-specific models for every domain from scratch, since
as generating scripts for EDA tools. this often requires millions of GPU training hours. To cost-
effectively train domain-specific models, we instead propose
We believe that LLMs have the potential to help chip design
to combine the following techniques: Domain-Adaptive Pre-
productivity by using generative AI to automate many language-
Training (DAPT) [14] of foundation models with domain-
related chip design tasks such as code generation, responses to
adapted tokenizers, model alignment using general and domain-
engineering questions via a natural language interface, analysis
specific instructions, and retrieval-augmented generation (RAG)
and report generation, and bug triage. In this study, we focus on
[15] with a trained domain-adapted retrieval model.
§ Equal contribution As shown in Figure 1, our approach is to start with a base
foundational model and apply DAPT followed by Supervised We achieve a score of 7.4 out of 10 point scale for
Fine-Tuning (SFT). DAPT, also known as continued pretraining engineering assistant chatbot responses based on expert
with in-domain data, has been shown to be effective in areas evaluations, achieve more than 50% correctness in EDA
such as biomedical and computer science publications, news, script generation, and expert evaluation rating of 4 to 5
and reviews. In our case, we construct our domain-specific out of 7 point scale for summarizations and assignment
pre-training dataset from a collection of proprietary hardware- identification tasks.
related code (e.g. software, RTL, verification testbenches, etc.) • Domain-adapted ChipNeMo models dramatically out-
and natural language datasets (e.g. hardware specifications, performs all vanilla LLMs evaluated on both multiple-
documentation, etc.). We clean up and preprocess the raw choice domain-specific AutoEval benchmarks and human
dataset, then continued-pretrain a foundation model with the evaluations for applications.
domain-specific data. We call the resulting model a ChipNeMo • For tasks where it is possible for the model to generate text
Foundation Model. DAPT is done on a fraction of the tokens from the prompt context (e.g. chat with RAG hits, summa-
used in pre-training, and is much cheaper, only requiring a rization, code generation with provided documentation),
few thousand GPU hours. As described in Section V, we find domain-adaptation closes the gap between a state-of-the-
this approach to be more effective than Parameter Efficient art LLaMA2 70B model and a much smaller 13B model (a
Training (PEFT) techniques such as LoRA [16] for our use small incremental training cost enables up to 5x parameter
cases. reduction for reduced inference cost).
LLM tokenizers convert text into sequences of tokens • Customized tokenizers reduce DAPT token count by up
for LLM training. A domain-specific tokenizer improves the to 3.3% without hurting effectiveness on applications.
tokenization efficiency by tailoring rules and patterns for • SFT on an additional 1.1K domain-specific instructions
domain-specific terms such as keywords commonly found in significantly improves applications proficiency by up to
RTL. For DAPT, we cannot retrain a new domain-specific 0.33 out of 10-point scale, 18% correctness and 0.79 out
tokenizer from scratch, since it would make the foundation of 7-point scale in engineering assistant chatbot, EDA
model invalid. Instead of restricting ChipNeMo to the pre- scripts generation, and bug summarization and analysis,
trained general-purpose tokenizer used by the foundation model, respectively.
we instead adapt the pre-trained tokenizer to our chip design • Fine-tuning our ChipNeMo retrieval model with domain-
dataset, only adding new tokens for domain-specific terms. specific data improves the retriever hit rate by 30% over
ChipNeMo foundation models are completion models which a pre-trained state-of-the-art retriever, in turn improving
require supervised-fine-tuning (SFT) to adapt to tasks such overall quality of RAG responses.
as chat. We use largely publicly available general-purpose The paper is organized as follows. Section II describes our
chat instruction datasets for multi-turn chat together with a dataset and auto evaluation benchmarks for domain knowledge
small amount of domain-specific instruction datasets to perform verification. Section III outlines domain adaptation and training
SFT on the ChipNeMo foundation model, which produces methods used including the adapted tokenizer, DAPT, SFT, and
the ChipNeMo Chat model. We observe that SFT with a RAG. Section IV provides details of each application and the
general purpose chat instruction dataset is adequate to align the experimental setup. Section V describes the experimental results
ChipNeMo foundation models with queries in the chip design including human evaluations for each application. Section VI
domain. We also added a small amount of task-specific SFT discusses ChipNeMo limitations and future work. Section VII
instruction data, which further improves the alignment. We describes relevant LLM methods and other work targeting
trained multiple ChipNeMo Foundation and Chat models based LLMs for chip design. Finally, complete results along with
on variants of LLaMA2 models used as the base foundation additional model training details and examples of text generated
model. by the application use-cases are illustrated in the Appendix.
To improve performance on the engineering assistant chatbot
application, we also leverage Retrieval Augmented Generation
(RAG). RAG is an open-book approach for giving LLMs precise II. DATASET
context for user queries. It retrieves relevant in-domain knowl-
A. DAPT Dataset
edge from its data store to augment the response generation
given a user query. This method shows significant improvement During Domain-Adaptive Pre-Training (DAPT), we assemble
in grounding the model to the context of a particular question. a dataset from a combination of NVIDIA-proprietary chip
Crucially we observed significant improvements in retrieval hit design specific data sources and publicly available datasets.
rate when finetuning a pretrained retrieval model with domain Chip Design Datasets: Our internal dataset consists of a
data. This led to even further improvements in model quality. diverse range of text sources pertinent to chip design, spanning
We highlight the following contributions and findings related design, verification, infrastructure, and internal documentation.
to adapting LLMs to the chip design domain: Table I provides a breakdown of the data collected after
• We demonstrate domain-adapted LLM effectiveness on filtering, and the corresponding number of tokens using the
three use-cases: an engineering assistant chatbot, EDA LLaMA2 tokenizer. We construct the dataset by gathering
tool script generation, and bug summarization and analysis. all relevant internal data, then filtering by file type, based
on filename extensions and distinguishing between machine- cases. These examples have been meticulously crafted by
generated and human-written content. Although we evaluated subject matter experts and are formatted as single-turn questions
on three specific use cases, we did not specifically limit the and answers. Table II depicts the quantity of our domain-
dataset to sources known to be relevant to these use cases since specific instruction dataset. It’s worth noting that the total
we believed that incorporating additional domain knowledge number of training samples in the domain-specific instruction
would improve performance. After collection, cleaning, and dataset is quite small when compared to the extensive amount
filtering, the internal data training corpus has 23.1 billion tokens. of generative chat instruction data.
Further details of the data collection process are covered in
Appendix A. C. AutoEval
Public Datasets: We augment the chip design specific data In order to quickly and quantitatively assess the accuracy of
with a sample of publicly available data from various sources, various models, we established evaluation criteria structured as
a common practice in the development of foundational large multiple-choice question-and-answer formats for each use case,
language models. Our approach was to reuse public training designed to closely align with established benchmarks, such
data from other language models, with the stipulation that it as MMLU [22]. In the process of formulating these multiple-
must be publicly accessible and compatible with open sourcing. choice questions, collaboration with domain experts was pivotal.
These datasets exhibit a high degree of correlation with the The goal was to ensure that each question included at least
pretraining data used in LLaMA2 [5], with the intention of one complex answer choice, thereby posing a challenge to
preserving general knowledge and natural language capabilities individuals with limited domain expertise. Careful attention
during DAPT. The public datasets used by ChipNeMo can be was also given to prevent any inadvertent contamination of the
categorized into two groups, natural language and code. For the questions with data from our domain-specific SFT. In addition
natural language component, we draw from Wikipedia data [17], to the per-use-case benchmarks, an additional benchmark was
as it is widely regarded for its high data quality. For code, we created for general circuit design knowledge, covering both
leverage GitHub data [18], focusing on programming languages analog and digital design topics. The number of multiple-choice
also present in our internal data chip design dataset such as questions for evaluation benchmark are shown in Table III.
C++, Python, and Verilog. To ensure that the overall dataset is When we report results on the above benchmarks, we take
representative of pre-training distributions, we perform a sub- average results obtained from five distinct runs to mitigate
sampling operation that results in approximately 9.2% of the the effects of variance and noise in the testing process. Each
total training tokens being sampled from these public datasets, iteration employs a set of 5-shot examples, with variations
with a balanced representation of natural language and code. introduced across each individual runs.
Data Blend: A significant proportion of the domain data In addition to these domain-specific evaluation benchmarks,
we gathered is comprised of unannotated code from diverse we also include commonly-used publicly available LLM
origins. In an effort to enhance the model’s comprehension of academic benchmarks. Furthermore, we measure the model’s
domain-specific knowledge, we conducted downsampling of code generation capabilities, by evaluating HumanEval [23]
code data while concurrently upsampling natural language data, for Python and VerilogEval [12] for Verilog.
specifically design documentation, over a span of 2 to 4 training
epochs. We also increased the representation of data that we
III. C HIP N E M O D OMAIN A DAPTATION M ETHODS
deemed more pertinent to downstream applications, such as
human-written EDA tool scripts. Furthermore, we incorporated ChipNeMo implements multiple domain adaptation tech-
publicly available domain data for 1 epoch. Details of the token niques to adapt LLMs to the chip design domain. These
distribution for training are shown in Table I. techniques include custom tokenizers for chip design data,
domain adaptive pretraining with large corpus of domain data,
B. SFT Instruction Data supervised-fine-tuning with domain specific tasks, and retrieval-
During Supervised Fine-Tuning (SFT), we employ a general augmented generation with a fine-tuned retrieval model. We
chat SFT instruction dataset that is accessible for commercial will illustrate the details of each technique in this section.
use. The dataset is comprised largely of publicly available in-
struction following datasets including OASST [19], FLAN [20], A. Tokenizer
P3 [21] and a small amount of a broad domain proprietary When adapting a pre-trained tokenizer, the main goals are
dataset comprising various topics such as brainstorming, open- to improve tokenization efficiency on domain-specific data,
ended question answering, rewriting, summarization etc. It’s maintain efficiency and language model performance on general
important to note that the SFT instruction data we discuss datasets, and minimize the effort for retraining/fine-tuning. To
here is focused on general natural language tasks and does not achieve this, we’ve developed a four-step approach:
contain any information or tasks related to the downstream use • Step 1: Training a tokenizer from scratch using domain-
cases in chip design. In total, this dataset comprises 128,000 specific data.
training samples. • Step 2: From the vocabulary of the new tokenizer,
Additionally, we meticulously assembled a domain-specific identifying tokens that are absent in the general-purpose
instruction dataset for aligning the model to downstream use tokenizer and are rarely found in general-purpose datasets.
Data Data Training Training
Data Source Type
Percentage (%) Tokens (B) Percentage (%) Tokens (B)
Bug Summary 9.5% 2.4 10.0% 2.4
Design Source 47.0% 11.9 24.5% 5.9
Documentation 17.8% 4.5 34.0% 8.2
Verification 9.1% 2.3 10.4% 2.5
Other 7.9% 2.0 12.0% 2.9
Wikipedia 5.9% 1.5 6.2% 1.5
Github 2.8% 0.7 3.0% 0.7
Total 100.0% 25.3 100.0% 24.1

TABLE I: Breakdown of Data by Source. Token count measured with original LLaMA2 tokenizer.

Domain Source Number of Samples


Design Knowledge 280
size is set at 256, and a context window of 4096 tokens is
EDA Script Generation 480 applied, resulting in an effective batch size of 1M tokens.
Bug summarization and analysis 392 Detailed training hyperparameters are provided in Appendix B.
Total 1152
The toal number of training steps is set to 23,200, equating to
TABLE II: Breakdown of Domain SFT Data. roughly 1 epoch of the data blend.

Domain Source Number of Questions


Design Knowledge (Design) 94
EDA Script Generation (Scripting) 74
Bug Summarization and Analysis (Bugs) 70
Open Domain Circuit Design (Circuits) 227

TABLE III: Domain-specific Evaluation Benchmark.

• Step 3: Expanding the general-purpose tokenizer with the


newly identified tokens at Step 2.
• Step 4: Initializing the embeddings of the new tokens by
utilizing the general-purpose tokenizer.
Specifically for Step 4, when a new token is encountered,
it is tokenized using the pretrained general-purpose tokenizer.
The embedding of the new token is determined by averaging Fig. 2: Smoothed Training Loss for ChipNeMo with Tokenizer
the embeddings of the tokens generated by the general-purpose Augmentation.
tokenizer [24], and the output layer weights initialized to zero.
Figure 2 illustrates the training loss of ChipNeMo under
Step 2 helps maintain the performance of the pre-trained the specified hyperparameters. We do observe spikes in the
LLM on general datasets by selectively introducing new tokens training loss. In contrast to the hypothesis in [28], we postulate
that are infrequently encountered in general-purpose datasets. that in our scenario, these spikes can be attributed to “bad data”
And Step 4 reduces the effort required for retraining/finetuning since these irregularities seem to consistently occur in similar
the LLM via initialization of the embeddings of new tokens training steps for the same model, even across different model
guided by the general-purpose tokenizer. sizes. We chose not to address this issue, as these anomalies
B. Domain Adaptive Pretraining did not appear to significantly impede subsequent training steps
(with no noticeable degradation in validation loss), possibly
In our study, we apply DAPT on pretrained foundation base due to our application of a low learning rate.
models LLaMA2 7B/13B. Each DAPT model is initialized
using the weights of their corresponding pretrained foundational C. Supervised Fine-Tuning
base models. We name our DAPT models ChipNeMo. We After DAPT, we perform model alignment with supervised
employ tokenizer augmentation as depicted in Section III-A fine-tuning (SFT). We adopt the identical hyperparameter
and initialize embedding weight accordingly [24]. We conduct training configuration as DAPT for all models, with the
further pretraining on domain-specific data by employing exception of using a reduced global batch size of 128. All SFT
the standard autoregressive language modeling objective. All data is structured according to the chat template below:
model training procedures are conducted using the NVIDIA
<extra_id_0>System\n{system}
NeMo framework [25], incorporating techniques such as tensor
<extra_id_1>User\n{user_utterance}
parallelism [26] and flash attention [27] for enhanced efficiency.
<extra_id_1>Assistant\n{chipnemo_response}
Our models undergo a consistent training regimen with
...
similar configurations. A small learning rate of 5 · 10−6 is
employed, and training is facilitated using the Adam optimizer, We employ an autoregressive optimization objective, implement-
without the use of learning rate schedulers. The global batch ing a strategy where losses associated with tokens originating
from the system and user prompts are masked [5]. This We created our domain adapted retrieval model by fine-tuning
approach ensures that during backpropagation, our focus is the e5 small unsupervised model [30] with 3000 domain
exclusively directed towards the optimization of answer tokens. specific auto-generated samples using the Tevatron framework
We combine our domain SFT dataset, comprising approx- [31]. The sample generation and training process are covered
imately 1.1k samples, with the more extensive general chat in Appendix C.
SFT dataset of 128k samples. We then engaged in fine-tuning Even with the significant gains that come with fine-tuning a
for a single epoch after applying a random shuffle to the retrieval model, the fact remains that retrieval still struggles with
data. We conducted experiments involving augmentation of the queries that do not map directly to passages in the document
domain-specific SFT dataset for more than one epoch. However, corpus or require more context not present in the passage.
it became evident that the model rapidly exhibited signs of Unfortunately, these queries are also more representative of
overfitting when presented with in-domain questions, often queries that will be asked by engineers in real situations.
repeating irrelevant answers from the domain SFT dataset. Combining retrieval with a domain adapted language model is
Additionally, we conducted an additional SFT using solely one way to address this issue.
the general chat dataset, excluding any domain-specific SFT
data. For clarity, we designate all our ChipNeMo models as IV. LLM A PPLICATIONS
follows: We conducted a survey of potential LLM applications within
1) ChipNeMo-Chat: Models fine-tuned with both domain our design teams and categorized them into four buckets: code
and general chat data; generation, question & answer, analysis and reporting,
2) ChipNeMo-Chat (noDSFT): Models fine-tuned with and triage. Code generation refers to LLM generating design
general chat data exclusively. code, testbenches, assertions, internal tools scripts, etc.; Q &
We also experimented with DAPT directly on a chat aligned A refers to an LLM answering questions about designs, tools,
model, such as the LLaMA2-Chat model. We found that infrastructures, etc.; Analysis and reporting refers to an LLM
DAPT significantly degraded the model’s alignment, making analyzing data and providing reports; triage refers to an LLM
the resulting model useless for downstream tasks. helping debug design or tool problems given logs and reports.
We selected one key application from each category to study
D. Retrieval-Augmented Generation in this work, except for the triage category which we leave
for further research. The motivation and technical details of
It is well known that LLMs can generate inaccurate text, each application are given below.
so-called hallucination [29]. Although the phenomenon is not
completely understood, we still must mitigate hallucinations A. Engineering Assistant Chatbot
since they are particularly problematic in an engineering This application aims to help design engineers with answers
assistant chatbot context, where accuracy is critical. Our to their architecture, design, verification, and build questions,
proposal is to leverage the retrieval augmented generation which could significantly improve their overall productivity
(RAG) method. RAG tries to retrieve relevant passages from without impacting the productivity of others. It is observed
a database to be included in the prompt together with the that design engineers often enjoy brainstorming, designing
question, which grounds the LLM to produce more accurate hardware, and writing code, but can be slowed down waiting
answers. We find that using a domain adapted language model for answers on design knowledge they lack. Design productivity
for RAG significantly improves answer quality on our domain can also be enhanced by avoiding having engineers write code
specific questions. Also, we find that fine-tuning an off-the-shelf based on mistaken assumptions or debugging code that they are
unsupervised pre-trained dense retrieval model with a modest unfamiliar with. Internal studies have shown that up to 60% of a
amount of domain specific training data significantly improves typical chip designer’s time is spent in debug or checklist related
retrieval accuracy. Our domain-adapted RAG implementation tasks across a range of topics including design specifications,
diagram is illustrated on Figure 3. testbench construction, architecture definition, and tools or
infrastructure. Experts on these issues are often spread around
the globe in a multinational company, such that it is not always
convenient to find immediate help. Therefore, an engineering
assistant chatbot based on knowledge extracted from internal
design documents, code, any recorded data about designs and
technical communications such as emails and corporate instant
communications, etc. could help significantly improve design
productivity. We implemented this application with the domain-
adapted RAG method mentioned in Section III-D.
B. EDA Script Generation
Another common task in an industrial chip design flow is
Fig. 3: RAG Implementation Variations writing EDA scripts to accomplish a variety of tasks such
supporting information and quickly summarize both technical
and managerial data as well as suggest next steps would boost
team productivity. We focus on using LLMs to generate three
different outputs - one focused on technical details, one on
managerial details and one recommending task assignment.
To study these tasks we used NVIDIA’s internal bug database,
NVBugs. This database is used for bug reporting, tracking and
resolution as well as general task and feature tracking across
the company. We expect ChipNeMo models to perform well
on this task as a large amount of bug data was included in
Fig. 4: LLM script generator integration with EDA tools
the DAPT dataset. Additionally, we built a domain-specific
SFT dataset for this task that includes examples of the bug
as design implementation, introspection and transformation. summarizing and task assignment tasks.
These scripts often leverage both tool-specific and custom Often, bug descriptions contain large snippets of log files
internal script libraries. Learning these libraries, navigating or code dumps along with long comment histories. In such
tool documentation, and writing and debugging these scripts, cases, the bug text is too large for our LLM context windows.
can take up a significant amount of engineering time. To work around this, we implemented two solutions. First, we
LLMs have proven adept at small scale code generation on a found and replaced long path names with shorter aliases to
wide array of tasks [32] and therefore customizing these models allow the model to associate paths that occur in multiple places
to accelerate engineer productivity in this domain specific task in the bug without needing to process the entire string. Second,
is a natural fit. In this work we focus on generating two different we split the summarization task into an incremental task where
types of scripts from natural language task descriptions. The the model is tasked with accumulating data across multiple
first are scripts which leverage Tool1, an internal python library summary and bug data chunks. We use a hierarchical approach
for design editing and analysis. The second are Tcl scripts where the bug is first separated into chunks that fit into the
that use the command interface provided by Tool2, which is a context window. Those chunks are then summarized and the
leading industrial static timing analysis tool. summaries are accumulated then separated into chunks. This
In order to build our domain-specific fine-tuning dataset process is repeated until the entire set of summaries fits into
for this task, production scripts for both tools were collected a single context window and a single summary is generated.
from design experts. We observed that our DAPT models can We use this same approach independent of the LLM used for
generate reasonable inline comments for the code. This enabled summarization.
us to use these models to improve the quality of collected scripts
by generating additional inline comments. Human experts later V. E VALUATIONS
verified and corrected these comments and created an associated
prompt. These prompts and code pairs make up the data used We evaluate our training methodology and application
for DSFT in the format discussed in Section III-C. performance in this section. We study both 7B and 13B models
To provide and collect feedback in the most meaningful way, in the training methodology evaluation, and only 13B models
we spent significant effort building the flow shown in Fig. 4 in the application performance evaluation. For comparison, we
where engineers can both query the model and run generated also evaluate two baseline chat models: LLaMA2-13B-Chat*
code through the same interface. This allows us to be confident and LLaMA2-70B-Chat. LLaMA2-13B-Chat* is the foundation
in the correctness of generated code as well as provide accurate LLaMA2 13B base model fine-tuned with our general purpose
feedback by allowing engineers to see how many corrections chat instruction dataset, which is different from the original
they might need to get a functioning script. We support Tool1 LLaMA2-13B-Chat model trained with reinforcement learning
and Tool2 integration by establishing interactive connections from human feedback (RLHF). We chose to do so for fair
to tool servers. comparison of domain adapted models and base models under
Additionally, we provide a user feedback form, allowing us the same model alignment approach. LLaMA2-70B-Chat is
to compare different models and glean valuable insights from the publicly released LLaMA2-Chat model trained with RLHF,
user feedback. This valuable information can aid us in further which is considered as the state-of-the-art(SOTA) open-source
refining our models. chat model.

C. Bug Summarization and Analysis A. Tokenizer


Tracking the reporting, triage, debug and resolution of We adapt the LLaMA2 tokenizer (containing 32K tokens)
various features and bugs across stages of the production flow to chip design datasets using the previously outlined four-
is a time-consuming process. Engineering managers spend step process. Approximately 9K new tokens are added to
a lot of time reviewing internal issue tracking databases to the LLaMA2 tokenizer. The adapted tokenizers can improve
build understanding of the state of the project and help speed tokenization efficiency by 1.6% to 3.3% across various chip
their execution. Therefore, a tool that is able to look at all design datasets as shown in Figure 5. We observe no obvious
approach eventually led to improved training and validation loss,
we noted substantial degradations across all domain-specific
and academic benchmarks, except on coding. We hypothesize
that a smaller learning rate played a dual role, facilitating
the distillation of domain knowledge through DAPT while
maintaining a balance that did not veer too far from the base
model, thus preserving general natural language capabilities.
We also explored the application of Parameter Efficient
Fine-Tuning (PEFT) in the context of Domain-Adaptive Pre-
training (DAPT). In this pursuit, we conducted two experiments
involving the incorporation of LoRA adapters [16], introducing
Fig. 5: ChipNeMo Tokenizer Augmentation Improvements.
additional parameters of 26.4 million (small) and 211.2 million
(large) respectively. In both instances, our findings revealed a
changes to tokenizer efficiency on public data. Importantly, we significant accuracy gap on in-domain tasks when compared
have not observed significant decline in the LLM’s accuracy to the full-parameter DAPT approach. Furthermore, when
on public benchmarks when using the custom augmented contrasting the outcomes between small and large PEFT models,
tokenizers even prior to DAPT. we observed a marginal enhancement on in-domain task
B. Domain Adaptive Pretraining accuracy, with large models exhibiting a slight improvement.
We posit that this phenomenon may be attributed to the
Figure 6 presents the outcomes for ChipNeMo models on necessity of training a large amount of parameters in order
the AutoEval benchmark for chip design domain and open to accommodate a substantial volume of information, and the
domain academic benchmarks. Our research findings can be susceptibility of PEFT models to catastrophic forgetting [33].
summarized as follows:
D. Training Cost
1) DAPT models exhibit a slight degradation in accuracy on
open-domain academic benchmarks. All models have undergone training using 128 A100 GPUs.
2) DAPT exerts a substantial positive impact on tasks within We estimate the costs associated with domain adaptive pre-
the domain itself. This effect is manifested in significant training for ChipNeMo as illustrated in Table IV. It is worth
improvements in internal design knowledge as well as noting that DAPT accounts for less than 1.5% of the overall
general circuit design knowledge. cost of pretraining a foundational model from scratch.
3) The use of larger and more performant foundational Model Size Pretraining DAPT SFT
models yields better zero-shot results on domain-specific 7B 184,320 2,620 90
tasks. Furthermore, the employment of superior base 13B 368,640 4,940 160
70B 1,720,320 - -
models results in enhanced domain models post-DAPT,
leading to heightened performance on in-domain tasks. TABLE IV: Training cost of LLaMA2 models in GPU hours.
4) Improvements attributed to DAPT with in-domain tasks Pretraining cost from [5].
exhibit a positive correlation with model size, with larger
models demonstrating more pronounced enhancements in E. RAG and Engineering Assistant Chatbot
domain-specific task performance post-DAPT. We created a benchmark to evaluate the performance
of design chat assistance, which uses the RAG method.
C. Training Ablation Studies This benchmark includes 88 questions in three categories:
For our ablation studies, we conducted multiple rounds of architecture/design/verification specifications (Specs), testbench
domain adaptive pre-training. We provide brief summaries and regression documentation (Testbench), and build infrastructure
refer to the Appendix B for details. documentation (Build). For each question, we specify the
The differences between training with the augmented tok- golden answer as well as the paragraphs in the design document
enizer and the original tokenizer appeared to be negligible. We that contains the relevant knowledge for the answer. These
thus primarily attribute the accuracy degradation on academic questions are created by designers manually based on a
benchmarks to domain data. Moreover, the removal of the set of design documents as the data store for retrieval. It
public dataset only slightly regressed on most tasks including includes about 1.8K documents, which were segmented into
academic benchmarks, with the exception of Verilog coding, 67K passages, each about 512 characters.
where we observed a noticeable difference. This suggests First, we compare our domain adapted retrieval model with
that the inclusion of GitHub Verilog data contributed to Sentence Transformer [34] and e5 small unsupervised [30] on
enhanced Verilog coding capabilities, particularly when the each category. Each model fetches its top 8 passages from the
base foundation models lacked sufficient data in this domain. data store.
In our exploration, we experimented with employing a larger As shown in Figure 7, our domain-adapted model performed
learning rate, as in CodeLLaMA [32]. We observed large 2x better than the original e5 small unsupervised model and
spikes in training loss at the initial training steps. Although this 30% better than sentence transformer.
(a) Chip Design Domain Benchmarks. (b) Academic Benchmarks.
Fig. 6: AutoEval Benchmark Result for ChipNeMo.

Fig. 7: Retrieval Model Accuracy Comparison Fig. 8: Human Evaluation of Different Models. Model Only represents
results without RAG. RAG (Hit)/(Miss) only include questions whose
retrieved passages hit/miss their ideal context, RAG (Total) includes
The queries in the Specs category are derived directly from all questions.
passages in the documents, so their answers are often nicely
contained in a concise passage and clearly address the query. (7.4) as the 5X larger model LLaMA2-70B-Chat with
On the other hand, the queries of the Testbench and Build RAG, where LLaMA2-70B-Chat does better in extracting
categories are not directly derived from passages, so their answers on hits; however, domain adaptation makes up
answers were often not as apparent in the fetched passages and for it on the misses.
required more context (see Appendix C for detailed examples). • Domain SFT helps improve the performance of
This significantly contributes to the difference in retrieval ChipNeMo-13B-Chat by 0.28 (with RAG) and 0.33
quality between the categories. (without RAG).
We conducted evaluation of multiple ChipNeMo models and The complete evaluation results on all models are shown in
LLaMA2 models with and without RAG. The results were then Appendix D.
scored by human evaluators on a 10 point scale and shown in
Figure 8. F. EDA Script Generation
We made the following observations: In order to evaluate our model on the EDA script generation
• RAG significantly boosts human scores. RAG improves the task, we created two different types of benchmarks. The first
scores of LLaMA2-13B-Chat*, ChipNeMo-13B-Chat, and is a set of “Easy” and “Medium” difficulty tasks (1-4 line
LLaMA2-70B-Chat by 3.82, 2.19, and 5.05, respectively. solutions) that can be evaluated without human intervention by
Note that, scores are generally higher even with RAG comparing with a golden response. Due to the work required
miss, particularly on LLaMA2 models. We hypothesize to build and evaluate these benchmarks we only have this
that the additional in-domain context helps to boost the evaluation set for our Python task. The second set of tasks
performance. (“Hard”) come from real use case scenarios that our engineers
• ChipNeMo-13B-Chat outperform similar sized LLaMA2- chose. These tasks are much harder requiring 10’s of lines to
13B-Chat* in model only and RAG evaluations by 2.88 solve. Because these are hard to evaluate in an automatic way,
and 1.25, respectively. we had human engineers judge the correctness between 0% and
• ChipNeMo-13B-Chat with RAG achieves the same score 100%. The size of these benchmarks are described in Table V.
Work is ongoing to both increase the size and scope for these summarization. This includes having a long comment history
benchmarks to allow us to further improve these models. or other data which makes the bugs hard for a human to
We discovered that our models were unable to answer some quickly summarize. We then ask humans to rate both modes
of our harder tasks. The tasks required knowledge of many of summarization as well as the bug assignment the LLM
tool APIs and the model seemed to be unable to decide on the suggests. The evaluation metric is based on a 7 point Likert
proper ones while keeping the control flow properly organized. scale. Our results are included in Figure 10.
To mitigate this, we appended a human curated context to
the prompt, specific to each question. This context contained
explanations of different functions or attributes needed to
properly write the desired script. We only provided this for the
“Hard with Context” benchmark category. This also allows us
to study the possible effect of a retrieval based solution, which
we leave to future work.
As can be seen in the ablation results in Figure 9, both DAPT
and domain SFT for our problem was important. Without DAPT,
the model had little to no understanding of the underlying APIs
and performed poorly on automatically evaluated benchmarks.
Domain SFT further improved the results. We believe this is Fig. 10: Bug Summarization and Analysis Evaluation Results
because our domain SFT data helps guide the model to present ChipNeMo-13B-Chat models outperform the base LLaMA2-
the final script in the most directly applicable fashion. 13B-Chat* model for all three tasks, improving the 7 point
One interesting result is the LLaMA2-70B pass rate on Likert score by 0.82, 1.09 and 0.61 for technical summary, man-
“Hard with Context” benchmarks. It performs better than most agerial summary and assignment recommendation, respectively.
models on the Python tool but poorly on the Tcl tool. This Domain SFT also significantly improves the performances over
is likely because when provided with the correct context, without domain SFT on managerial summarization and task
LLaMA2-70B’s superior general Python coding ability is able assignment.
to solve novel problems it has not been trained on. However, the We hypothesize that contrary to the technical summarization
LLaMA2-70B model is unable to generalize its coding ability task whose quality and technical content are more dependent
to the Tcl tool, likely because it has not been exposed to a large on the model’s understanding of natural language semantics,
volume of Tcl code. This highlights the benefit of DAPT when managerial summary requires the model to understand how to
it comes to low-volume or proprietary programming languages. summarize the input data while retaining key personnel/engi-
neer names. This needs a more careful instruction-based fine-
tuning of the LLM.
LLaMA2-70B-Chat model also performs very well on all
three tasks, beating ChipNeMo-13B model over all tasks. Note
that LLaMA2-70B-Chat model also suffers from long-context
challenges with 4096 context size, we believe effective chunk-
and-combine schemes (hierarchical and incremental), choice of
instructional prompts at various stages of summarization, choice
of prompt during task assignment, and raw data formatting/pre-
processing help in circumventing the long-context challenge
and enable LLaMA2-70B-Chat to achieve high scores even
without DAPT and domain SFT.

VI. D ISCUSSION
Fig. 9: EDA Script Generation Evaluation Results A. Considerations for Domain Adaptation
G. Bug Summarization and Analysis Although domain-adapted ChipNeMo models achieve signifi-
To evaluate our models on bug summarization and analysis cant improvements over their corresponding foundation models,
we have a hold out set of 40 bugs which are ideal candidates for we also observe that the larger LLaMA2 70B can sometimes
achieve similar accuracy as ChipNeMo, as seen in Figures 8,
Evaluation Benchmark Name Size
9, and 10. Recent work has leveraged these powerful models
Tool1 (Python) - Automatic (Easy) 150 to perform chip design tasks.
Tool1 (Python) - Automatic (Medium) 30 However, it is important to consider the cost-efficiency
Tool1 (Python) - Human (Hard with Context) 10
Tool2 (Tcl) - Human (Hard with Context) 10
benefits gained from the use of a smaller model. Pope et
al. demonstrate that inference costs on an 8B model are 8-
TABLE V: EDA Script Generation Evaluation Benchmarks 12x lower than on a 62B model for equal latency targets
[35]. Furthermore, model size reduction can lead to dramatic reward models trained over general purpose datasets. We
increases in inference speed by allowing a model to fit within also plan to conduct long-context training [38] to overcome
a single GPU or node where it otherwise could not [36]. Our the challenge where long context is needed, e.g. in the bug
ChipNeMo 13B model can be loaded within the memory summarization application. In general, longer context support
of a single A100 GPU without any quantization, unlike the would help improve retrieval based methods for chat assistance
LLaMA2 70B model. This leads to significant inference speed as well as code generation.
increases under normal GPU operation, which can be traded 4) Retrieval: We will further investigate better RAG meth-
off for significant inference cost reduction should the GPU be ods for both the engineering assistant chatbot and EDA script
underclocked. generation. For the engineering assistant chatbot, we can create
Thus, when deciding between the use of a larger general- different data stores for different application areas. We can also
purpose model versus a smaller specialized model in a produc- integrate enterprise search engines with RAG to find relevant
tion environment the following criteria must be considered: context for a diverse set of problems. For code generation, we
• Training and inference trade-off: Smaller domain can investigate automated retrieval of context from existing
adapted models can match the accuracy of larger gen- code and documentation.
eral purpose models. While domain adaptation incurs
additional up-front costs, the use of smaller models leads C. Agent-Based Design Methodologies
to significantly reduced operating costs. The use cases we experimented in this work are straight-
• Uniqueness of use case: As can be seen from Figures forward applications of the prompt and response capability of
6, 9, and 10, domain adapted models show the most LLMs. Agents refer to the use of an LLM to choose a sequence
improvement on tasks that are rarely present in the public of actions to take, where an LLM is acting as a reasoning
domain, such as writing code in proprietary languages or engine to drive outside tools. Chip design processes involve
libraries. Indeed, our data shows that even when they are many existing EDA tools and methodologies. We believe some
provided with hand-picked contexts, large general purpose of these methodologies can be driven by agents powered by
models have difficulty matching the accuracy of domain domain-adapted LLMs such as ChipNeMo models. We plan
adapted models in such scenarios. to work on agent-based design methodologies for verification
• Availability of domain data: Domain adaption works and optimization in the future.
best when there is large amount of training data, i.e.
billions of training tokens. This is often the case for large
VII. R ELATED W ORKS
corporations and projects which have accumulated a large
amount of internal documents and code, but not necessarily Many domains have a significant amount of proprietary
true for smaller businesses or projects. data which can be used to train a domain-specific LLM. One
• End use case diversity: It is possible to fine-tune a approach is to train a domain specific foundation model from
general purpose model for a particular task, but domain- scratch, e.g., BloombergGPT [10] for finance, BioMedLLM
adapted models are suited for a diverse set of tasks in a [11] for biomed, and Galactica [39] for science. These
domain. Although we only demonstrate three use cases for models were usually trained on more than 100B tokens of
ChipNeMo models in this work, it can be readily re-used raw domain data. The second approach is domain-adaptive
for other use cases with sufficient SFT data. pretraining (DAPT) [14] which continues to train a pretrained
foundation model on additional raw domain data. It shows
B. Performance Gap slight performance boost on domain-specific tasks in domains
Although ChipNeMo achieves impressive results in our such as biomedical, computer science publications, news, and
selected applications as shown in Appendix E, the evaluation reviews. In one example, [40] continued-pretrained a foundation
results for all applications still show a considerate gap with model on technical content datasets and achieved state-of-the-
human expert performance. We are considering the following art performance on many quantitative reasoning tasks.
approaches to bridge this performance gap: Retrieval Augmented Generation (RAG) helps ground the
1) Data Collection: We can expand the DAPT dataset LLM to generate accurate information and to extract up-to-date
to include more internal proprietary data. In addition, we information to improve knowledge-intensive NLP tasks [41].
plan to add more task specific instruction sets for SFT as It is observed that smaller models with RAG can outperform
evidence shown task specific SFT improves the evaluation larger models without RAG [42]. Retrieval methods include
results meaningfully. sparse retrieval methods such as TF-IDF or BM25 [43], which
2) Base Model: We expect better and larger base models analyze word statistic information and find matching documents
can improve performance, such as LLaMA2 70B. We can also with a high dimensional sparse vector. Dense retrieval methods
explore applying DAPT to code-specific base models such as such as [44] [45] find matching documents on an embedding
Code LLaMA [32] for code generation tasks. space generated by a retrieval model pretrained on a large
3) Training: We also plan to conduct reinforcement learning corpus with or without fine-tuning on a retrieval dataset. The
from human feedback (RLHF) [37] over the ChipNeMo chat retrieval model can be trained standalone [44] [45] [46] or
model to make it more versatile. We plan to leverage pretrained jointly with language models [47] [42]. In addition, it has been
shown that off-the-shelf general purpose retrievers can improve inference of ChipNeMo models; NVIDIA Infrastructure teams
a baseline language model significantly without further fine- for supporting the GPU training and inference resources for
tuning [48]. RAG is also proposed to perform code generation the project; NVIDIA Hardware design teams for their support
tasks [49] by retrieving from coding documents. and insight.
Foundation models are completion models, which have
X. C ONTRIBUTIONS
limited chat and instruction following capabilities. Therefore,
a model alignment process is applied to the foundation models Mingjie Liu conducted DAPT and SFT model training.
to train a corresponding chat model. Instruction fine-tuning Teo Ene, Robert Kirby developed inference and application
[20] and reinforcement learning from human feedback (RLHF) evaluation infrastructure.
[37] are two common model alignment techniques. Instruction Chris Cheng developed RAG framework.
fine-tuning further trains a foundation model using instructions Nathaniel Pinckney collected and prepared data sets for
datasets. RLHF leverages human feedback to label a dataset training.
to train a reward model and applies reinforcement learning to Rongjian Liang developed custom tokenizers.
further improve models given the trained reward model. RLHF Siddhanth Dhodhi, Ismet Bayraktaroglu, Himyanshu
is usually more complex and resource hungry than instruction Anand, Eric Hill designed engineering assistant chatbot,
fine-tuning. Therefore, recent studies also propose to reduce provided domain instruction datasets, evaluation benchmarks,
this overhead with simpler methods such as DPO [50] and and conducted evaluation.
SteerLM [51]. Parikshit Deshpande, Zhengjiang Shao, Kaizhe Xu,
Researchers have started to apply LLM to chip design Jiashang Hu, Laura Dang, Xiaowei Li, Hao Liu, Ambar
problems. Early works such as Dave [52] first explored Sarkar developed engineering assistant chatbot application.
the possibility of generating Verilog from English with a Sreedhar Pratty, Kishor Kunal, Varun Tej, Sumit Jain,
language model (GPT-2). Following that work, [6] showed that Sujeet Omar, Pratik P Suthar, Hanfei Sun developed EDA
fine-tuned open-source LLMs (CodeGen) on Verilog datasets scripts generation application, provided domain instruction
collected from GitHub and Verilog textbooks outperformed datasets and evaluation benchmarks.
state-of-the-art OpenAI models such as code-davinci-002 on Bonita Bhaskaran, Arjun Chaudhuri, Sanmitra Banerjee
17 Verilog questions. [12] proposed a benchmark with more developed bug summarization and analysis application, pro-
than 150 problems and demonstrated that the Verilog code vided domain instruction datasets and evaluation benchmarks.
generation capability of pretrained language models could be Brucek Khailany, Stuart Oberman, Sharon Clay, Sameer
improved with supervised fine-tuning by bootstrapping with Halepete, Bryan Catanzaro, Jonah Alben, Bill Dally advised
LLM generated synthetic problem-code pairs. Chip-Chat [7] from AI research and hardware engineering perspectives.
experimented with conversational flows to design and verify Haoxing Ren designed and led the research.
a 8-bit accumulator-based microprocessor with GPT-4 and R EFERENCES
GPT-3.5. Their findings showed that although GPT-4 produced
[1] B. Khailany et al., “Accelerating chip design with machine learning,”
relatively high-quality codes, it still does not perform well IEEE Micro, vol. 40, no. 6, pp. 23–32, 2020.
enough at understanding and fixing the errors. ChipEDA [8] [2] H. Ren and M. Fojtik, “Invited- nvcell: Standard cell layout in advanced
proposed to use LLMs to generate EDA tools scripts. It also technology nodes with reinforcement learning,” in 2021 58th ACM/IEEE
Design Automation Conference (DAC), 2021.
demonstrated that fine-tuned LLaMA2 70B model outperforms [3] R. Roy et al., “PrefixRL: Optimization of parallel prefix circuits
GPT-4 model on this task. using deep reinforcement learning,” in 2021 58th ACM/IEEE Design
Automation Conference (DAC), 2021.
[4] W.-L. Chiang et al., “Vicuna: An open-source chatbot impressing
VIII. C ONCLUSIONS gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available:
https://lmsys.org/blog/2023-03-30-vicuna/
We explored domain-adapted approaches to improve LLM [5] H. Touvron et al., “Llama 2: Open foundation and fine-tuned chat models,”
performance for industrial chip design tasks. Our results show 2023.
that domain-adaptive pretrained models, such as ChipNeMo- [6] S. Thakur et al., “Benchmarking large language models for automated
verilog rtl code generation,” in 2023 Design, Automation & Test in
13B-Chat, achieve similar or better results than their base Europe Conference & Exhibition (DATE), 2023, pp. 1–6.
models. Closing the gap with much more powerful LLaMA2 [7] J. Blocklove et al., “Chip-chat: Challenges and opportunities in conver-
70B model on all three use cases: engineering assistant chatbot, sational hardware design,” 2023.
[8] Z. He et al., “Chateda: A large language model powered autonomous
EDA scripts generation, and bug summarization and analysis. agent for eda,” 2023.
Our future work will focus on further improving ChipNeMo [9] S. Bubeck et al., “Sparks of artificial general intelligence: Early
models and methods to make them ready for production use. experiments with gpt-4,” 2023.
[10] S. Wu et al., “Bloomberggpt: A large language model for finance,” 2023.
[11] M. LLC. (2022) Biomedlm: a domain-specific large
IX. ACKNOWLEDGEMENTS language model for biomedical text. [Online]. Available:
https://www.mosaicml.com/blog/introducing-pubmed-gpt
The authors would like to thank: NVIDIA IT teams for [12] M. Liu et al., “VerilogEval: evaluating large language models for verilog
their support on NVBugs integration; NVIDIA Hardware code generation,” in 2023 IEEE/ACM International Conference on
Computer-Aided Design (ICCAD), 2023.
Security team for their support on security issues; NVIDIA [13] E. Nijkamp et al., “Codegen: An open large language model for code
NeMo teams for their support and guidance on training and with multi-turn program synthesis,” ICLR, 2023.
[14] S. Gururangan et al., “Don’t stop pretraining: Adapt language models to [47] G. Izacard et al., “Few-shot Learning with Retrieval Augmented Language
domains and tasks,” 2020. Models,” 2022. [Online]. Available: http://arxiv.org/abs/2208.03299
[15] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive [48] O. Ram et al., “In-context retrieval-augmented language models,” 2023.
nlp tasks,” 2021. [49] S. Zhou et al., “Docprompting: Generating code by retrieving the docs,”
[16] E. J. Hu et al., “Lora: Low-rank adaptation of large language 2023.
models,” CoRR, vol. abs/2106.09685, 2021. [Online]. Available: [50] R. Rafailov et al., “Direct preference optimization: Your language model
https://arxiv.org/abs/2106.09685 is secretly a reward model,” 2023.
[17] L. Gao et al., “The pile: An 800gb dataset of diverse text for language [51] Y. Dong et al., “Steerlm: Attribute conditioned sft as an (user-steerable)
modeling.” alternative to rlhf,” 2023.
[18] D. Kocetkov et al., “The stack: 3 tb of permissively licensed source
[52] H. Pearce, B. Tan, and R. Karri, “Dave: Deriving automatically verilog
code,” 2022.
from english,” in Proceedings of the 2020 ACM/IEEE Workshop on
[19] A. Köpf et al., “Openassistant conversations – democratizing large
Machine Learning for CAD, ser. MLCAD ’20. New York, NY,
language model alignment,” 2023.
USA: Association for Computing Machinery, 2020, p. 27–32. [Online].
[20] J. Wei et al., “Finetuned language models are zero-shot learners,” 2022.
Available: https://doi.org/10.1145/3380446.3430634
[21] V. Sanh et al., “Multitask prompted training enables zero-shot task
generalization,” 2022. [53] “Beautiful Soup,” https://www.crummy.com/software/BeautifulSoup/,
[22] D. Hendrycks et al., “Measuring massive multitask language understand- accessed: 10 Oct 2023.
ing,” 2021. [54] K. Sakaguchi et al., “Winogrande: An adversarial winograd schema
[23] M. Chen et al., “Evaluating large language models trained on code,” challenge at scale,” arXiv preprint arXiv:1907.10641, 2019.
2021. [55] R. Zellers et al., “Hellaswag: Can a machine really finish your sentence?”
[24] F. Koto, J. H. Lau, and T. Baldwin, “IndoBERTweet: A pretrained in Proceedings of the 57th Annual Meeting of the Association for
language model for Indonesian Twitter with effective domain-specific Computational Linguistics, 2019.
vocabulary initialization,” in Proceedings of the 2021 Conference on [56] P. Clark et al., “Think you have solved question answering? try arc, the
Empirical Methods in Natural Language Processing, Nov. 2021, pp. ai2 reasoning challenge,” 2018.
10 660–10 668. [57] G. Lai et al., “Race: Large-scale reading comprehension dataset from
[25] O. Kuchaiev et al., “Nemo: a toolkit for building ai applications using examinations,” 2017.
neural modules,” 2019.
[26] M. Shoeybi et al., “Megatron-lm: Training multi-billion parameter lan-
guage models using model parallelism,” arXiv preprint arXiv:1909.08053,
2019. A PPENDIX
[27] T. Dao et al., “FlashAttention: Fast and memory-efficient exact attention
with IO-awareness,” in Advances in Neural Information Processing
Systems, 2022. A. Data Collection Process
[28] A. Chowdhery et al., “Palm: Scaling language modeling with pathways,”
2022. Collection was implemented with a set of shell and Python
[29] Z. Ji et al., “Survey of hallucination in natural language generation,” scripts, designed to identify relevant design data and docu-
ACM Comput. Surv., vol. 55, no. 12, mar 2023. [Online]. Available:
https://doi.org/10.1145/3571730 mentation, convert them to plain text if applicable, filter them
[30] L. Wang et al., “Text embeddings by weakly-supervised contrastive using basic quality metrics, compute a checksum for precise file
pre-training,” arXiv preprint arXiv:2212.03533, 2022. deduplication, and compress them for storage. The collection
[31] L. Gao et al., “Tevatron: An efficient and flexible toolkit for dense
retrieval,” 2022. flow did not use off-the-shelf LLM-specific scraping and
[32] B. Rozière et al., “Code llama: Open foundation models for code,” 2023. collection scripts, as we aimed to minimize space requirements
[33] J. Kirkpatrick et al., “Overcoming catastrophic forgetting in neural through in-situ data collection of internal data sources (both
networks,” Proceedings of the national academy of sciences, vol. 114,
no. 13, pp. 3521–3526, 2017. networked file systems and internal web applications). For file
[34] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings system-based collection, data was kept in-place while being
using siamese bert-networks,” in Proceedings of the 2019 Conference filtered for quality, instead of storing additional sets of raw
on Empirical Methods in Natural Language Processing. Association
for Computational Linguistics, 11 2019. [Online]. Available: data locally.
http://arxiv.org/abs/1908.10084 The design and verification data collection encompassed
[35] R. Pope et al., “Efficiently scaling transformer inference,” 2022.
[36] R. Y. Aminabadi et al., “Deepspeed inference: Enabling efficient inference a variety of source files, including Verilog and VHDL (RTL
of transformer models at unprecedented scale,” 2022. and netlists), C++, Spice, Tcl, various scripting languages,
[37] L. Ouyang et al., “Training language models to follow instructions with and build-related configuration files. Data from internal web
human feedback,” 2022.
[38] W. Xiong et al., “Effective long-context scaling of foundation models,”
services were gathered through both REST API calls and
2023. conventional crawling, with HTML formatting being removed
[39] R. Taylor et al., “Galactica: A large language model for science,” 2022. using the open-source BeautifulSoup [53] Python library in both
[40] A. Lewkowycz et al., “Solving quantitative reasoning problems with
language models,” 2022.
instances to minimize inadvertent removal of coding examples,
[41] P. Lewis et al., “Retrieval-augmented generation for knowledge-intensive at the cost of introducing more boiler plate navigation bars and
nlp tasks,” 2021. other HTML page elements. Our data collection flow supported
[42] S. Borgeaud et al., “Improving language models by retrieving from
trillions of tokens,” 2022.
conventional documentation formats, including .docx, .pptx,
[43] S. Robertson and H. Zaragoza, “The probabilistic relevance framework: and .pdf, using readily available Python conversion libraries
Bm25 and beyond,” Found. Trends Inf. Retr., vol. 3, no. 4, p. 333–389, and open-source tools.
apr 2009. [Online]. Available: https://doi.org/10.1561/1500000019
[44] V. Karpukhin et al., “Dense passage retrieval for open-domain question As most internal data is believe to be of high quality, minimal
answering,” 2020. filtering was applied: line count filtering was used to ensure
[45] G. Izacard et al., “Unsupervised dense information retrieval with that exceedingly large or small files were excluded, and files
contrastive learning,” 2022.
[46] W. Shi et al., “Replug: Retrieval-augmented black-box language models,” were sorted into broad categories of manually written versus
2023. tool-generated.
B. Domain Adaptive Pretraining (DAPT)
In this section we present detailed results on our domain
adaptive pretrained models. We also detail our ablation experi-
ments on domain adaptive pretraining.
DAPT Hyperparameters: Details presented in Table VI.
Hyperparameters Value
Context Window 4096
Global Batch Size 256 (128)
Optimizer distributed fused adam
Weight Decay 0.01
Betas 0.9, 0.95 (0.9, 0.98)
Learning Rate 5 · 10−6
Scheduler None

TABLE VI: DAPT and SFT hyperparameters, SFT values shown in


parenthesis (if differs from DAPT). Fig. 11: Smoothed Training Loss with Original LLaMA2 Tokenizer.

Auto Eval Results: We present detailed results on auto


evaluation benchmarks in Table VII and Table VIII. For sim-
plicity, in the remainders of the section we present aggregated
benchmark results for ablation studies:
• Chip: We report average results on in-domain Design,
Scripting, Bugs, and Circuits benchmarks from Table III
(5-shot).
• MMLU: We report the overall results on MMLU (5-
shot) [22] a popular aggregated benchmark on a wide
variety of subjects.
• Reasoning: We report average results on popular public
benchmarks on common sense reasoning (0-shot), includ-
ing Winogrande [54], hellaswag [55], ARC-easy [56], and
RACE-High [57]. Fig. 12: Smoothed Training Loss with Larger Learning Rate. We
• Code: We report average pass-rate of coding bench- include loss curves of suggested hyperparameters for comparison.
marks with greedy decoding, including HumanEval [23],
VerilogEval-Machine [12], and VerilogEval-Human [12].
Tokenizer Augmentation: We experimented with DAPT of models. We conducted another round of DAPT with tokenizer
using the original LLaMA2 tokenizer and the augmented augmentation using only the domain data, training for the same
tokenizer as described in Section III-A. Figure 11 depicts number of steps equating to roughly 1.1 epoch of the data. We
smoothed training loss for ChipNeMo with the original found that public data mix-in slightly improves results. We
unmodified tokenizer. When compared with Figure 2, we present detailed results in Table X.
observe that an augmented tokenizer has larger training loss Learning Rate: We experimented with employing a
upon initialization, due to added tokens never being observed larger learning rate, inspired by the approach used in
during foundation model pretraining. Similar training loss is CodeLLaMA [32]. We use similar training hyperparameters as
achieved for DAPT with 1 epoch. in Table XI. We use a cosine schedule with 200 warm-up steps,
Table IX presents aggregated auto evaluation benchmark and set the final learning rate to be 1/30th of the peak learning
results. We note that careful tokenizer augmentation and weight rate of 3 · 10−4 . We use the same batch size and number of
initialization only slightly impacts model performance on training steps as DAPT.
general academic benchmarks. DAPT significantly improved Figure 12 shows the training loss for ChipNeMo-7B with
domain benchmarks with any tokenizer, including Verilog augmented tokenizers including public dataset mix-in. We
coding (no major difference in HumanEval). We conclude that observed large spikes in training loss at the initial training steps
augmenting the tokenizer comes with the benefit of improved with the final training loss for 7B models to even be better
tokenizer and training efficiency with no degradation on the than 13B original DAPT hyperparameters. However, we note
models general language and domain capabilities. substantial degradation across natural language benchmarks as
Public Datasets Mix-in: As introduced in Section II-A shown in Table XII, including in-domain chip design. Coding
we included public data in DAPT, sampled from commonly- capabilities improved as consistent with the findings of [32].
used public datasets for foundation model pre-training. We We highlight that our case differs from that in [32]. Although
primarily hoped that mixing in public data such as Wikipedia we also conduct “continued pretraining” initializing from pre-
in DAPT could help “correct” disturbances brought by tokenizer trained checkpoints, we preferably want the model to maintain
augmentation and improve general natural language capabilities high degrees of performance on general capabilities, while
MMLU Winogrande hellaswag ARC-e RACE-H
Model Design Scripting Bugs Circuits
[22] [54] [55] [56] [57]
LLaMA2-7B 41.1 42.0 42.2 47.9 45.7 68.9 75.6 73.5 46.2
ChipNeMo-7B 57.5 49.3 42.8 49.5 44.6 67.4 76.3 73.7 46.2
LLaMA2-13B 43.6 49.6 39.7 55.5 55.4 72.1 79.3 76.3 46.7
ChipNeMo-13B 67.9 56.3 50.1 56.8 53.4 71.1 80.3 76.7 46.1

TABLE VII: Auto Evaluation Results. We report academic benchmark results for LLaMA2 using proprietary evaluation methods. ChipNeMo
models trained with tokenizer augmentation.

Model HumanEval VerilogEval- VerilogEval- Public Chip MMLU Reason Code


[23] Human [12] Machine [12] No 56.9 53.0 67.5 24.1
LLaMA2-7B 14.0 3.8 24.5 Yes 57.8 53.4 68.5 27.8
ChipNeMo-7B 12.2 8.3 28.7
LLaMA2-13B 17.1 9.0 30.8 TABLE X: Ablation on Public Dataset Mix-in with ChipNeMo-13B.
ChipNeMo-13B 17.7 22.4 43.4 Public data mix-in slightly improves results.
TABLE VIII: Coding Evaluation Results. Showing pass-rate with Hyperparameters Value
greedy decoding. We report results for LLaMA2 using proprietary Context Window 4096
evaluation methods. ChipNeMo models trained with tokenizer aug- Global Batch Size 256
mentation. Optimizer distributed fused adam
Weight Decay 0.01
Model Tokenizer DAPT Chip MMLU Reason Code Betas 0.9, 0.95
7B Ori. No 43.4 45.7 66.1 14.1 Learning Rate (lr) 3 · 10−4
7B Aug. No 42.7 44.6 65.9 13.9 Scheduler CosineAnnealing
7B Ori. Yes 51.2 44.8 65.7 17.6 Warmup Steps 200
7B Aug. Yes 49.8 44.6 65.8 16.4 min lr 1 · 10−5
13B Ori. No 47.1 55.4 68.6 18.9
13B Aug. No 46.0 55.1 68.6 18.4 TABLE XI: Training Hyperparameters with Larger Learning Rate.
13B Ori. Yes 57.7 54.0 68.4 27.2 We adopt similar parameter as to [32].
13B Aug. Yes 57.8 53.4 68.5 27.8

TABLE IX: Evaluation Results on ChipNeMo models with Different


Tokenizers. Aug. indicate augmented tokenizer and Ori. indicate a certain point. Table XIII reports the evaluation results on
using LLaMA2 original tokenizer. Using augmented tokenizer without LoRA models. Both LoRA models significantly underperforms
DAPT corresponds to the model initialization as in Section III-A. full parameter training on in-domain chip design tasks. LoRA
models improve in chip design tasks compared to their non-
DAPT counterparts, with the larger model exhibiting slightly
distilling domain dataset information and knowledge (unseen better (but non significant) results.
in model pretraining) into model weights. In contrast, [32] Based on the results, we hypothesize that the observed
use publicly available code data that predominantly lacks phenomenon can be attributed to the imperative need for a
natural language elements, emphasizing their primary focus on sufficiently trainable number of model parameters capable
coding-related tasks. We hypothesize that a smaller learning of accommodating the substantial volume of information
rate played a dual role for domain adaptation, facilitating for DAPT. Additionally, it indicates that PEFT models with
the distillation of domain knowledge through DAPT while limited trainable parameters are susceptible to encountering
maintaining a balance that did not veer too far from the base the challenge of catastrophic forgetting [33].
model, thus preserving general natural language capabilities
while significantly improving performance on in-domain tasks. C. Retrieval Model Training
Parameter Efficient Fine-Tuning (PEFT): Parameter Manually generating training samples is very effort intensive,
efficient fine-tuning freezes the pre-trained model weights so we elected to implement a process to generate them
and injects trainable parameters in smaller adapter models for automatically. Since we are using contrastive learning to fine-
efficient fine-tuning of downstream tasks. We explore the use tune our model, each sample requires a set of both positive
of PEFT in DAPT using Low-Rank Adaptation (LoRA) [16]. passages and negative passages, particularly hard negatives to
Since our transformer layer implementation fuses KQV into a maximize the accuracy.
single projection, we add LoRA adapters for a single Low-Rank 1) Dataset Sampling Procedure: Figure 14 describes the
projection for each self attention layer in combined fashion. steps taken to generate a sample:
We experiment on LLaMA2-13B models with the original • Step 1: Randomly select a passage from the document
LLaMA2 tokenizer, using the same DAPT training setups corpus
in Table VI. We ran two experiments, introducing additional • Step 2: Use a language model (Vicuna) to generate a valid
trainable parameters of 26.4 million (small) and 211.2 million query from the passage
(large) respectively. • Step 3: Use a pre-existing retrieval model (sentence trans-
Figure 13 shows the training loss curves of LoRA models former) to fetch the top-N passages from the document
and compares with full parameter training. For both LoRA corpus for the query where each passage is a potential
models, the loss quickly converges and stops decreasing beyond hard-negative
Learning Rate Chip MMLU Reason Code Parameters Chip MMLU Reason Code
5 · 10−6 49.8 44.6 65.8 16.4 None 47.1 55.4 68.6 18.9
3 · 10−4 25.5 26.6 49.8 18.1 26.4M 49.0 55.0 68.2 13.0
211.2M 49.6 54.2 68.6 15.3
TABLE XII: Ablation on Learning Rate with ChipNeMo-7B. A larger 13B 57.7 54.0 68.4 27.2
learning rate significantly degrades performance on all language related
tasks but slightly improves coding. TABLE XIII: Evaluation Results on LoRA Models. First column
indicate number of trainable parameters. None indicates LLaMA2-
13B model without DAPT. 13B indicates full parameter training.

Fig. 13: Smoothed Training Loss of LoRA [16]. 13B corresponds to


full parameter DAPT.

•Step 4: It is possible that some of the fetched passages Fig. 14: Sample Generation For Retrieval Model Training
are actually positive, so use the same language model to
filter out the positive passages
Query:
• Step 5: If there are not enough negative passages after What is the support DL for XXX build issues?
this filtering process, supplement with random passages
from the corpus Hit Passage:
For our initial research we used Vicuna [4] and Sentence Tree Setup
Working in the XXX Mainline explains initial tree setup and build
Transformer [34]; however, they can easily be replaced with steps
LLaMA2 [5] and BM25 [43] respectively to produce a retrieval Build
model that is commercially viable. Arch-Build-Hotseat-XXX - Hotseat support for XXX build issues
YYY build failures
2) Hit Quality Comparison: Not all hits are created equal.
The passage in the Spec example below clearly and completely
answers its query. The passage in the Build example contains D. Additional Evaluation Data
the answer; however, more context is required to answer the
query. Table XIV shows the evaluation data for all models on the
Spec Example: Hit passage clearly answers the query. engineering assistant chatbot application.
Table XV shows our evaluation results for all models on the
EDA script generation task.
Query:
What is the XXX buffer in the YYY used for? Table XVI shows our evaluation results for all models on
the bug summarization and analysis task.
Hit Passage:
Since the memory subsystem is partitioned, YYY may receive read Model Domain Hit Miss ALL
responses whose order is different from their requests. ZZZ requires SFT
strict in-order per-thread processing, thus YYY reorders the read
LLaMA2-13B-Chat* No 2.13 2.80 2.33
responses to the request order before sending them back to ZZZ.
ChipNemo-13B-Chat No 4.64 5.40 4.88
YYY re-orders the read responses in a read return reorder buffer
ChipNemo-13B-Chat Yes 4.66 6.44 5.21
(XXX) to guarantee the ordering.
LLaMA2-13B-Chat* + RAG No 7.68 3.72 6.46
ChipNemo-13B-Chat + RAG No 7.86 5.48 7.12
Answer: re-orders the read responses. ChipNemo-13B-Chat + RAG Yes 7.86 6.36 7.40
LLaMA2-70B-Chat No 2.36 2.32 2.35
LLaMA2-70B-Chat + RAG No 8.38 5.20 7.40
Build Example: Additional information is required to fully
answer the query. Such as: What is a DL? How do we know TABLE XIV: Engineering Assistant Chatbot Human Evaluation
Arch-Build-Hotseat-XXX is a DL?
Tool1 (Python) Tool2 (Tcl)
Model Automatic (Easy) Automatic (Medium) Human (Hard with Context) Human (Hard with Context)
LLaMA2-13B-Chat* 7% 4% 15% 6%
ChipNeMo-13B-Chat (noDSFT) 26% 35% 39% 33%
ChipNeMo-13B-Chat 44% 39% 41% 36%
LLaMA2-70B-Chat 3% 4% 49% 18%

TABLE XV: EDA Script Generation Evaluation.


Scored binary for Automatic Evaluation and 0-100% for Human Evaluation.

Model Technical Summary Managerial Summary Task Assignment


LLaMA2-13B-Chat* 3.79 3.15 3.00
ChipNeMo-13B-Chat (noDSFT) 4.61 3.45 3.43
ChipNeMo-13B-Chat 4.61 4.24 3.61
LLaMA2-70B-Chat 4.92 4.58 3.91

TABLE XVI: Bug Summarization and Analysis Evaluation. Likert scale 1-7.

E. Examples

1) Engineering Assistant Chatbot: 2) EDA Script Generation: Some function names and
Question: commands are obfuscated.
Write a System Verilog assert to check if signal A is true within 5
cycles of signal B being true Question:
write a code in T OOL1 to print the number of flop cells in the give
Answer: rectangle (0,0,100,100)
Sure, here’s an example System Verilog assert to check if signal A is
true within 5 cycles of signal B being true: Answer:
// Define the signals # Sure, here is a TOOL1 code to print the number of
logic a, b; ,→ flop cells in the given rectangle
// Define the assert ,→ (0,0,100,100):
assert property (@(posedge clk) a |-> ##[1:5] b);
In this example, the assert property is used to check if signal a is # This code gets the top_ref of the design
true, then signal b must be true within 5 cycles. The assert will fail if top_ref = env().get_top_reference()
b is not true within 5 cycles of a being true. # This code gets the cells of the rectangle in the
,→ top_ref
cells = env().layout().cells_in_rectangle( top_ref,
Question: ,→ rectangle(0, 0, 100, 100), hierarchical)
why do we add retiming stages? # This code gets the cells that are flop cells
flop_cells = cells.type("is_flop_cell")
# This code prints the number of flop cells
Answer: print(flop_cells.size())
Retiming stages are added to improve the timing of the design and to
account for the physical distance between different parts of the chip.
By adding retiming stages, we can reduce the amount of time it takes Question:
for signals to travel between different parts of the chip, which can Write a T OOL2 tcl proc to find intersecting logic from pins ’$pinA’
help to improve the overall performance of the design. Additionally, to pin ’$pinB’.
retiming stages can also help to reduce the amount of noise and
interference that can occur on the signals, which can further improve
the performance of the design. Answer:
#Sure, here is the code:
proc myfunc {pinA pinB} {
# Get all the pins in the fanout of the given pin
Question:
set fanout_pins [fanout -from $pinA]
How do I run multiple seeds of random tests using testgen
# Get all the pins in the fanout of the given pin
framework?
set fanin_pins [fanin -to $pinB]
# Get the intersection of the two collections
Answer: set intersection [common_logic $fanout_pins
You can run multiple seeds of random tests using the Testgen ,→ $fanin_pins]
framework by using the rand tgen target. This target allows you # Return the intersection
to specify a variety of options for the stopping criteria of random runs return $intersection
using the same Testgen testlist(s). }
nvmk -c . rand_tgen TESTLIST=my_testlist MON=1
,→ TGEN.LOOP=50

This command will run 50 iterations of the my testlist testlist,


each with a different seed. You can specify the number of iterations
and the seed mode using the TGEN.LOOP and SEED options,
respectively.
3) Bug Summary and Analysis: Usernames, chip names and paths are obfuscated.

Prompt:
BugId : 002
Synopsis : Request power/current measurement on fabrics using fabric test module
Module : Silicon - DFT
Description :
Hi all,
We want to evaluate silicon current (power) consumed by fabric macros in a chip. This will help us correlate this power to pre-Si power estimates.
Here are the conditions we want to look at for this data:
Fabric test module activity: 0/25/50/75/100%
Corners:
typical low voltage
typical high voltage
We want to isolate fabric current (power) as much as possible, so we want to enable only the logic needed for fabric test modules and the fabrics
themselves.
If this information can be broken down per module (there are 6 in the chip), that will help.
Severity : 5-Performance
Priority : 5-Normal
Comments :
#1 : Author : User A
Is this a requirement only for current or for future chips as well ?
#2 : Author : User B
There will likely be interest in future chips to perform this exercise as we push the boundaries of fabric power density for better area.
#3 : Author : User A
What is needed in this bug ?
@User B It is not clear.
Can you list the specific needs with timelines and priority

Technical Summary:
In NVIDIA’s internal bug management system, a bug has been filed regarding the power and current measurements of the fabrics in the latest chip.
The request is to evaluate the silicon current (power) consumed by the fabric macros in the latest chip and to correlate this power to pre-Si power
estimates. The conditions for the measurements are the fabric test module activity at 0/25/50/75/100% and the corners are typical low voltage and
typical high voltage. The data needs to be broken down per module (there are 6 in the latest chip). Only the logic needed for fabric test modules
and the fabrics themselves need to be enabled.

Managerial Summary:
User A asks User B if the need for evaluating the silicon current of fabric macros in a chip is only for that specific chip or for future ones as
well. User B replies that there will likely be interest in future chips to perform this exercise as they push the boundaries of fabric power density for
better area. User A then asks what is needed in this bug and asks for specific needs with timelines and priority.

Task Assignment:
User B

You might also like