0% found this document useful (0 votes)
39 views

7 Full Stack Optimization of Tra

The transformer inference
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

7 Full Stack Optimization of Tra

The transformer inference
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Architecture and System Support for Transformer Models (ASSYST), ISCA, 2023

Full Stack Optimization of Transformer Inference


Sehoon Kim∗1 , Coleman Hooper∗1 , Thanakul Wattanawong1 , Minwoo Kang1 , Ruohan Yan1 , Hasan Genc1
Grace Dinh1 , Qijing Huang2 , Kurt Keutzer1 , Michael W. Mahoney134 , Yakun Sophia Shao1 , Amir Gholami13
1 2 3 4
University of California, Berkeley NVIDIA ICSI LBNL

Abstract—Recent advances in state-of-the-art neural network decade. This parallels many research accelerators developed in
architecture design have been moving toward Transformer mod- academia [7]–[10], [16], [18]–[20], [36]. Together with hard-
els. These models achieve superior accuracy across a wide range ware accelerator development, the software frameworks [2],
of applications in computer vision, natural language processing,
and speech recognition. This trend has been consistent over [5], [24], [34] and compilers [6], [32], [42] for deploying var-
the past several years since Transformer models were originally ious deep learning algorithms have also enhanced and matured.
introduced. However, the amount of compute and bandwidth These tools enable the execution of deep learning algorithms
required for inference of recent Transformer models is growing on accelerators, and they perform mapping optimizations to
at a significant rate, and this has made their deployment in improve the performance and efficiency of the full deep
latency-sensitive applications challenging. As such, there has been
an increased focus on making Transformer models more efficient, learning pipeline. Nonetheless, the fast-evolving deep learning
with methods that range from changing the architecture design, algorithms still keep introducing new demands for hardware
all the way to developing dedicated domain-specific accelerators. and software support, as well as their co-optimization, to
In this work, we pursue a full-stack approach to optimiz- satisfy various deployment constraints.
ing Transformer inference. We analyze the implications of the The recent rise in popularity of Transformers and large
Transformer architecture on hardware, including the impact of
nonlinear operations such as Layer Normalization, Softmax, and
language models [4], [12], [14], [15], [21], [38]–[41], [43],
GELU, as well as linear operations, and we use this analysis [45] for solving various natural language processing (NLP)
to optimize a fixed Transformer architecture. We assess the tasks presents a brand new set of challenges in the design
challenges with finding the right mapping and scheduling of oper- of accelerators as well as frameworks. There has been an in-
ations for Transformer models, and pursue neural architecture creased focus on making Transformer inference more efficient,
search to further optimize the Transformer network. We find
that a full-stack co-design approach with the aforementioned
especially due to their growing size and run-time complexity.
methods can result in up to 88.7× end-to-end speedup with However, there is still a lack of understanding regarding the
minimal performance degradation for Transformer inference. workload characteristics of Transformer architectures, and thus
More details can be found in our full paper [27], which includes of the design principles necessary for effectively running these
(1) a comprehensive analysis of Transformer workloads, (2) an models, when compared to the more well-known convolutional
extensive survey of the current hardware and software solutions
on efficient Transformer inference, and (3) case studies to quan-
neural network (CNN) architectures. For instance, compared
tify the advantages of co-design and co-optimization techniques to the conventional CNN-focused design, Transformers are
across the stack on full-stack Transformer inference. mostly composed of matrix multiplications (matmuls) together
with memory-intensive nonlinear operations. In addition, the
I. I NTRODUCTION computational graph and dataflow of Transformer models are
Deep learning models have scaled up to billions of param- more complex than that of CNNs, with more types of operation
eters and billions of multiply-accumulate operations during nodes, as well as more dataflow splits and concatenations.
both training and inference. As a result, there has been a All these challenges require us to undertake a comprehensive
growing interest in computing these models efficiently and in analysis of the current hardware and software solutions as well
deploying these compute and memory-intensive workloads on as the various design trade-offs for Transformer inference.
resource-constrained edge devices. These edge devices have Our analysis yielded several key findings:
tight energy and memory constraints, and the corresponding • We adapt Gemmini [19], which was originally designed for

applications that leverage deep learning models also often have CNN workloads, for Transformer inference. Without modi-
real-time latency constraints. fications, the primary bottleneck for running Transformers
The demand for fast and efficient computation, coupled with on CNN accelerators is the time spent on floating-point
the characteristics of deep learning workloads that involve a non-linear operations. However, by adapting Gemmini to
small set of distinct operations with substantial data reuse, support an integer-only BERT variant [26], and tuning the
have led to the use of hardware accelerators. A multitude of memory configuration, we improve performance by 39.6×.
enterprise deep learning accelerators, such as [1], [3], [17], • Fusing BatchNorm with the neighboring convolution in

[23], [25], [28]–[30], [37], [44], [46], have been developed and CNNs is straightforward. However, the benefits of fusing
integrated into commodity hardware by industry in the past operations in the Transformer architecture with the pre-
ceding matmuls depends on the particular operation as it
∗ Equal contribution. [email protected], [email protected] can impose constraints on the mapping, leading to runtime

1
𝑑x𝑙 𝑑/ℎ x 𝑙 WOut
achieves real-time or near-real-time performance on end-to-
𝑙 x 𝑑/ℎ Norm + Add
MHA Module 𝑑x𝑑 Transpose
𝑑x𝑑 end CNN workloads, the performance on Transformer work-
WQ LayerNorm
Encoder
Input
𝑑x𝑙 𝑑x𝑙 𝑑/ℎ x 𝑙
Softmax
𝑑x𝑙 loads such as BERT is severely limited [19] as will be
𝑑x𝑑
WK
𝑙x𝑙
𝑑x𝑙
𝑑x𝑙 discussed in more detail.
𝑑x𝑙 𝑑/ℎ x 𝑙 𝑑/ℎ x 𝑙 Attention
Concatenate
𝑑x𝑑
Output
2) Performance Bottlenecks: Our observation has revealed
WV

FFN Module W1 W2
that the baseline CNN accelerator, when deployed for Trans-
Norm + Add
𝑑𝐹𝐹𝑁 x 𝑑 𝑑 x 𝑑𝐹𝐹𝑁 former inference, exhibits < 1% utilization of its functional
Attention 𝑑x𝑙 𝑑x𝑙 Encoder
GELU LayerNorm
Output 𝑑𝐹𝐹𝑁 x 𝑙 𝑑𝐹𝐹𝑁 x 𝑙 𝑑x𝑙 Output
units. Although individual matmuls exhibit 74% utilization, the
performance is severely impeded by the nonlinear operations
that need to be executed on the host CPU as they are not
Fig. 1: Map of the computations performed in (Top) the multi-
natively supported by the accelerator. This is further exacer-
head attention (MHA) module and (Bottom) the feed-forward
bated by the fact that the nonlinear operations necessitate the
network (FFN) module in the Transformer encoder block
use of floating-point arithmetic. Not only it is less energy and
latency efficient than their integer counterparts [22], it also
costs that outweigh the gains from operator fusion.
entails dequantization and re-quantization of the activations.
• We apply automated neural architecture search (NAS)
These overheads account for 96% of the overall execution
to search for efficient and high-performance Transformer
time (Fig. 2). Given that the majority of FLOPs in Trans-
architectures on Gemmini-driven hardware. NAS finds an
former inference are matmuls, the time spent on the nonlinear
architecture that Network
Feed-Forward improves
(FFN)EDP
Moduleby 10.6× with minimal
operations in the baseline accelerator is far from the theoretical
degradation on target benchmark. Combined with the hard-
optimal, unless further optimizations are implemented.
ware improvement, we achieve 88.7× end-to-end speedup.
In contrast to the convolutions in CNNs, which exhibit high
II. H ARDWARE A RCHITECTURE O PTIMIZATION arithmetic intensity, Transformers mostly comprise matmuls,
often with small and/or rectangular matrices, which translate
We first illustrate how architects familiar with mainstream
to lower arithmetic intensities and different optimal tiling
accelerators for convolutional, vision-based workloads can de-
strategies. This indicates that the memory hierarchy and mem-
sign state-of-the-art Transformer accelerators. We start with a
ory bandwidth of our baseline CNN accelerator need to be
fairly typical CNN accelerator generated by the Gemmini [19]
recalibrated for more efficient Transformer inference.
accelerator-generator, optimized primarily for ResNet50-like
workloads, and we discuss changes we made to this acceler- 3) Memory Configuration Re-adjustment: We have ob-
ator and its software stack to efficiently support Transformer served that the performance of BERT matmul operations
workloads such as BERT. Throughout this section, we use can be significantly improved by adjusting the sizes of the
BERT-Base as a workload. For more details, please refer to input/weight scratchpad and the partial sum accumulator.
Section 3 of our full paper [27]. Specifically, we have found that larger accumulators with
1) Baseline Accelerator: We first generate a fairly typical higher output-reuse are more suitable for several matmuls in
CNN accelerator with a 16×16 systolic array and the weight- Transformers, such as the query × key matmuls, which have
stationary dataflow using the Gemmini accelerator-generator. l × l output activation matrices which can be much larger than
The 8-bit integer weights and inputs are stored in a 256 kB the l × d/h input matrices for l, d, and h sequence length,
local scratchpad memory, and the 32-bit partial sums are stored hidden dimension, and number of heads, respectively. Based
in a dual-ported 64 kB accumulator SRAM which performs on this observation, we have modified the CNN-optimized
matrix additions. When DNN layers are too large to fit into the memory configuration of our baseline accelerator by reducing
local scratchpad, they fall back onto an external L2 cache and the size of the scratchpad from 256 kB to 64 kB, and
DRAM which are shared with CPUs and other accelerators increasing the size of the accumulator from 64 kB to 256 kB.
on the system-on-chip (SoC). A host CPU tiles such layers to Importantly, these changes do not result in an increase in the
compute the full outputs. The baseline accelerator produced total SRAM capacity or the total area; however, they result in
by Gemmini incorporates peripheral circuitry that enables a substantial 36% reduction in total matmul latency.
the execution of ReLU and max-pool operations, alongside 4) Hardware-Software Co-Design: To alleviate the over-
integer-float multipliers that facilitate the scaling of 32-bit head incurred by runtime quantization and dequantization, as
partial sums into 8-bit inputs for the subsequent layer. Native well as the offloading of nonlinear operations to the CPU, we
support for these operations is important, as it eliminates the have transitioned our baseline Transformer workload from a
necessity of offloading such operations to the host CPUs, naive BERT implementation, where only matmuls are quan-
thereby circumventing the costly transfers of activations be- tized, to an integer-only BERT variant known as I-BERT [26].
tween DRAM or outer caches and the local scratchpad. I-BERT substitutes floating-point nonlinear operations with
Finally, note that this baseline CNN accelerator does not integer polynomial approximations, which can be implemented
include any Transformer-specific features. In particular, there faster and more efficiently in specialized accelerators. To
is no support for nonlinear normalization operations such incorporate I-BERT, we add new integer implementations of
as GELU, Softmax, or LayerNorm. Therefore, although it I-BERT’s GELU, LayerNorm, and Softmax variants to our

2
Matmul 100

Percentage of Latency (%)


Softmax GELU Resadd 75
1%
19% 10% 4% LayerNorm
LayerNorm 7% 50
4% 3% Softmax
11% 87% 25
Resadd Matmul+GELU
49% 0
128 256 512
De/Quantization Sequence Length
Matmul+GELU LayerNorm Resadd Softmax
Fig. 2: The time breakdown of a BERT inference with a sequence-length of 512, when running on (Left) the baseline CNN
accelerator, and (Middle) the accelerator with I-BERT’s hardware/software features incorporated. (Right) The time breakdown
with different sequence lengths after the change. For all sequence lengths, the total execution time is dominated by matmuls.

baseline CNN accelerator. The 32-bit matmul results residing BERT-Base MHA Latency Breakdown BERT-Base FFN Latency Breakdown
in the accumulator are fed into a newly added “normalization 5 6
unit” which computes reduction operations (e.g. sum, sum-of-

Latency (Cycles, 1e6)

Latency (Cycles, 1e6)


4 5
square, max, etc.) which are used by LayerNorm and Softmax. 3
4
Multiple passes of accumulator reads are required to compute 3
2
all the reductions in these operations. Subsequentially, the mat- 2
mul results in the accumulator undergo a final read operation 1 1
to be fed into a set of 16 activation units, which compute 0
Non-Fused Fusion-Optimized
0
Non-Fused Fusion-Optimized
I-BERT’s non-linear variants in parallel.
QxK Softmax Wout proj. LN W2 proj. LayerNorm
With these new features, overall end-to-end BERT inference
performance improved by 39.6× over the baseline acceler- Fig. 3: (Left) Impact of fusion-optimized scheduling for MHA
ator’s initial performance. As Fig. 2 illustrates, the compu- execution. Hiding the Softmax latency via fusion-optimized
tational bottleneck once again became the matmuls rather scheduling improves overall MHA latency by 78%, but over-
than normalization or activation functions. Quantization and lapping Wout projection with LayerNorm can hurt total la-
dequantization no longer become necessary and GELU can tency. (Right) Impact of fusion-optimized scheduling for FFN
be trivially fused with the preceding matmuls, so that they matmul that enables latency hiding of the LayerNorm opera-
become one pipelined operation. When synthesized with the tion. We observe that fusion-optimized scheduling hurts total
ASAP7 PDK [13], the new hardware units increased the total latency by 27%. In both cases, we assume an input sequence
area consumption of the accelerator by only 14%, and the length of 512 and accumulator size of 256kB.
GELU, LayerNorm, and Softmax operations increased the
power consumption of a BERT inference by only 9.3%. matmul execution schedule are necessary. In particular, the
tiling factor size of either output dimension of the matmul
III. S CHEDULING O PTIMIZATION must be maximized, so that rows/columns are immediately
In Sec. II, we have demonstrated that the nonlinear opera- ready and stored at the Gemmini accumulator scratchpad for
tions in Transformers introduce challenges to efficient acceler- computing the mean and standard deviation. We refer to this
ator design. We further find that these operations present non- alternate scheduling approach as fusion-optimized scheduling.
trivial challenges to the scheduling problem as well. In this In Fig. 3, we take a deeper look into the performance
section, we provide a brief overview of those challenges. For implications of fusion-optimized scheduling for the BERT-
more details, please refer to Section 5 of our full paper [27]. Base encoder. We model the total latency of each adja-
Generally in DNN scheduling, it is an enticing strategy cent pair of matmul and LayerNorm/Softmax operations via
to fuse relatively high-arithmetic-intensity matmuls with the Timeloop [33] with the target hardware being the I-BERT
following low-arithmetic-intensity normalization operations. modified Gemmini described in Sec. II. Opportunities for
For example, execution schedulers for CNN-type accelerators overlapping computations include: (1) the MHA query × key
often fuse convolutions with ReLU or max-pool operations. matmul and following Softmax; (2) MHA Wout projection
This strategy is especially applicable in the case of quantized and following LayerNorm; and (3) FFN W2 projection and
workloads, where partial sums awaiting normalization are following LayerNorm. The two scheduling strategies we com-
often of higher bitwidth than the final normalized outputs. pare are: (1) fusion-optimized scheduling and (2) Gemmini’s
Similarly, for Transformer encoders, we could overlap the default heuristic-based scheduler, which greedily maximizes
execution of normalization operations (LayerNorm and Soft- loop tile factors at the local SRAM level for each of the three
max) with their preceding matmuls. However, this strategy matmul dimensions. We refer to the second, default scheduling
may require hardware/software changes. First in the case of approach as non-fused scheduling.
DNN accelerators like Gemmini, additional hardware support The left plot of Fig. 3 showcases the promises of mat-
for directly accessing partial sums by normalization operation mul and non-linear operator fusion within the MHA. With
units may be required. Second, appropriate constraints on the Gemmini on-chip scratchpad and accumulator SRAM sizes of

3
NAS Results: Latency vs. Normalized EDP NAS Results: Latency vs. Perplexity NAS Results: Latency vs. Energy
(Scratchpad 64kB, Accumulator: 256kB) (Scratchpad 64kB, Accumulator: 256kB) (Scratchpad 64kB, Accumulator: 256kB)
25.0 NAS 25.0 NAS 25.0 NAS
Trained from scratch Trained from scratch Trained from scratch
24.5 24.5 24.5
24.0
Perplexity

Perplexity

Perplexity
24.0 24.0
23.5 +1 Perplexity +1 Perplexity +1 Perplexity
23.5 23.5
23.0 23.0 23.0
+0.1 Perplexity +0.1 Perplexity
22.5 +0.1 Perplexity
22.5 22.5
22.0 0.0 0.2 0.4 0.6 0.8 1.0 20 40 60 80 100 120 0 20 40 60 80 100 120
EDP (Normalized) Latency (109 Cycles) Energy (10 3 J)
Fig. 4: (Left) EDP-perplexity, (Middle) Latency-perplexity, and (Right) Energy-perplexity plots of the Transformer architectures
found via evolutionary search on our Gemmini hardware configuration. Lower perplexity indicates better performance of the
trained models. For better comparison, we additionally plot lines to illustrate +0.1 and +1 point perplexity degradation.

256KB, we observe that it is advantageous to fuse query × scheduling with a peak learning rate of {5, 2, 1, 0.5} × 10−5 .
key matmuls with Softmax for each attention head and thereby We use a sequence length of 512 and a batch size of 16.
hide the relatively high latency of executing the Softmax For NAS, we adopt the BigNAS [47] strategy to train a
operation. Assuming an input sequence length of 512, the supernet using the same training hyperparameters as the stand-
Softmax latency is significant compared to the matmul, taking alone training. The NAS search space is comprised of various
up around 78% of the total cycles and contributes greatly to combinations of the number of layers in {3, 4, 5, 6}, number
the total latency. of heads in {4, 6, 8, 10, 12}, hidden dimension in [384, 768],
On the other hand, the right plot of Fig. 3 shows the and FFN dimension in [768, 3072]. Subsequently, we use
results on matmul and LayerNorm overlapping in the FFN W2 evolutionary search for 40 iterations with a population size of
projection. Here, we observe that fusion-optimized scheduling 40 and mutation probability of 0.2 to search optimal subnets
worsens total latency by 27%. When scheduling the FFN, we out of the fully trained supernet. After every iteration, only the
find that at the BERT-Base scale, it is consistently favorable to subnets that are Pareto-optimal in EDP (energy-delay-product)
overlap the MHA query × key with the ensuing Softmax but and perplexity are retained. To measure the hardware cost, we
consistently disadvantageous to chain the FFN W2 projection use a lookup table-based method for quickly assessing the
matmul with LayerNorm. This is in contrast with previous latency and energy consumption of each subnet on the target
studies on GPU kernel fusion for Transformers [11], [35], hardware, instead of using time-consuming RTL simulation.
and it highlights how scheduling for Transformer matmuls The lookup table contains Timeloop [33] simulated latency and
becomes more complex when targeting different styles of energy numbers for each operation, which are then summed
custom hardware designs, including the Gemmini accelerator. up to estimate the end-to-end values for the entire subnets.
After the evolutionary search, the Pareto-optimal subnets are
IV. N EURAL A RCHITECTURE O PTIMIZATION then evaluated with an RTL simulator to obtain a more precise
Another important avenue in full stack optimization of estimation of the latency. For the energy measure, we continue
DNNs is optimizing DNN architectures and tailoring them for to use the numbers from Timeloop. For the target hardware,
specific hardware platforms. However, the exponential search we use Gemmini with the optimizations applied in Sec. II.
space of DNN architectures often makes it challenging to 2) Experiment Results: We show the NAS Pareto-frontier
find an optimal architecture, even without considering the results for EDP, latency and energy in Fig. 4 (blue curves)
underlying hardware. To address this issue, automated neural where each point corresponds to a different Transformer
architecture search (NAS) methods have been proposed to architecture found from the evolutionary search algorithm. Ad-
adapt DNNs for given hardware constraints. In this regard, we ditionally, we plot the stand-alone trained baseline Transformer
apply hardware-aware NAS to search for Transformer archi- model trained as a reference (× mark). As can be seen in
tectures that are optimal on the Gemmini-driven accelerator the EDP plot (Fig. 4 Left), the NAS framework allows us to
with better efficient and performance trade-offs. For a more obtain multiple Transformer architectures with better hardware
detailed overview of hardware-aware NAS and its application cost to perplexity trade-offs. That is, it finds architectures with
to the Transformer architectures, please refer to Section 6 of similar or even better perplexity, as compared to the baseline
our full paper [27]. with smaller hardware costs.
1) Experiment Setup: As a baseline architecture, we use a Fig. 4 (Middle and Right) further illustrates latency and
6-layer Transformer architecture with all other model config- energy separately. As one can see, it is possible to attain a
urations remaining the same as BERT-Base. We use language 1.4× reduction in latency versus the baseline Transformer
modeling on the WikiText-2 [31] as a training objective. To with 0.1 point perplexity degradation. If one could tolerate
evaluate the model performance, we measured perplexity on 1 point degradation in perplexity, latency can be reduced
the validation examples, where lower scores indicate better by 2.4×. With regards to energy, one can attain a 1.6×
performance. The stand-alone baseline model was trained for improvement considering 0.1 point perplexity degradation, and
50 epochs with the Adam optimizer and a linear learning rate 4.4× when allowing perplexity degradation of 1 point. Taking

4
both together, it is possible to reduce EDP by 2.2× with R EFERENCES
just 0.1 point perplexity degradation, and 10.6× with 1 point
[1] “Edge TPU,” https://cloud.google.com/edge-tpu/, accessed: 2018-12-05.
perplexity degradation. These examples illustrate the power of [2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
co-design in allowing practitioners to choose a combination S. Ghemawat, G. Irving, M. Isard et al., “{TensorFlow}: a system for
that best matches their needs. It is important to note that this {Large-Scale} machine learning,” in USENIX Symposium on Operating
Systems Design and Implementation (OSDI), 2016.
represents a single run of our co-design methodology on a [3] D. Abts, J. Kim, G. Kimmell, M. Boyd, K. Kang, S. Parmar, A. Ling,
specific hardware platform, and results may vary depending A. Bitar, I. Ahmed, and J. Ross, “The groq software-defined scale-
on the target hardware and optimization goals. out tensor streaming multiprocessor: From chips-to-systems architectural
overview,” in IEEE Hot Chips Symposium, 2022, pp. 1–69.
[4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
V. C ONCLUSION A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models
are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
[5] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu,
While Transformer models have shown significant per- C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine
formance improvements, their growing size and run-time learning library for heterogeneous distributed systems,” arXiv preprint
complexity present a critical challenge in efficient inference. arXiv:1512.01274, 2015.
[6] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen, M. Cowan,
In this work, we have demonstrated the benefits of a full L. Wang, Y. Hu, L. Ceze et al., “{TVM}: An automated end-to-end
stack approach by leveraging the advantages of co-design and optimizing compiler for deep learning,” in 13th {USENIX} Symposium
co-optimization techniques across the stack. We adapted a on Operating Systems Design and Implementation ({OSDI} 18), 2018,
pp. 578–594.
CNN-oriented accelerator to efficient Transformer inference [7] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,
by supporting integer-only nonlinear operations [26] and re- “Diannao: A small-footprint high-throughput accelerator for ubiquitous
balancing the memory hierarchy, which yielded a 39.6× machine-learning,” in Proceedings of the 19th International Conference
on Architectural Support for Programming Languages and Operating
latency reduction. We also applied NAS to search for Pareto- Systems, ser. ASPLOS ’14. New York, NY, USA: ACM, 2014, pp.
optimal Transformer architectures given the tradeoff between 269–284.
EDP and perplexity, leading to a 10.6× EDP reduction with [8] Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: A Spatial Architecture
for Energy-efficient Dataflow for Convolutional Neural Networks,” in
minimal performance drop. Altogether, we have exhibited a Proceedings of the International Symposium on Computer Architecture
88.7× latency improvement without a noticeable performance (ISCA), 2016.
drop compared to a naive implementation without full-stack [9] Y.-H. Chen, T.-J. Yang, J. Emer, and V. Sze, “Eyeriss v2: A flexible
accelerator for emerging deep neural networks on mobile devices,” IEEE
considerations. We have also demonstrated that unlike in Journal on Emerging and Selected Topics in Circuits and Systems, 2019.
CNNs, nonlinear operations in Transformers require careful [10] Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen,
consideration when performing operator fusion when targeting Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-learning
Supercomputer,” in Proceedings of the International Symposium on
custom accelerators, e.g. systolic-array based architectures. We Microarchitecture (MICRO), 2014.
expect more improvement when we take this into consider- [11] J. Choi, H. Li, B. Kim, S. Hwang, and J. H. Ahn, “Accelerating trans-
ation when designing the end-to-end full stack optimization former networks through recomposing softmax layers,” in International
Symposium on Workload Characterization (IISWC), 2021.
pipeline. We refer interested readers to our full paper [27], [12] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts,
which includes (1) a comprehensive analysis of Transformer P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling
workloads, (2) an extensive survey of the current hardware language modeling with pathways,” arXiv preprint arXiv:2204.02311,
2022.
and software solutions on efficient Transformer inference, [13] L. Clark, V. Vashishtha, L. Shifren, A. Gujia, S. Sinha, B. Cline,
and (3) case studies to quantify the advantages of co-design C. Ramamurthya, and G. Yeric, “ASAP7: A 7-nm FinFET Predictive
and co-optimization techniques across the stack on full-stack Process Design Kit,” Microelectronics Journal, 2016.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training
Transformer inference. of deep bidirectional transformers for language understanding,” arXiv
preprint arXiv:1810.04805, 2018.
ACKNOWLEDGEMENTS [15] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun,
Y. Zhou, A. W. Yu, O. Firat et al., “Glam: Efficient scaling of
language models with mixture-of-experts,” in International Conference
We acknowledge gracious support from Meta and in partic- on Machine Learning. PMLR, 2022, pp. 5547–5569.
ular Michael Anderson, Satish Nadathur and Summer Deng, [16] Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen,
as well as Google Cloud, Google TRC team, and specifically and O. Temam, “Shidiannao: Shifting vision processing closer to the
sensor,” in 2015 ACM/IEEE 42nd Annual International Symposium on
Jonathan Caton, Prof. David Patterson, and Jing Li. Prof. Computer Architecture (ISCA), 2015, pp. 92–104.
Keutzer’s lab is sponsored by Intel corporation, Intel VLAB [17] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Neural Accel-
team, Intel One-API center of excellence, as well as funding eration for General-Purpose Approximate Programs,” in Proceedings of
the International Symposium on Microarchitecture (MICRO), 2012.
through BDD and BAIR. Sehoon Kim would like to acknowl- [18] M. Gao, J. Pu, X. Yang, M. Horowitz, and C. Kozyrakis, “Tetris:
edge the support from Korea Foundation for Advanced Studies Scalable and Efficient Neural Network Acceleration with 3D Memory,”
(KFAS). Amir Gholami was supported through funding from in Proceedings of the International Conference on Architectural Support
for Programming Languages and Operation Systems (ASPLOS), 2017.
Samsung SAIT. Michael W. Mahoney would also like to [19] H. Genc, S. Kim, A. Amid, A. Haj-Ali, V. Iyer, P. Prakash, J. Zhao,
acknowledge a J. P. Morgan Chase Faculty Research Award D. Grubb, H. Liew, H. Mao, A. Ou, C. Schmidt, S. Steffl, J. Wright,
as well as the DOE, NSF, and ONR. Our conclusions do not I. Stoica, J. Ragan-Kelley, K. Asanovic, B. Nikolic, and Y. S. Shao,
“Gemmini: Enabling systematic deep-learning architecture evaluation
necessarily reflect the position or the policy of our sponsors, via full-stack integration,” in Proceedings of the 58th Annual Design
and no official endorsement should be inferred. Automation Conference (DAC), 2021.

5
[20] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and [41] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
W. J. Dally, “Eie: Efficient inference engine on compressed deep neural Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of trans-
network,” SIGARCH Comput. Archit. News, vol. 44, no. 3, Jun. 2016. fer learning with a unified text-to-text transformer,” arXiv preprint
[21] J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, arXiv:1910.10683, 2019.
E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark [42] A. Sabne, “Xla: Compiling machine learning for peak performance,”
et al., “Training compute-optimal large language models,” arXiv preprint 2020.
arXiv:2203.15556, 2022. [43] T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow,
[22] M. Horowitz, “1.1 computing’s energy problem (and what we can do R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé et al., “Bloom: A 176b-
about it),” in 2014 IEEE International Solid-State Circuits Conference parameter open-access multilingual language model,” arXiv preprint
Digest of Technical Papers (ISSCC), 2014, pp. 10–14. arXiv:2211.05100, 2022.
[23] J. Hruska, “New movidius myriad x vpu packs a custom neural compute [44] F. Sijstermans, “The NVIDIA Deep Learning Accelerator,” in Hot Chips,
engine,” 2017. 2018.
[24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Girshick, [45] S. Smith, M. Patwary, B. Norick, P. LeGresley, S. Rajbhandari, J. Casper,
S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for Z. Liu, S. Prabhumoye, G. Zerveas, V. Korthikanti et al., “Using
Fast Feature Embedding,” CoRR, vol. abs/1408.5093, 2014. deepspeed and megatron to train megatron-turing nlg 530b, a large-scale
[25] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, generative language model,” arXiv preprint arXiv:2201.11990, 2022.
S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P. Cantin, C. Chao, [46] E. Talpes, D. D. Sarma, G. Venkataramanan, P. Bannon, B. McGee,
C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaem- B. Floering, A. Jalote, C. Hsiong, S. Arora, A. Gorti et al., “Compute
maghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, solution for tesla’s full self-driving computer,” IEEE Micro, vol. 40,
J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, no. 2, pp. 25–35, 2020.
H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, [47] J. Yu, P. Jin, H. Liu, G. Bender, P.-J. Kindermans, M. Tan, T. Huang,
J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, X. Song, R. Pang, and Q. Le, “Bignas: Scaling up neural architecture
A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, search with big single-stage models,” in European Conference on
R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, Computer Vision. Springer, 2020, pp. 702–717.
J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov,
M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson,
B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter, W. Wang,
E. Wilcox, and D. H. Yoon, “In-datacenter performance analysis of a
tensor processing unit,” in 2017 ACM/IEEE 44th Annual International
Symposium on Computer Architecture (ISCA), June 2017, pp. 1–12.
[26] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert:
Integer-only bert quantization,” in International conference on machine
learning. PMLR, 2021, pp. 5506–5518.
[27] S. Kim, C. Hooper, T. Wattanawong, M. Kang, R. Yan, H. Genc,
G. Dinh, Q. Huang, K. Keutzer, M. W. Mahoney et al., “Full
stack optimization of transformer inference: a survey,” arXiv preprint
arXiv:2302.14017, 2023.
[28] S. Knowles, “Graphcore,” in IEEE Hot Chips Symposium, 2021, pp.
1–25.
[29] H. Liao, J. Tu, J. Xia, and X. Zhou, “Davinci: A scalable architecture
for neural network computing.” in IEEE Hot Chips Symposium, 2019,
pp. 1–44.
[30] S. Lie, “Cerebras architecture deep dive: First look inside the hw/sw
co-design for deep learning: Cerebras systems,” in IEEE Hot Chips
Symposium, 2022, pp. 1–34.
[31] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel
mixture models,” 2016.
[32] NVIDIA. (2018) TensorRT: https://developer.nvidia.com/tensorrt.
[33] A. Parashar, P. Raina, Y. S. Shao, Y.-H. Chen, V. A. Ying, A. Mukkara,
R. Venkatesan, B. Khailany, S. W. Keckler, and J. Emer, “Timeloop: A
systematic approach to dnn accelerator evaluation,” in 2019 IEEE inter-
national symposium on performance analysis of systems and software
(ISPASS). IEEE, 2019, pp. 304–315.
[34] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An
imperative style, high-performance deep learning library,” Advances in
neural information processing systems, vol. 32, 2019.
[35] S. Pati, S. Aga, N. Jayasena, and M. D. Sinclair, “Demystifying bert:
Implications for accelerator design,” in International Symposium on
Workload Characterization (IISWC), 2021.
[36] J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang, Z. Zou,
Z. Wu, W. He et al., “Towards artificial general intelligence with hybrid
tianjic chip architecture,” Nature, vol. 572, no. 7767, pp. 106–111, 2019.
[37] R. Prabhakar and S. Jairath, “Sambanova sn10 rdu: Accelerating soft-
ware 2.0 with dataflow,” in IEEE Hot Chips Symposium, 2021, pp. 1–37.
[38] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
language understanding by generative pre-training,” 2018.
[39] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
“Language models are unsupervised multitask learners,” OpenAI blog,
vol. 1, no. 8, p. 9, 2019.
[40] J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song,
J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language
models: Methods, analysis & insights from training gopher,” arXiv
preprint arXiv:2112.11446, 2021.

You might also like