0% found this document useful (0 votes)
8 views

Compiler Hackathon Assignment II

The Compile-a-thon for the Compiler Design course requires teams of up to three students to create a compiler for a custom Processor-in-Memory architecture, focusing on AI/ML applications. Participants must submit a report detailing their design, code, and output, with specific page requirements, by the end of the event on March 30, 2025. The winning teams will have opportunities to collaborate on further research projects related to compiler development.

Uploaded by

Jesher Joshua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Compiler Hackathon Assignment II

The Compile-a-thon for the Compiler Design course requires teams of up to three students to create a compiler for a custom Processor-in-Memory architecture, focusing on AI/ML applications. Participants must submit a report detailing their design, code, and output, with specific page requirements, by the end of the event on March 30, 2025. The winning teams will have opportunities to collaborate on further research projects related to compiler development.

Uploaded by

Jesher Joshua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Compile-a-thon

Course Name: Compiler Design (B1 and B2 slot)


Course Code: BCSE307L
Semester: Winter 2024-2025
Date and Time: 29/Mar/2025, 8:00 Am to 30/Mar/2025, 8:00 Am
Mode: Online
General Instructions:
1) Each team size should be maximum of three
2) The students has to upload the report in the Google drive link
(link will be circulated)
3) The report should consists of minimum 10 to 15 pages
4) In the report the following contents should be there
a) Design and Algorithm (2 to 3 Pages)
b) Code (5 to 7 Pages)
c) Output Screenshot (3 to 5 Pages)
5) This will be considered for the DA3
6) One team has to upload only one report in PDF format only
7) The MS Teams link will be circulated
8) The rubrics is given as follows:
a) Algorithm and Design : 10 Marks
b) Implementation: 35 Marks
c) Output: 5 Marks
Problem statement

1. Create a compiler, in the form or a translator for a custom


Processor-in-Memory architecture geared towards AI/ML
applications
2. Input: C++ program for multiplying 2 matrices of parameterized
sizes - integer operands
3. Output: stream of custom ISA compatible instructions
a. Compiler will integrate physical memory mapping
b. ISA compatible instruction format is discussed in Section
IV-D of attached paper on custom ISA for the novel PIM
architecture
4. Adopting an LLVM framework (https://llvm.org/) will be
preferred.
5. Winning team(s) will be invited to collaborate on research
projects to develop full compiler for multiple applications such
as AI/ML programs
2021 IEEE 39th International Conference on Computer Design (ICCD)

Flexible Instruction Set Architecture for


Programmable Look-up Table based
Processing-in-Memory
2021 IEEE 39th International Conference on Computer Design (ICCD) | 978-1-6654-3219-1/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICCD53106.2021.00022

Mark Connolly, Purab Ranjan Sutradhar, Mark Indovina, Amlan Ganguly


Department of Computer Engineering, Rochester Institute of Technology, Rochester, NY, USA
{mfc5867, ps9525, maieee, axgeec}@rit.edu

Abstract—Processing in Memory (PIM) is a recent novel com- The earliest PIM works endeavored to implement complete
puting paradigm that is still in its nascent stage of development. traditional processing units inside the memory chip [4]. How-
Therefore, there has been an observable lack of standardized
ever, the complication associated with implementing complex
and modular Instruction Set Architectures (ISA) for the PIM
devices. In this work, we present the design of an ISA which logic blocks inside the memory chip hindered the practicality
is primarily aimed at a recent programmable Look-up Table of such designs. Recent PIMs are integrating processing cir-
(LUT) based PIM architecture. Our ISA performs the three cuitry with simpler construction deeply within the memory
major tasks of i) controlling the flow of data between the memory chip. For example, the bit-wise parallel PIM architectures
and the PIM units, ii) reprogramming the LUTs to perform
[5]–[8] feature logic gates on the memory bitlines so that
various operations required for a particular application, and iii)
executing sequential steps of operation within the PIM device. A the data can undergo processing without leaving the memory
microcoded architecture of the Controller/Sequencer unit ensures subarray inside which it is located. Such a PIM architecture
minimum circuit overhead as well as offers programmability also has to be accompanied by an instruction set architecture
to support any custom operation. We provide a case study (ISA) that enables it to access part or entirety of the memory
of CNN inferences, large matrix multiplications, and bitwise
micro-architecture for performing the operations. This may
computations on the PIM architecture equipped with our ISA
and present performance evaluations based on this setup. We also involve taking over the control of memory row access from
compare the performances with several other PIM architectures. the memory controller in order to perform computations [5].
Index Terms—Instruction set architecture, microcode, process- Such a custom ISA also requires an accompanying software
ing in memory, look-up table, convolutional neural network, deep interface to receive instructions from a host device [5], [7].
neural network, DRAM Although most of the recent PIM designs mention an
accompanying ISA and software interface [5], [7], there has
I. I NTRODUCTION been an observable lack of a detailed discussion of the ISA
architecture in these works. A previous work called ‘PIM-
The ever-increasing growth of the processing capability and enabled Instructions’ [9] presented a generalized, architecture-
efficiency of modern processors is strikingly contrasted by level abstraction of PIM and offered functional access to the
the performance of the memory. The memory devices suffer PIM device via the memory controller. However, it is not com-
from significant data-access latency and poor energy efficiency patible with later PIMs which feature a significantly deeper
which end up causing the major performance bottleneck in the integration of the logic with the memory micro-architecture
state-of-the-art of computing devices. Moreover, the physical [5], [6], [10]. These PIMs require ISA support with signifi-
separation between the processor and the memory imposed cantly finer-grained control over the memory architecture. This
by the von Neumann Computing model essentially limits calls for an initiative towards custom-designed ISAs targeted at
the achievable data-transfer bandwidth between these two specific PIM architectures so as to capitalize on the strengths
units. This prevents a processing device from maximizing its and quirks of each particular PIM architecture.
performance and thereby creates a ‘memory wall’ bottleneck In this work, we present a microcoding-based programmable
[1]. At the same time, the data communication between the ISA for a recent PIM architecture called pPIM [11], [12].
processor and the memory chip also accounts for a dominating Our choice of a micro-coding-based architecture for the ISA
share of the total power consumption. is inspired by several factors. First, a microcoding-based
The memory wall issue is increasingly motivating investiga- Controller unit is fully programmable which allows complete
tion into the alternative non-von Neumann computing models. functional flexibility of operations to be performed by the PIM
Processing-in-memory (PIM) has lately emerged as a viable architecture. Moreover, this programmable nature will allow
solution that minimizes latency and power consumption from the inclusion of yet unexplored operations and applications on
the data communication by implementing the processor and this platform with significant ease that would be not possible
the memory inside the same chip. Although PIM is not a new with a custom-designed ISA. This flexibility of the proposed
concept [2], it is lately being re-explored under a new light micro-coded ISA can be fully utilized by the programmable
for a versatile range of applications [3]. lookup table (LUT) based pPIM architecture.

978-1-6654-3219-1/21/$31.00 ©2021 IEEE 66


DOI 10.1109/ICCD53106.2021.00022
Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
pPIM consists of a parallel processing element (PE) termed also interconnected via an in-memory communication archi-
as ‘clusters’ each of which contains several programmable tecture [11], [12], [16]which enables these to communicate
look-up table cores to perform micro-operations. With the data-operands internally during the operations. Therefore,
aid of a sophisticated routing mechanism that interconnects these PIMs also leverage a mechanism resembling the Systolic
all the LUT cores inside, each cluster of pPIM is capable Array architectures. However, the amalgamation of all these
of performing more complex operations such as matrix mul- features into a PIM architecture is a task left for the ISA to
tiplications. We choose pPIM as the platform for our ISA perform. Therefore, in this work, we present an ISA for a
design for four main reasons. First, pPIM is a highly flexible look-up Table based PIM (pPIM [11], [12]) inspired by the
processing architecture with LUTs which can be programmed technical aspects from the aforementioned computing models
to perform virtually any logic/arithmetic operation. Second, and bring these features together in one single ISA design.
the performance and energy efficiency of pPIM is impressive
III. P PIM A RCHITECTURE
thanks to its unique LUT-based computing paradigm. Third,
pPIM performs data communication with the adjacent memory Our ISA design is wrapped around the pPIM architecture.
subarrays in large batches which allows for a comparatively pPIM is suitable for hardware acceleration on data-centric
simpler data organization scheme. Fourth, pPIM clusters fea- applications, especially AI applications, such as DNNs &
ture an internal data-routing mechanism that, along with the CNNs. The architecture of pPIM is presented in Figure 1 in
programmable LUTs, allows a user to design custom opera- a hierarchical manner.
tions. Our microcoding-based ISA is also programmable which The top-level element within the pPIM architecture is the
makes it highly suitable for operating the pPIM Architecture. pPIM cluster. Figure 1 (a) shows the arrangement of the
Our ISA supports the baseline pPIM functionalities of Mul- clusters in a DRAM bank and the architecture of a single
tiplication & Accumulation (MAC) and the ReLU activation cluster is shown in Figure 1 (b). Within the cluster are nine
filter with a view to performing CNN inferences. We also pPIM cores that communicate their output through an all-to-
explore a diverse range of applications, including large matrix all router architecture. The micro-architecture of this router
multiplications and bitwise logic operations. Our ISA design is shown in Figure 1 (c). The bulk of the processing power
space leaves room for further expansion in functionality. of the pPIM architecture is contained within the pPIM core.
A core is a reprogrammable component that can facilitate
II. BACKGROUND & M OTIVATION any operation between a pair of 4-bit inputs and produces
8-bit outputs. Figure 1 (d) shows the architecture of a single
Bitwise processing PIMs make up the majority of the core. The LUT in a core is implemented with eight 256-to-1
works in the PIM domain. These PIM architectures perform multiplexers, accompanied by eight 256-bit latch/register files.
logic/arithmetic operations on each memory bitline either by The 8 select bits of the multiplexers are controlled by two 4-bit
charge sharing [13] [14] or by appending logic circuitry input registers. The LUT can be reprogrammed by re-writing
on the local sense amplifiers [7], [10]. In order to perform the latch/register files.
such bitwise operations, multiple memory rows containing the Through the use of multiple pPIM cores, the architecture
operands are activated simultaneously. Therefore, the ISAs of is able to process more complex functions that a single core
these PIMs need to facilitate simultaneous activation of multi- alone could not. Complex functions can be broken down into
ple memory rows which generally involves the implementation multiple functions that can be handled individually at the pPIM
of additional custom row-decoders [13]. core level. The router allows the orchestration of multistage
The bitwise processing PIMs, however, are not suitable for data-flow schemes to implement such complex operations. A
scaling-up the data-precision of operations. A few works, such cluster can perform a chain of operations on a pair of 8-
as Drisa [5] and DrAcc [6] support operations on larger data- bit operands. A 16-bit Accumulator located inside a cluster
precisions, albeit at the expense of significant area overhead captures the output of an operation so that it can be re-utilized
and operational complexity. Several bitwise processing PIM during the operation of the following cycles if required.
architectures rely on bit-serial computing to support larger
IV. I NSTRUCTION S ET A RCHITECTURE
data-precision of operations [8], [15], but at the same also
suffers from significant operational latency. A. ISA Design
LUT-based PIMs [11], [12], [16], on the other hand, are The proposed ISA is primarily designed to drive the pPIM
inherently capable of performing operations with larger data architecture for performing data-intensive applications such
granularity. The data look-ups do not require multi-row ac- as CNN acceleration. For this purpose, we develop a set
cesses, unlike the bitwise PIMs. This simplifies the complexity of instructions that are required for implementing different
associated with ISA design and the ISA hardware overhead. CNNs layers, i.e. Convolutional Layer, Activation Layer, etc.
Therefore, a LUT-based PIM has been chosen over the bitwise The proposed ISA is depicted in Figure 2. The ISA consists
PIMs as a platform PIM in this work. of an Instruction Register/Decoder unit, a group of Pointers,
Moreover, the LUT-based PIMs feature individual Process- and a Controller/Sequencer unit. Decoded instruction bits
ing Elements (PE) that operate in parallel like in a Single- are distributed among an Address Pointer, a Core Pointer,
Instruction-Multiple-Data (SIMD) architecture. These PEs are Read/Write Pointers, and the Program Counter inside the

67

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
(b) pPIM Cluster

R/ W Buffer
Subarray Router

Extended Bitlines
Local Row Buffer Read Write
Core Ptr. Ptr.
1
Cluster Cluster Cluster
Core Reg A
2 Router
Subarray Interlinks Reg B

Core
9 8
Acc.
Subarray
256
Function-word
Local Row Buffer 4
I/O Q Q‫ݫ‬N Function-word
From 256:1
Cluster Cluster Cluster 4
n:1 Bitlines Function-word
MUX
Subarray Interlinks Core k/
n:1 Acc Function-word
Core
MUX
2
Core (c) Router
(a) DRAM Bank 1 Microarchitecture (d) LUT Core

Fig. 1. Hierarchical view of the pPIM architecture including (a) arrangement of pPIM clusters in a DRAM Bank, (b) cluster architecture, (c) cluster router
design, and (d) LUT-Core architecture
Controller/Sequencer unit. The Address pointer is used for ISA would involve a certain amount of CMOS logic circuitry
accessing data-operands from the memory subarray. The Core which is challenging to implement on a DRAM memory chip.
pointer is used for selecting specific cores inside a cluster In contrast, the microcoding based Controller/Sequencer Unit
during the programming stage of operation. Additionally, the can be implemented on SRAM, ROM, or non-volatile memory
read/write pointers facilitate sequential reads and writes of table. In fact, the microcode table for the Controller/Sequencer
data-operands from and to the clusters. of our proposed ISA is left with empty slots which can be
We have adopted a microcoding-based implementation of programmed to support an additional set of instructions that
the Controller/Sequencer unit where control signals are gener- are compatible with the pPIM architecture. The microcoded
ated from a microcode table, in the form of ‘control-words’. Controller makes our ISA highly modular and flexible.
These control words may perform any or all the following B. ISA Connectivity
operations: a) programming the core LUTs inside a cluster
with newer functionalities, b) routing data-operands among the The connectivity pattern of the ISA units with the pPIM
cores via the router during an operation, and c) reading/writing clusters in a DRAM bank is shown in Figure 3. Only the
data from/to the memory subarray. vertically aligned pPIM clusters are capable of having data
The choice of a microcoding-based architecture of the communications via the interlinked bitlines. This essentially
Controller/Sequencer unit is inspired by several factors. First, results in a number of ‘Process Threads’ consisting of verti-
the alternative to microcoded ‘control-words’ would have been cally aligned clusters only. Each Process Thread can run in
a ‘hard-wired’ logic-based ISA which does not allow in-situ parallel and perform identical operations on entirely different
modification. This would essentially restrict the functional sets of data-operands. This makes a perfect candidate for
flexibility and scope for functional expansion of the pPIM implementing a SIMD processing model inside a pPIM bank.
architecture. Moreover, the adoption of such a logic-based Therefore, our ISA units are designed to receive a stream of
instructions from a host CPU and in turn, drive a group of
PIM Bank clusters each of which belongs to a different Process Thread,
Bitlines
Decoder

DRAM Memory
Row

Addr. Bus
Subarray as shown in Figure 3.
Rd. Buffer Wr. Buffer
Row Buffer Process Process Process
Thread 1 Thread 2 Thread 8
Instr. Register/

pPIM
Host CPU

Addr. Ptr.
Decoder

Cluster HOST SA 1
Core Ptr.
Ctrl.Word Router CPU
R/W Ptr. ISA Unit Cluster Cluster Cluster
Accumulator

Instructions

1 1 2 8
Accesses
Memory

Ctrl. Word 1
SA2
Ctrl. Word 2
Program
Clk Counter Ctrl. Word 3 Memory ISA Unit Cluster Cluster Cluster
Controller 2 9 10 16
Controller/ Ctrl. Word n
Sequencer Microcode Table Cores Memory Bank
Fig. 3. Interfacing of the proposed ISA with pPIM clusters in a bank and
Fig. 2. Proposed instruction set architecture for pPIM the host CPU

68

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
127

64
4-bit Multiplier Core
A Routing A Enable
4-bit Adder Core
aH aL bH bL
Halt
63

0
B Routing B Enable V0 V1 V2 V0 V1 V2 V0 V1 V2

Output Routing Reg Enable


V3 V3 I1
c V3 J1 K1
Reg Routing
A0 A1
Fig. 4. Microcoded Control-Word format I0 I2 J0 K0 J2
C. Controller/Sequencer Design t=1 t=2 t=3 A0
The Controller/Sequencer unit consists of a Program
Counter and a microcode table that drives the control bus. The V0 V1 V2 V0 V1 V2 V0 V1 V2
A2 A3
microcode table is a 2-D array of memory cells where each V3 K2 K1 V3 L2 L1 V3 M1
row contains one complete control-word. The structure of a
control-word is shown in Fig. 4. Each control word contains L0 K0 K3 L0 L3 M0 M2

ISA dataflow control signals, routing signals for the pPIM t=4 t=5 A1 t=6
cluster, and register-enable signals for the registers contained
within the clusters. A control word also contains a 1-bit ‘stop’
signal that denotes the end of an instruction. This bit resets
c

the program counter so that it can terminate the ongoing


instruction and fetch the next instruction sent from the host F0 N0 F0 I2

CPU. These signals total up to a length of 120 bits or 15 t=7 A2 t=8 t=9
A3
-byte length of a control word.
In its default state, the Program Counter (PC) points to Fig. 6. Sequential model of a MAC operation implemented on the pPIM
architecture. The inputs ‘a’ and ‘b’ indicate input to the cluster from memory
a control word that sets the cluster to an ‘idle state’ and with the high and low segments of that memory dictated by the subscripts
waits for further instructions from the host CPU. A valid ‘H’ and ‘L’, respectively. Input and output of the accumulator is dictated by
instruction from the host sets the program counter to point A with the segment represented by the following number.
at the initiating control word in the microcode table. In EXE instruction causes the Program Counter to jump to a
the following clock cycles, the program counter progresses specific control-word location in the microcode table that
through the next consecutive control words in the table, one designates the first of a set of consecutive control-words
by one. This progression continues until it reaches a control associated with that operation. During several following clock
word with the ‘stop’ bit set to high. In such a case, the counter cycles, the Program Counter increments through a number of
is reset to either to the idle state or to the initiating control consecutive control-words, until it reaches the final one which
word. The latter is performed when there are more instructions is an END instruction. It designates the end of the ongoing
waiting in the queue to be executed. operation by scheduling a Synchronous reset of the Program
D. Instructions Counter. The END instruction also synchronously resets the
computing registers located inside the clusters (excluding the
The pPIM ISA features a fixed-length 24-bit instruction
latch/register arrays inside the LUTs).
format shown in Figure 5. The instruction-word has two
distinct segments: the upper 8-bit segment is dedicated to The lower 10-bits of the instruction are dedicated to manag-
the execution of different operations while the lower 10-bit ing any sort of memory accesses made by the pPIM clusters.
segment is dedicated to accessing the memory. The addi- This segment contains a read bit, write-bit, and 9-bit row
tional 6-bits are left blank for facilitating further expansion address pointer for pointing at any of the 512 memory rows
of the functionality if required. The most significant 8-bit in a subarray. By setting the read bit to High, data can be read
segment consists of a 2-bit sub-segment for defining the type from the specified row of the subarray, into the read buffer via
of instruction, accompanied by a 6-bit pointer value. This the bitlines. Conversely, by setting the write bit to High, the
segment of the instruction-word can be set to either of the contents of the write buffer can be dispatched to a subarray row
three possible instructions: ‘PROG’, ‘EXE’, and ‘END’. A designated by the 9-bit row address segment of the instruction.
PROG instruction reprograms a core identified by the pointer In the event that both the read and write bits are high, the
bits with new functionality. This is done by re-writing the read bit is given priority- the data will first be read from the
latch/register files in that core. An EXE instruction is used subarray row buffer by the cluster read buffer. Then the data
for initiating a particular operation inside the cluster. The contained in the write buffer of the cluster is written into the
memory.
Encoding Op 18 17 16 15 14 13 12 11
00 NoOp Operation READ Ptr. / CORE Ptr.
01 PROG
E. Operation Mapping
10 EXE 10 9 8 7 6 5 4 3 2 1 0
11 END RD WR ROW ADDRESS Our proposed ISA is capable of orchestrating the routing of
data-operands among the cores in a pPIM cluster in several
Fig. 5. Instruction-word Format consecutive steps to perform complex operations. We demon-

69

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
TABLE I MAC Operation Timing
M ICROCODE S EQUENCES
ISA RD MEM RD MEM END+WR
Operation +MAC 1x8 +MAC 1x8 MEM
Operation Stage Control Word Memory RD RD RD RD
Operation A[0:7] B[0:7] A[8:15] B[8:15]

1 00 00 00 11 8C 20 0F 00 00 00 4E 53 90 3C 00 Microcode MAC MAC MAC MAC MAC MAC MAC MAC MAC
Operation [0] [1] [2] [3] [4] [5] [6] [7] [8]
2 00 00 80 60 00 01 50 80 04 02 80 00 05 42 20
ISA
3 30 D2 24 C0 00 01 F0 88 30 0C 00 00 05 42 21 Operation
4 18 00 02 00 00 01 10 C3 50 97 80 00 07 C2 30 Memory WR
MAC 5 30 06 1F 00 00 01 70 40 1D 88 80 00 04 C2 32 Operation Y[0:7]

6 18 0C 01 E0 00 01 50 C0 14 0C 00 00 05 42 00 Microcode MAC MAC MAC MAC MAC MAC MAC MAC MAC
7 30 C0 00 00 00 01 80 40 00 00 00 00 04 02 04 Operation [0] [1] [2] [3] [4] [5] [6] [7] [8]

8 00 00 00 00 00 00 00 03 80 00 00 00 02 01 C0
9 80 00 00 00 00 00 00 00 00 00 00 00 00 01 C8 Fig. 7. Instruction protocol for initiating MAC operation in the pPIM clusters

F. Compatible Operations
strate this capability with the example of an 8-bit unsigned We further demonstrated the versatility of the ISA through
Multiply-and-Accumulate (MAC) operation. The 8-bit MAC the mapping of additional operations. Table II lists the number
operation itself is the most frequently performed operation of steps/clock cycles required for various operations. This
during the execution of a convolutional or fully connected includes both the microcode sequences and the core config-
layer of a CNN. The process begins with programming urations for programming the pPIM cores. The size of the
all the cores with functionalities required for the operation. microcode sequences is determined by the number of control
The programming is performed by transporting eight 256-bit words within the sequence. The size of the core configurations
function-words to the latch/register file of a particular core. is determined by the number of unique functions that are used
These function-words cover all the possible outcomes of the for an operation in a cluster. This demonstrates the ability
8-bit operation For the 8-bit MAC operation, four cores are of this architecture to be reconfigured for not only various
programmed as 4-bit multipliers and the other five cores are applications but also precision scaling.
programmed as 4-bit adders. The upper 4-bits of the output of We also map various linear algebraic operations on the
an adder core contains zero-padded carry-out. pPIM architecture which are similar to how convolutions are
The 8-bit inputs, A & B, of the MAC operation are each split performed for ML applications. In this calculation, each pPIM
into pairs of 4-bit segments AH , AL & BH , BL respectively. cluster in the architecture can be dedicated to an output value
Partial products V0 -V3 are generated from cross multiplication in the resulting matrix. Therefore the throughput is directly
of these four 4-bit operands: proportional to the number of pPIM cluster engaged in parallel
V 0 = aL ∗ b L (1) on a particular operation.
V1 = a L ∗ b H (2)
V2 = a H ∗ b L (3) G. Compiler Support
V3 = a H ∗ b H (4) A low-level compiler is required to be integrated the pro-
These 8-bit partial products are then aggregated in seven posed ISA into the host system. The compiler performs two
consecutive steps, as shown in Figure6, to perform MAC primary functions. First, it converts a program written in a
operation on the input pair. The 8-stage routing pattern shown high-level language (i.e. Python) into a stream of instruction-
in Figure 6 is encoded in the eight consecutive control-words words in the format shown in Figure 5. Second, it optimizes
dedicated to this operation. Table 1 shows the portion of each the distribution of those instructions among the pPIM clusters
of these control words responsible for the routing patterns. in such a way that minimizes the overhead associated with
The fetching of the data-operands required for the MAC accessing and transporting the data-operands from the memory
operation precedes the actual MAC operation. A memory read subarrays. Figure 8 gives an overview of the interaction of such
instruction reads data-operands from the memory row and a Compiler with the proposed ISA.
stores it in the cluster read-buffer. Once the data-operands In order to generate the instructions, the Compiler translates
are available, the execution of the MAC operation is initiated the high-level code blocks into corresponding machine codes.
by reading the first control-word of the MAC operation from These codes are then appended with the addresses to the
the microcode table via the control bus. During the next
TABLE II
seven clock cycles, the seven following control-words for the P PIM COMPATIBLE OPERATIONS
MAC operation are read one by one. While performing the
MAC operation, the ISA is also able to perform a write-back No. of No. of Different Core
Operation
Steps Configurations
operation for depositing the results from the previous round of Unsigned MAC (8-bit) 9 2
operations. Once finished, the output of the MAC operation is Unsigned MAC (4-bit) 5 2
forwarded to the write-buffer from which it is written back into Signed MAC (8-bit) 13 5
a memory row later on. For the case of a chain of consecutive ReLu (16-bit) 4 1
ReLu (8-bit) 2 1
MAC operations, more memory read requests can be executed Max Index (16-bit) 13 4
prior to a write-request. This protocol is outlined in Figure 7. Max Index (8-bit) 7 4

70

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
memory rows containing the operands to form the instructions. with a higher number of clusters (256 & 512) [11], [12].
An optimization algorithm is envisioned that tries to ensure We evaluate the pPIM ISA for DL applications, Matrix, and
that each instruction is executed in a pPIM cluster located vector operations as well as bit-wise logical operations to
next to the memory subarray that contains the correspond- demonstrate the flexibility of the microcode-based ISA in the
ing operands. When such a case is not possible, the data next subsections.
operands are to hop across multiple subarrays via the subarray- 1) Performance with DL applications: The evaluation of
interlinks. The compiler sends subarray-interlink controlling the DL applications on the pPIM architecture is performed
bits to the ISA units for this purpose. both for 8-bit and 4-bit fixed point precisions. We compare
the performance of the ISA-pPIM setup for CNN inferences
Compiler ISA with several other computing architectures. This includes high-
High Level Generating end computing devices such as Intel Knights Landing server
Program Machine Codes
CPU (KNL) and Nvidia Tesla P100, AI accelerators Edge
Instruction
+ Generation TPU & Intel Neural Compute Stick 2 [17], as well as two
In-DRAM Data-operand
Optimized
contemporary PIM architectures, DRISA [5], and LAcc [16].
Data Arrangment Indexing
Distribution of
Instructions to
Instruction
Execution
Fig. 9 presents the comparison of throughput and power
ISA Units consumption respectively for Alexnet inferences on these
Subarray-Interlink Memory
devices. It can be observed that the two AI accelerators and
Controls Accesses the PIM devices, in general, outperform the general-purpose
computing architectures of CPU and the GPU by a huge
margin both in terms of performance and power efficiency.
Fig. 8. Overview of the functionality of the proposed Compiler
However, the PIM devices also perform noticeably better than
the AI accelerators: TPU and Neural Compute Stick 2, at a
V. R ESULTS similar range of power consumption. This highlights the merit
In this section, we discuss the performance of the ISA in of a PIM- based hardware solution for AI acceleration which
conjunction with the pPIM architecture. effectively eliminates the performance and energy bottleneck
from the data communications.
A. ISA and pPIM architecture characteristics Among the PIM devices, LAcc [16] and pPIM, both of
The ISA and pPIM are characterized using post-synthesis which are Look-up Table (LUT) based architectures, offer very
models using the Synopsys Design Compiler at the 28nm high performances for the least amount of power consumption.
technology node. The results of the hardware synthesis and DRISA, which is a DRAM-based bitwise processing accel-
the pPIM architecture are outlined in Table III. It can be erator, outperforms LAcc and the pPIM-256 configuration,
observed that the area overhead from the ISA is minimal albeit at a significantly higher power consumption rate. pPIM
thanks to the microcode-based implementation of the Control equipped with our ISA achieves slightly better performance at
Unit. The microcode table can support 128 control-words a slightly lower rate of power consumption than LAcc This
each of 120 bits and contributes only 15.24 μm2 of area is possible due to the comparatively more efficient, clustered
overhead. Moreover, since each ISA unit is in charge of LUT-based architecture of the pPIM which also enables our
multiple (eight) pPIM clusters, the incremental area overhead ISA to perform massively parallel operation mapping. pPIM
from the inclusion of ISA is minimal (<1%). The operational achieves nearly double the performance for inferences with 4-
clock speed of the ISA is determined by the critical latency bit fixed point precision (pPIM-256A) compared to the 8-bit
of a pPIM Core (0.8ns). fixed-point operation mode. We further scaled up the pPIM
architecture to 512 clusters (pPIM-512) which achieves a
B. System-Level Performance Evaluation
similar level of performance as the pPIM-256A.
We evaluate the performance of the pPIM architecture
equipped with our proposed ISA. This involves a pPIM
configuration with eight parallel process threads in a bank KNL KNL
[12] where each ISA unit is in charge of eight parallel SIMD P100 P100
Edge TPU Edge TPU
clusters. We further scale up the evaluations for configurations
Neural Stick Neural Stick
DRISA DRISA
Lacc Lacc
TABLE III pPIM-256 pPIM-256
S YNTHESIS R ESULTS pPIM-256A pPIM-256A
pPIM-512 pPIM-512
Delay Power Active Area 0.1 1 10 100 1000 0.1 1 10 100 1000
Component
(ns) (mW) μm2 Throughput (Frames/s) Power Consumption (Watts)
PIM ISA 0.549 0.155 968.16 (a) (b)
PIM Core 0.8 2.7 4616.85 Fig. 9. Comparison of (a) throughput (Frames/s) and (b) power consumption
PIM Cluster (Watts) of the ISA-equipped pPIM architecture with general purpose proces-
6.4 5.2 41551.66
(MAC Operation) sors, AI accelerators and other PIM architectures for AlexNet inferences.

71

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
Energy (Precise) Energy (Approx.) FPS (Precise) FPS (Approx.) Throughput (GOPs/s) Energy Consumption (pJ)

Throughput (GOP/s)

Energy Consumption
3000 18

Throughput (Frames/s)
0.05 500

per Operation (pJ)


Energy (Joule)

0.04 400
2000 12
0.03 300
0.02 200 1000 6
0.01 100
0 0
0 0 4-bit 8-bit 12-bit 16-bit 32-bit
AlexNet ResNet 18 ResNet 34 ResNet 50 VGG 16
Fig. 10. Inference throughput and total energy consumption for 4-bit and 8- Fig. 12. Evaluation of throughput (GOPs/s) for bitwise logic operations such
bit precision inferences of several CNN algorithms by the pPIM architecture as AND/NAND, OR/NOR, NOT, XOR/XNOR with different data-precision
equipped with the porposed ISA (4-bit to 32-bit) on 256 pPIM clusters equipped with the proposed ISA, as
well as energy consumption (pJ) per each operation.

Alongside AlexNet, we evaluate the performance of the wise growth of the operational latency with the increase of the
pPIM-256 and pPIM-256A configurations for four other CNN dimension of the matrices/ vectors is visible in Figure 11. This
algorithms: ResNet 18, ResNet 34, ResNet 50, and VGG 16. can be traced back to the 256 pPIM Cluster configuration that
The performance throughput and total energy consumption has been used for performing this simulation. Since the PIM
for these inferences are presented in Figure 10. It can be can perform up to 256 computations simultaneously at a time,
observed that the maximum inference throughput and the least additional calculations only impact the performance of the
energy consumption are achieved for the VGG 16 inferences architecture once it reaches an additional 256 computations.
for both setups due to the least computational workload from 3) Performance with bit-wise operations: The LUT-based
this algorithm. Overall, the efficient mapping of operations by architecture of pPIM can support any bit-wise logic operation,
our ISA across the parallel processing threads inside the pPIM i.e. bitwise inversion, AND/NAND, OR/NOR, XOR/XNOR,
architecture enables us to achieve a superior CNN inference etc. This is performed by programming each LUT-core of a
throughput and energy-efficiency from this device than all the pPIM cluster with an identical set of function-words corre-
other devices in comparison. sponding to the specific bitwise operation. The programming
2) Performance with matrix Operations: The proposed ISA is performed using a PROG instruction in the proposed ISA.
is also capable of implementing linear algebraic operations Although these operations have a single-bit granularity, each
such as multiplications and additions of large-scale vectors and LUT-core performs these operations on 4-bit segments of
matrices on the pPIM architecture. For example, large-scale operands. By combining operations across multiple LUT-cores
matrix multiplications can be performed by leveraging the in parallel, bitwise operations can be performed on larger
same MAC Operation configuration used for CNN inferences, data operands. Figure 12 shows performance evaluation of
without any further reprogramming of the pPIM cores. We bitwise logic operations in the pPIM architecture with 256-
evaluate the performance of the pPIM architecture equipped cluster configuration for operands ranging from 4-bit, up to 32-
with the proposed ISA for several linear algebraic operations. bit precision. For lower-precision of operations such as 4-bit
Our evaluations include latency of addition and multiplication or 8-bit precision, multiple bitwise operations are performed
operations on a wide range of dimension of matrices and in parallel across the LUT-cores in a cluster for improved
vectors with 8-bit and 4-bit data-precision, as shown in Figure throughput. It can be observed that the pPIM architecture
11. Both the matrices (N×N) and the vectors (1× N) are can achieve very high throughput (up to 2880 GOPs/s) at
represented on the same axis in terms of the value of N. A step- low energy consumption (2.16 pJ/OP) for the 4-bit bitwise
operations, thanks to the single clock of operational latency,
irrespective of how complex the bitwise operation is.
Vector Multiplication (8 bit) NxN Matrix Multiplication (8 bit) VI. C ONCLUSIONS
Scalar Multiplication (8 bit) Matrix Addition (8 bit)
Vector Multiplication (4 bit) NxN Matrix Multiplication (4 bit) In this paper, we propose an ISA for reconfigurable and
1.00E+02
Scalar Multiplication (4 bit) Matrix Addition (4 bit) programmable PIM architectures through the use of microcode
1.00E+00 sequences that promotes the adaptability of PIMs. The full ca-
1.00E-02 pability of the ISA design is demonstrated in cooperation with
1.00E-04
the pPIM architecture which delivers very high throughput
Latency (s)

1.00E-06
1.00E-08 and re-programmability at a low area overhead. We evaluate
1.00E-10
1.00E-12
the ISA by comparing the combined pPIM and ISA power
1.00E-14 figures against other PIM architectures. The ISA is capable of
1.00E-16
1.00E-18
instructing programmable processing elements at a negligible
cost to the performance of the PIM in a small form factor, so
Matrix Dimension, N
as not to detract from the DRAM architecture. The ISA also
demonstrates a level of adaptability through the reformatting
Fig. 11. Evaluation of throughput of calculating various linear algebra of microcode control words that can be reassigned to other
operations on the pPIM architecture existing PIMs.

72

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.
R EFERENCES
[1] S. L. et al., “Scaling the “memory wall”: Designer track,” in IEEE/ACM
International Conference on Computer-Aided Design (ICCAD), 2012.
[2] H. S. Stone, “A logic-in-memory computer,” IEEE Transactions on
Computers, vol. C-19, no. 1, pp. 73–78, Jan 1970.
[3] S. Bavikadi, P. R. Sutradhar, K. N. Khasawneh, A. Ganguly, and
S. M. Pudukotai Dinakarrao, “A review of in-memory computing
architectures for machine learning applications,” ser. GLSVLSI ’20.
New York, NY, USA: Association for Computing Machinery, 2020, p.
89–94. [Online]. Available: https://doi.org/10.1145/3386263.3407649
[4] D. P. et al., “Intelligent ram (iram): the industrial setting, applications,
and architectures,” in Proceedings International Conference on Com-
puter Design VLSI in Computers and Processors, Oct 1997, pp. 2–7.
[5] S. L. et al., “Drisa: A dram-based reconfigurable in-situ accelerator,” in
IEEE/ACM International Symposium on Microarchitecture, 2017.
[6] Q. D. et al., “Dracc: a dram based accelerator for accurate cnn inference,”
in ACM/ESDA/IEEE Design Automation Conference (DAC), 2018.
[7] S. Li and et al., “Scope: A stochastic computing engine for dram-
based in-situ accelerator,” in 2018 51st Annual IEEE/ACM International
Symposium on Microarchitecture (MICRO), Oct 2018, pp. 696–709.
[8] C. E. et al., “Neural cache: Bit-serial in-cache acceleration of deep neural
networks,” in 2018 ACM/IEEE 45th Annual International Symposium on
Computer Architecture (ISCA), June 2018, pp. 383–396.
[9] J. Ahn, S. Yoo, O. Mutlu, and K. Choi, “Pim-enabled instructions:
A low-overhead, locality-aware processing-in-memory architecture,” in
2015 ACM/IEEE 42nd Annual International Symposium on Computer
Architecture (ISCA), June 2015, pp. 336–348.
[10] S. Angizi and D. Fan, “Redram: A reconfigurable processing-in-dram
platform for accelerating bulk bit-wise operations,” in IEEE/ACM Inter-
national Conference on Computer-Aided Design (ICCAD), 2019.
[11] P. R. Sutradhar, M. Connolly, S. Bavikadi, S. M. Pudukotai Dinakarrao,
M. A. Indovina, and A. Ganguly, “ppim: A programmable processor-
in-memory architecture with precision-scaling for deep learning,” IEEE
Computer Architecture Letters, vol. 19, no. 2, pp. 118–121, 2020.
[12] P. R. Sutradhar, S. Bavikadi, M. Connolly, S. K. Prajapati, M. A.
Indovina, S. M. Pudukotaidinakarrao, and A. Ganguly, “Look-up-table
based processing-in-memoryarchitecture with programmable precision-
scalingfor deep learning applications,” IEEE Transactions on Parallel
and Distributed Systems, pp. 1–1, 2021.
[13] V. S. et al., “Ambit: In-memory accelerator for bulk bitwise operations
using commodity dram technology,” IEEE/ACM International Sympo-
sium on Microarchitecture (MICRO), 2017.
[14] F. Gao, G. Tziantzioulis, and D. Wentzlaff, “Computedram: In-memory
compute using off-the-shelf drams,” in Proceedings of the 52nd
Annual IEEE/ACM International Symposium on Microarchitecture,
ser. MICRO ’52. New York, NY, USA: Association for
Computing Machinery, 2019, p. 100–113. [Online]. Available:
https://doi.org/10.1145/3352460.3358260
[15] H. et al., “Simdram: A framework for bit-serial simd processing using
dram extended abstract.”
[16] Q. D. et al., “Lacc: Exploiting lookup table-based fast and accurate
vector multiplication in dram-based cnn accelerator,” in ACM/IEEE
Design Automation Conference (DAC), 2019.
[17] “Edge tpu performance benchmarks,” coral. [online].” November 2020.
[Online]. Available: https://coral.ai/docs/edgetpu/benchmarks/

73

Authorized licensed use limited to: Rochester Institute of Technology. Downloaded on March 17,2025 at 18:46:43 UTC from IEEE Xplore. Restrictions apply.

You might also like