0% found this document useful (0 votes)
5 views

Streamline Ring ORAM Accesses Through Spatial and Temporal Optimization

Uploaded by

a18257157319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Streamline Ring ORAM Accesses Through Spatial and Temporal Optimization

Uploaded by

a18257157319
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

Streamline Ring ORAM Accesses through Spatial


and Temporal Optimization
Dingyuan Cao∗† , Mingzhe Zhang†, Hang Lu† , Xiaochun Ye†‡ , Dongrui Fan†, Yuezhi Che§ and Rujia Wang§
∗ Tsinghua
University, Beijing, China
† State
Key Laboratory of Computer Architecture, ICT, CAS, Beijing, China
‡ State Key Laboratory of Mathematical Engineering and Advanced Computing, China
§ Illinois Institute of Technology

[email protected], {zhangmingzhe, luhang, yexiaochun, fandr}@ict.ac.cn, [email protected], [email protected]

Abstract—Memory access patterns could leak temporal and TrustZone [6], as well as academic secure processor prototypes
spatial information in a sensitive program; therefore, obfuscated such as AEGIS [7], ASCEND [8]. It is clear to see that the
memory access patterns are desired from the security perspective. current trusted computing base (TCB) is still small and limited,
Oblivious RAM (ORAM) has been the favored candidate to
eliminate the access pattern leakage through randomly remap- which includes part of the processor plus partial memory
ping data blocks around the physical memory space. Meanwhile, space. The majority of components in the system, such as the
accessing memory with ORAM protocols results in significant main memory, storage device, and the interconnections, are
memory bandwidth overhead. For each memory request, after still vulnerable to various attacks. For example, if the memory
going through the ORAM obfuscation, the main memory needs access patterns are leaked to the adversary, the program
to service tens of actual memory accesses, and only one real
access out of them is useful for the program execution. Besides, that is being executed may be inferred through the Control
to ensure the memory bus access patterns are indistinguishable, Data Flow Graph reconstruction [9]. Researchers have also
extra dummy blocks need to be stored and transmitted, which discovered that on a remote data storage server with searchable
cause memory space waste and poor performance. encryption, access patterns can still leak a significant amount
In this work, we introduce a new framework, String ORAM, of sensitive information [10]. Some recent attacks show that
that accelerates the Ring ORAM accesses with Spatial and
Temporal optimization schemes. First, we identify that dummy neural network architecture [11] and RSA keys [12] can be
blocks could significantly waste memory space and propose a reconstructed through the memory access patterns.
compact ORAM organization that leverages the real blocks To completely remove the potential leakage through the
in memory to obfuscate the memory access pattern. Then, memory access pattern we need to obfuscate the access
we identify the inefficiency of current transaction-based Ring patterns that the adversary may observe. Oblivious RAM
ORAM scheduling on DRAM devices and propose an effective
scheduling technique that can overlap the time spent on row (ORAM), as a cryptographic approach, provides a complete
buffer misses while ensuring correctness and security. With set of access and remap operations that can randomly move
a minimal modification on the hardware and software, and the data blocks in the memory to a different physical address
negligible impact on security, the framework reduces 30.05% after each access [13]. The basic idea of ORAM is to utilize
execution time and up to 40% memory space overhead compared dummy blocks for obfuscation. For each memory access,
to the state-of-the-art bandwidth-efficient Ring ORAM.
Index Terms—Ring ORAM, Performance, Space Efficiency, dummy blocks are fetched together with the real block. After
Security each memory access, the location of the accessed block is
reassigned so that the temporal and spatial access pattern can
I. I NTRODUCTION be hidden. As a result, an outside attacker cannot infer the type
As protecting data security and privacy becomes increas- of access or whether the user is accessing the same data repeat-
ingly critical, modern computing systems start to equip trusted edly. Because of the redundancy of dummy blocks, ORAM
hardware to protect the computation and data from various comes with a high overhead, which motivates the ORAM
attacks. For example, we see industrial standard trusted com- designers to optimize its theoretical performance. Through
puting modules such as Trusted Computing Module (TPM) decades of advances in cryptography, tree-based ORAMs show
[1], eXecute Only Memory (XOM) [2], Trusted Execution great potential to be adopted in main memory systems with
Technology (TXT) [3], Intel SGX [4], AMD SME [5], ARM relatively efficient bandwidth and storage overhead. For ex-
ample, Path ORAM [14] translates one memory access into a
Mingzhe Zhang, Rujia Wang and Dingyuan Cao have equal contribution. full path operation, and has been made into hardware prototype
This work was performed while Dingyuan Cao was an undergraduate research
intern at ICT, CAS. This work is supported in part by National Natural Science [15] and integrated with SGX [16]. Further, Ring ORAM [17]
Foundation of China grants No. 62002339, No. 61732018, the Strategic optimizes the online data read overhead by selectively reading
Priority Research Program of the Chinese Academy of Sciences under grant blocks along the path, and reduces overall access bandwidth by
No. XDB44030200, the Open Project Program of the State Key Laboratory
of Mathematical Engineering and Advanced Computing (No. 2019A07) and 2.3× to 4× and online bandwidth by more than 60× relative
the CARCH Innovation Project (CARCH4506). to the Path ORAM [17].

2378-203X/21/$31.00 ©2021 IEEE 14


DOI 10.1109/HPCA51647.2021.00012
While Ring ORAM is efficient in terms of theoretical systems are vulnerable to access pattern attacks, such as
bandwidth overhead (log(N ), where N is the total data blocks physically monitoring the visible signals on the printed circuit
in the ORAM tree), when implemented on real memory boards (including the motherboard and memory modules).
system, we identify that both space and access efficiency With commodity DRAM DIMMs in the system, the address
require further optimization. Since we need to reserve and bus, the command bus, and the data bus are separate. The
store abundant dummy blocks in the memory or storage de- memory controller sends out the pair of addresses and data
vices, the off-chip memory effective utilization rate decreases to the DRAM with corresponding DRAM command, such
significantly. Also, Ring ORAM accesses show biased locality as precharge, read/write, and activate. Therefore, the attacker
during different access phases, even with the optimal subtree can still sniff critical information through the address and
layout is applied. The read path operation selectively read command bus, even when the data bus is encrypted. By
blocks along a path, leading to relatively low row buffer hit observing the access patterns such as access frequency, access
rate; eviction operation reads and then writes on a full path type (read or write), and also the repeatability of accesses
and shows much better utilization of the subtree layout. As a to the same location, the attacker can obtain some leaked
result, implementing Ring ORAM on main memory devices sensitive information in the program [10].
requires further optimizations with hardware implications to With the emerging memory technologies and interfaces,
minimize additional overhead. adding additional trusted components to the memory DIMMs
This motivates us to rethink the existing ORAM design: or data path can partially reduce the attack surfaces, therefore,
can we achieve access and space efficiency while ensuring the reducing the need for security protection. For example, Secure
memory access pattern is well obfuscated? In this work, we DIMM [20] assumes that the entire ORAM controller can
propose String ORAM with a set of schemes that are aiming be moved from the processor to the memory module side.
at achieving such goals. We first quantitatively analyze the D-ORAM [21] assumes that the memory module is still
memory space waste of state-of-art ORAMs due to the dummy not trusted, but we can leverage a secure delegator on the
blocks. Then, we dig into the behavior of DRAM bank during motherboard to facilitate the ORAM accesses. Meanwhile,
each ORAM access, and find the opportunity to squeeze more ObfusMem [22] and InvisiMem [23] assume that the entire
command into one access. Based on such observations, we memory modules are trusted with logic inside and only care
propose several schemes, which can: 1) minimize the memory about the access pattern in between. However, such designs
space waste and reshuffle overhead by reusing real data blocks, always require substantial modifications to the memory hard-
2) improve the ORAM operation performance by reducing ware or interface, which are less general approaches to hide
memory bank idle time, 3) allow us to achieve both space the memory access pattern. Therefore, in our threat model, we
and access efficiency with a fully integrated architecture. Our still assume the memory device is out of the trusted boundary.
contributions are as follows:
B. Basics of ORAM
• We present an in-depth study of memory space and access
inefficiency caused by dummy data blocks in current ORAM Oblivious RAM [13] is a security primitive that can hide the
design. program’s access pattern and eliminate information leakage
• We propose innovative protocol side modifications that can
accordingly. The basic idea of ORAM is to access more
hide the access pattern by utilizing existing massive real data blocks than the actual data we need, and shuffle the address
blocks. The more compact protocol can reduce the memory space so that the program address appears to be random. With
space utilization inefficiency. the ORAM controller in the secure processor, one memory
• We propose a slight modification on the DRAM command
access from the program is translated into an ORAM-protected
scheduler, which can issue PRE and ACT command in ad- sequence. ORAM protocol guarantees that any two ORAM
vance. This helps to minimize DRAM bank idleness and return access sequences are computationally indistinguishable. In
real blocks faster. other words, ORAM physical access pattern and the original
• We combine the two optimization approaches through archi-
logical access pattern are independent, which hides the actual
tectural integration and evaluate our String ORAM framework data address with the ORAM obfuscation. Since all ORAM
with state-of-the-art optimizations. access sequences are indistinguishable, an attacker cannot
• We show the evaluation results of improvement in per-
extract sensitive information through the access pattern.
formance, queuing time, and row buffer miss rate. We also Tree-based ORAM schemes, such as Path ORAM [14] and
evaluate the hardware modification overhead and security of Ring ORAM [17], have greatly improved the overall access
our design. and reshuffle efficiency through cryptographic innovations.
Tree-based ORAM schemes are also the building blocks of
II. BACKGROUND several advanced ORAM frameworks, such as Obliviate [24],
Taostore [25] and Zerotrace [16]. In this work, we focus on
A. Threat Model one of the most bandwidth-efficient tree-based ORAM, Ring
In this work, the system equips the secure and tamper- ORAM [17].
resistance processor, which is capable of computing without Figure 1 shows the control and memory layout of Ring
information leakage [14], [17]–[19]. The off-chip memory ORAM. The memory is organized as a binary tree, which

15
0 1 2 3 4 5 6 7 Write Queue
row

Row Decoder
Level 3 Read Queue
wordline cell
Level 2 To/From LLC

bitline
bucket

row-addr
Memory
Level 1 Controller amplifier
Row Buffer
Level 0 Ă
Data Data
Return R/W E/D Ă Rank Rank
cmd
Response
Stash E/D Logic
Memory addr Bank
Ă Bank
Coming Controller data
Pos. Map Addr. Logic
Request
Leaf Physical
Label Addr. Channel 0
Channel N
Real Block Dummy Block Access Path

Fig. 3. The DRAM-based main memory organization.


Fig. 1. An example 4-level Ring ORAM structure with Z = 4 and S = 5.

Data Metadata The Ring ORAM operations are summarized as below:


Index Valid? Real? Counter • Read path operation reads and decrypts the metadata of all
Access
0 1 1 2 3 buckets along the path , to determine which bucket contains
1 1 0 0
Index
2 Valid?
1 Real?
0 the block of interest. Then the ORAM controller selects one
Index
3 Valid?
1 Real?
1
Real Block block to read per bucket along the path. The selection is
Index
4 Valid?
1 Real?
0
5 0 1 Dummy Block randomly based on the metadata: a block that has been read
6 1 1
Accessed before cannot be reread (by checking the valid bits). The
7 0 0
bucket that contains the block of interest returns the real block,
while all other buckets along this path return a dummy block.
Fig. 2. Ring ORAM bucket details (Z = 4, S = 4). The blocks accessed in each bucket are marked invalid.
has L + 1 levels – the root of the tree is at level 0 while the • Eviction operation is issued after every A times of read

leaves are at level L. And each node of the tree is a bucket path operations. For each bucket along the path, it reads all the
that can store multiple real and dummy data blocks. All data Z real blocks, permutes them, and writes Z + S blocks back.
blocks are encrypted indistinguishably so that an adversary The sole purpose of an eviction operation is to push blocks
cannot differentiate dummy blocks from real ones. Each leaf back to the binary tree from the stash. Ring ORAM adopts
node has an one-to-one correspondence to a path  that goes a deterministic eviction order, reverse lexicographic order so
to the leaf node from the root, so there are 2L paths in total that consecutive eviction paths have fewer overlapped buckets
(path 0, 1, ..., 2L − 1). On the controller side, the ORAM [17].
interface consists of several components: stash, position map, • Early reshuffle is needed to ensure that each bucket is

address logic, and encryption/decryption logic. The stash is a properly randomly shuffled. Each bucket can only be touched
small buffer that temporarily stores data blocks fetched from at most S times before early reshuffle or eviction because,
the ORAM tree. The position map is a lookup table that maps after S accesses to a bucket, all dummy blocks have been
program addresses to data blocks in the tree. In the position invalidated. Early reshuffle operation reads and writes buckets
map, each data block corresponds to a path id , indicating that have been accessed S times and reset the metadata fields.
that it is situated in a bucket along the path .
C. Basics of DRAM
In the Ring ORAM construction, each bucket on the binary
tree node has Z + S slots and a small amount of metadata. The structure of the DRAM-based main memory system is
In these slots, Z slots store real data blocks, and S slots store as shown in Figure 3. In modern DRAM, a bank is the finest
dummy blocks. Figure 2 shows the bucket organization of Ring granularity that can be accessed in parallel (referred to as bank-
ORAM. In this example, we have a bucket with Z = 4 and level parallelism [26]). A group of banks in different DRAM
S = 4, and each bucket has additional metadata fields such as chips consist of a rank and operate in lockstep. Finally, several
index, valid, real, and a counter. Bit valid identifies whether banks are organized into one channel, which shares one set of
this block has been accessed, bit real identifies which blocks in physical links (consisted of command, address, data buses) to
the bucket are real blocks, and the counter records how many communicate with the memory controller. Note that the banks
times this bucket has been accessed. For example, in Figure in different channels can be accessed in parallel (referred to
2, a dummy block at index 1 (real bit is 0) has been accessed, as channel-level parallelism), while the banks in one channel
so its valid bit changes to 0, and the counter increases by have contention at the physical link.
1. For every bucket in the tree, the physical positions of the The top-right portion of Figure 3 (in the blue box) shows
Z + S real and dummy blocks are permuted randomly when the structure of a bank. In each bank, a two-dimensional
the counter exceeds S. array of DRAM cells is used to store data, and a row buffer

16
Capacity (GB)
100
consisting of several amplifiers is connected to the array. The 80 Z A X S
array includes a series of rows, each of which can be selected 60 Config-1 4 3 2 5
40
via a wordline. The array also has multiple columns, and the 20 Config-2 8 8 4 12
0
memory cells in each column are connected to an amplifier Config-1 Config-2 Config-3 Config-4
Config-3 16 20 7 27

in the row buffer via a shared bitline. The wordlines and the Dummy Blocks Real Blocks Config-4 32 46 12 58

bitline together determine the data to be read or written.


Fig. 4. The memory space utilization of Ring ORAM [17] with different
The middle-left portion of Figure 3 (in the green box) configurations. The height of the ORAM tree is 23 (L = 23), and each block
presents the basic structure of the memory controller. In the is 64 Byte.
memory controller, one read queue and one write queue are
allocated for each channel. The memory access requests from to hide where real data blocks are stored. As a result, the
the Last Level Cache (LLC) are stored in the corresponding overall memory space utilization is reduced. We calculate the
queues according to their target addresses and issued when occupied capacity for real and dummy blocks in different Ring
the target DRAM bank is idle. Note that, once the read/write ORAM settings, as shown in Figure 4. Here, the definition of
request queues are full, the memory controller stops to receive parameter Z, S, and A are described in Section II-B, which
the incoming requests, which probably causes the pipeline stall refer to the number of real blocks in a bucket, the number of
at the processing core [27]. dummy blocks in a bucket, and the eviction frequency. The
To serve the memory requests, the memory controller issues relationship between S and A is theoretically defined by the
different commands to a bank according to its status. The equation S = A + X (X ≥ 0) in [17]. With a larger S
commands are defined as follow: than A, we can ensure the early reshuffle operations don’t
happen too frequently. All configurations have been proved to
• Activate(ACT): selects a row and copies the entire row
meet the security requirements of Ring ORAM and are most
content to the row buffer.
theoretically bandwidth-efficient [17]. To quantitatively show
• Read/Write(RD/WR): accesses the data in the row buffer
the capacity overhead of dummy blocks, we set the height of
according to the column address.
the ORAM tree as L = 23, and block size as 64 Byte. We
• Precharge(PRE): de-activates the row buffer and copies
observe the following rules from the experimental analysis:
its data back to the array.
1) The real blocks capacity grows linearly from Config-1 to
Note that, the ACT command can only be sent to a bank
Config-4, with the Z value increase from 4 to 32, and the
in a precharged state, i.e., the previous row buffer content is
actual capacity grows from 4GB to 32GB. This is because the
cleared. In general, there are two types of schemes for bank
definition of Z decides how many real blocks are stored in
access: the close-page policy and the open-page policy. With
the ORAM tree.
the close-page policy, the PRE command is sent immediately
2) The dummy blocks capacity grow above linearly from
after the RD/WR is done. Such a method removes the PRE
Config-1 to Config-4, due to the S value increase from 5 to 58.
command from the critical path, but it also misses the op-
The reason for the S value needs to be this large is because
portunity of utilizing the locality at the row buffer. On the
of the equation defined above. Theoretically, these A and S
contrary, the open-page policy allows the row buffer to keep
pairs can achieve the best overall bandwidth [17]; however,
its data after the RD/WR commands. In this way, consecutive
the memory space waste is unacceptable – with Z = 32 and
requests to the same row can be served one after another
S = 58, the ORAM tree requires 58GB extra memory space
without PRE and ACT. However, if the request is a row buffer
for dummy blocks to provide 32GB capacity for real blocks.
miss, the PRE and ACT must be sent before the data can be
Such a configuration only has a memory space efficiency of
accessed. Such a worst-case situation is referred to as a row
35.56% (proportion of real blocks capacity over total memory
buffer conflict. Although the open-page policy may cause more
capacity allocated to the ORAM).
latency for the worst-case access, it also has the opportunity
to accelerate memory access if the row buffer conflict rate is B. Row Buffer Conflicts with Selective Read Operation
low. In this paper, we assume that the DRAM modules use
the open-page policy. Ring ORAM utilizes a tree-based logical structure to ensure
its theoretical security and efficiency. When the logical tree is
III. M OTIVATION mapped to the physical DRAM memory space, we need to
In this section, we first discuss the inefficiency of Ring consider how to utilize the parallelism in the memory system
ORAM when it is implemented on DRAM-based system in during an ORAM access. Also, different address bit stripping
the perspective of low space utilization and high performance schemes could result in distinct path access patterns. Subtree
overhead. Then we present our observations and the optimiza- layout [19] is considered as the most efficient address mapping
tion opportunity for Ring ORAM. for tree-based ORAM organization, which maximizes the row
buffer utilization, especially for full path accesses (such as
A. Memory Space Waste Due to Dummy Blocks Path ORAM). The core idea of subtree layout is to group
As we introduced earlier in Section II-B, the memory orga- the data blocks in a subtree and map them to row buffer as
nization with ORAM protection is padded with dummy blocks. a whole. As shown in Figure 5(a), the 4-level ORAM tree
The excessive dummy blocks in the memory are needed is horizontally divided into two layers of subtrees. Assuming

17
Bank 0

Bank 1

Bank 2

Bank 3

Oram Access Oram Access Oram Access


Transaction 1 Transaction 2 Transaction 3
(a) An example 4-level Ring ORAM with subtree layout.
1.2 Precharge Activation Read/Write Idle Time
Read Path Eviciton
Conflict Rate
Row Buffer

1.0
0.8
0.6
0.4
0.2 Fig. 6. The illustration of the idle time for the ORAM based on a 4-bank
0.0 DRAM.
bla fac ferr flui freq les libq mu stre sw GE
ck e et d lie mme am apt OM
r E AN Algorithm 1: Transaction-based scheduler algorithm
(b) Row Buffer conflict rate for Ring ORAM with subtree layout. Input: i: current ORAM access transaction number
n: current cycle
Fig. 5. The ineffectiveness of sub-tree layout for Ring ORAM. Output: Issue command to the DRAM module
1 while not end of the program do
that each subtree’s blocks are in the same row, then accessing 2 if memory controller can issue command at cycle n
a full path can be translated into 2 row accesses (16 memory then
accesses in total). In this case, only 2 of them are row buffer 3 check memory command queue;
misses, and the remaining 14 blocks are all fast row buffer if has commands ∈ transaction i then
hits. The row buffer conflict rate is relatively low in this case. 4 issue the command based on FR-FCFS;
Although Ring ORAM is also tree-based, the unique read 5 else
path operation degrades the benefits of the subtree layout. 6 Continue;
Only one block per bucket is fetched each time, so the total 7 end
data blocks transferred are reduced. Considering the same 8 end
tree configuration, as shown in Figure 5(a), a Ring ORAM 9 if no commands ∈ transaction i then
read operation will bring only 4 blocks in total. In this case, 10 i++;
half of the accesses are row buffer hits, and the other half 11 end
are row buffer misses. Therefore the row buffer conflict rate 12 n++;
is increased. Such a scenario would be exaggerated when 13 end
we have a multi-channel multi-bank memory system. Our
experiment found that on a four-channel memory system,
the row buffer conflict rate during the selective read path
operation is significantly higher than the full path eviction grams without ORAM’s protection, once the requests are in
operation. Figure 5(b) illustrated the biased locality on row the memory controller’s queue, the PRE, ACT, or RD/WR
buffer during these two distinct phases. During the read path can be freely scheduled based on the bank or channel idleness
operation, the row buffer conflict rate is around 74%; however, to maximize the performance. However, one single ORAM
the full path eviction operation has a much lower conflict access now consists of multiple data block accesses, and
rate of 10%. Therefore, we find that the subtree layout is they must be issued to the memory in-order and atomically.
exceptionally effective for full path operation, but not enough We refer to such scheduling as transaction-based scheduling
for accelerating the selective read path operation in the Ring [28], where the transaction means all the memory requests
ORAM. The read path operation is always a critical operation for the same ORAM operation. Figure 6 shows the example
during the execution, so its performance impact is obvious. of three ORAM accesses to multiple memory banks. Within
each ORAM access transaction, the commands include not
C. Idle Bank Time with Transaction-based Scheduling only the actual RD/WT commands but also the PRE and ACT
Next, we discuss how ORAM accesses are translated into commands due to the bank conflicts.
DRAM commands and scheduled by the DRAM memory con- As a result, when the memory controller is issuing the
troller. After checking the position map, the ORAM controller memory requests, it has to follow the transaction-based timing
will generate the physical addresses of the data block to be constraints. The transaction-based scheduling algorithm is
fetched along the selected path. The memory controller then described in Algorithm 1. The i + 1-th ORAM access must
actually translates the access sequences into memory requests wait for the i-th access completion before it is scheduled
that contain memory commands and addresses that memory out to the memory. We can observe mixed commands sent
DIMMs are capable of understanding. For conventional pro- to random memory channels and banks within each ORAM

18
transaction due to the random selective read path operation. and the row buffer pollution. The framework consists of: a)
Since the ACT and PRE are also attached to their own ORAM a compact ORAM bucket organization and updated access
transaction, such commands can only start at the beginning of protocol; b) a new scheduler aiming at reducing bank idle
each transaction when there is a bank conflict. The simple time caused by transaction based scheduling; c) an integrated
transaction-based scheduling would cause abundant wasted architecture that can support efficient memory utilization and
time on memory banks. We define the memory bank idle time access for ORAM.
as the average duration each bank stops receiving memory
command due to the transaction-based scheduling barrier. In A. Compact Bucket (CB): A Compact ORAM Organization
Figure 6, we can observe that when some memory banks have and Access Protocol
a higher workload than the others, although the idle banks are Based on our motivations in Section III-A, the majority of
ready to issue the ACT or PRE commands, they are not able the allocated space for the ORAM protected program stores
to do so, such as bank 1 in ORAM access 1, and bank2 in dummy data blocks, which significantly reduces the usability
ORAM access 2. of the limited main memory space. As shown in Figure 7 (a),
To summarize, we identify that current ORAM transaction- Ring ORAM reserves S dummy blocks per data bucket so that
based scheduling can cause significant bank idleness, espe- it can support at most S dummy accesses before a reshuffle
cially when the read path operation causes a high row buffer operation. Meanwhile, the rest of Z real blocks may remain
conflict rate, as explained in the prior section. The PRE and untouched, if there is no real data access in this bucket.
ACT commands do not return any data back to the processor.
Therefore, if we can free the scheduling of them from the
Dummy/Real Valid Green Block Counter 1
transaction, we can significantly improve the memory bank
utilization and overlap the row buffer conflicts. Real 1
Dummy/Real Valid
Real 1
Z
D. Design Opportunities Real 1 Real 1
Real 1 Real 1
Based on the three observations above, we identify the Dummy 1
Z
Real (Green) 1
following design opportunities: Y
Dummy 0 Real (Green) 0
S S
1) Typically, the Ring ORAM requires that the S  A , which Dummy 0 Dummy 0
Dummy 0 Dummy 0
provides abundant dummy blocks for the read path operations
at the cost of storage waste. If we can reduce the S and (a) Ring ORAM bucket organization (b) Green Dummy bucket organization
allow part of real blocks to be accessed as dummy blocks, Fig. 7. The equivalent Compact Bucket design
the memory space efficiency can be improved significantly. Ideally, we want to minimize the dummy blocks in the
2) Subtree layout can significantly promote the access ef- bucket. A simple take is to reduce the value of S directly.
ficiency under the open-page policy for full path read or However, if we only have a few dummy blocks per bucket,
write operation. However, it is not efficient for selective read the reshuffle would happen very frequently, and the overhead
path operation. Therefore, if we can change the row buffer would be significant. To reduce the value S, and ensure the
management scheme for the read path, we are able to minimize reshuffles happen at a similar frequency, our idea is to borrow
the performance impact of high row buffer conflict rate and the real blocks that are already in the bucket, and treat them
long critical path delay. as green blocks, as shown in Figure 7 (b).
3) Transaction based ORAM scheduling technique ensures the The Compact Bucket(CB) organization in Figure 7 (b) can
correctness of ORAM protocol; however, when it comes to support equivalent accesses to the bucket, compared with (a).
the command-level scheduling, we find it is less desired to Here, we reduce the number of reserved dummy blocks to
group the PRE and ACT within the current ORAM access S −Y , where Y is the number of real blocks in the bucket that
transaction. If we can schedule such commands earlier, we can be served as dummy data during a read path access. We
have a higher chance to utilize the idle bank and hide the define such blocks as green blocks, and value Y as CB rate.
latency caused by row buffer conflicts. In this case, without Therefore, with the help of Y green blocks, we can achieve
reducing or changing the number of row buffer conflicts, the same number of operations per bucket as in Figure 7 (a).
we preserve the security and correctness of ORAM while In this example (Z = 4, S = 4, Y = 2), we limit the number
improving the performance. of real data blocks that can be fetched as dummy blocks to
The mentioned approaches, in turn, improve the efficiency 2. And with the additional 2 dummy blocks reserved in the
of ORAM access from spatial and temporal aspects. The next bucket, this bucket can still support up to 4 accesses before
section describes the details of our spatial optimization through path eviction or early reshuffle.
a compact ORAM bucket design and temporal optimization ORAM accesses with CB. To facilitate the accesses to CB,
with a proactive bank management scheme. we slightly modify the metadata in the bucket. In the original
Ring ORAM, we use a counter per bucket to hold how many
IV. D ESIGN accesses have been made to the bucket, and one bit per block
Our ORAM framework, String ORAM, reduces the wasted to record whether it is a real or dummy block. As we need to
memory space, the average memory request queuing time, limit the number of green blocks in a bucket, we need to have

19
a green block counter to record how many green blocks have of the memory accesses during the read path phase are row
been touched. The counter size is comparably small, which buffer conflicts. This means the row buffer inside each memory
is at log2 (Y ). When there is a read path operation, the block bank needs to be closed then opened frequently with PRE and
selection can freely choose a dummy block in the bucket or ACT commands. Moreover, we find that due to the transaction-
a real block if the green counter value is less than Y . During based scheduling, the PRE and ACT cannot be issued ahead
the eviction and reshuffle, the green counter values are reset, of each transaction.
just like other metadata in the bucket. We propose a proactive bank (PB) scheduler that separates
Choosing the right Y and managing the stash overflow. the PRE and ACT commands from the ORAM transaction
The following questions arise when we modify the Ring during the command scheduling. Algorithm 2 shows our
ORAM into a compact format. First, can we set the value modified scheduling policy. Instead of staying idle and waiting
of Y as big as possible? Second, what is the consequence for all commands for the current transaction i finished, the PB
of setting a big Y ? Third, how do we determine the best Y scheduler scans the memory command queue to see if any PRE
for a given ORAM configuration? Clearly, with CB, we are or ACT coming from i + 1 can be issued ahead. In this case,
bringing more than one real block per read path operation, when the current transaction is finished, the next transaction
and this adds the burden on the stash. With the same size of can directly start with RD or WR. In other words, the long
stash, using an aggressive CB configuration with a large Y row buffer miss penalty is hidden through latency overlapping.
value can cause the stash fill quickly. To address the stash
overflow problem, we adopt background eviction [29], which Algorithm 2: PB scheduler algorithm
was initially proposed for Path ORAM. When the stash size
Input: i: current ORAM access transaction number
reaches a threshold, the background eviction is triggered, and
n: current cycle
it halts the execution and starts to write blocks in the stash
Output: Issue command to the DRAM module
back to the ORAM. However, at this point, we may not
1 while not end of the program do
meet the Ring ORAM eviction frequency A. If the ORAM
2 if memory controller can issue command at cycle n
controller issues the eviction directly without following the
then
eviction frequency, it may leak information such as the stash
3 check memory command queue;
is almost full, as we see consecutive eviction patterns instead
if has commands ∈ transaction i then
of multiple read path then eviction pattern. Therefore, dummy
4 issue the command based on FR-FCFS;
read path operations (reading specifically dummy blocks) have
5 else if has command ∈ transaction i+1 then
to be issued until the desired interval A is reached and eviction
6 if meet inter-transaction row buffer conflict
operation is called. In this way, our background eviction does
and the command is PRE or ACT then
not change the access sequences and prevents such leakage.
7 issue the command;
Due to the high overhead of background eviction, it is
8 end
recommended to have a modest Y value that triggers less
9 else
or almost no background eviction. We analyze the tradeoffs
10 Continue;
in the result sections with different Y selections and various
11 end
stash sizes.
12 end
CB benefits summary. As the spatial optimization in our 13 if no commands ∈ ORAM transaction i then
framework, the space efficiency brought by CB is obvious. If 14 i++;
we reserve Y real blocks in one bucket as green blocks, we 15 end
can reduce the space overhead by Y blocks per bucket. If the 16 n++;
value of Y is properly chosen (without triggering too many
17 end
background eviction), the additional benefits of this scheme are
that the number of dummy blocks that need to be read and
written during the eviction/reshuffle phase is reduced, as well By revisiting the example in motivation, with PB scheduler,
as the number of blocks per path that needs to be permuted. some of the PREs and ACTs can be issued by the memory
Thus, the time spent on eviction and reshuffle is reduced, and controller ahead of the current ORAM transaction, as shown
this can in turn accelerate the read path operation. ORAM in the Figure 8. These commands are marked with a red
accesses will experience much shorter request queuing time outline. Clearly, the reason that such PREs and ACTs can
in the memory controller. be done ahead is that such row buffer conflicts are inter-
transaction. As a result, whenever these ORAM transactions
B. Proactive Bank (PB): A Proactive Memory Management are in the memory request queue, the PREs and ACTs are
Scheme for ORAM Scheduler able to be issued. Our PB scheduler does not fetch the PREs
As reported in section III-B, the read path and eviction op- and ACTs that are caused by intra-transaction conflicts to
eration in Ring ORAM show distinct memory access locality. the same bank. For example, in ORAM access 2, the second
The selective block read cannot fully leverage the locality set of PRE and ACT are still issued in-order. As we do not
benefits from the subtree layout, therefore, a large portion change the access sequences for each ORAM access, such

20
highlighted the modified changes on the ORAM controller and
Bank 0 memory controller.
For the CB scheme, we modify the bucket structure and
Bank 1
add the green block counter to record how many green block
Bank 2 accesses have been made to the bucket and limit the maximum
to Y . In addition, the ORAM controller needs to be able
Bank 3 to issue background eviction to mitigate the potential stash
Time

Oram Access Oram Access


Saving
overflow caused by an aggressive Y value.
Oram Access
Transaction 1 Transaction 2
Transaction 3 For the PB scheme, there are no modifications to the DRAM
interface or DIMM side. The modification is a very lightweight
Precharge Activation Read/Write
scheduling policy that can be incorporated with the DRAM
Current Original
Idle Time Finish Time Finish Time controller. The PB scheduler in the DRAM controller only
needs to work with the ORAM interface to know the current
ORAM access number and scan the command queue to decide
Fig. 8. The illustration for the timing behavior of PB. which command to be issued.
Channel 0
Read Queue V. S ECURITY A NALYSIS
From Addr. Logic PB
Write Queue From E/D Logic Logic To/From In this section, we discuss the security implications of our
Main
Modified Logic
To E/D Logic Memory proposed performance optimization framework.
Claim 1: CB does not leak access pattern information or cause
To/From To Memory stash overflow. Compact Bucket (CB) aims at reducing the
Stash Channel 0
To LLC
(with CB E/D Logic
Channels bucket size and allows real blocks to be used as dummy blocks
support) To Memory during the read path operation. Therefore, it is possible to
Channel 1
bring more than one real data block into the stash, which is
From To
Ă

LLC Channels different from the original Ring ORAM’s protocol. The stash
Pos. Map Addr. Logic To Memory
Channel N is within the security boundary, therefore the extra real data
ORAM Memory block inside of the stash does not leak any information. If the
Controller Controller
additional real blocks brought into the stash are serviced by
Fig. 9. The architecture overview. other memory requests before eviction, the timing differences
of execution the program may differ. We argue that this is not
intra-transaction conflicts are inevitable. a critical issue since the prior ORAM prefetch work [29] also
Impact on access sequence. By scheduling the memory brings more than one real block per read request. Moreover,
command PREs and ACTs out of ORAM transaction, these without the superblock scheme, in our experiment, it is rare to
commands’ issue time will be earlier than the original time. PB see the green blocks brought into stash will be consumed by
scheduling only affects when such commands are issued, but other memory requests before eviction. To completely remove
not changing the command order or causing asynchronous data such leakage potential, we can force the green blocks not
read or write. The actual RD and WR commands that carry to be directly fetched by other requests from the stash. The
data are still obeying the transaction-based access sequences. other issue is the filling speed for the stash could be faster
Besides, the row addresses associated with PRE and ACT are and cause stash overflow. We discuss that through leakage-
public information, scheduling them ahead does not change free background eviction (use dummy read path operations to
original addresses nor leak any information. reach the eviction interval), we can keep the stash occupancy
PB benefits summary. PB optimizes the Ring ORAM ac- low. The relationship between stash size, CB rate(Y ) and
cesses in the temporal aspect. We separate the non-data related performance are presented in Section VII.
commands from the original transaction through proactive Claim 2: PB does not leak access pattern information during
command scheduling, hence utilizing the bank idle time to the scheduling. Proactive Bank (PB) is a lightweight memory
prepare fast data access for the next ORAM transaction. The scheduler that is easy to implement and only modify the
row buffer miss latency can be hidden through a multi-channel issue time of non-data related commands(PREs and ACTs)
multi-bank memory system. Not only do we reduce the idle on the memory bus. The memory access sequences on the
time on the memory system, but also shorten the read path bus, including the number of requests/commands, the order
latency. of requests/commands, are entirely remained unchanged with
PB scheduler. The addresses associated with PRE and ACT
C. Architecture Integration are public information: PRE closes a bank and only contains
To support the proposed spatial and temporal optimizations, the last accessed bank information, which is known since the
we slightly modify the ORAM interface, bucket structure, bank has been previously accessed; ACT contains the row
and the DRAM command scheduler. Figure 9 shows the address for next transaction’s access, which is also public
overall hardware architecture of our proposed framework. We as long as the path id is determined. Whether an ORAM

21
TABLE III
TABLE I
T HE DEFAULT String ORAM CONFIGURATIONS
P ROCESSOR C ONFIGURATION

Frequency 3.2GHz Inherited ORAM Model Ring ORAM [17]


# Cores 4 Stash Size 500
Core 5-stage pipeline, OoO execution support Data Block Size 64Byte
ROB size: 128; Retire width: 4; Fetch width: Binary Tree Levels (L+1) 24
4 Tree Top Cache Levels 6
Last Level Cache 4MB Real Blocks per Bucket (Z) 8
Cacheline Size 64B Dummy Block per Bucket (S) 12
CB Rate (Y) 8
TABLE II
M EMORY S UBSYSTEM C ONFIGURATION
The configuration of ORAM in our framework is shown in
Memory Controller
Table III. The ORAM tree is set as Z = 8, S = 12, Y = 8, and
# Memory Channel 4
Read Queue 64 entries
L = 23, with a total size of 20GB and fits into our simulated
Write Queue 64 entries per channel memory system. The default stash size is set at 500. In Section
Address Mapping row:bank:column: rank:channel:offset VII, we will provide further discussion on the impact of the
DRAM Module stash size and the CB Rate(Y).
Specification DDR3-1600 TABLE IV
Memory Capacity 8GB (per channel) W ORKLOADS AND THEIR MPKI S .
Ranks per Channel 1
Banks per Rank 8 Suite Workload MPKI Workload MPKI
Rows per Bank 16384 black 4.58 face 10.37
Columns (cachelines) 128 per row ferret 10.42 fluid 4.72
PARSEC
Row Buffer Capacity 4KB freq 4.42 stream 5.57
swapt 5.16
SPEC leslie 9.45 libq 20.20
BIOBENCH mummer 24.07
transaction can be accelerated depends on the bank idleness
and the transaction access distribution, which is also random VII. R ESULTS
and public. Therefore, the memory access pattern is still
computationally indistinguishable for the attacker. To evaluate the String ORAM, we conduct a series of ex-
Claim 3: Combining CB and PB does not leak information. periments. We first compare the performance of our proposed
Combining both schemes does not introduce additional infor- schemes with the baseline Ring ORAM, both in execution
mation leakage. CB and PB try to streamline Ring ORAM time, memory request queuing time and bank idle time. After
accesses from two distinct directions, and combining them that, we provide a sensitivity study on the CB rate with
will only increase the performance benefits from the spatial a thorough impact analysis on stash size and background
and temporal aspects. eviction rate. Lastly, we discuss the broader applicability of
our proposed schemes.
VI. M ETHODOLOGY A. Performance
We implement our proposed String ORAM framework on
USIMM [30], which supports cycle-accurate simulation for a
Execution Time

1.0
Normalized

0.8
detailed DRAM-based memory system. Based on this plat- 0.6
0.4
form, we simulate a CMP system with the parameters of the 0.2
0.0
state-of-art commercial processor, and the detailed configura- bla fac ferr fluid freq les libq mu stre sw AV
ck e et lie mm am apt G
tions are as shown in Table I. For the memory subsystem, er
1. Baseline 2. CB 3. PB 4. ALL Read Eviction Reshuffle Other
we follow the JEDEC DDR3-1600 specification to simulate
a DRAM module with 4 channels, and each channel has 8 Fig. 10. Normalized execution time.
banks. The total capacity of the DRAM module is 32GB. The We use the total execution time (including all operations:
address mapping follows the order of “row:bank:column: read path, eviction, early reshuffle, and other related opera-
rank:channel:offset”,which follows the subtree layout tions) to denote the system performance. As shown in Figure
to maximize the row buffer locality [19]. The detailed param- 10, the individual CB scheme improves the performance by
eters for the memory subsystem are shown in Table II. 11.72% as the average. This is because the CB scheme reduces
We use 10 memory-intensive applications for the evaluation. the number of total blocks on the path so that eviction
The applications are selected from PARSEC 3.0, SPEC and operations take a shorter time to finish. Besides, the PB
BIOBENCH. For each benchmark, a methodology similar provides a more significant performance improvement than
to Simpoint is used to generate the trace file consisted of CB. On average, the execution time is decreased by 18.87%.
500 million instructions out of 5 billion instructions. The Such improvement is achieved by moving ACT and PRE
applications and corresponding traces are also used in the MSC command from future ORAM access to occupy idle bank
contest [31]. The applications are described as Table IV. while waiting for current ORAM access finishes. Finally,

22
Idle Time Proportion
when we consider the combination of CB and PB, the total Baseline PB

Average Bank
1.2
performance improvement achieves 30.05%. 1.0
0.8
In addition, Figure 10 also shows that the CB, PB, and 0.6
0.4
CB+PB schemes provide similar performance improvement 0.2
0.0
range across different applications (the variation of all results bla
ck
fac
e
ferr
et
flui
d
freq les
lie
libq mu
m me
stre
am
sw GE
apt OM
is less than 0.38%). This indicates that our proposed schemes r E AN

work for different applications and prevent information leak- (a) Average bank idle time proportion.
age from the execution time variance. 0.8

PB Operation
PRE ACT

Proportion
0.7
B. Queuing Time 0.6
0.5
0.4
0.3
Queuing Time

1.2
Normalized

1.0 bla fac ferr flui freq les libq mu stre s G


0.8
ck e et d lie mm am wapt EOM
er EA
0.6 N
0.4
0.2 (b) PB operation proportion.
0.0
Fig. 12. Bank idle time and PB proportion.
bla fac ferr flui freq les libq mu stre sw GE
ck e et d lie m me am apt OM
r EAN TABLE V
Baseline CB only PB only ALL CB CONFIGURATIONS AND CORRESPONDING SPACE SAVING .
(Z = 8, S = 12, L = 23)
(a) Read Queue.
CB rate Total Memory Dummy Block
Queuing Time

1.2
Normalized

1.0 Space (GB) Percentage


0.8
0.6 Baseline Y =0 20 60%
0.4 Config-1 Y =2 18 55.6%
0.2
0.0 Config-2 Y =4 16 50%
bla
ck
fac
e
ferr
et
flui
d
freq les
lie
libq mu
mm
stre s G
am wapt EOM
Config-3 Y =6 14 42.9%
er EA
N Config-4 (default) Y =8 12 33.3%
Baseline CB only PB only ALL

(b) Write Queue. can be read directly at the beginning of the transaction. The
Fig. 11. Normalized request queuing time.
remaining commands that cannot be fetched earlier are mainly
Figure 11 presents the memory request queuing time of dif- caused by intra-transaction bank conflicts.
ferent schemes. We can see that CB provides similar queuing
time reduction for the read queue (10.41%) and write queue D. CB Sensitivity Analysis
(11.83%). The reason is that CB alleviates the access overhead
to the memory channel by reducing the number of memory We further evaluate the effectiveness of CB with different
accesses in eviction operation, which allows both queues configurations of Y . As shown in Table V, the five config-
to have more opportunities to service read path operation. urations represent different compact rates corresponding to
On the other hand, the queuing time reduction of the read different memory space efficiency. A higher compact rate can
queue caused by PB is higher than the write queue (22.53% reduce the occupied total memory space significantly, as well
vs. 19.46%). Because PB directly reduces the performance as the dummy block percentage. If we want to have extreme
overhead of read path operations, as the read requests can be storage efficient ORAM construction, we may choose a higher
completed more quickly. Since write operations only happen Y value, at the cost of more frequent background evictions.
during eviction and reshuffle, the queuing time reduction of While CB is not performance optimization oriented, we can
the read queue indirectly helps the write queue to gain benefit. still observe some performance gain through a more compact
Overall, the CB and PB scheme together reduce the queuing bucket design. The performance gain of CB mainly comes
time of the read queue & write queue by 32.87% and 31.30%. from the eviction phase, as we reduce the number of blocks
that need to be read and written. We show the performance
C. Bank Idle Time Reduction with different CB rates in Figure 13. When the stash size is
Figure12(a) shows the average bank idle time before and at 500, which does not cause additional background eviction,
after applying PB scheme. Originally, as discussed in previous Config-4 with Y = 8 achieves the best performance. The CB
sections, DRAM banks suffer from an imbalanced workload, scheme with Y = 2 to 8 has a total execution time reduction
causing bank idleness while waiting for other banks to finish from 2.02% to 11.72%. When combining the PB scheme, the
current ORAM access. This idle time takes up 65.99% of the performance improvement between Y = 2 to 8 increases from
total execution time. Through the PB scheme, the idle time 20.79% to 30.05%.
of the bank is greatly reduced to 40.72% of execution time, Figure 13 also shows the green blocks fetched per read.
enabling bank to serve more requests than before. With CB, 0.17 ∼ 3.26 green blocks are brought into the stash
Our experiments also suggest that 59.31% PRE and 56.93% per read path, on average. Therefore, Y cannot be set too
ACT can be issued earlier than its own transaction, as shown aggressively if we don’t want the stash to be filled too quickly.
in Figure12(b). These commands were overlapped with the In the next section, we study the stash fill rate with different
critical path in each transaction, and as a result, data blocks stash sizes.

23
Execution Time 1.2 Green Blocks Stash Size
Normalized

Occupancy
Fetched Per 200
1.0 Read

Stash
0.8 Config-1 0.167
100
Config-2 0.652
0.6
Config-3 1.638 0
CB ALL
0 5000 10000 15000 20000
Baseline Config-1 Config-2 Config-3 Config-4 Config-4 3.255
Baseline Config-1 Config-2 Config-3 Config-4

Fig. 13. The sensitivity study to the CB compact rate. (a) Stash Size = 200.
(a) Performance (b) Eviction Number
Stash Size
62

2..28
19

39

Occupancy
300

Eviction Number
1.

1.
1
Execution Time

1.2 1.2

Stash
Normalized

Normalized
1.0 1.0 150

0.8 0.8
0
0.6 0.6 0 5000 10000 15000 20000

200 300 400 500 200 300 400 500 Baseline Config-1 Config-2 Config-3 Config-4
Baseline Config-1 Config-2 Config-3 Config-4
(b) Stash Size = 300.
Stash Size
Fig. 14. Stash size v.s performance.

Occupancy
400

Stash
200
E. Stash Size v.s. Eviction Overhead
0
We further analyze the relationship between the stash size 0 5000 10000 15000 20000

and additional background eviction operations with the CB Baseline Config-1 Config-2 Config-3 Config-4
scheme. Figure 14 and 15 show additional eviction operations (c) Stash Size = 400.
with different stash sizes and the dynamic stash occupancy Stash Size

Occupancy
500
with different Y . Clearly, a smaller stash can be filled faster
Stash
by the additional fetched real blocks, yet a larger one can 250

mitigate this issue. 0


0 5000 10000 15000 20000
From the results, we observe that, although the stash oc-
Baseline Config-1 Config-2 Config-3 Config-4 (Default)
cupancy increases with the Y value, in practice, the stash is
not blown up with effective reverse lexicographical eviction (d) Stash Size = 500.
Fig. 15. Run-time stash occupancy with different stash size configurations
scheme. By properly selecting the stash size, we can mitigate
the background eviction overhead when the Y is large. For
example, when the stash size is too small, i.e., 200, Y ≥ 6 Similarly, Shadow Block [28] utilizes the dummy blocks to
starts to cause background evictions. However, when we have store the additional copies of the real data, which transforms
a relatively large stash size, such as 500, even with Y = 8, the dummy accesses to the prefetching for real blocks. The
the background eviction is not triggered during the simulated Relaxed hierarchical ORAM [33] reforms the ORAM to a
time. The enlarged stash size is still considered very small layered construction and uses small ORAM as the cache of
(64B × 500 = 32KB) and bounded. full ORAM. Lastly, efficient ORAM scheduling schemes have
been explored to maximize memory system utilization. For
VIII. R ELATED W ORK example, CP-ORAM [34] focuses on fairness between ORAM
We see an increasing number of architectural optimizations applications and normal applications through an application-
on ORAM recently. Firstly, since ORAM protocols generate aware scheduler. A channel imbalance-aware scheduler [35]
massive data movement, with trusted ORAM logic closer to was proposed to minimize the channel imbalance for Ring
the memory, we can significantly reduce the data movement ORAM read operation. Our proposed String ORAM is a
between the processor and the memory. For example, Se- new framework focusing on the memory space waste due
cureDIMM [20] implements a PIM-like structure that mit- to dummy blocks and memory idle time due to ineffective
igates the transfer overhead by adding ORAM logic to the locality improvement schemes designed for Path ORAM.
DRAM module. D-ORAM [21] moves the ORAM controller Instead of using ORAM, other memory access pattern ob-
on board and minimizes the bandwidth interference with other fuscation techniques are proposed to achieve lower overhead.
applications. Secondly, the effective data fetch per ORAM InvisMem [23] and ObfusMem [22] use the logic layer on
access can be improved by locality-aware schemes. PrORAM HMC and the bridge chip on NVM to implement the memory
[29] dynamically merges or separates the consecutive data ac- obfuscation function and reduce the overhead of issuing mul-
cesses with superblocks and then implements a locality aware tiple dummy accesses per memory request. Note that, such
prefetcher for ORAM. Multi-range ORAM [32] proposes to implementations require the memory device to be partially
store range data within a path and reduce the required accesses. trusted. Memcloak [36] claims to store the same data block
Thirdly, the dummy blocks in ORAM protocols have a high multiple times at different addresses to achieve the obfuscated
impact on the overall performance. Fork Path [18] focuses memory accesses. Compared to our design, we limit the TCB
on the overlapped data during the path ORAM accesses and boundary while improving memory space utilization instead
proposes to cache the content instead of writing them back. of storing multiple copies of real data around.

24
IX. C ONCLUSIONS [16] S. Sasy, S. Gorbunov, and C. W. Fletcher, “Zerotrace: Oblivious memory
primitives from intel sgx.” IACR Cryptology ePrint Archive, vol. 2017,
In this paper, we present String ORAM, a framework that p. 549, 2017.
accelerates the Ring ORAM accesses through an integrated [17] L. Ren, C. W. Fletcher, A. Kwon, E. Stefanov, E. Shi, M. Van Dijk,
and S. Devadas, “Constants count: Practical improvements to oblivious
architecture with spatial and temporal optimizations. Through ram.” in USENIX Security Symposium, 2015, pp. 415–430.
extensive experiments, we identify that dummy blocks in [18] X. Zhang, G. Sun, C. Zhang, W. Zhang, Y. Liang, T. Wang, Y. Chen, and
Ring ORAM protocols cause significant memory space waste. J. Di, “Fork path: improving efficiency of oram by removing redundant
memory accesses,” in Proceedings of the 48th International Symposium
Further, we find that current locality optimization schemes are on Microarchitecture, 2015.
less effective for Ring ORAM read operation. Therefore, we [19] L. Ren, X. Yu, C. W. Fletcher, M. Van Dijk, and S. Devadas, “Design
first present a compact ORAM bucket design (CB), which space exploration and optimization of path oblivious ram in secure
processors,” in Proceedings of the 40th Annual International Symposium
brings two folds of benefits: reduced memory space with fewer on Computer Architecture, 2013, pp. 571–582.
dummy blocks, and reduced evict path overhead with fewer [20] A. Shafiee, R. Balasubramonian, M. Tiwari, and F. Li, “Secure dimm:
blocks to shuffle. Then, we present a proactive ORAM access Moving oram primitives closer to memory,” in 2018 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA).
scheduler (PB) on the DRAM controller, which minimize IEEE, 2018, pp. 428–440.
the bank idle time without modifying the access sequences [21] R. Wang, Y. Zhang, and J. Yang, “D-oram: Path-oram delegation for
of ORAM. Next, we show the integrated String ORAM low execution interference on cloud servers with untrusted memory,” in
High Performance Computer Architecture (HPCA), 2018.
architecture that supports our designs. Lastly, we evaluated our [22] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “Obfusmem: A low-
proposed framework in terms of security, performance gain, overhead access obfuscation for trusted memories,” in Proceedings of
queuing time reduction, memory bank idle time reduction. the 44th Annual International Symposium on Computer Architecture,
2017, pp. 107–119.
R EFERENCES [23] S. Aga and S. Narayanasamy, “Invisimem: Smart memory defenses
for memory bus side channel,” in Proceedings of the 44th Annual
[1] S. Bajikar, “Trusted platform module (tpm) based security on notebook International Symposium on Computer Architecture, 2017, pp. 94–106.
pcs-white paper,” Mobile Platforms Group Intel Corporation, vol. 1, [24] A. Ahmad, K. Kim, M. I. Sarfaraz, and B. Lee, “Obliviate: A data
p. 20, 2002. oblivious filesystem for intel sgx.” in NDSS, 2018.
[2] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, [25] C. Sahin, V. Zakhary, A. El Abbadi, H. Lin, and S. Tessaro, “Taostore:
and M. Horowitz, “Architectural support for copy and tamper resistant Overcoming asynchronicity in oblivious data storage,” in 2016 IEEE
software,” Acm Sigplan Notices, vol. 35, no. 11, pp. 168–177, 2000. Symposium on Security and Privacy (SP). IEEE, 2016, pp. 198–217.
[3] D. Grawrock, The Intel safer computing initiative: building blocks for [26] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving memory
trusted computing. Intel Press Hillsboro, 2006, vol. 976483262. bank-level parallelism in the presence of prefetching,” in 2009 42nd
[4] S. Johnson, V. Scarlata, C. Rozas, E. Brickell, and F. Mckeen, “Intel® Annual IEEE/ACM International Symposium on Microarchitecture (MI-
software guard extensions: Epid provisioning and attestation services,” CRO). IEEE, 2009, pp. 327–336.
White Paper, vol. 1, pp. 1–10, 2016. [27] L. Zhang, B. Neely, D. Franklin, D. Strukov, Y. Xie, and F. T. Chong,
[5] D. Kaplan, J. Powell, and T. Woller, “Amd memory encryption,” White “Mellow writes: Extending lifetime in resistive memories through selec-
paper, 2016. tive slow write backs,” in 2016 ACM/IEEE 43rd Annual International
[6] “Introducing arm trustzone,” https://developer.arm.com/ip- Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 519–
products/security-ip/trustzone, accessed: 2019-03-30. 531.
[7] G. E. Suh, C. W. O’Donnell, and S. Devadas, “Aegis: A single-chip [28] X. Zhang, G. Sun, P. Xie, C. Zhang, Y. Liu, L. Wei, Q. Xu, and
secure processor,” IEEE Design & Test of Computers, vol. 24, no. 6, C. J. Xue, “Shadow block: Accelerating oram accesses with data
pp. 570–580, 2007. duplication,” in 2018 51st Annual IEEE/ACM International Symposium
[8] L. Ren, C. W. Fletcher, A. Kwon, M. van Dijk, and S. Devadas, “Design on Microarchitecture (MICRO). IEEE, 2018, pp. 961–973.
and implementation of the ascend secure processor,” IEEE Transactions [29] X. Yu, S. K. Haider, L. Ren, C. Fletcher, A. Kwon, M. van Dijk, and
on Dependable and Secure Computing, 2017. S. Devadas, “Proram: dynamic prefetcher for oblivious ram,” in Com-
[9] X. Zhuang, T. Zhang, and S. Pande, “Hide: an infrastructure for puter Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International
efficiently protecting information leakage on the address bus,” in ACM Symposium on. IEEE, 2015, pp. 616–628.
SIGPLAN Notices, vol. 39, no. 11. ACM, 2004, pp. 72–84. [30] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
[10] M. S. Islam, M. Kuzu, and M. Kantarcioglu, “Access pattern disclosure A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah
on searchable encryption: Ramification, attack and mitigation,” in in simulated memory module,” University of Utah, Tech. Rep, 2012.
Network and Distributed System Security Symposium (NDSS. Citeseer, [31] “2012 memory scheduling championship (msc),”
2012. http://www.cs.utah.edu/ rajeev/jwac12/, accessed: 2018-11-01.
[11] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu, [32] Y. Che and R. Wang, “Multi-range supported oblivious ram for efficient
T. Sherwood et al., “Deepsniffer: A dnn model extraction framework block data retrieval,” in 2020 IEEE International Symposium on High
based on learning architectural hints,” in Proceedings of the Twenty-Fifth Performance Computer Architecture (HPCA). IEEE, 2020, pp. 369–
International Conference on Architectural Support for Programming 382.
Languages and Operating Systems, 2020, pp. 385–399. [33] C. Nagarajan, A. Shafiee, R. Balasubramonian, and M. Tiwari, “Re-
[12] T. M. John, “Privacy leakage via write-access patterns to the main laxed hierarchical oram,” in The 24th ACM International Conference
memory,” 2017. on Architectural Support for Programming Languages and Operating
[13] O. Goldreich, “Towards a theory of software protection and simulation Systems(ASPLOS), 2019.
by oblivious rams,” in Proceedings of the nineteenth annual ACM [34] R. Wang, Y. Zhang, and J. Yang, “Cooperative path-oram for effective
symposium on Theory of computing. ACM, 1987, pp. 182–194. memory bandwidth sharing in server settings,” in High Performance
[14] E. Stefanov, M. Van Dijk, E. Shi, C. Fletcher, L. Ren, X. Yu, and Computer Architecture (HPCA), 2017.
S. Devadas, “Path oram: an extremely simple oblivious ram protocol,” [35] Y. Che, Y. Hong, and R. Wang, “Imbalance-aware scheduler for fast
in Proceedings of the 2013 ACM SIGSAC conference on Computer & and secure ring oram data retrieval,” in 2019 IEEE 37th International
communications security. ACM, 2013, pp. 299–310. Conference on Computer Design (ICCD). IEEE, 2019, pp. 604–612.
[15] C. W. Fletcher, L. Ren, A. Kwon, M. Van Dijk, E. Stefanov, D. Serpanos, [36] W. Liang, K. Bu, K. Li, J. Li, and A. Tavakoli, “Memcloak: Practical
and S. Devadas, “A low-latency, low-area hardware oblivious ram access obfuscation for untrusted memory,” in Proceedings of the 34th
controller,” in 2015 IEEE 23rd Annual International Symposium on Annual Computer Security Applications Conference. ACM, 2018, pp.
Field-Programmable Custom Computing Machines. IEEE, 2015, pp. 187–197.
215–222.

25

You might also like