Streamline Ring ORAM Accesses Through Spatial and Temporal Optimization
Streamline Ring ORAM Accesses Through Spatial and Temporal Optimization
Abstract—Memory access patterns could leak temporal and TrustZone [6], as well as academic secure processor prototypes
spatial information in a sensitive program; therefore, obfuscated such as AEGIS [7], ASCEND [8]. It is clear to see that the
memory access patterns are desired from the security perspective. current trusted computing base (TCB) is still small and limited,
Oblivious RAM (ORAM) has been the favored candidate to
eliminate the access pattern leakage through randomly remap- which includes part of the processor plus partial memory
ping data blocks around the physical memory space. Meanwhile, space. The majority of components in the system, such as the
accessing memory with ORAM protocols results in significant main memory, storage device, and the interconnections, are
memory bandwidth overhead. For each memory request, after still vulnerable to various attacks. For example, if the memory
going through the ORAM obfuscation, the main memory needs access patterns are leaked to the adversary, the program
to service tens of actual memory accesses, and only one real
access out of them is useful for the program execution. Besides, that is being executed may be inferred through the Control
to ensure the memory bus access patterns are indistinguishable, Data Flow Graph reconstruction [9]. Researchers have also
extra dummy blocks need to be stored and transmitted, which discovered that on a remote data storage server with searchable
cause memory space waste and poor performance. encryption, access patterns can still leak a significant amount
In this work, we introduce a new framework, String ORAM, of sensitive information [10]. Some recent attacks show that
that accelerates the Ring ORAM accesses with Spatial and
Temporal optimization schemes. First, we identify that dummy neural network architecture [11] and RSA keys [12] can be
blocks could significantly waste memory space and propose a reconstructed through the memory access patterns.
compact ORAM organization that leverages the real blocks To completely remove the potential leakage through the
in memory to obfuscate the memory access pattern. Then, memory access pattern we need to obfuscate the access
we identify the inefficiency of current transaction-based Ring patterns that the adversary may observe. Oblivious RAM
ORAM scheduling on DRAM devices and propose an effective
scheduling technique that can overlap the time spent on row (ORAM), as a cryptographic approach, provides a complete
buffer misses while ensuring correctness and security. With set of access and remap operations that can randomly move
a minimal modification on the hardware and software, and the data blocks in the memory to a different physical address
negligible impact on security, the framework reduces 30.05% after each access [13]. The basic idea of ORAM is to utilize
execution time and up to 40% memory space overhead compared dummy blocks for obfuscation. For each memory access,
to the state-of-the-art bandwidth-efficient Ring ORAM.
Index Terms—Ring ORAM, Performance, Space Efficiency, dummy blocks are fetched together with the real block. After
Security each memory access, the location of the accessed block is
reassigned so that the temporal and spatial access pattern can
I. I NTRODUCTION be hidden. As a result, an outside attacker cannot infer the type
As protecting data security and privacy becomes increas- of access or whether the user is accessing the same data repeat-
ingly critical, modern computing systems start to equip trusted edly. Because of the redundancy of dummy blocks, ORAM
hardware to protect the computation and data from various comes with a high overhead, which motivates the ORAM
attacks. For example, we see industrial standard trusted com- designers to optimize its theoretical performance. Through
puting modules such as Trusted Computing Module (TPM) decades of advances in cryptography, tree-based ORAMs show
[1], eXecute Only Memory (XOM) [2], Trusted Execution great potential to be adopted in main memory systems with
Technology (TXT) [3], Intel SGX [4], AMD SME [5], ARM relatively efficient bandwidth and storage overhead. For ex-
ample, Path ORAM [14] translates one memory access into a
Mingzhe Zhang, Rujia Wang and Dingyuan Cao have equal contribution. full path operation, and has been made into hardware prototype
This work was performed while Dingyuan Cao was an undergraduate research
intern at ICT, CAS. This work is supported in part by National Natural Science [15] and integrated with SGX [16]. Further, Ring ORAM [17]
Foundation of China grants No. 62002339, No. 61732018, the Strategic optimizes the online data read overhead by selectively reading
Priority Research Program of the Chinese Academy of Sciences under grant blocks along the path, and reduces overall access bandwidth by
No. XDB44030200, the Open Project Program of the State Key Laboratory
of Mathematical Engineering and Advanced Computing (No. 2019A07) and 2.3× to 4× and online bandwidth by more than 60× relative
the CARCH Innovation Project (CARCH4506). to the Path ORAM [17].
15
0 1 2 3 4 5 6 7 Write Queue
row
Row Decoder
Level 3 Read Queue
wordline cell
Level 2 To/From LLC
bitline
bucket
row-addr
Memory
Level 1 Controller amplifier
Row Buffer
Level 0 Ă
Data Data
Return R/W E/D Ă Rank Rank
cmd
Response
Stash E/D Logic
Memory addr Bank
Ă Bank
Coming Controller data
Pos. Map Addr. Logic
Request
Leaf Physical
Label Addr. Channel 0
Channel N
Real Block Dummy Block Access Path
leaves are at level L. And each node of the tree is a bucket path operations. For each bucket along the path, it reads all the
that can store multiple real and dummy data blocks. All data Z real blocks, permutes them, and writes Z + S blocks back.
blocks are encrypted indistinguishably so that an adversary The sole purpose of an eviction operation is to push blocks
cannot differentiate dummy blocks from real ones. Each leaf back to the binary tree from the stash. Ring ORAM adopts
node has an one-to-one correspondence to a path that goes a deterministic eviction order, reverse lexicographic order so
to the leaf node from the root, so there are 2L paths in total that consecutive eviction paths have fewer overlapped buckets
(path 0, 1, ..., 2L − 1). On the controller side, the ORAM [17].
interface consists of several components: stash, position map, • Early reshuffle is needed to ensure that each bucket is
address logic, and encryption/decryption logic. The stash is a properly randomly shuffled. Each bucket can only be touched
small buffer that temporarily stores data blocks fetched from at most S times before early reshuffle or eviction because,
the ORAM tree. The position map is a lookup table that maps after S accesses to a bucket, all dummy blocks have been
program addresses to data blocks in the tree. In the position invalidated. Early reshuffle operation reads and writes buckets
map, each data block corresponds to a path id , indicating that have been accessed S times and reset the metadata fields.
that it is situated in a bucket along the path .
C. Basics of DRAM
In the Ring ORAM construction, each bucket on the binary
tree node has Z + S slots and a small amount of metadata. The structure of the DRAM-based main memory system is
In these slots, Z slots store real data blocks, and S slots store as shown in Figure 3. In modern DRAM, a bank is the finest
dummy blocks. Figure 2 shows the bucket organization of Ring granularity that can be accessed in parallel (referred to as bank-
ORAM. In this example, we have a bucket with Z = 4 and level parallelism [26]). A group of banks in different DRAM
S = 4, and each bucket has additional metadata fields such as chips consist of a rank and operate in lockstep. Finally, several
index, valid, real, and a counter. Bit valid identifies whether banks are organized into one channel, which shares one set of
this block has been accessed, bit real identifies which blocks in physical links (consisted of command, address, data buses) to
the bucket are real blocks, and the counter records how many communicate with the memory controller. Note that the banks
times this bucket has been accessed. For example, in Figure in different channels can be accessed in parallel (referred to
2, a dummy block at index 1 (real bit is 0) has been accessed, as channel-level parallelism), while the banks in one channel
so its valid bit changes to 0, and the counter increases by have contention at the physical link.
1. For every bucket in the tree, the physical positions of the The top-right portion of Figure 3 (in the blue box) shows
Z + S real and dummy blocks are permuted randomly when the structure of a bank. In each bank, a two-dimensional
the counter exceeds S. array of DRAM cells is used to store data, and a row buffer
16
Capacity (GB)
100
consisting of several amplifiers is connected to the array. The 80 Z A X S
array includes a series of rows, each of which can be selected 60 Config-1 4 3 2 5
40
via a wordline. The array also has multiple columns, and the 20 Config-2 8 8 4 12
0
memory cells in each column are connected to an amplifier Config-1 Config-2 Config-3 Config-4
Config-3 16 20 7 27
in the row buffer via a shared bitline. The wordlines and the Dummy Blocks Real Blocks Config-4 32 46 12 58
17
Bank 0
Bank 1
Bank 2
Bank 3
1.0
0.8
0.6
0.4
0.2 Fig. 6. The illustration of the idle time for the ORAM based on a 4-bank
0.0 DRAM.
bla fac ferr flui freq les libq mu stre sw GE
ck e et d lie mme am apt OM
r E AN Algorithm 1: Transaction-based scheduler algorithm
(b) Row Buffer conflict rate for Ring ORAM with subtree layout. Input: i: current ORAM access transaction number
n: current cycle
Fig. 5. The ineffectiveness of sub-tree layout for Ring ORAM. Output: Issue command to the DRAM module
1 while not end of the program do
that each subtree’s blocks are in the same row, then accessing 2 if memory controller can issue command at cycle n
a full path can be translated into 2 row accesses (16 memory then
accesses in total). In this case, only 2 of them are row buffer 3 check memory command queue;
misses, and the remaining 14 blocks are all fast row buffer if has commands ∈ transaction i then
hits. The row buffer conflict rate is relatively low in this case. 4 issue the command based on FR-FCFS;
Although Ring ORAM is also tree-based, the unique read 5 else
path operation degrades the benefits of the subtree layout. 6 Continue;
Only one block per bucket is fetched each time, so the total 7 end
data blocks transferred are reduced. Considering the same 8 end
tree configuration, as shown in Figure 5(a), a Ring ORAM 9 if no commands ∈ transaction i then
read operation will bring only 4 blocks in total. In this case, 10 i++;
half of the accesses are row buffer hits, and the other half 11 end
are row buffer misses. Therefore the row buffer conflict rate 12 n++;
is increased. Such a scenario would be exaggerated when 13 end
we have a multi-channel multi-bank memory system. Our
experiment found that on a four-channel memory system,
the row buffer conflict rate during the selective read path
operation is significantly higher than the full path eviction grams without ORAM’s protection, once the requests are in
operation. Figure 5(b) illustrated the biased locality on row the memory controller’s queue, the PRE, ACT, or RD/WR
buffer during these two distinct phases. During the read path can be freely scheduled based on the bank or channel idleness
operation, the row buffer conflict rate is around 74%; however, to maximize the performance. However, one single ORAM
the full path eviction operation has a much lower conflict access now consists of multiple data block accesses, and
rate of 10%. Therefore, we find that the subtree layout is they must be issued to the memory in-order and atomically.
exceptionally effective for full path operation, but not enough We refer to such scheduling as transaction-based scheduling
for accelerating the selective read path operation in the Ring [28], where the transaction means all the memory requests
ORAM. The read path operation is always a critical operation for the same ORAM operation. Figure 6 shows the example
during the execution, so its performance impact is obvious. of three ORAM accesses to multiple memory banks. Within
each ORAM access transaction, the commands include not
C. Idle Bank Time with Transaction-based Scheduling only the actual RD/WT commands but also the PRE and ACT
Next, we discuss how ORAM accesses are translated into commands due to the bank conflicts.
DRAM commands and scheduled by the DRAM memory con- As a result, when the memory controller is issuing the
troller. After checking the position map, the ORAM controller memory requests, it has to follow the transaction-based timing
will generate the physical addresses of the data block to be constraints. The transaction-based scheduling algorithm is
fetched along the selected path. The memory controller then described in Algorithm 1. The i + 1-th ORAM access must
actually translates the access sequences into memory requests wait for the i-th access completion before it is scheduled
that contain memory commands and addresses that memory out to the memory. We can observe mixed commands sent
DIMMs are capable of understanding. For conventional pro- to random memory channels and banks within each ORAM
18
transaction due to the random selective read path operation. and the row buffer pollution. The framework consists of: a)
Since the ACT and PRE are also attached to their own ORAM a compact ORAM bucket organization and updated access
transaction, such commands can only start at the beginning of protocol; b) a new scheduler aiming at reducing bank idle
each transaction when there is a bank conflict. The simple time caused by transaction based scheduling; c) an integrated
transaction-based scheduling would cause abundant wasted architecture that can support efficient memory utilization and
time on memory banks. We define the memory bank idle time access for ORAM.
as the average duration each bank stops receiving memory
command due to the transaction-based scheduling barrier. In A. Compact Bucket (CB): A Compact ORAM Organization
Figure 6, we can observe that when some memory banks have and Access Protocol
a higher workload than the others, although the idle banks are Based on our motivations in Section III-A, the majority of
ready to issue the ACT or PRE commands, they are not able the allocated space for the ORAM protected program stores
to do so, such as bank 1 in ORAM access 1, and bank2 in dummy data blocks, which significantly reduces the usability
ORAM access 2. of the limited main memory space. As shown in Figure 7 (a),
To summarize, we identify that current ORAM transaction- Ring ORAM reserves S dummy blocks per data bucket so that
based scheduling can cause significant bank idleness, espe- it can support at most S dummy accesses before a reshuffle
cially when the read path operation causes a high row buffer operation. Meanwhile, the rest of Z real blocks may remain
conflict rate, as explained in the prior section. The PRE and untouched, if there is no real data access in this bucket.
ACT commands do not return any data back to the processor.
Therefore, if we can free the scheduling of them from the
Dummy/Real Valid Green Block Counter 1
transaction, we can significantly improve the memory bank
utilization and overlap the row buffer conflicts. Real 1
Dummy/Real Valid
Real 1
Z
D. Design Opportunities Real 1 Real 1
Real 1 Real 1
Based on the three observations above, we identify the Dummy 1
Z
Real (Green) 1
following design opportunities: Y
Dummy 0 Real (Green) 0
S S
1) Typically, the Ring ORAM requires that the S A , which Dummy 0 Dummy 0
Dummy 0 Dummy 0
provides abundant dummy blocks for the read path operations
at the cost of storage waste. If we can reduce the S and (a) Ring ORAM bucket organization (b) Green Dummy bucket organization
allow part of real blocks to be accessed as dummy blocks, Fig. 7. The equivalent Compact Bucket design
the memory space efficiency can be improved significantly. Ideally, we want to minimize the dummy blocks in the
2) Subtree layout can significantly promote the access ef- bucket. A simple take is to reduce the value of S directly.
ficiency under the open-page policy for full path read or However, if we only have a few dummy blocks per bucket,
write operation. However, it is not efficient for selective read the reshuffle would happen very frequently, and the overhead
path operation. Therefore, if we can change the row buffer would be significant. To reduce the value S, and ensure the
management scheme for the read path, we are able to minimize reshuffles happen at a similar frequency, our idea is to borrow
the performance impact of high row buffer conflict rate and the real blocks that are already in the bucket, and treat them
long critical path delay. as green blocks, as shown in Figure 7 (b).
3) Transaction based ORAM scheduling technique ensures the The Compact Bucket(CB) organization in Figure 7 (b) can
correctness of ORAM protocol; however, when it comes to support equivalent accesses to the bucket, compared with (a).
the command-level scheduling, we find it is less desired to Here, we reduce the number of reserved dummy blocks to
group the PRE and ACT within the current ORAM access S −Y , where Y is the number of real blocks in the bucket that
transaction. If we can schedule such commands earlier, we can be served as dummy data during a read path access. We
have a higher chance to utilize the idle bank and hide the define such blocks as green blocks, and value Y as CB rate.
latency caused by row buffer conflicts. In this case, without Therefore, with the help of Y green blocks, we can achieve
reducing or changing the number of row buffer conflicts, the same number of operations per bucket as in Figure 7 (a).
we preserve the security and correctness of ORAM while In this example (Z = 4, S = 4, Y = 2), we limit the number
improving the performance. of real data blocks that can be fetched as dummy blocks to
The mentioned approaches, in turn, improve the efficiency 2. And with the additional 2 dummy blocks reserved in the
of ORAM access from spatial and temporal aspects. The next bucket, this bucket can still support up to 4 accesses before
section describes the details of our spatial optimization through path eviction or early reshuffle.
a compact ORAM bucket design and temporal optimization ORAM accesses with CB. To facilitate the accesses to CB,
with a proactive bank management scheme. we slightly modify the metadata in the bucket. In the original
Ring ORAM, we use a counter per bucket to hold how many
IV. D ESIGN accesses have been made to the bucket, and one bit per block
Our ORAM framework, String ORAM, reduces the wasted to record whether it is a real or dummy block. As we need to
memory space, the average memory request queuing time, limit the number of green blocks in a bucket, we need to have
19
a green block counter to record how many green blocks have of the memory accesses during the read path phase are row
been touched. The counter size is comparably small, which buffer conflicts. This means the row buffer inside each memory
is at log2 (Y ). When there is a read path operation, the block bank needs to be closed then opened frequently with PRE and
selection can freely choose a dummy block in the bucket or ACT commands. Moreover, we find that due to the transaction-
a real block if the green counter value is less than Y . During based scheduling, the PRE and ACT cannot be issued ahead
the eviction and reshuffle, the green counter values are reset, of each transaction.
just like other metadata in the bucket. We propose a proactive bank (PB) scheduler that separates
Choosing the right Y and managing the stash overflow. the PRE and ACT commands from the ORAM transaction
The following questions arise when we modify the Ring during the command scheduling. Algorithm 2 shows our
ORAM into a compact format. First, can we set the value modified scheduling policy. Instead of staying idle and waiting
of Y as big as possible? Second, what is the consequence for all commands for the current transaction i finished, the PB
of setting a big Y ? Third, how do we determine the best Y scheduler scans the memory command queue to see if any PRE
for a given ORAM configuration? Clearly, with CB, we are or ACT coming from i + 1 can be issued ahead. In this case,
bringing more than one real block per read path operation, when the current transaction is finished, the next transaction
and this adds the burden on the stash. With the same size of can directly start with RD or WR. In other words, the long
stash, using an aggressive CB configuration with a large Y row buffer miss penalty is hidden through latency overlapping.
value can cause the stash fill quickly. To address the stash
overflow problem, we adopt background eviction [29], which Algorithm 2: PB scheduler algorithm
was initially proposed for Path ORAM. When the stash size
Input: i: current ORAM access transaction number
reaches a threshold, the background eviction is triggered, and
n: current cycle
it halts the execution and starts to write blocks in the stash
Output: Issue command to the DRAM module
back to the ORAM. However, at this point, we may not
1 while not end of the program do
meet the Ring ORAM eviction frequency A. If the ORAM
2 if memory controller can issue command at cycle n
controller issues the eviction directly without following the
then
eviction frequency, it may leak information such as the stash
3 check memory command queue;
is almost full, as we see consecutive eviction patterns instead
if has commands ∈ transaction i then
of multiple read path then eviction pattern. Therefore, dummy
4 issue the command based on FR-FCFS;
read path operations (reading specifically dummy blocks) have
5 else if has command ∈ transaction i+1 then
to be issued until the desired interval A is reached and eviction
6 if meet inter-transaction row buffer conflict
operation is called. In this way, our background eviction does
and the command is PRE or ACT then
not change the access sequences and prevents such leakage.
7 issue the command;
Due to the high overhead of background eviction, it is
8 end
recommended to have a modest Y value that triggers less
9 else
or almost no background eviction. We analyze the tradeoffs
10 Continue;
in the result sections with different Y selections and various
11 end
stash sizes.
12 end
CB benefits summary. As the spatial optimization in our 13 if no commands ∈ ORAM transaction i then
framework, the space efficiency brought by CB is obvious. If 14 i++;
we reserve Y real blocks in one bucket as green blocks, we 15 end
can reduce the space overhead by Y blocks per bucket. If the 16 n++;
value of Y is properly chosen (without triggering too many
17 end
background eviction), the additional benefits of this scheme are
that the number of dummy blocks that need to be read and
written during the eviction/reshuffle phase is reduced, as well By revisiting the example in motivation, with PB scheduler,
as the number of blocks per path that needs to be permuted. some of the PREs and ACTs can be issued by the memory
Thus, the time spent on eviction and reshuffle is reduced, and controller ahead of the current ORAM transaction, as shown
this can in turn accelerate the read path operation. ORAM in the Figure 8. These commands are marked with a red
accesses will experience much shorter request queuing time outline. Clearly, the reason that such PREs and ACTs can
in the memory controller. be done ahead is that such row buffer conflicts are inter-
transaction. As a result, whenever these ORAM transactions
B. Proactive Bank (PB): A Proactive Memory Management are in the memory request queue, the PREs and ACTs are
Scheme for ORAM Scheduler able to be issued. Our PB scheduler does not fetch the PREs
As reported in section III-B, the read path and eviction op- and ACTs that are caused by intra-transaction conflicts to
eration in Ring ORAM show distinct memory access locality. the same bank. For example, in ORAM access 2, the second
The selective block read cannot fully leverage the locality set of PRE and ACT are still issued in-order. As we do not
benefits from the subtree layout, therefore, a large portion change the access sequences for each ORAM access, such
20
highlighted the modified changes on the ORAM controller and
Bank 0 memory controller.
For the CB scheme, we modify the bucket structure and
Bank 1
add the green block counter to record how many green block
Bank 2 accesses have been made to the bucket and limit the maximum
to Y . In addition, the ORAM controller needs to be able
Bank 3 to issue background eviction to mitigate the potential stash
Time
LLC Channels different from the original Ring ORAM’s protocol. The stash
Pos. Map Addr. Logic To Memory
Channel N is within the security boundary, therefore the extra real data
ORAM Memory block inside of the stash does not leak any information. If the
Controller Controller
additional real blocks brought into the stash are serviced by
Fig. 9. The architecture overview. other memory requests before eviction, the timing differences
of execution the program may differ. We argue that this is not
intra-transaction conflicts are inevitable. a critical issue since the prior ORAM prefetch work [29] also
Impact on access sequence. By scheduling the memory brings more than one real block per read request. Moreover,
command PREs and ACTs out of ORAM transaction, these without the superblock scheme, in our experiment, it is rare to
commands’ issue time will be earlier than the original time. PB see the green blocks brought into stash will be consumed by
scheduling only affects when such commands are issued, but other memory requests before eviction. To completely remove
not changing the command order or causing asynchronous data such leakage potential, we can force the green blocks not
read or write. The actual RD and WR commands that carry to be directly fetched by other requests from the stash. The
data are still obeying the transaction-based access sequences. other issue is the filling speed for the stash could be faster
Besides, the row addresses associated with PRE and ACT are and cause stash overflow. We discuss that through leakage-
public information, scheduling them ahead does not change free background eviction (use dummy read path operations to
original addresses nor leak any information. reach the eviction interval), we can keep the stash occupancy
PB benefits summary. PB optimizes the Ring ORAM ac- low. The relationship between stash size, CB rate(Y ) and
cesses in the temporal aspect. We separate the non-data related performance are presented in Section VII.
commands from the original transaction through proactive Claim 2: PB does not leak access pattern information during
command scheduling, hence utilizing the bank idle time to the scheduling. Proactive Bank (PB) is a lightweight memory
prepare fast data access for the next ORAM transaction. The scheduler that is easy to implement and only modify the
row buffer miss latency can be hidden through a multi-channel issue time of non-data related commands(PREs and ACTs)
multi-bank memory system. Not only do we reduce the idle on the memory bus. The memory access sequences on the
time on the memory system, but also shorten the read path bus, including the number of requests/commands, the order
latency. of requests/commands, are entirely remained unchanged with
PB scheduler. The addresses associated with PRE and ACT
C. Architecture Integration are public information: PRE closes a bank and only contains
To support the proposed spatial and temporal optimizations, the last accessed bank information, which is known since the
we slightly modify the ORAM interface, bucket structure, bank has been previously accessed; ACT contains the row
and the DRAM command scheduler. Figure 9 shows the address for next transaction’s access, which is also public
overall hardware architecture of our proposed framework. We as long as the path id is determined. Whether an ORAM
21
TABLE III
TABLE I
T HE DEFAULT String ORAM CONFIGURATIONS
P ROCESSOR C ONFIGURATION
1.0
Normalized
0.8
detailed DRAM-based memory system. Based on this plat- 0.6
0.4
form, we simulate a CMP system with the parameters of the 0.2
0.0
state-of-art commercial processor, and the detailed configura- bla fac ferr fluid freq les libq mu stre sw AV
ck e et lie mm am apt G
tions are as shown in Table I. For the memory subsystem, er
1. Baseline 2. CB 3. PB 4. ALL Read Eviction Reshuffle Other
we follow the JEDEC DDR3-1600 specification to simulate
a DRAM module with 4 channels, and each channel has 8 Fig. 10. Normalized execution time.
banks. The total capacity of the DRAM module is 32GB. The We use the total execution time (including all operations:
address mapping follows the order of “row:bank:column: read path, eviction, early reshuffle, and other related opera-
rank:channel:offset”,which follows the subtree layout tions) to denote the system performance. As shown in Figure
to maximize the row buffer locality [19]. The detailed param- 10, the individual CB scheme improves the performance by
eters for the memory subsystem are shown in Table II. 11.72% as the average. This is because the CB scheme reduces
We use 10 memory-intensive applications for the evaluation. the number of total blocks on the path so that eviction
The applications are selected from PARSEC 3.0, SPEC and operations take a shorter time to finish. Besides, the PB
BIOBENCH. For each benchmark, a methodology similar provides a more significant performance improvement than
to Simpoint is used to generate the trace file consisted of CB. On average, the execution time is decreased by 18.87%.
500 million instructions out of 5 billion instructions. The Such improvement is achieved by moving ACT and PRE
applications and corresponding traces are also used in the MSC command from future ORAM access to occupy idle bank
contest [31]. The applications are described as Table IV. while waiting for current ORAM access finishes. Finally,
22
Idle Time Proportion
when we consider the combination of CB and PB, the total Baseline PB
Average Bank
1.2
performance improvement achieves 30.05%. 1.0
0.8
In addition, Figure 10 also shows that the CB, PB, and 0.6
0.4
CB+PB schemes provide similar performance improvement 0.2
0.0
range across different applications (the variation of all results bla
ck
fac
e
ferr
et
flui
d
freq les
lie
libq mu
m me
stre
am
sw GE
apt OM
is less than 0.38%). This indicates that our proposed schemes r E AN
work for different applications and prevent information leak- (a) Average bank idle time proportion.
age from the execution time variance. 0.8
PB Operation
PRE ACT
Proportion
0.7
B. Queuing Time 0.6
0.5
0.4
0.3
Queuing Time
1.2
Normalized
1.2
Normalized
(b) Write Queue. can be read directly at the beginning of the transaction. The
Fig. 11. Normalized request queuing time.
remaining commands that cannot be fetched earlier are mainly
Figure 11 presents the memory request queuing time of dif- caused by intra-transaction bank conflicts.
ferent schemes. We can see that CB provides similar queuing
time reduction for the read queue (10.41%) and write queue D. CB Sensitivity Analysis
(11.83%). The reason is that CB alleviates the access overhead
to the memory channel by reducing the number of memory We further evaluate the effectiveness of CB with different
accesses in eviction operation, which allows both queues configurations of Y . As shown in Table V, the five config-
to have more opportunities to service read path operation. urations represent different compact rates corresponding to
On the other hand, the queuing time reduction of the read different memory space efficiency. A higher compact rate can
queue caused by PB is higher than the write queue (22.53% reduce the occupied total memory space significantly, as well
vs. 19.46%). Because PB directly reduces the performance as the dummy block percentage. If we want to have extreme
overhead of read path operations, as the read requests can be storage efficient ORAM construction, we may choose a higher
completed more quickly. Since write operations only happen Y value, at the cost of more frequent background evictions.
during eviction and reshuffle, the queuing time reduction of While CB is not performance optimization oriented, we can
the read queue indirectly helps the write queue to gain benefit. still observe some performance gain through a more compact
Overall, the CB and PB scheme together reduce the queuing bucket design. The performance gain of CB mainly comes
time of the read queue & write queue by 32.87% and 31.30%. from the eviction phase, as we reduce the number of blocks
that need to be read and written. We show the performance
C. Bank Idle Time Reduction with different CB rates in Figure 13. When the stash size is
Figure12(a) shows the average bank idle time before and at 500, which does not cause additional background eviction,
after applying PB scheme. Originally, as discussed in previous Config-4 with Y = 8 achieves the best performance. The CB
sections, DRAM banks suffer from an imbalanced workload, scheme with Y = 2 to 8 has a total execution time reduction
causing bank idleness while waiting for other banks to finish from 2.02% to 11.72%. When combining the PB scheme, the
current ORAM access. This idle time takes up 65.99% of the performance improvement between Y = 2 to 8 increases from
total execution time. Through the PB scheme, the idle time 20.79% to 30.05%.
of the bank is greatly reduced to 40.72% of execution time, Figure 13 also shows the green blocks fetched per read.
enabling bank to serve more requests than before. With CB, 0.17 ∼ 3.26 green blocks are brought into the stash
Our experiments also suggest that 59.31% PRE and 56.93% per read path, on average. Therefore, Y cannot be set too
ACT can be issued earlier than its own transaction, as shown aggressively if we don’t want the stash to be filled too quickly.
in Figure12(b). These commands were overlapped with the In the next section, we study the stash fill rate with different
critical path in each transaction, and as a result, data blocks stash sizes.
23
Execution Time 1.2 Green Blocks Stash Size
Normalized
Occupancy
Fetched Per 200
1.0 Read
Stash
0.8 Config-1 0.167
100
Config-2 0.652
0.6
Config-3 1.638 0
CB ALL
0 5000 10000 15000 20000
Baseline Config-1 Config-2 Config-3 Config-4 Config-4 3.255
Baseline Config-1 Config-2 Config-3 Config-4
Fig. 13. The sensitivity study to the CB compact rate. (a) Stash Size = 200.
(a) Performance (b) Eviction Number
Stash Size
62
2..28
19
39
Occupancy
300
Eviction Number
1.
1.
1
Execution Time
1.2 1.2
Stash
Normalized
Normalized
1.0 1.0 150
0.8 0.8
0
0.6 0.6 0 5000 10000 15000 20000
200 300 400 500 200 300 400 500 Baseline Config-1 Config-2 Config-3 Config-4
Baseline Config-1 Config-2 Config-3 Config-4
(b) Stash Size = 300.
Stash Size
Fig. 14. Stash size v.s performance.
Occupancy
400
Stash
200
E. Stash Size v.s. Eviction Overhead
0
We further analyze the relationship between the stash size 0 5000 10000 15000 20000
and additional background eviction operations with the CB Baseline Config-1 Config-2 Config-3 Config-4
scheme. Figure 14 and 15 show additional eviction operations (c) Stash Size = 400.
with different stash sizes and the dynamic stash occupancy Stash Size
Occupancy
500
with different Y . Clearly, a smaller stash can be filled faster
Stash
by the additional fetched real blocks, yet a larger one can 250
24
IX. C ONCLUSIONS [16] S. Sasy, S. Gorbunov, and C. W. Fletcher, “Zerotrace: Oblivious memory
primitives from intel sgx.” IACR Cryptology ePrint Archive, vol. 2017,
In this paper, we present String ORAM, a framework that p. 549, 2017.
accelerates the Ring ORAM accesses through an integrated [17] L. Ren, C. W. Fletcher, A. Kwon, E. Stefanov, E. Shi, M. Van Dijk,
and S. Devadas, “Constants count: Practical improvements to oblivious
architecture with spatial and temporal optimizations. Through ram.” in USENIX Security Symposium, 2015, pp. 415–430.
extensive experiments, we identify that dummy blocks in [18] X. Zhang, G. Sun, C. Zhang, W. Zhang, Y. Liang, T. Wang, Y. Chen, and
Ring ORAM protocols cause significant memory space waste. J. Di, “Fork path: improving efficiency of oram by removing redundant
memory accesses,” in Proceedings of the 48th International Symposium
Further, we find that current locality optimization schemes are on Microarchitecture, 2015.
less effective for Ring ORAM read operation. Therefore, we [19] L. Ren, X. Yu, C. W. Fletcher, M. Van Dijk, and S. Devadas, “Design
first present a compact ORAM bucket design (CB), which space exploration and optimization of path oblivious ram in secure
processors,” in Proceedings of the 40th Annual International Symposium
brings two folds of benefits: reduced memory space with fewer on Computer Architecture, 2013, pp. 571–582.
dummy blocks, and reduced evict path overhead with fewer [20] A. Shafiee, R. Balasubramonian, M. Tiwari, and F. Li, “Secure dimm:
blocks to shuffle. Then, we present a proactive ORAM access Moving oram primitives closer to memory,” in 2018 IEEE Interna-
tional Symposium on High Performance Computer Architecture (HPCA).
scheduler (PB) on the DRAM controller, which minimize IEEE, 2018, pp. 428–440.
the bank idle time without modifying the access sequences [21] R. Wang, Y. Zhang, and J. Yang, “D-oram: Path-oram delegation for
of ORAM. Next, we show the integrated String ORAM low execution interference on cloud servers with untrusted memory,” in
High Performance Computer Architecture (HPCA), 2018.
architecture that supports our designs. Lastly, we evaluated our [22] A. Awad, Y. Wang, D. Shands, and Y. Solihin, “Obfusmem: A low-
proposed framework in terms of security, performance gain, overhead access obfuscation for trusted memories,” in Proceedings of
queuing time reduction, memory bank idle time reduction. the 44th Annual International Symposium on Computer Architecture,
2017, pp. 107–119.
R EFERENCES [23] S. Aga and S. Narayanasamy, “Invisimem: Smart memory defenses
for memory bus side channel,” in Proceedings of the 44th Annual
[1] S. Bajikar, “Trusted platform module (tpm) based security on notebook International Symposium on Computer Architecture, 2017, pp. 94–106.
pcs-white paper,” Mobile Platforms Group Intel Corporation, vol. 1, [24] A. Ahmad, K. Kim, M. I. Sarfaraz, and B. Lee, “Obliviate: A data
p. 20, 2002. oblivious filesystem for intel sgx.” in NDSS, 2018.
[2] D. Lie, C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, [25] C. Sahin, V. Zakhary, A. El Abbadi, H. Lin, and S. Tessaro, “Taostore:
and M. Horowitz, “Architectural support for copy and tamper resistant Overcoming asynchronicity in oblivious data storage,” in 2016 IEEE
software,” Acm Sigplan Notices, vol. 35, no. 11, pp. 168–177, 2000. Symposium on Security and Privacy (SP). IEEE, 2016, pp. 198–217.
[3] D. Grawrock, The Intel safer computing initiative: building blocks for [26] C. J. Lee, V. Narasiman, O. Mutlu, and Y. N. Patt, “Improving memory
trusted computing. Intel Press Hillsboro, 2006, vol. 976483262. bank-level parallelism in the presence of prefetching,” in 2009 42nd
[4] S. Johnson, V. Scarlata, C. Rozas, E. Brickell, and F. Mckeen, “Intel® Annual IEEE/ACM International Symposium on Microarchitecture (MI-
software guard extensions: Epid provisioning and attestation services,” CRO). IEEE, 2009, pp. 327–336.
White Paper, vol. 1, pp. 1–10, 2016. [27] L. Zhang, B. Neely, D. Franklin, D. Strukov, Y. Xie, and F. T. Chong,
[5] D. Kaplan, J. Powell, and T. Woller, “Amd memory encryption,” White “Mellow writes: Extending lifetime in resistive memories through selec-
paper, 2016. tive slow write backs,” in 2016 ACM/IEEE 43rd Annual International
[6] “Introducing arm trustzone,” https://developer.arm.com/ip- Symposium on Computer Architecture (ISCA). IEEE, 2016, pp. 519–
products/security-ip/trustzone, accessed: 2019-03-30. 531.
[7] G. E. Suh, C. W. O’Donnell, and S. Devadas, “Aegis: A single-chip [28] X. Zhang, G. Sun, P. Xie, C. Zhang, Y. Liu, L. Wei, Q. Xu, and
secure processor,” IEEE Design & Test of Computers, vol. 24, no. 6, C. J. Xue, “Shadow block: Accelerating oram accesses with data
pp. 570–580, 2007. duplication,” in 2018 51st Annual IEEE/ACM International Symposium
[8] L. Ren, C. W. Fletcher, A. Kwon, M. van Dijk, and S. Devadas, “Design on Microarchitecture (MICRO). IEEE, 2018, pp. 961–973.
and implementation of the ascend secure processor,” IEEE Transactions [29] X. Yu, S. K. Haider, L. Ren, C. Fletcher, A. Kwon, M. van Dijk, and
on Dependable and Secure Computing, 2017. S. Devadas, “Proram: dynamic prefetcher for oblivious ram,” in Com-
[9] X. Zhuang, T. Zhang, and S. Pande, “Hide: an infrastructure for puter Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International
efficiently protecting information leakage on the address bus,” in ACM Symposium on. IEEE, 2015, pp. 616–628.
SIGPLAN Notices, vol. 39, no. 11. ACM, 2004, pp. 72–84. [30] N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi,
[10] M. S. Islam, M. Kuzu, and M. Kantarcioglu, “Access pattern disclosure A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah
on searchable encryption: Ramification, attack and mitigation,” in in simulated memory module,” University of Utah, Tech. Rep, 2012.
Network and Distributed System Security Symposium (NDSS. Citeseer, [31] “2012 memory scheduling championship (msc),”
2012. http://www.cs.utah.edu/ rajeev/jwac12/, accessed: 2018-11-01.
[11] X. Hu, L. Liang, S. Li, L. Deng, P. Zuo, Y. Ji, X. Xie, Y. Ding, C. Liu, [32] Y. Che and R. Wang, “Multi-range supported oblivious ram for efficient
T. Sherwood et al., “Deepsniffer: A dnn model extraction framework block data retrieval,” in 2020 IEEE International Symposium on High
based on learning architectural hints,” in Proceedings of the Twenty-Fifth Performance Computer Architecture (HPCA). IEEE, 2020, pp. 369–
International Conference on Architectural Support for Programming 382.
Languages and Operating Systems, 2020, pp. 385–399. [33] C. Nagarajan, A. Shafiee, R. Balasubramonian, and M. Tiwari, “Re-
[12] T. M. John, “Privacy leakage via write-access patterns to the main laxed hierarchical oram,” in The 24th ACM International Conference
memory,” 2017. on Architectural Support for Programming Languages and Operating
[13] O. Goldreich, “Towards a theory of software protection and simulation Systems(ASPLOS), 2019.
by oblivious rams,” in Proceedings of the nineteenth annual ACM [34] R. Wang, Y. Zhang, and J. Yang, “Cooperative path-oram for effective
symposium on Theory of computing. ACM, 1987, pp. 182–194. memory bandwidth sharing in server settings,” in High Performance
[14] E. Stefanov, M. Van Dijk, E. Shi, C. Fletcher, L. Ren, X. Yu, and Computer Architecture (HPCA), 2017.
S. Devadas, “Path oram: an extremely simple oblivious ram protocol,” [35] Y. Che, Y. Hong, and R. Wang, “Imbalance-aware scheduler for fast
in Proceedings of the 2013 ACM SIGSAC conference on Computer & and secure ring oram data retrieval,” in 2019 IEEE 37th International
communications security. ACM, 2013, pp. 299–310. Conference on Computer Design (ICCD). IEEE, 2019, pp. 604–612.
[15] C. W. Fletcher, L. Ren, A. Kwon, M. Van Dijk, E. Stefanov, D. Serpanos, [36] W. Liang, K. Bu, K. Li, J. Li, and A. Tavakoli, “Memcloak: Practical
and S. Devadas, “A low-latency, low-area hardware oblivious ram access obfuscation for untrusted memory,” in Proceedings of the 34th
controller,” in 2015 IEEE 23rd Annual International Symposium on Annual Computer Security Applications Conference. ACM, 2018, pp.
Field-Programmable Custom Computing Machines. IEEE, 2015, pp. 187–197.
215–222.
25