0% found this document useful (0 votes)
9 views16 pages

Learning-Based Memory Allocation For C++ Server Workloads

Uploaded by

misskanagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views16 pages

Learning-Based Memory Allocation For C++ Server Workloads

Uploaded by

misskanagi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Learning-based Memory Allocation

for C++ Server Workloads


Martin Maas, David G. Andersen*† , Michael Isard, Mohammad Mahdi Javanmard*‡ ,
Kathryn S. McKinley, Colin Raffel
Google Research † Carnegie Mellon University ‡ Stony Brook University

Abstract ACM Reference Format:


Modern C++ servers have memory footprints that vary widely Martin Maas, David G. Andersen, Michael Isard, Mohammad Mahdi
Javanmard, Kathryn S. McKinley, Colin Raffel. 2020. Learning-based
over time, causing persistent heap fragmentation of up to 2×
Memory Allocation for C++ Server Workloads. In Proceedings of
from long-lived objects allocated during peak memory usage. the Twenty-Fifth International Conference on Architectural Support
This fragmentation is exacerbated by the use of huge (2MB) for Programming Languages and Operating Systems (ASPLOS ’20),
pages, a requirement for high performance on large heap March 16–20, 2020, Lausanne, Switzerland. ACM, New York, NY,
sizes. Reducing fragmentation automatically is challenging USA, 16 pages. https://doi.org/10.1145/3373376.3378525
because C++ memory managers cannot move objects.
This paper presents a new approach to huge page frag-
mentation. It combines modern machine learning techniques 1 Introduction
with a novel memory manager (Llama) that manages the Optimizing interactive web services, many of which are writ-
heap based on object lifetimes and huge pages (divided into ten in C++, requires meeting strict latency requirements
blocks and lines). A neural network-based language model while minimizing resource usage. Users abandon services if
predicts lifetime classes using symbolized calling contexts. response times are too slow and data center costs are directly
The model learns context-sensitive per-allocation site life- proportional to resource usage. Multithreaded services re-
times from previous runs, generalizes over different binary quire large heaps both to minimize the number of deployed
versions, and extrapolates from samples to unobserved call- instances and to handle multiple requests simultaneously.
ing contexts. Instead of size classes, Llama’s heap is orga- Hardware has not kept pace with these demands. While
nized by lifetime classes that are dynamically adjusted based memory sizes have increased, Translation Lookaside Buffers
on observed behavior at a block granularity. (TLB) have not, because address translation is on the crit-
Llama reduces memory fragmentation by up to 78% while ical path. One solution is increasing TLB reach with huge
only using huge pages on several production servers. We ad- (2 MB) pages, i.e., each entry covers more memory. Huge
dress ML-specific questions such as tolerating mispredictions pages reduce TLB misses, improving performance by up to
and amortizing expensive predictions across application ex- 53% [33, 37]. Looking forward, 1 GB pages are already avail-
ecution. Although our results focus on memory allocation, able and variable-sized ranges can eliminate even more TLB
the questions we identify apply to other system-level prob- misses [27, 33]. Future virtual memory systems may hence
lems with strict latency and resource requirements where predominantly rely on huge pages and ranges.
machine learning could be applied. These trends require workloads to efficiently use huge
pages. While Operating Systems (OS) have explored trans-
CCS Concepts • Computing methodologies → Super- parent huge pages [37, 45], they either trade performance for
vised learning; • Software and its engineering → Allo- space, increasing the physical memory footprint by up to 23%
cation / deallocation strategies; and 69% on server workloads [37], or break up huge pages,
Keywords Memory management, Machine Learning, Life- sacrificing performance (TLB hits) and depleting contiguous
time Prediction, Profile-guided Optimization, LSTMs physical memory for all workloads on the machine [37, 45].
If the C++ memory allocator is not huge page aware, it may
* Work done while at Google. further defeat the OS. Only one C++ memory allocator in
Permission to make digital or hard copies of part or all of this work for
the literature uses huge pages, but its evaluation uses mi-
personal or classroom use is granted without fee provided that copies are crobenchmarks [36]. To our knowledge, no current memory
not made or distributed for profit or commercial advantage and that copies allocator efficiently manages memory entirely with huge
bear this notice and the full citation on the first page. Copyrights for third- pages without incurring significant fragmentation.
party components of this work must be honored. For all other uses, contact We identify a root cause of huge page fragmentation in
the owner/author(s).
long-running servers: allocations of long-lived objects at
ASPLOS ’20, March 16–20, 2020, Lausanne, Switzerland
© 2020 Copyright held by the owner/author(s).
peak memory usage. Since C++ allocators cannot move ob-
ACM ISBN 978-1-4503-7102-5/20/03. jects, using huge pages increase the probability of one long-
https://doi.org/10.1145/3373376.3378525 lived object preventing a page from being released to the
OS. For instance, if 99.99% of objects are short-lived and
their average size is 64 B, then using 4 KB pages, the prob-
ability that any given page contains a long-lived object is
less than 1% (1 − (0.9999) 4096/64 ). Using 2 MB huge pages, the
probability is 96%. Figure 1 shows that heap fragmentation
with huge pages for a production image processing service
on a synthetic workload grows over time as a function of
peak memory consumption. Many web services exhibit such
highly variable memory consumption [37, 40] and allocate Figure 1. Image server memory usage resizing groups of
critical long-lived session state. large and small images either backed by huge (red) or small
Solving this problem fundamentally depends on reason- (yellow) pages in the OS, derived from analyzing an allo-
ing about object lifetimes and grouping objects with similar cation trace in a simulator. Huge pages waste systemically
lifetimes together [4, 11, 16, 17]. Prior lifetime region and more memory and increasingly more over time.
pool memory management techniques [6, 34, 43] depend on
programmer intervention and are limited because not all life- language, we train simple language models on symbolized
times are statically known, software can change over time, calling contexts. We use a Long Short-Term Memory (LSTM)
and libraries are used in multiple contexts. Previous object recurrent neural network model to learn common and rare
lifetime predictors for C++ and garbage collected languages contexts (Section 5). Whereas other lifetime predictors are
use profiling to classify objects as short or long lived, but are simple binary classifiers for exactly matching contexts or
used in settings (such as pretenuring) where mispredictions single allocation sites [4, 11, 16], Llama’s predictor learns
are tolerable [4, 11, 16, 30]. In contrast, because every wrong multiple lifetime classes and accurately predicts unobserved
prediction may retain up to 2 MB and errors accumulate contexts because it uses program symbols, rather than match-
on long-running servers, we require an approach that does ing stack traces or hard-coded allocation sites. However, per-
not induce fragmentation upon misprediction, and need to forming inference on every allocation is too expensive, so
address the following challenges: Llama caches inferences and periodically re-evaluates them.
In contrast to C/C++ free-list allocators that organize the
Lifetime accuracy and coverage. Full coverage and per-
heap based on object size classes [5, 9, 19, 21, 38], Llama or-
fect accuracy are not achievable because exercising all
ganizes the heap based on lifetime classes. It manages huge
possible application behavior ahead-of-time is challeng-
pages by subdividing them into 8 KB blocks and 128 B lines.
ing, especially for evolving servers configured in myriad
It assigns each huge page and block a lifetime class (LC).
ways with different libraries.
Llama maintains two invariants: 1) it fills blocks with one
Overheads. Continuous profiling in deployment is not prac-
predicted lifetime class (LC) at a time and 2) this LC is the
tical because it adds 6% overhead [13, 42], which can be
same or shorter than the huge page’s LC. The huge page’s LC
more than memory allocation itself [31].
thus matches or over-predicts its blocks to tolerate mispre-
These challenges require accurate predictions in previously dictions. To limit fragmentation and handle mispredictions,
unobserved contexts and a memory manager that explicitly Llama dynamically reclassifies a huge page’s LC based on
reasons about lifetimes to recover from mispredictions. Our its observed block lifetimes.
contributions are as follows: (1) The design of a recurrent Llama assigns each huge page a predicted LC and a dead-
neural network predictor that trains on samples and general- line, by when all objects should be dead. It first fills blocks in
izes to different application versions and build configurations huge pages with objects of the same LC, marking these same-
with accurate, but not perfect prediction. (2) A novel Learned LC blocks residual. When blocks are freed, Llama aggres-
Lifetime-Aware Memory Allocator (Llama) with low fragmen- sively reuses them for predicted shorter-lived (non-residual)
tation that only uses huge pages, but subdivides them into LC blocks. These shorter-lived blocks are likely to be freed
blocks and lines. It then manages huge pages and blocks before the huge page’s deadline. This policy limits fragmen-
using their predicted and observed lifetime class. (3) Some tation without extending huge page lifetimes. If the deadline
lessons for applying ML to other systems problems. expires and any residual blocks are still live (i.e., lifetime was
To increase coverage and accuracy, the predictor can be under-predicted), Llama promotes the huge page to the next-
trained on different server versions and configurations. To longer-lived LC. If all residual blocks have been freed (i.e., life-
reduce profiling overhead, we sample allocations and frees time may be over-predicted since all live blocks have a lower
to produce training data with allocation calling context (i.e., LC than their huge page), Llama reduces the huge page’s LC
stack traces) and object lifetimes. We classify objects into and its remaining blocks become residual. Llama tracks line
lifetime classes separated by orders of magnitude: ≤ 10 ms, liveness in a block and recycles partially free blocks. Llama’s
100 ms, 1 s, 10 s, etc. Based on the insight that program sym- hierarchical heap organization (huge page, block, line) fol-
bols in stack traces carry meaning similar to words in natural lows Immix’s (block, line) mark-region design [10, 48]. Its
Figure 2 shows object lifetimes. While 92% of the over 100 M
allocations live for less than 1 s, 4% (millions) of allocations
live for over 10 s and 1% (thousands) live for over 100 s.
These long-lived objects cause excessive fragmentation.
Workloads with varying memory footprint are more suscep-
tible to this problem because small numbers of long-lived
objects on a huge page prevent reusing it for large alloca-
Figure 2. Long tail of object lifetimes from a single run; x- tions. In the image server, short-lived objects that cause the
axis is log-scale. The vast majority of objects are short-lived, heap to grow temporarily include data structures to process
but rare long-lived objects impact fragmentation. each request and image data. At the same time, it allocates
long-lived objects that are used for tracking the cluster envi-
ronment, system statistics, log tracing, and long-lived session
lifetime organization is similar to generational and lifetime-
state. Long-lived state per request is application critical and
based copying garbage collectors [8, 10, 18, 49, 51]. How-
is not the result of poor software engineering.
ever, unlike a managed runtime where GC can move objects
Highly varying memory footprints are typical of servers [37,
between regions, Llama cannot move objects and instead
40]. Fragmentation remains an open problem, recently re-
reclassifies huge pages. Llama is the first C/C++ allocator to
ported for many applications and allocators beyond TCMal-
organize the heap based on lifetime versus object size.
loc [36, 37, 46]. However, strategies for addressing fragmenta-
We prototype Llama and compare to TCMalloc, a popular
tion in these allocators are designed for 4 KB pages [5, 46]. As
and highly tuned allocator, backed by OS huge pages. Llama
our probabilistic argument in Section 1 points out, address-
never breaks up huge pages while simultaneously reducing
ing fragmentation for huge pages is fundamentally more
fragmentation by up to 78% on several production code bases.
difficult, particularly without lifetime information.
We compare Llama to Mesh [46], which uses allocation ran-
domization and page combining to combat fragmentation 2.2 Lifetime Prediction Challenges
for small pages. Using Mesh’s publicly available scripts on a
Prior work predicts object lifetime as long or short based on
worst case microbenchmark that emulates address random-
allocation site and precisely matching calling context [11, 16]
ization for long lived objects, Llama reduces fragmentation
(although Cohn and Singh did use stack data for predictions
over Mesh on huge pages by an order of magnitude. We
instead [17]). Current approaches typically store a table of
further show Llama accurately predicts new contexts, adds
allocation sites, together with a summary of observed per-
little overhead, and recovers from mispredictions. We also
site lifetimes [13]. They either 1) collect lifetime information
draw lessons for applying machine learning in other latency-
at runtime, i.e., dynamic pretenuring [16, 30] or 2) use profile-
critical systems settings.
guided optimization (PGO), collecting lifetimes offline with
special instrumentation, analyzing it offline, and then using
2 Motivation and Background
it in deployment [11]. Lifetime prediction faces the following
2.1 Server Fragmentation significant challenges:
We demonstrate huge page fragmentation on a production
Overheads. Collecting allocation lifetimes incurs a substan-
C++ image server that applies filters and transforms images.
tial overhead, e.g., stack tracing adds 14% end-to-end over-
We drive this server using a request generator that mimics
head and writing to disk further increases the cost, making
workload shifts over time. One iteration running for 448 s
continuous profiling infeasible in production. Looking up
with an average live set of 628 MB has ≈110 M allocations
a predicted object lifetime also incurs overhead, including
from ≈215 K allocation contexts. It allocates (with malloc()
recording the calling context. Table 1 shows recording the
or new) objects of different sizes and frees (with free() or
calling stack for an allocation can take an order of magnitude
delete) allocated memory using TCMalloc [21]. Like all
longer than the allocation, which is problematic. Solutions in-
C/C++ allocators, once TCMalloc places objects in virtual
clude instrumenting the stack prologue and epilogue to keep
memory, it never moves them. We extended TCMalloc to
track of the current stack through a series of bits stored in a
record every object allocation and free with the address, size,
register [12, 13, 29]. However, overheads of this approach are
thread, dynamic stack trace, and timestamp.
≈6% and higher, exceeding all the time spent in memory al-
We replay these traces in a simulator that determines
location [31]. We solve these problems by using stack height
which pages contain live objects at a given time by modeling
and object size for per-site prediction and cache lookups.
the OS giving out 4 KB or 2 MB pages for unmapped virtual
addresses. Figure 1 shows the average fragmentation (ratio Coverage and Accuracy. Encountering a sufficient fraction
of memory occupied by live pages to actual live memory) of allocation sites for accurate prediction is critical. When
is 1.03x when the OS backs memory with 4 KB pages, but collecting lifetime data online, we cannot make a prediction
increases to 2.15x with huge pages and gets worse over time. unless we have seen its context at least once. However, in
TCMalloc Fast Path (new/delete) 8.3 ns
TCMalloc Slow Path (central list) 81.7 ns
Capture full stack trace 396 ns ± 364 ns
Look up stack hash (Section 7) 22.5 ns

Table 1. Timescale comparisons


Version Difference Matching/Total # Traces
Revisions 1 week apart 20,606 / 35,336 (58.31%)
Revisions 5 months apart 127 / 33,613 (0.38%) Figure 3. Overview of our ML-based Allocator
Opt. vs. non-opt. build 43 / 41,060 (0.10%)
Table 2. Fraction of individual stack traces that match be-
tween different binary versions (using exact match of sym- 3 Overview of Lifetime Prediction
bolized function names).
We address overhead and coverage challenges by sampling
multiple executions. Sampling is suitable for both server
our example workload, 64% of distinct allocation contexts applications in datacenters and multiple runs of a popular
are seen only once and 17% of all permanent allocations (i.e., application (e.g., a web browser) on a client. We connect to
allocations that never get freed) are from contexts that are a given application for a sample period and collect lifetimes
only encountered once. PGO avoids this problem by using for a small fraction of all allocations that occur during this
profiles from previous runs, but is more difficult to apply to period (Section 4).
lifetimes than in traditional scenarios, such as inlining [15]. Sampling may not observe all allocation calling contexts
First, these decisions do not depend on the dynamic calling and we must combine samples from a heterogeneous set of
context. As such, each call site only needs to be observed different software versions, while the code bases are con-
once (in any context). In contrast, lifetime prediction requires stantly updated. We therefore cannot simply use a lookup
observing every context for full coverage. For instance, in- table, as shown in Table 2. Our solution is to use ML on the
lining data only needs to collect a single event per sample, observed samples of tokenized calling contexts (i.e., symbol-
while lifetime profiling requires observing both the alloca- ized/textual stack traces) to predict object lifetimes. We train
tion and free. As such, profiling data is more scarce in our a model that maps from calling context to lifetime, while
setting than typical PGO scenarios. generalizing to previously unseen contexts. The predictions
Instability. Stack traces are brittle when used across exe- drive our novel C++ memory allocator that organizes the
cutions. Even stack traces on the exact same binary may heap based on lifetime to reduce fragmentation. While our
differ due to address layout randomization. Using symbol prototype focuses on learning a mapping from contexts to
information, it is possible to reconstruct the original method lifetime, we could add other input features, such as perfor-
name for each stack frame, but different builds of the same mance counters or user-level statistics.
binary may still differ. For example, changing libraries can Another challenge is to perform prediction without sig-
affect inlining decisions, different compiler settings lead to nificant overhead. The allocation fast path is 8.3 ns (Table 1),
slightly different symbol names, and function names and which is too short to obtain a prediction from an ML model.
interfaces change over time. This problem also occurs when In fact, it is not even sufficient to gather all the required
collecting traces across a large number of instances of the features since collecting a deep stack trace takes 400 ns. We
same server with different build configurations and software address this problem by not invoking the model for every
versions. Table 2 shows that the fraction of matching stack allocation. Instead, we use a hashing-based mechanism (Sec-
traces between builds with even minor changes is low and tion 7) to identify previously seen contexts by using values
decreases over time. This result explains why almost all prac- that are already in registers (the return address and stack
tical lifetime predictors today use online profiling instead of pointer) to index a hash table and execute the model only if
PGO, or rely on site instead of the full dynamic stack. the lookup fails. We thus amortize model executions over the
lifetime of a long-running server. We discuss other strategies
We solve coverage and instability problems by enhancing to reduce this cost even further (Section 10). We now explain
PGO to work without observing all contexts. We design an each component in detail.
ML-based predictor that learns on calling contexts of tok-
enized class and method names to produce accurate predic-
tions for unobserved contexts. If a single binary is deployed 4 Sampling-based Data Collection
sufficiently often to achieve full coverage, our approach re- Our sampling approach periodically connects to servers (for
duces to conventional PGO. However, these situations are a time period such as ≈5 minutes) and samples a subset of
rare — most companies have different software versions in all memory allocations. Each sample includes stack trace,
production at the same time [7, 47]. object size and address at allocation and deallocation time.
Lifetime
the entire trace in simulation (Section 2). The two approaches
LSTM
Cell
LSTM
Cell
LSTM
Cell
... LSTM
Cell
produce consistent results (Section 9.3).
[x1,…,xn] [x1,…,xn] [x1,…,xn] [x1,…,xn]

Embedding Embedding Embedding Embedding

...
5 Lifetime Prediction Model
proto2 :: MessageLite )
Our goal is to predict object lifetimes based on our collection
Figure 4. LSTM-based model architecture
of past lifetime samples. As shown in Section 2, a simple
lookup table is insufficient and brittle to changes in the ap-
plication. We instead construct a dataset of samples from a
This approach follows continuous profiling tools used in range of scenarios and train a machine learning model on
production settings [31]. this dataset to generalize to previously unseen stack traces.
We integrate this approach into TCMalloc [21]. Its existing
heap profiling mechanism identifies long-lived objects well
by producing a list of sampled objects at the end of the appli- 5.1 Data Processing
cation’s execution, most of which are long-lived, including We pre-process our sample data using a distributed dataflow
their allocation sites. It misses the more prolific allocations computation framework [2, 14]. We group inputs by alloca-
of short-lived objects that are not live at the end of the pro- tion site and calculate the distribution of observed lifetimes
gram. We therefore extend the heap profiling mechanism to for each site. We use the 95th percentile 𝑇95 𝑖 of observed

record frees (deallocations) as well. We do so using hooks (i.e., lifetimes of site 𝑖 to assign a label 𝐿𝑖 ∈ {1, . . . , 7, ∞} to the
functions) that are called periodically, based on the number site such that 𝑇95𝑖 < 𝑇 (𝐿 ) = (10) 𝐿𝑖 ms. Objects the program
𝑖
of allocated bytes. These hooks incur virtually no overhead never frees get a special long-lived label ∞. This produces life-
when they are disabled. When enabled, each sampled allo- time classes of 10 ms, 100 ms, 1 s, 10 s, 100 s, 1000 s, ≥1000 s,
cation triggers TCMalloc to store it at a special address in and ∞. Our model classifies stack traces according to these
memory and then deallocation can identify those sampled labels. To ensure that our model assigns greater importance
objects and call the corresponding deallocation hook. to stack traces that occur more often, we weight each stack
We install an HTTP handler accessible by pprof [25], an trace according to the number of times it was observed and
open-source profiling and analysis tool. When invoked, the sample multiple copies for frequently occurring traces. The
handler registers two hooks, one for allocation and one for resulting datasets for our applications contain on the order
deallocation. It also allocates a new data structure (outside of of tens of thousands of elements.
the TCMalloc-managed heap) to store observed stack traces. The use of wallclock time for lifetime prediction is a de-
The allocation hook stores the allocation’s full stack trace, a parture from prior work that expresses lifetime with respect
timestamp of the allocation, object size, alignment, and the to allocated bytes [4], which can be more stable across en-
stack and processor ID of the allocation into a hash table, vironments (e.g., server types) at short timescales. We ex-
indexed by a pointer to where the object was allocated. The perimented with logical time measured in bytes, but for our
deallocation hook matches its pointer to the hash table and if server systems, wallclock time works better. We believe time
it finds an entry, records its own stack trace, timestamp and works better because 1) our lifetime classes are very coarse-
thread/CPU where the deallocation occurred. This pair of grained (10×) and absorb variations, 2) if the speed difference
entries is now stored in a different hash table, which is used between environments is uniform, nothing changes (lifetime
to deduplicate all samples. For each entry, we keep a running classes are still a factor of 10× apart). Meanwhile, variations
tally of the distribution of lifetimes, by storing the maximum, in application behavior make the bytes-based metric very
minimum, count, sum and sum of squares (the latter two brittle over long time ranges (e.g., in the image server, the
allow us to calculate mean and variance of the lifetime at sizes of submitted images, number of asynchronous external
a later point). We also store how many of these allocations events, etc. dilate logical time).
were allocated and deallocated on the same CPU or thread
(we do not currently use this information, but explain in
5.2 Machine Learning Model
Section 10 how it might be used). At the end of a sampling
period, we store the result into a protocol buffer [24]. We use a model similar to text models. First, we treat each
In deployment, we would periodically connect to servers frame in the stack trace as a string and tokenize it by splitting
in the fleet and collect samples. For this research, we run based on special characters such as: , and ::. We separate
smaller-scale experiments to understand the trade-offs of stack frames with a special token: @. We take the most com-
our approach and mostly rely on full traces collected by mon tokens and create a table that maps them to a particular
instrumenting allocation and free calls. While too expensive ID with one special ID reserved for unknown or rare tokens,
for production, this approach is useful for understanding denoted as UNK. The table size is a configuration parameter
coverage of different sampling rates (Section 9), or to replay (e.g., 5,000 covers most common tokens).
1 __gnu_cxx :: __g :: __string_base char , std :: __g :: char_traits Here lies an opportunity for the model to generalize. If the
2
char , std :: __g :: allocator char :: _M_reserve ( unsigned long )
proto2 :: internal :: InlineGreedyStringParser ( std :: __g ::
model can learn that tokens such as ParseFromArray and
basic_string char , std :: __g :: char_traits char , std :: __g :: InternalParse appear in similar contexts, it can generalize
3
allocator char *, char const *, proto2 :: internal :: ParseContext *)
proto2 :: FileDescriptorProto :: _InternalParse ( char const *,
when it encounters stack traces that it has not seen before.
proto2 :: internal :: ParseContext *) Note that our approach is not specific to LSTMs. We chose
4
5
proto2 :: MessageLite :: ParseFromArray ( void const *, int )
proto2 :: DescriptorPool :: TryFindFileInFallbackDatabase ( std ::
the LSTM architecture since it is one of the simplest se-
__g :: basic_string char , std :: __g :: char_traits char , std :: quence models, but future work could explore more sophis-
6
__g :: allocator char const ) const
proto2 :: DescriptorPool :: FindFileByName ( std :: __g ::
ticated model architectures that could incorporate more de-
basic_string char , std :: __g :: char_traits char , std :: __g :: tails of the underlying program (e.g., Graph Neural Networks
allocator char const ) const proto2 :: internal ::
AssignDescriptors ( proto2 :: internal :: AssignDescriptorsTable *)
trained on program code [3]). Our specific model architec-
7 system2 :: Algorithm_descriptor () ture is a standard single-layer LSTM with a hidden state size
8
9
system2 :: init_module_algorithm_parse ()
Initializer :: TypeData :: RunIfNecessary ( Initializer *)
of 64 (we experiment with 16 as well), embedding size of
10 Initializer :: RunInitializers ( char const *) 32, uses a softmax output, and is trained against a standard
11
12
RealInit ( char const *, int *, char *** , bool , bool )
main
cross-entropy classification loss via gradient descent. The
final state of the LSTM is passed through a fully connected
Figure 5. An example of an altered but representative stack layer. Training uses the Adam optimizer [35] with a learning
trace used to predict object lifetimes. rate of 0.001 and gradients clipped to 5.0.

We use a long short-term memory (LSTM) recurrent neu- 5.3 Model Implementation
ral network model [28]. LSTMs are typically used for se- We implement and train our model using TensorFlow [1].
quence prediction, e.g., for next-word prediction in natural Calling into the full TensorFlow stack to obtain a lifetime
language processing. They capture long-term sequential de- prediction would be prohibitively expensive for a memory
pendencies by applying a recursive computation to every allocator, so after training, we use TensorFlow’s XLA com-
element in a sequence and outputting a prediction based piler to transform the trained model into C++ code that we
on the final step. In contrast, feed-forward neural networks compile and link into our allocator directly. The model runs
like multi-layer perceptrons [23] or convolutional neural within the allocating thread. To allow multiple threads to use
networks [20, 39] can recognize local patterns, but require the model concurrently, we instantiate the model’s internal
some form of temporal integration in order to apply them to buffers multiple times and add concurrency control.
variable-length sequences.
Our choice of an LSTM is informed by stack trace structure.
6 Lifetime Aware Allocator Design
Figure 5 shows an example. Sequentially processing a trace
from top to bottom conceptually captures the nesting of This section introduces a fundamentally new design for
the program. In this case, the program is creating a string, C/C++ memory managers based on predicted object life-
which is part of a protocol buffer (“proto”) parsing operation, times. Instead of building an allocator around segmenting
which is part of another subsystem. Each part on its own is allocations into size classes [5, 9, 19, 21, 36, 38], we directly
not meaningful: A string may be long-lived or short-lived, manage huge pages and segment object allocation into pre-
depending on whether it is part of a temporary data structure dicted lifetime classes. We further divide, manage, and track
or part of a long-lived table. Similarly, some operations in huge pages and their liveness at a block and line granularity
the proto might indicate that a string constructed within it to limit fragmentation. We implement our allocator from
is temporary, but others make the newly constructed string scratch. It is not yet highly tuned, but it demonstrates the
part of the proto itself, which means they have the same potential of a lifetime-based approach. We address two chal-
lifetime. In this case, the enclosing context that generates lenges required to incorporate ML into low-level systems: 1)
the proto indicates whether the string is long or short-lived. how to deal with mispredictions and 2) prediction latencies
For our model to learn these types of patterns, it must step that are orders of magnitude longer than the typical alloca-
through the stack frames, carrying through information, and tion latency. We first describe the structure of the memory
depending on the context, decide whether or not a particular allocator, then how we make fast predictions, and follow
token is important. This capability is a particular strength of with key implementation details.
LSTMs (Figure 4). We feed the stack trace into the LSTM as a
sequence of tokens (ordered starting from the top of the trace) 6.1 Heap Structure and Concurrency
by first looking up an “embedding vector” for each token in a We design our memory manager for modern parallel soft-
table represented as a matrix 𝐴. The embedding matrix A is ware and hardware. Llama organizes the heap into huge
trained as part of the model. Ideally, 𝐴 will map tokens with pages to increase TLB reach. To limit physical fragmentation,
a similar meaning close together in embedding space, similar we divide huge pages into 8 KB blocks and track their live-
to word2vec embeddings [41] in natural language processing. ness. Llama assigns each active huge page one of 𝑁 lifetime
classes (LC), separated by at least an order of magnitude (e.g., to a local allocator, it marks the blocks open for allocation.
10 ms, 100 ms, 1000 ms, . . . , ∞). Our implementation uses a If the blocks are on an open huge page, it also marks the
maximum of 𝑁 = 7 lifetime classes. Llama exploits the large blocks residual. Residual blocks are predicted to match the
virtual memory of 64-bit architectures, as fragmentation of LC of their huge page. An active huge page may also contain
virtual memory is not a concern. Llama divides virtual mem- other live (non-residual) blocks, but these blocks will contain
ory into 16 GB LC regions, one per lifetime class. Section 8 objects of a shorter lifetime class, as explained below. Thread-
describes enhancements when an LC region is exhausted. local allocators bump-pointer allocate small objects in block
The global allocator manages huge pages and their blocks. spans. When they exhaust a span, they mark it closed.
It performs bump pointer allocation of huge pages in their Llama first fills a huge page with same LC blocks and then
initial LC regions, acquiring them from the OS. It directly transitions it from open to active. At this point, the huge
manages large objects (>= 8 KB), placing them into contigu- page contains residual blocks and maybe free blocks. Figure 6
ous free blocks in partially free huge pages or in new huge shows an illustrative, but simplified, example of the logical
pages. A huge page may contain large and small objects. LC Llama heap (huge pages and blocks) and its behavior
Llama achieves scalability on multicore hardware by us- over time. This heap has three lifetime classes, separated by
ing mostly unsynchronized thread-local allocation for small orders of magnitude. A large amount of initial allocation in
objects (<=8 KB). The global allocator gives block spans to Figure 6a, including a large object allocation into huge page
local allocators upon request. When a thread-local allocator 11 and 12, is followed by a large number of frees in Figure 6b.
allocates the first object of a given LC or it exhausts its cur- Llama returns free huge pages 2 and 6 to the OS.
rent LC block span, it requests one from the global allocator.
Block spans consist of 𝑀 blocks and reduce synchronization 6.3 Limiting Fragmentation by Recycling Blocks
with the global allocator. Our implementation uses 𝑀 = 2 Notice in Figure 6b active huge pages contain free blocks and
(16 KB block spans) with 16 KB alignment. Llama further live residual blocks of the same LC. Llama limits fragmen-
subdivides block spans into 128 B lines and recycles lines in tation by aggressively recycling such free blocks for objects
partially free block spans for small objects (see Section 6.6). in shorter LCs (except for the shortest LC, since no LC is
It tracks line and block liveness using object counters. Small shorter). Section 6.5 explains the fast bit vector operations
objects never cross span boundaries, but may cross line and that find recyclable blocks of the correct size and alignment.
block boundaries. Each thread-local allocator maintains one Given a request for LC 𝑙𝑟 , the global allocator prefers to
or two block spans per LC for small objects. use free blocks from a longer-lived active huge page (LC
Llama tracks predicted and actual block lifetimes and uses > 𝑙𝑟 ). These recycled blocks are allocated non-residual, as
them to decrease or increase their huge page’s LC. Llama illustrated in Figure 6c. If no such recyclable blocks exist, the
maintains the following invariants. 1) It allocates only objects global allocator uses block(s) from the open huge page of
of one predicted LC into a block or span at a time. 2) A huge the same LC = 𝑙𝑟 . Intuitively, if the predictor is accurate or
page contains blocks with the same or shorter predicted LC. overestimates lifetime class, the program with high probabil-
We next describe how we use LC predictions to manage ity will free shorter-lived objects on recycled blocks before
huge pages and blocks. Sections 6.3 and 6.4 describe the it frees residual blocks with the same LC as the huge page.
policies that limit fragmentation and dynamically detect Because lifetime classes are separated by at least an order
and control the impact of mispredicted lifetimes. Section 6.6 of magnitude, the allocator may reuse these blocks many
then describes how Llama uses lines to identify and recycle times while the longer-lived objects on the huge page are
memory in partially free block spans. in use, reducing the maximum heap footprint. If the predic-
tor underestimates lifetime, the objects will have more time
6.2 Lifetime-Based Huge Page Management to be freed. This design is thus tolerant of over and under
Each huge page has three states: open, active, and free. Open estimates of lifetime.
and active blocks are live. The first allocation into a huge page For large objects, the global allocator assigns blocks di-
makes it open and determines its LC. Only one huge page rectly. For example, given the heap state in Figure 6b and a
per LC is open at a time. While a huge page is open, Llama request for a two block large object with 𝑙𝑟 < 10 ms, the global
only assigns its blocks to the same LC. Llama transitions a allocator allocates it into huge page 7 with LC < 100 ms and
huge page from open to active and assigns it a deadline after marks the blocks non-residual, as illustrated in Figure 6c.
filling all its constituent blocks for the first time. The huge When Llama recycles a block span (assigning it to a local
page remains active for the rest of its lifetime. The OS backs allocator), it marks the blocks open and non-residual. The
huge pages lazily, upon first touch. A huge page is free when local allocator assigns the span to the requested LC 𝑙𝑟 , even
all its blocks are free and is immediately returned to the OS. if the span resides on a huge page assigned to a longer
All blocks in a huge page are free or live; open or closed lifetime class. The local allocator only allocates objects of
for allocation; and residual or non-residual. All blocks are this predicted lifetime 𝑙𝑟 into this span. After it fills the span
initially free. When the global allocator returns a block span with 𝑙𝑟 object allocations, it marks the blocks closed. This
● Residual Allocation O Open Huge Page Huge Page
LC Lifetime Class A Active Huge Page
object lifetimes to the next longer LC and huge pages with
blocks
LC A ① A ② A ③ A ④ O ⑤ over-predicted objects to the next shorter lifetime class.
< 10 ms
A
• • • • • •

• • • • • •
A ⑦ O
• • • • • •

• • • • • • • • • •
Huge Page
We detect under-prediction of lifetimes using deadlines.
< 100 ms
A
• • • • • •

• • • • • •
A ⑩ A
• •
⑪ O ⑫
Identifiers When a huge page becomes full for the first time, the global
<1s • • • • • • • • • • • • • • • • • • • • • allocator transitions it from open to active and assigns it a
large object allocation deadline as follows:
(a) Initial allocations. Huge pages are bump-pointer allocated into LC regions.
Each huge page is first filled with same LC blocks, marked residual with a dot.
deadline = current_timestamp + K × LC𝐻𝑢𝑔𝑒𝑃𝑎𝑔𝑒
A ① A ② A ③ A ④ O ⑤ When Llama changes the LC of a huge page, it assigns the
< 10 ms • • freed to OS • • • • • • • • • •
A ⑥ A ⑦ O ⑧ huge page a new deadline using the same calculation and the
< 100 ms • to OS
freed • • • •
A ⑨ A ⑩ A ⑪ O ⑫ new lifetime class. We experimented with 𝐾 = 2 and 𝐾 = 4.
<1s • • • • • • • • • • • • • • • •
When a huge page’s deadline expires, then the predictor
(b) After objects free, some blocks and huge pages are free (white). Llama made a mistake. To recover, Llama increases the huge page’s
immediately returns free huge pages to the OS to control maximum heap size. lifetime class and gives it a new deadline. Figure 6d depicts
After frees, Llama
< 10 ms
A returns completely
• •
① A free huge③pages

A to the OS. ④
• • • • •
O
• • • •

this case. The residual blocks in huge page 1 outlive their
< 100 ms
A
• •
⑦ O
• •
⑧ deadline and Llama increases its LC to 100 ms. A huge page
<1s
A ⑨ A ⑩ A ⑪ O ⑫ may also contain non-residual blocks which it leaves un-
• • • • • • • • • • • • • • • •
changed. Llama essentially predicts that the residual blocks
(c) Subsequent allocations of shorter LC small objects first fill free blocks in
were just mispredicted by one LC and non-residual blocks
the highest LC in A(ctive)
Llama preferentially hugeLCpages
allocates shorter blocks 9
onand 10,longer
A(ctive) andlived
then blocks
huge pages. in huge page are shorter lived than this LC. If either live longer that this
7. These blocks are not residual (no dot) and expected to be freed before the LC, this process will repeat until the blocks are freed or reach
residual blocks. O(pen) pages 5, 8, and 12 are ineligible for such allocation. the longest lived LC. This policy ensures that huge pages
< 10 ms • •
A ③

A
• • • •


O
• • • •

with under predicted objects eventually end up in the correct
< 100 ms
A
• •
① A
• •
⑦ O
• •
⑧ lifetime class, tolerating mispredictions.
<1s
A ⑨ A ⑩ A ⑪ O ⑫ Llama’s recycling mechanism works well for both accu-
• • • • • • • • • • • • • •
rate and under-predicted lifetimes. If all lifetimes are accurate
(d) When huge page 1’s deadline expires, residual blocks are still live (mis-
or under-predicted, a program will free all residual blocks be-
prediction). Llama
If residual blocks increases
on a huge page outlivethe huge
the LC, page’s
Llama LChuge
moves the bypage
one,to from
the next10 to 100
higher LC. 𝑚𝑠. fore their huge page deadline since the deadline is generous.
Residual blocks remain residual; their expected lifetime is now at least 100 𝑚𝑠. As blocks become free on active huge pages, the allocator
A ③ A ④ O ⑤ may recycle them for shorter lifetime classes, as explained
< 10 ms • • • • • • • • • •
A ⑨ A ① A ⑦ O ⑧ above. Llama may repeatedly recycle blocks on active huge
< 100 ms • • • • • • • • • •
A ⑩ A ⑪ O ⑫ pages, each time they are freed. Before the deadline expires,
<1s • • • • • • • • • • • • • •
if all blocks in the huge page are free at once, Llama simply
releases it to the OS. Otherwise given accurate or under pre-
(e) Huge page 9 only contains non-residual blocks and consequently, Llama
decreases
If all live its LC.
blocks It marks
belong all LC,
to a shorter liveLlama
blocks
movesresidual since
the huge page they
to the nextmatch or are less
shorter LC.
diction, the huge page will at some point contain only live
than the huge page’s LC. non-residual (shorter LC) blocks when the deadline expires.
Figure 6. Llama’s logical heap organization with three life- Llama will then decrease the huge page’s LC by one and
time classes (< 10 𝑚𝑠, < 100 𝑚𝑠, < 1 𝑠). Each live huge page is compute a new deadline using the current time and new LC.
A(ctive) or O(pen) and divided into blocks. Block color de- Figure 6e shows such an example. Because huge page 9
picts predicted LC or free (white). Residual blocks are marked contains only non-residual blocks, Llama decreases its LC
with a dot. Deadlines and lines are omitted. and marks all live blocks residual. With accurate and under-
predicted lifetimes, this process repeats: either the huge page
is freed or its LC continues to drop until it reaches the short-
est LC. In the shortest LC since no blocks are recycled and
policy guarantees that when a block is open for allocation, when prediction is accurate, all blocks are freed before the
it receives only one LC. deadline and the huge page is released.
Llama’s recycling policy is configurable. In the current
implementation, Llama prefers 𝑙𝑟 + 1 for large objects and 6.5 Data Structures
the longest available LC for small objects. Llama tracks liveness at the huge page, block, and line gran-
ularity. It stores metadata in small pages at the beginning
6.4 Tolerating Prediction Errors of each 16 GB LC region. Each huge page in a region corre-
Lifetime prediction will never be perfect. Llama tolerates sponds to one 256 B metadata region in the metadata page.
mispredictions by tracking block and huge page lifetimes us- Mapping between a huge page and its metadata therefore
ing deadlines. It promotes huge pages with under-predicted consists of quick bit operations.
The global allocator tracks active huge pages in a list for Allocation

each LC and open blocks in each huge page in bit vectors.


Hash
The metadata for huge pages stores bitmaps with 256 bits (< Cache Lifetime Class #1
Lifetime Class #2
one cache line). One bitmap stores whether a block is live Stack Trace Prediction Lifetime Class #3
(i.e., contains live objects). Another bitmap identifies residual Optional: Periodically discard cache
blocks that contain live objects of the same LC as the huge Model Model

page. Non-residual live blocks thus contain shorter-lived


objects. When the global allocator assigns a block from the Figure 7. High-level overview of low-latency prediction. We
same LC as the request, it marks the block residual. use the model only when the hash of the current stack trace
When Llama frees a block, it clears the corresponding is not in the cache. Discarding cache entries periodically
bits in both bitmaps. If all blocks in a huge page are free (the helps dynamically adapting to workload changes.
live bitmap is all zeros), it returns the huge page to the OS.
Otherwise, it examines the residual bitmap. If it is all zeroes,
any live blocks must contain objects with shorter predicted Llama bump-pointer allocates small objects into partially
lifetimes. Llama therefore assign the page to the next-lower free spans until it encounters an occupied line or the end of
lifetime class (huge page 9 in Figure 6d), copies the current the span. When it encounters an occupied line, it skips to
live bitmap into the residual bitmap and continues. The huge the next free line(s). For tiny objects less than or equal to
pages in the shortest LC contain no recycled blocks. the line size (128 B), if the current line has insufficient free
memory, it skips to the next free line which is guaranteed to
be sufficient, wasting some memory. For other small objects
6.6 Recycling Lines in Block Spans (> 128 B), Llama limits line fragmentation using demand
This section describes how Llama limits fragmentation by driven overflow allocation [10]. If a small (not tiny) object
further subdividing block spans into lines, recycling lines does not fit, the allocator instead obtains a second completely
in partially free spans, and using the overflow allocation free overflow span from the global allocator for this object
optimization [10]. For spans with small objects, Llama keeps and any future such allocations. It thus avoids searching for
live object counts for each line, and a count of live lines per 𝑛 > 1 free contiguous lines or wasting free lines. A local
span. Small objects occupy one or more contiguous lines and allocator may thus have two spans per LC: one partially free
only one span. Once a span is closed (filled at least once), and one overflow span. The local allocator prefers to allocate
subsequent frees may create a partially free span. in the partially free span, but once exhausted, it will fill the
Multiple threads can free objects in a span, thus counting overflow span before requesting a new span.
live objects requires synchronization. For each small object
allocation, local allocators perform an atomic increment on 7 Low-Latency and Accurate Prediction
its span’s and line(s)’s object counts. For each free, an atomic The allocator must predict object lifetimes quickly to meet
decrement is performed on the counts. Section 8 describes latency requirements. TCMalloc allocation times are <100
this synchronization in more detail. If the span count be- cycles, but even a simple neural network takes microseconds.
comes 0, the thread that drops the count to zero returns it We therefore cache predictions. Figure 7 shows how at each
to the global allocator. Free spans on active huge pages are allocation, we compute a hash of the return address, stack
immediately available for recycling. When a line count drops height and object size, and index a thread-local hashmap.
to zero, the freeing thread updates the span’s count. Because stack traces have temporal locality, we expect the
A span with free lines and live lines is partially free. The lookup will usually hit in the L1 cache. Prior work shows
global allocator recycles partially free spans only after the stack height identifies C/C++ stack traces with 68% accu-
deadline of their huge page expires. It scans the huge page racy [42]. We find adding object size increases accuracy. If
and adds any closed partially free spans to a list. When it the hash hits, we use the cached prediction. Otherwise, we
assigns spans to a thread-local allocator, it marks them as run the compiled model which takes hundreds of 𝜇𝑠 (depend-
open. A local allocator may have one or two open spans per ing on the stack depth), and store the result in the cache.
LC: one initially partially free and one initially fully free. When stack hashes with very different lifetimes alias or
Each span is exclusive to the requesting local allocator workloads change, prediction accuracy suffers. For example,
which only allocates objects of the lifetime it requested, re- if we store a hash for an allocation site that is predicted
gardless of the huge page’s LC. When a block is full, the local short-lived, but a second site, more common and long-lived,
allocator marks the block closed and releases it. Each time aliases, then Llama may allocate a large number of long-
a span is open for allocation, it only receives one LC. Each lived objects into short-lived block spans. We found that 14%
time a partially free span is opened, it may however receive of predictions disagreed with the currently cached value.
a different LC, mixing lifetimes. The LC of these objects will To address this problem, we periodically discard cached
always be shorter than the LC of the huge page. entries. Every, e.g., 1,000 cache hits, we run prediction again.
If the result agrees with the current entry, we do nothing. has at most one owner thread, which performs unsynchro-
Otherwise, we set the cache entry to the maximum lifetime nized allocation and atomic reference count increments. Since
of the old and new prediction. We use maximum because the different threads can free objects, span and line reference
allocator is more resilient to under-predicted lifetimes than count increments and decrements must be synchronized (or
over-predicted lifetimes. queued in a buffer for later processing [48]). The thread that
drops a span reference count to zero is responsible for free-
8 Implementation Details ing it. The owner of an open span increments its reference
Allocation size lookup. When freeing objects, we need to count by 1 when it acquires a span and decreases it by 1
know their size. We use a 256-entry bitmap representing when it releases it since no thread can free the span while an
each block in a huge page. We set a bit to 1 if and only if owner is still allocating into it. We apply the same technique
the corresponding block is the last block occupied by an to lines – the allocator increments the reference count for a
object. Given an address, we find the blocks it occupies by line when it partially allocates into it and then decrements
rounding it down to the closest block size and using the it when it moves on to the next line.
bitmap to find the next set bit. This approach does not work Potential optimizations include eliding the increment and
for the last object (which may span multiple huge pages). decrement pair when an object crosses lines and deferral,
We therefore store a 64-bit value in the huge page metadata, similar to reference counting Immix [48]. We note that highly
which contains the size of the last object on the huge page. tuned allocators perform a large number of additional opti-
A similar approach tracks small objects that span lines, but mizations (such as prefetching, restartable sequences, hand-
since small objects cannot straddle spans, it needs only one tuned assembly sequences [32]) that are missing from this
byte to store the number of lines occupied by an object. research allocator.
Bootstrap allocator. Llama needs some basic functionality
C-style malloc/free API. Our allocator is designed for C++,
during initialization, such as querying the binary’s symbol
but supports legacy C code, which requires storing the pre-
table. For prototyping, Llama uses a bootstrap allocator that
cise allocation size to support realloc calls. If we encounter
handles initial allocations before executing the program. Our
legacy malloc calls, we pad the object with a header that
prototype uses TCMalloc as this bootstrap allocator. The
contains the object size.
memory usage reported in this paper consists of memory
Alignment. Our allocator handles alignment and aligns all allocated by both allocators, including fragmented memory.
objects to at least 8 B. The huge page allocator handles com- Bootstrap memory is a small fraction of the heap. A full im-
mon alignments automatically, as blocks are 8 KB aligned. plementation would likely use a simpler bootstrap allocator.
For larger alignments, we increase the allocation size as
necessary and shift the start pointer to match the required 9 Evaluation
alignment. When we search for a free gap in a page, we try We evaluate Llama on four workloads. Except for Redis, they
gaps individually to find ones that fit the object with the are large production code bases:
correct alignment. Image Processing Server. A Google-internal production
image processing server that filters and transforms images.
Lifetime region management. Above, we assume one 16
We use synthetic inputs, but the fragmentation in our
GB virtual memory region per lifetime class. Llama never
experiments is consistent with production.
reuses huge page virtual memory. Even after it frees a huge
TensorFlow. The open-source TensorFlow Serving frame-
page, Llama still continues to use fresh virtual memory space
work [44] running the InceptionV3 [50] image recognition
if it needs to allocate another huge page in this region. This
model. This workload exercises libraries with complex
approach is practical because 64 bit architectures provide
memory allocation behavior, such as the Eigen linear alge-
virtual address space that exceeds the physical address space
bra library. It runs 400 batches of requests in a harness.
per-process by orders of magnitude. If we run out of virtual
Data Processing Pipeline. A Google-internal data process-
memory in a region of a given lifetime, we allocate an ad-
ing workload running word count on a 1 GB file with 100 M
ditional 16 GB virtual memory region for this lifetime class.
words. We run the entire computation in a single process,
Llama manages these regions in an array. The OS only maps
which creates very high allocator pressure, resulting in
small and huge pages when the program accesses them and
476 parallel threads and 5M allocations per second.
unmaps pages when the allocator releases them.
Redis. The open-source Redis key-value store (v. 4.0.1) run-
Locking. The main scalability bottleneck in Llama is a sin- ning its standard redis-benchmark, configured with 5K
gle lock in the global allocator that performs huge page and concurrent connections and 100K operations of 1000 B. We
block allocations. We leave adding per-huge-page locks and rename its zcalloc function to avoid a name collision.
concurrency to the global allocator to future work. Llama The goal of the evaluation is to 1) demonstrate this approach
also uses synchronized reference counting. Each block span is promising and works on large production code bases; 2)
Prediction Accuracy Final Steady-state Memory Fragmentation
Workload
Weighted Unweighted TCMalloc Llama Live reduction
Image Processing Server 96% 73% 664 MB 446 MB 153 MB 43%
TensorFlow InceptionV3 Benchmark 98% 94% 282 MB 269 MB 214 MB 19%
Data Processing Pipeline 99% 78% 1964 MB 481 MB 50 MB 78%
Redis Key-Value Store 100% 94% 832 MB 312 MB 115 MB 73%
Table 3. Summary of Model Accuracy and End-to-end Fragmentation Results

Figure 8. Llama reduces huge page (HP) fragmentation com- (a) Image Processing Server
pared to TCMalloc on the Image Processing Server. TCMalloc
numbers optimistically assume all free spans are immediately
returned to the OS, which is not the case.

understand trade-offs, such as the model’s generalization


abilities; and 3) characterize Llama. We use a workstation
with a 6-core Intel Xeon E5-1650 CPU running at 3.60GHz
(b) TensorFlow InceptionV3 Benchmark
with 64 GB of DRAM and Linux kernel version 4.19.37.
These workloads stress every part of our allocator. They Figure 9. Llama’s memory consumption with perfect life-
use 10s to 100s of threads, a mix of C++ and C memory al- time predictions (using traces) is close to an oracle and
location, object alignment, have a large ratio of allocation closely follows the live heap size.
to live objects, and a large amount of thread sharing. They
frequently communicate objects between threads, causing server (in steady state and at termination). Note these results
the free lists to be “shuffled” and leading to fragmentation. include the memory overheads of our model.
We believe these workloads are representative of modern The data processing pipeline represents a different kind of
C/C++ server application. They stress the memory alloca- workload than the servers. While the heap size variation in
tor significantly more than workloads used in some prior servers results from changing request size patterns, the data
C/C++ memory manager evaluations, such as SPEC CPU. processing pipeline’s heap size varies based on its execution
These patterns are similar to Java applications, illustrating stages. Fragmentation occurs when long-lived outputs of a
the evolution of C/C++ applications and how they heavily stage are allocated while the heap contains a large amount
rely on their memory managers. of temporary data from an active stage.
Redis illustrates the limitations of a PGO-based approach.
9.1 End-to-end Evaluation Our model learns the difference between per-connection
Table 3 shows end-to-end fragmentation improvements over data (which is short-lived) and stored data (which is long-
TCMalloc for the four workloads (not from simulation), rang- lived). However, Redis servers are often dominated by stored
ing from 19% to 78%. Figure 8 shows image processing server data and the lifetime of these objects is entirely determined
fragmentation as a function of time. Since vanilla TCMalloc by client requests and cannot be predicted. As such, Redis
does not support huge pages, we reconstruct the number of represents workloads where a PGO approach alone is limited.
occupied and free huge pages from its bookkeeping infor- Redis implements a feature called active defragmentation that
mation. This method is a lower bound because it does not relocates its long-lived stored-data, giving the allocator an
take into account that TCMalloc does not immediately (or opportunity to compact memory and decrease fragmenta-
sometimes ever) release pages to the OS. TCMalloc’s actual tion. Redis thus illustrates fragmentation is a large enough
occupancy will be between this amount and the largest peak problem that the developers hand-coded a mitigation. How-
in the trace, depending on page release rate. Even when com- ever, this approach only supports Redis’s stored-data data
pared with the most optimistic variant, we eliminate 43% of structure, and not other objects (e.g., session state). We hy-
the fragmentation introduced by TCMalloc for the image pothesize that a model can be effective when combined with
(a) Sampling Validation (b) LSTM Execution Time
(a) Sampling Rate (b) Workload Variations
Figure 10. The lifetime model generalizes to unobserved Figure 11. Validation of sampling and compiled model exe-
allocation sites from different versions and compiler settings. cution latency for the image processing server.
Blue shows accuracy per stack trace, green weighted by times for a total of 2.3 B allocations, and sample each allo-
allocations. Light/dotted data shows off-by-one accuracy. cation with a particular probability ranging from 1:100 to
1:1,000,000. We then compare the resulting predictions to the
this mechanism to only predict lifetimes of non-Redis ob- original training data (Figure 10). Even when only sampling
jects. Further, if client requests have regularity (e.g., when every millionth allocation, the model still produces the same
Redis is used as a cache), the model might be able to learn output as the training data 80% of the time and almost 100%
this behavior as well. are off at most by one lifetime class. This demonstrates our
To isolate the impact of the accuracy of lifetime predic- model’s ability to generalize.
tions from Llama’s memory management algorithm, we
measure its effectiveness with perfect predictions. We link 9.4 Predictor Overheads
the allocator into our simulator and run it using pre-recorded Latency. We next evaluate the computational performance
traces with perfect lifetime information. Figure 9 shows that of our model. Figure 11b shows the prediction latency with
with a perfect lifetime oracle, the average fragmentation is increasing stack sizes. We compare two different models to
less than 1.1× for both workloads. This result demonstrates understand the trade-off space. The model we use through-
that Llama succeeds at packing objects into huge pages. out uses a 64-dimensional vector as the internal state of the
LSTM. We compare to a smaller model with a 16-dimensional
9.2 Model Evaluation vector that can potentially store less information but exe-
LSTM Model Generalization. Figure 10 shows accuracy re- cutes more quickly. In practice, we would tune this parameter
mains high when training our model on one version of the when we train an application-specific memory allocator.
image server and applying it to another. The same config-
uration in Table 2 shows almost no matching stack traces Memory Consumption. We measure the memory consump-
(i.e., a lookup table would not work). In contrast, the model tion introduced by our predictor. First, our allocator loads
achieves upwards of 80% accuracy when applied to the other the symbol map associated with the binary, which is 17 MB
revision, and increases to 95% when ignoring errors where for the image processing server. Next, every instance of the
the prediction is off by at most one lifetime class. We see an model’s internal buffers uses 58 KB (the number of instances
interesting effect for the non-optimized build. This example limits the number of parallel threads performing prediction
achieves few exact matches but higher accuracy for off-by- simultaneously). We use 64 of them (less than 4 MB of mem-
one errors. We hypothesize that because the non-optimized ory). Finally, the allocator maintains a map from symbols
version of the code runs slower, lifetimes are consistently in to tokens. We could fold this memory into the symbol ta-
a higher class than optimized code. ble to eliminate most of this overhead. The allocator thus
adds 56 MB for prediction for this workload, less than 2% of
9.3 Sampling Effectiveness the maximum heap size. As we show in Section 9.1, Llama
recoups this memory easily.
We measure the overhead of sampling RPC latencies in the
image processing server at an average of ≈5 %, but with large Stack hashing accuracy. For the image server, 95% of pre-
variations (1-8%). To evaluate if the sampled data and the dictions hit in the cache, which shows stack hashing reduces
full trace data we use elsewhere in the paper are consistent, model evaluations. To evaluate accuracy, we sample predic-
Figure 11a shows the distribution of lifetime classes of full tions and measure how often they disagreed with the cached
traces sub-sampled at 1:100 K, compared to the lifetime pro- value. They disagree 14% of the time, but only require up-
filer’s data. Note that this log-scale figure does not imply dates to longer lifetime classes for 1.6% of allocation sites.
that the fractions of the different traces are the same, but
that they are in the same order of magnitude for each of the 9.5 Lifetime Aware Memory Allocator Performance
classes, the accuracy the system needs. We now characterize Llama’s performance. While our re-
To evaluate how many samples we need to construct an search prototype is not highly tuned, we ensure its perfor-
accurate model, we run our image processing workload 20 mance is sufficient to run the full benchmarks at reasonable
TCMalloc Fast path 8.3 ± 0.1 ns
TCMalloc Global allocator 81.7 ± 1.0 ns
Fast path (w/o prediction) 29.1 ± 0.9 ns
Without lines/recycling block spans 17.1 ± 0.8 ns
With 2 threads 28.6 ± 0.1 ns
With 4 threads 28.7 ± 0.1 ns
Fast path (prediction cached) 48.8 ± 0.1 ns
Fast path (run ML model, size=64) 144.6 ± 1.5 us
(a) Image Server Simulation (b) Microbenchmark
Global allocator (w/o prediction) 52.7 ± 2.9 ns
With 2 threads 274.5 ± 38.0 ns Figure 12. Llama reduces fragmentation compared to Mesh.
With 4 threads 802.2 ± 75.0 ns
Global allocator (prediction cached) 88.0 ± 7.8 ns exceeds the number of physical cores. This lock could be
Global allocator (run ML model, size=64) 143.8 ± 1.2 us replaced with a readers-writer lock for the list of active pages
Table 4. Memory Allocator alloc+free Performance (which is mostly read-only) and a per huge-page lock that is
only acquired when a page is updated. The list could also be
implemented as a lock-free data structure. While these over-
speed. We now discuss the prototype’s bottlenecks and how heads mean that our research prototype is not production-
a production implementation could address them. We believe ready, the focus of this work has been fragmentation and
that none of these bottlenecks are fundamental. our prototype is suitable for evaluating it.
Production memory allocators are highly tuned and ap- We also gather statistics to confirm that Llama’s differ-
plications are often co-optimized with a particular memory ent behaviors and execution paths are actually exercised by
allocator. Allocator optimizations include rigorous tuning of our workloads. For the image processing server (spanning
every instruction on the fast path, software prefetch instruc- 130M allocations and 207 GB allocated), the allocator allo-
tions, use of restartable sequences to reduce synchronization cates 640 K block spans, observes expiring deadlines 1,011
overheads, size class tuning, and fine-grained locking. In times, and demotes huge pages 8,492 times, confirming the
contrast, our allocator contains very few optimizations and benchmarks exercise Llama’s key features.
the global allocator is protected by a central lock which is
currently the main performance bottleneck. We also do not 9.6 Comparison to Mesh [46]
take advantage of sized deallocation in C++11. Compared to Fragmentation induced by long-lived objects allocated at
TCMalloc, which handles objects of up to 256 KB using the peak memory usage is fundamental to most memory alloca-
fast path, Llama’s cut-off is 8 KB, causing a larger fraction of tors, since avoiding this fragmentation requires the allocator
allocations to use the slow path. Kanev et al. describe many to know at allocation time which objects are long-lived. As
fast path optimizations in TCMalloc [32]. such, strategies such as size class tuning or best fit allocation
We use microbenchmarks to quantify the slowdown of do not address this source of fragmentation.
allocation relative to TCMalloc for a number of common A recent proposal, Mesh [46], takes a different approach
allocation paths. On average, allocation is currently 2 − 3× and reduces fragmentation by combining (meshing) virtual
slower than TCMalloc. In practice, the memory allocator pages with non-overlapping objects into the same physical
sees much less contention than in this stress test, and end-to- page using copying and page remapping. As such, Mesh
end slowdowns are less dramatic (Section 9.1). For example, has the potential to address fragmentation caused by long-
the image server slows down ≈ 12.5% per query compared lived objects. For example, Mesh reduces fragmentation in
to TCMalloc. On the other end of the spectrum, the global Firefox by 16%. We compare Llama to Mesh. A challenge is
lock is a problem under very high allocator pressure. For that Mesh’s probabilistic guarantees rely both on random
the data processing pipeline with 476 threads mapped to allocation and on small 4 KB pages. The paper states that
6 physical cores and 5M allocations per second, Llama’s Mesh is not designed to work with huge pages. We thus
performance degrades by 2.84× compared to a recent version compare with Mesh first on the Image Server using huge
of TCMalloc [26]. Note that TCMalloc is highly tuned and pages and then — using a microbenchmark that simulates
that this benchmark is limited by global synchronization in varying heap sizes — on both small and huge pages.
Llama and thus is particularly advantageous for TCMalloc.
The overheads in our allocator could be addressed. In the Image Server (Simulation). For the image server, we use
fast path, the main bottleneck is the atomic operations re- our simulator to compute occupancy bitmaps throughout
quired to update object counts – these operations could be the execution and then give them as input to Mesh’s anal-
elided by operating entirely on thread-local counters and ysis scripts to compute meshing opportunities, using the
only writing them back when an open block span is released. “greedy” mesher. Figure 12a shows Llama saves memory
In the slow path, the main bottleneck is the global lock. between a factor of 2 to 5 compared to meshing throughout
This is particularly pronounced when the number of threads the execution of the image server.
Microbenchmark. We compare Llama to Mesh and TCMal- by simulating different scenarios and using techniques such
loc on small and huge pages using a microbenchmark that as Bayesian optimization to choose among them [22].
mimics varying heap size. The microbenchmark allocates a
Improving accuracy and reducing prediction costs. The
sequence of short-lived 64 B objects and fluctuates between
cost of our model could be significantly reduced. Currently,
a 1 MB and a 1 GB heap size. Every 10,000 allocations, it al-
we need to look up each stack pointer within a trace in the
locates a long-lived object, for a total of 5 MB of long-lived
binary’s symbol table, tokenize it, multiply the results with
data spread out evenly across the virtual address space. It
the embedding matrix, and feed it into the model. While we
represents a stress-test for the type of fragmentation that
cache tokenizations of symbols, these lookups incur addi-
Llama and Mesh address. At the end of the execution, all
tional delays at runtime. Instead, we could precompute all
but the long-lived objects are freed and we report live pages
of these steps at compile-time when we build the symbol
in Figure 12b for small and huge pages.
table, including the execution of the part of the model that
Figure 12b shows vanilla TCMalloc incurs high fragmen-
multiplies tokens with the embedding matrix. This approach
tation. With 2 MB pages, it frees almost no pages. With 4 KB
is a form of partial evaluation.
pages, it frees about half of the memory. Note that not all this
We may also be able to reduce the latency of our model
fragmentation is caused by live objects, as TCMalloc has cells
by not feeding sequences of tokens into the model but by
held in caches and free lists. In contrast, Mesh (only counting
learning an embedding for entire stack frames. This approach
memory in MiniHeaps) reclaims most of the fragmentation
may reduce the LSTM length by an order of magnitude, and
in the 4 KB pages case (91.7 MB), as intended. However, when
would be particularly effective when combined with partial
looking at 2 MB pages, this memory becomes 558 MB, con-
evaluation. A final optimization is to memorize our hash
firming that Mesh works well with 4 KB pages but not 2 MB
tables across runs to avoid startup overheads.
pages. Meanwhile, our allocator only uses 22 MB in both
cases when supplied with correct lifetime predictions, not General implications for ML for Systems. We believe that
accounting for the bootstrap allocator or any models. many issues this paper addresses for using ML apply to other
systems problems, such as sizing queues and data structures
These experiments show Mesh is highly effective for ad- (e.g., vectors and maps). These predictions are also latency-
dressing fragmentation with 4 KB pages. While Mesh alone sensitive, can benefit from calling context, and need to toler-
does not solve the fragmentation problem for huge pages, ate mispredictions. We think a general approach to system
we believe that our approach can be combined with Mesh resource management problems is to decompose the prob-
to further reduce fragmentation. When Llama encounters lem into a supervised learning problem that can be solved by
long-lived blocks on mostly empty pages, the global alloca- learning from profiling data and a conventional algorithmic
tor could avoid the corresponding locations on other pages, solution for handling mispredictions.
making it more likely that these pages can be meshed in the
future. This approach could likely use the same bitmap-based 11 Conclusion
mechanism already used by Llama. We show that modern ML techniques can be effectively used
to address fragmentation in C++ server workloads that is
10 Discussion induced by long-lived objects allocated at peak heap size.
We use language models to predict lifetimes for unobserved
Extension to other properties. Our model predicts lifetimes, allocations sites, a problem unexplored in prior lifetime pre-
but the allocator can benefit from other properties, e.g., diction work. We introduce Llama, a novel memory manager
whether or not an object will be freed by the same thread that organizes the heap using huge pages and lifetime classes,
that allocated it. This information is useful because it allows instead of size classes. Llama packs objects with similar life-
us to allocate objects that stay within the same thread in time together in the blocks of a huge page, tracks actual
the same block span, which reduces synchronization and im- lifetimes and uses them to correct for mispredictions. It lim-
proves performance. As with the page allocator, we need to its fragmentation by filling gaps created by frees in blocks
consider mispredictions. As we are using atomic operations and their lines with shorter-lived objects. In this context, this
to update the reference count, correction is simple. If the work solves challenges related to applying ML to systems
prediction was correct, performance improves from reduced problems with strict resource and latency constraints.
synchronization. For incorrect predictions, we incur a minor
performance loss by having to synchronize on the cache line, Acknowledgements. We would like to thank our shepherd Harry Xu for
but these are rare if predictions are mostly correct. A more his help improving the paper. We would also like to thank Ana Klimovic,
generalized predictor could inform various other memory Chris Kennelly, Christos Kozyrakis, Darryl Gove, Jeff Dean, Khanh Nguyen,
Mark Hill, Martin Abadi, Mike Burrows, Milad Hashemi, Paul Barham,
allocation strategies (e.g., based on object sizes, alignment, Paul Turner, Sanjay Ghemawat, Steve Hand and Vijay Reddi, as well as the
freeing thread, etc.) and learn which strategy to pick for each anonymous reviewers, for their feedback. Finally, we would like to give
credit to Rebecca Isaacs and Amer Diwan for the initial implementation of
allocation. The strategies themselves could be determined the stack hashing mechanism.
References [13] Rodrigo Bruno, Duarte Patricio, José Simão, Luis Veiga, and Paulo Fer-
[1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, reira. 2019. Runtime Object Lifetime Profiler for Latency Sensitive Big
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Data Applications. In Proceedings of the Fourteenth EuroSys Conference
Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry 2019 (EuroSys ’19). ACM, New York, NY, USA, Article 28, 16 pages.
Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasude- https://doi.org/10.1145/3302424.3303988
van, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. [14] Craig Chambers, Ashish Raniwala, Frances Perry, Stephen Adams,
TensorFlow: A System for Large-scale Machine Learning. In Proceed- Robert Henry, Robert Bradshaw, and Nathan. 2010. FlumeJava: Easy,
ings of the 12th USENIX Conference on Operating Systems Design and Efficient Data-Parallel Pipelines. In ACM SIGPLAN Conference on Pro-
Implementation (OSDI’16). USENIX Association, Berkeley, CA, USA, gramming Language Design and Implementation (PLDI). 363–375.
265–283. http://dl.acm.org/citation.cfm?id=3026877.3026899 [15] Pohua P. Chang, Scott A. Mahlke, William Y. Chen, and Wen-Mei W.
[2] Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Hwu. 1992. Profile-guided automatic inline expansion for C programs.
Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Software: Practice and Experience 22, 5 (1992), 349–369. https://doi.
Mills, Frances Perry, Eric Schmidt, and Sam Whittle. 2015. The org/10.1002/spe.4380220502
Dataflow Model: A Practical Approach to Balancing Correctness, La- [16] Daniel Clifford, Hannes Payer, Michael Stanton, and Ben L. Titzer.
tency, and Cost in Massive-scale, Unbounded, Out-of-order Data Pro- 2015. Memento Mori: Dynamic Allocation-site-based Optimizations.
cessing. Proc. VLDB Endow. 8, 12 (Aug. 2015), 1792–1803. https: In Proceedings of the 2015 ACM SIGPLAN International Symposium on
//doi.org/10.14778/2824032.2824076 Memory Management. 105–117.
[3] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. [17] David A. Cohn and Satinder P. Singh. 1997. Predicting Lifetimes
2018. Learning to Represent Programs with Graphs. In 6th Interna- in Dynamically Allocated Memory. In Advances in Neural In-
tional Conference on Learning Representations, ICLR 2018, Vancouver, formation Processing Systems 9, M. C. Mozer, M. I. Jordan, and
BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. T. Petsche (Eds.). MIT Press, 939–945. http://papers.nips.cc/paper/
[4] David A. Barrett and Benjamin G. Zorn. 1993. Using Lifetime Predictors 1240-predicting-lifetimes-in-dynamically-allocated-memory.pdf
to Improve Memory Allocation Performance. In Proceedings of the [18] David Detlefs, Christine H. Flood, Steve Heller, and Tony Printezis.
ACM SIGPLAN 1993 Conference on Programming Language Design 2004. Garbage-first garbage collection. In ACM International Sympo-
and Implementation (PLDI ’93). ACM, New York, NY, USA, 187–196. sium on Memory Management (ISMM). 37–48. https://doi.org/10.1145/
https://doi.org/10.1145/155090.155108 1029873.1029879
[5] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and [19] Jason Evans. 2006. A scalable concurrent malloc (3) implementation
Paul R. Wilson. 2000. Hoard: A Scalable Memory Allocator for Multi- for FreeBSD. In Proc. of the bsdcan conference, ottawa, canada.
threaded Applications. In Proceedings of the Ninth International Con- [20] Kunihiko Fukushima. 1980. Neocognitron: A self-organizing neural
ference on Architectural Support for Programming Languages and Op- network model for a mechanism of pattern recognition unaffected by
erating Systems (ASPLOS IX). ACM, New York, NY, USA, 117–128. shift in position. Biological Cybernetics 36, 4 (1980), 193–202.
https://doi.org/10.1145/378993.379232 [21] Sanjay Ghemawat and Paul Menage. 2009. Tcmalloc: Thread-caching
[6] Emery D. Berger, Benjamin G. Zorn, and Kathryn S. McKinley. 2002. malloc. http://goog-perftools.sourceforge.net/doc/tcmalloc.html
Reconsidering Custom Memory Allocation. In Proceedings of the 17th [22] Daniel Golovin, Benjamin Solnik, Subhodeep Moitra, Greg Kochanski,
ACM SIGPLAN Conference on Object-oriented Programming, Systems, John Karro, and D. Sculley. 2017. Google Vizier: A Service for Black-Box
Languages, and Applications (OOPSLA ’02). ACM, New York, NY, USA, Optimization. In Proceedings of the 23rd ACM SIGKDD International
1–12. https://doi.org/10.1145/582419.582421 Conference on Knowledge Discovery and Data Mining. 1487–1495.
[7] Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. [23] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep
2016. Site Reliability Engineering: How Google Runs Production Systems. Learning. MIT Press. http://www.deeplearningbook.org.
O’Reilly Media, Inc. [24] Google. 2020. C++ Arena Allocation Guide. https://developers.google.
[8] Stephen Blackburn, Richard E. Jones, Kathryn S. McKinley, and com/protocol-buffers/docs/reference/arenas
J. Eliot B. Moss. 2002. Beltway: Getting Around Garbage Collection [25] Google. 2020. pprof. https://github.com/google/pprof
Gridlock. In ACM SIGPLAN Conference on Programming Language [26] Google. 2020. TCMalloc. https://github.com/google/tcmalloc
Design and Implementation (PLDI). 153–164. [27] Swapnil Haria, Mark D. Hill, and Michael M. Swift. 2018. Devirtualizing
[9] Stephen M. Blackburn, Perry Cheng, and Kathryn S. McKinley. 2004. Memory in Heterogeneous Systems. In Proceedings of the Twenty-Third
Myths and realities: the performance impact of garbage collection. International Conference on Architectural Support for Programming
In Proceedings of the International Conference on Measurements and Languages and Operating Systems (ASPLOS ’18). ACM, New York, NY,
Modeling of Computer Systems, SIGMETRICS 2004. 25–36. https://doi. USA, 637–650. https://doi.org/10.1145/3173162.3173194
org/10.1145/1005686.1005693 [28] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term
[10] Stephen M. Blackburn and Kathryn S. McKinley. 2008. Immix: A Mark- memory. Neural computation 9, 8 (1997), 1735–1780.
region Garbage Collector with Space Efficiency, Fast Collection, and [29] Jipeng Huang and Michael D. Bond. 2013. Efficient context sensitivity
Mutator Performance. In Proceedings of the 29th ACM SIGPLAN Con- for dynamic analyses via calling context uptrees and customized mem-
ference on Programming Language Design and Implementation (PLDI ory management. In ACM Conference on Object-Oriented Programming
’08). 22–32. https://doi.org/10.1145/1375581.1375586 Languages and Systems (OOPSLA). 53–72.
[11] Stephen M. Blackburn, Sharad Singhai, Matthew Hertz, Kathryn S. [30] Maria Jump, Stephen M. Blackburn, and Kathryn S. McKinley. 2004.
McKinely, and J. Eliot B. Moss. 2001. Pretenuring for Java. In Proceed- Dynamic object sampling for pretenuring. In Proceedings of the 4th
ings of the 16th ACM SIGPLAN Conference on Object-oriented Program- International Symposium on Memory Management, ISMM 2004, Van-
ming, Systems, Languages, and Applications (OOPSLA ’01). ACM, New couver, BC, Canada, October 24-25, 2004. 152–162. https://doi.org/10.
York, NY, USA, 342–352. https://doi.org/10.1145/504282.504307 1145/1029873.1029892
[12] Michael D. Bond, Graham Z. Baker, and Samuel Z. Guyer. 2010. Bread- [31] Svilen Kanev, Juan Pablo Darago, Kim Hazelwood, Parthasarathy Ran-
crumbs: efficient context sensitivity for dynamic bug detection analy- ganathan, Tipp Moseley, Gu-Yeon Wei, and David Brooks. 2015. Pro-
ses. In Proceedings of the 2010 ACM SIGPLAN Conference on Program- filing a Warehouse-scale Computer. In Proceedings of the 42Nd Annual
ming Language Design and Implementation, PLDI 2010, Toronto, Ontario, International Symposium on Computer Architecture (ISCA ’15). ACM,
Canada. 13–24. https://doi.org/10.1145/1806596.1806599 New York, NY, USA, 158–169. https://doi.org/10.1145/2749469.2750392
[32] Svilen Kanev, Sam Likun Xi, Gu-Yeon Wei, and David Brooks. 2017. [42] Todd Mytkowicz, Devin Coughlin, and Amer Diwan. 2009. Inferred
Mallacc: Accelerating Memory Allocation. In Proceedings of the Twenty- Call Path Profiling. In Proceedings of the 24th ACM SIGPLAN Conference
Second International Conference on Architectural Support for Program- on Object Oriented Programming Systems Languages and Applications
ming Languages and Operating Systems (ASPLOS ’17). ACM, New York, (OOPSLA ’09). ACM, New York, NY, USA, 175–190. https://doi.org/10.
NY, USA, 33–45. https://doi.org/10.1145/3037697.3037736 1145/1640089.1640102
[33] Vasileios Karakostas, Jayneel Gandhi, Furkan Ayar, Adrián Cristal, [43] Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing
Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Xu. 2015. FACADE: A Compiler and Runtime for (Almost) Object-
Swift, and Osman S. Unsal. 2015. Redundant memory mappings for Bounded Big Data Applications. In Proceedings of the Twentieth Interna-
fast access to large memories. In Proceedings of the 42nd Annual In- tional Conference on Architectural Support for Programming Languages
ternational Symposium on Computer Architecture, Portland, OR, USA, and Operating Systems (ASPLOS ’15). ACM, New York, NY, USA, 675–
June 13-17, 2015. 66–78. https://doi.org/10.1145/2749469.2749471 690. https://doi.org/10.1145/2694344.2694345
[34] Sang-Hoon Kim, Sejun Kwon, Jin-Soo Kim, and Jinkyu Jeong. 2015. [44] Christopher Olston, Noah Fiedel, Kiril Gorovoy, Jeremiah Harmsen, Li
Controlling physical memory fragmentation in mobile systems. In Lao, Fangwei Li, Vinu Rajashekhar, Sukriti Ramesh, and Jordan Soyke.
Proceedings of the 2015 ACM SIGPLAN International Symposium on 2017. Tensorflow-serving: Flexible, high-performance ml serving.
Memory Management, ISMM 2015, Portland, OR, USA, June 13-14, 2015. arXiv preprint arXiv:1712.06139 (2017).
1–14. https://doi.org/10.1145/2754169.2754179 [45] Ashish Panwar, Aravinda Prasad, and K. Gopinath. 2018. Making Huge
[35] Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Sto- Pages Actually Useful. In Proceedings of the Twenty-Third International
chastic Optimization. In 3rd International Conference on Learning Rep- Conference on Architectural Support for Programming Languages and
resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Operating Systems (ASPLOS ’18). ACM, New York, NY, USA, 679–692.
Track Proceedings. http://arxiv.org/abs/1412.6980 https://doi.org/10.1145/3173162.3173203
[36] Bradley C. Kuszmaul. 2015. SuperMalloc: a super fast multithreaded [46] Bobby Powers, David Tench, Emery D. Berger, and Andrew McGregor.
malloc for 64-bit machines. In Proceedings of the 2015 ACM SIGPLAN 2019. Mesh: Compacting memory management for C/C++ applications.
International Symposium on Memory Management, ISMM 2015, Portland, In ACM Conference on Programming Language Design and Implemen-
OR, USA. 41–55. https://doi.org/10.1145/2754169.2754178 tation(PLDI). 333–346. https://doi.org/10.1145/3314221.3314582
[37] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J. Rossbach, [47] Chuck Rossi. 2017. Rapid release at massive scale. https://engineering.
and Emmett Witchel. 2016. Coordinated and Efficient Huge Page fb.com/web/rapid-release-at-massive-scale/.
Management with Ingens. In Proceedings of the 12th USENIX Conference [48] Rifat Shahriyar, Stephen M. Blackburn, Xi Yang, and Kathryn S. McKin-
on Operating Systems Design and Implementation (OSDI’16). USENIX ley. 2013. Taking off the gloves with reference counting Immix. In ACM
Association, Berkeley, CA, USA, 705–721. http://dl.acm.org/citation. Conference on Object-Oriented Programming Languages and Systems
cfm?id=3026877.3026931 (OOPSLA). 93–110. https://doi.org/10.1145/2509136.2509527
[38] Doug Lea and Wolfram Gloger. 1996. A memory allocator. [49] Darko Stefanovic, Kathryn S. McKinley, and J. Eliot B. Moss. 1999.
[39] Yann LeCun, Bernhard Boser, John S. Denker, Donnie Henderson, Age-Based Garbage Collection. In ACM SIGPLAN Conference on Object-
Richard E. Howard, Wayne Hubbard, and Lawrence D. Jackel. 1989. Oriented Programming Languages and Systems (OOPSLA). 370–381.
Backpropagation applied to handwritten zip code recognition. Neural https://doi.org/10.1145/320385.320425
Computation 1, 4 (1989), 541–551. [50] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and
[40] David Lo, Liqun Cheng, Rama Govindaraju, Luiz André Barroso, and Zbigniew Wojna. 2016. Rethinking the Inception Architecture for
Christos Kozyrakis. 2014. Towards energy proportionality for large- Computer Vision. In The IEEE Conference on Computer Vision and
scale latency-critical workloads. In ACM International Conference on Pattern Recognition (CVPR).
Computer Architecture (ISCA). 301–312. https://doi.org/10.1109/ISCA. [51] David M. Ungar. 1984. Generation Scavenging: A Non-Disruptive
2014.6853237 High Performance Storage Reclamation Algorithm. In Proceedings
[41] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Effi- of the ACM SIGSOFT/SIGPLAN Software Engineering Symposium on
cient estimation of word representations in vector space. arXiv preprint Practical Software Development Environments, Pittsburgh, Pennsylvania,
arXiv:1301.3781 (2013). USA, April 23-25, 1984. 157–167. https://doi.org/10.1145/800020.808261

You might also like