Learning-Based Memory Allocation For C++ Server Workloads
Learning-Based Memory Allocation For C++ Server Workloads
...
5 Lifetime Prediction Model
proto2 :: MessageLite )
Our goal is to predict object lifetimes based on our collection
Figure 4. LSTM-based model architecture
of past lifetime samples. As shown in Section 2, a simple
lookup table is insufficient and brittle to changes in the ap-
plication. We instead construct a dataset of samples from a
This approach follows continuous profiling tools used in range of scenarios and train a machine learning model on
production settings [31]. this dataset to generalize to previously unseen stack traces.
We integrate this approach into TCMalloc [21]. Its existing
heap profiling mechanism identifies long-lived objects well
by producing a list of sampled objects at the end of the appli- 5.1 Data Processing
cation’s execution, most of which are long-lived, including We pre-process our sample data using a distributed dataflow
their allocation sites. It misses the more prolific allocations computation framework [2, 14]. We group inputs by alloca-
of short-lived objects that are not live at the end of the pro- tion site and calculate the distribution of observed lifetimes
gram. We therefore extend the heap profiling mechanism to for each site. We use the 95th percentile 𝑇95 𝑖 of observed
record frees (deallocations) as well. We do so using hooks (i.e., lifetimes of site 𝑖 to assign a label 𝐿𝑖 ∈ {1, . . . , 7, ∞} to the
functions) that are called periodically, based on the number site such that 𝑇95𝑖 < 𝑇 (𝐿 ) = (10) 𝐿𝑖 ms. Objects the program
𝑖
of allocated bytes. These hooks incur virtually no overhead never frees get a special long-lived label ∞. This produces life-
when they are disabled. When enabled, each sampled allo- time classes of 10 ms, 100 ms, 1 s, 10 s, 100 s, 1000 s, ≥1000 s,
cation triggers TCMalloc to store it at a special address in and ∞. Our model classifies stack traces according to these
memory and then deallocation can identify those sampled labels. To ensure that our model assigns greater importance
objects and call the corresponding deallocation hook. to stack traces that occur more often, we weight each stack
We install an HTTP handler accessible by pprof [25], an trace according to the number of times it was observed and
open-source profiling and analysis tool. When invoked, the sample multiple copies for frequently occurring traces. The
handler registers two hooks, one for allocation and one for resulting datasets for our applications contain on the order
deallocation. It also allocates a new data structure (outside of of tens of thousands of elements.
the TCMalloc-managed heap) to store observed stack traces. The use of wallclock time for lifetime prediction is a de-
The allocation hook stores the allocation’s full stack trace, a parture from prior work that expresses lifetime with respect
timestamp of the allocation, object size, alignment, and the to allocated bytes [4], which can be more stable across en-
stack and processor ID of the allocation into a hash table, vironments (e.g., server types) at short timescales. We ex-
indexed by a pointer to where the object was allocated. The perimented with logical time measured in bytes, but for our
deallocation hook matches its pointer to the hash table and if server systems, wallclock time works better. We believe time
it finds an entry, records its own stack trace, timestamp and works better because 1) our lifetime classes are very coarse-
thread/CPU where the deallocation occurred. This pair of grained (10×) and absorb variations, 2) if the speed difference
entries is now stored in a different hash table, which is used between environments is uniform, nothing changes (lifetime
to deduplicate all samples. For each entry, we keep a running classes are still a factor of 10× apart). Meanwhile, variations
tally of the distribution of lifetimes, by storing the maximum, in application behavior make the bytes-based metric very
minimum, count, sum and sum of squares (the latter two brittle over long time ranges (e.g., in the image server, the
allow us to calculate mean and variance of the lifetime at sizes of submitted images, number of asynchronous external
a later point). We also store how many of these allocations events, etc. dilate logical time).
were allocated and deallocated on the same CPU or thread
(we do not currently use this information, but explain in
5.2 Machine Learning Model
Section 10 how it might be used). At the end of a sampling
period, we store the result into a protocol buffer [24]. We use a model similar to text models. First, we treat each
In deployment, we would periodically connect to servers frame in the stack trace as a string and tokenize it by splitting
in the fleet and collect samples. For this research, we run based on special characters such as: , and ::. We separate
smaller-scale experiments to understand the trade-offs of stack frames with a special token: @. We take the most com-
our approach and mostly rely on full traces collected by mon tokens and create a table that maps them to a particular
instrumenting allocation and free calls. While too expensive ID with one special ID reserved for unknown or rare tokens,
for production, this approach is useful for understanding denoted as UNK. The table size is a configuration parameter
coverage of different sampling rates (Section 9), or to replay (e.g., 5,000 covers most common tokens).
1 __gnu_cxx :: __g :: __string_base char , std :: __g :: char_traits Here lies an opportunity for the model to generalize. If the
2
char , std :: __g :: allocator char :: _M_reserve ( unsigned long )
proto2 :: internal :: InlineGreedyStringParser ( std :: __g ::
model can learn that tokens such as ParseFromArray and
basic_string char , std :: __g :: char_traits char , std :: __g :: InternalParse appear in similar contexts, it can generalize
3
allocator char *, char const *, proto2 :: internal :: ParseContext *)
proto2 :: FileDescriptorProto :: _InternalParse ( char const *,
when it encounters stack traces that it has not seen before.
proto2 :: internal :: ParseContext *) Note that our approach is not specific to LSTMs. We chose
4
5
proto2 :: MessageLite :: ParseFromArray ( void const *, int )
proto2 :: DescriptorPool :: TryFindFileInFallbackDatabase ( std ::
the LSTM architecture since it is one of the simplest se-
__g :: basic_string char , std :: __g :: char_traits char , std :: quence models, but future work could explore more sophis-
6
__g :: allocator char const ) const
proto2 :: DescriptorPool :: FindFileByName ( std :: __g ::
ticated model architectures that could incorporate more de-
basic_string char , std :: __g :: char_traits char , std :: __g :: tails of the underlying program (e.g., Graph Neural Networks
allocator char const ) const proto2 :: internal ::
AssignDescriptors ( proto2 :: internal :: AssignDescriptorsTable *)
trained on program code [3]). Our specific model architec-
7 system2 :: Algorithm_descriptor () ture is a standard single-layer LSTM with a hidden state size
8
9
system2 :: init_module_algorithm_parse ()
Initializer :: TypeData :: RunIfNecessary ( Initializer *)
of 64 (we experiment with 16 as well), embedding size of
10 Initializer :: RunInitializers ( char const *) 32, uses a softmax output, and is trained against a standard
11
12
RealInit ( char const *, int *, char *** , bool , bool )
main
cross-entropy classification loss via gradient descent. The
final state of the LSTM is passed through a fully connected
Figure 5. An example of an altered but representative stack layer. Training uses the Adam optimizer [35] with a learning
trace used to predict object lifetimes. rate of 0.001 and gradients clipped to 5.0.
We use a long short-term memory (LSTM) recurrent neu- 5.3 Model Implementation
ral network model [28]. LSTMs are typically used for se- We implement and train our model using TensorFlow [1].
quence prediction, e.g., for next-word prediction in natural Calling into the full TensorFlow stack to obtain a lifetime
language processing. They capture long-term sequential de- prediction would be prohibitively expensive for a memory
pendencies by applying a recursive computation to every allocator, so after training, we use TensorFlow’s XLA com-
element in a sequence and outputting a prediction based piler to transform the trained model into C++ code that we
on the final step. In contrast, feed-forward neural networks compile and link into our allocator directly. The model runs
like multi-layer perceptrons [23] or convolutional neural within the allocating thread. To allow multiple threads to use
networks [20, 39] can recognize local patterns, but require the model concurrently, we instantiate the model’s internal
some form of temporal integration in order to apply them to buffers multiple times and add concurrency control.
variable-length sequences.
Our choice of an LSTM is informed by stack trace structure.
6 Lifetime Aware Allocator Design
Figure 5 shows an example. Sequentially processing a trace
from top to bottom conceptually captures the nesting of This section introduces a fundamentally new design for
the program. In this case, the program is creating a string, C/C++ memory managers based on predicted object life-
which is part of a protocol buffer (“proto”) parsing operation, times. Instead of building an allocator around segmenting
which is part of another subsystem. Each part on its own is allocations into size classes [5, 9, 19, 21, 36, 38], we directly
not meaningful: A string may be long-lived or short-lived, manage huge pages and segment object allocation into pre-
depending on whether it is part of a temporary data structure dicted lifetime classes. We further divide, manage, and track
or part of a long-lived table. Similarly, some operations in huge pages and their liveness at a block and line granularity
the proto might indicate that a string constructed within it to limit fragmentation. We implement our allocator from
is temporary, but others make the newly constructed string scratch. It is not yet highly tuned, but it demonstrates the
part of the proto itself, which means they have the same potential of a lifetime-based approach. We address two chal-
lifetime. In this case, the enclosing context that generates lenges required to incorporate ML into low-level systems: 1)
the proto indicates whether the string is long or short-lived. how to deal with mispredictions and 2) prediction latencies
For our model to learn these types of patterns, it must step that are orders of magnitude longer than the typical alloca-
through the stack frames, carrying through information, and tion latency. We first describe the structure of the memory
depending on the context, decide whether or not a particular allocator, then how we make fast predictions, and follow
token is important. This capability is a particular strength of with key implementation details.
LSTMs (Figure 4). We feed the stack trace into the LSTM as a
sequence of tokens (ordered starting from the top of the trace) 6.1 Heap Structure and Concurrency
by first looking up an “embedding vector” for each token in a We design our memory manager for modern parallel soft-
table represented as a matrix 𝐴. The embedding matrix A is ware and hardware. Llama organizes the heap into huge
trained as part of the model. Ideally, 𝐴 will map tokens with pages to increase TLB reach. To limit physical fragmentation,
a similar meaning close together in embedding space, similar we divide huge pages into 8 KB blocks and track their live-
to word2vec embeddings [41] in natural language processing. ness. Llama assigns each active huge page one of 𝑁 lifetime
classes (LC), separated by at least an order of magnitude (e.g., to a local allocator, it marks the blocks open for allocation.
10 ms, 100 ms, 1000 ms, . . . , ∞). Our implementation uses a If the blocks are on an open huge page, it also marks the
maximum of 𝑁 = 7 lifetime classes. Llama exploits the large blocks residual. Residual blocks are predicted to match the
virtual memory of 64-bit architectures, as fragmentation of LC of their huge page. An active huge page may also contain
virtual memory is not a concern. Llama divides virtual mem- other live (non-residual) blocks, but these blocks will contain
ory into 16 GB LC regions, one per lifetime class. Section 8 objects of a shorter lifetime class, as explained below. Thread-
describes enhancements when an LC region is exhausted. local allocators bump-pointer allocate small objects in block
The global allocator manages huge pages and their blocks. spans. When they exhaust a span, they mark it closed.
It performs bump pointer allocation of huge pages in their Llama first fills a huge page with same LC blocks and then
initial LC regions, acquiring them from the OS. It directly transitions it from open to active. At this point, the huge
manages large objects (>= 8 KB), placing them into contigu- page contains residual blocks and maybe free blocks. Figure 6
ous free blocks in partially free huge pages or in new huge shows an illustrative, but simplified, example of the logical
pages. A huge page may contain large and small objects. LC Llama heap (huge pages and blocks) and its behavior
Llama achieves scalability on multicore hardware by us- over time. This heap has three lifetime classes, separated by
ing mostly unsynchronized thread-local allocation for small orders of magnitude. A large amount of initial allocation in
objects (<=8 KB). The global allocator gives block spans to Figure 6a, including a large object allocation into huge page
local allocators upon request. When a thread-local allocator 11 and 12, is followed by a large number of frees in Figure 6b.
allocates the first object of a given LC or it exhausts its cur- Llama returns free huge pages 2 and 6 to the OS.
rent LC block span, it requests one from the global allocator.
Block spans consist of 𝑀 blocks and reduce synchronization 6.3 Limiting Fragmentation by Recycling Blocks
with the global allocator. Our implementation uses 𝑀 = 2 Notice in Figure 6b active huge pages contain free blocks and
(16 KB block spans) with 16 KB alignment. Llama further live residual blocks of the same LC. Llama limits fragmen-
subdivides block spans into 128 B lines and recycles lines in tation by aggressively recycling such free blocks for objects
partially free block spans for small objects (see Section 6.6). in shorter LCs (except for the shortest LC, since no LC is
It tracks line and block liveness using object counters. Small shorter). Section 6.5 explains the fast bit vector operations
objects never cross span boundaries, but may cross line and that find recyclable blocks of the correct size and alignment.
block boundaries. Each thread-local allocator maintains one Given a request for LC 𝑙𝑟 , the global allocator prefers to
or two block spans per LC for small objects. use free blocks from a longer-lived active huge page (LC
Llama tracks predicted and actual block lifetimes and uses > 𝑙𝑟 ). These recycled blocks are allocated non-residual, as
them to decrease or increase their huge page’s LC. Llama illustrated in Figure 6c. If no such recyclable blocks exist, the
maintains the following invariants. 1) It allocates only objects global allocator uses block(s) from the open huge page of
of one predicted LC into a block or span at a time. 2) A huge the same LC = 𝑙𝑟 . Intuitively, if the predictor is accurate or
page contains blocks with the same or shorter predicted LC. overestimates lifetime class, the program with high probabil-
We next describe how we use LC predictions to manage ity will free shorter-lived objects on recycled blocks before
huge pages and blocks. Sections 6.3 and 6.4 describe the it frees residual blocks with the same LC as the huge page.
policies that limit fragmentation and dynamically detect Because lifetime classes are separated by at least an order
and control the impact of mispredicted lifetimes. Section 6.6 of magnitude, the allocator may reuse these blocks many
then describes how Llama uses lines to identify and recycle times while the longer-lived objects on the huge page are
memory in partially free block spans. in use, reducing the maximum heap footprint. If the predic-
tor underestimates lifetime, the objects will have more time
6.2 Lifetime-Based Huge Page Management to be freed. This design is thus tolerant of over and under
Each huge page has three states: open, active, and free. Open estimates of lifetime.
and active blocks are live. The first allocation into a huge page For large objects, the global allocator assigns blocks di-
makes it open and determines its LC. Only one huge page rectly. For example, given the heap state in Figure 6b and a
per LC is open at a time. While a huge page is open, Llama request for a two block large object with 𝑙𝑟 < 10 ms, the global
only assigns its blocks to the same LC. Llama transitions a allocator allocates it into huge page 7 with LC < 100 ms and
huge page from open to active and assigns it a deadline after marks the blocks non-residual, as illustrated in Figure 6c.
filling all its constituent blocks for the first time. The huge When Llama recycles a block span (assigning it to a local
page remains active for the rest of its lifetime. The OS backs allocator), it marks the blocks open and non-residual. The
huge pages lazily, upon first touch. A huge page is free when local allocator assigns the span to the requested LC 𝑙𝑟 , even
all its blocks are free and is immediately returned to the OS. if the span resides on a huge page assigned to a longer
All blocks in a huge page are free or live; open or closed lifetime class. The local allocator only allocates objects of
for allocation; and residual or non-residual. All blocks are this predicted lifetime 𝑙𝑟 into this span. After it fills the span
initially free. When the global allocator returns a block span with 𝑙𝑟 object allocations, it marks the blocks closed. This
● Residual Allocation O Open Huge Page Huge Page
LC Lifetime Class A Active Huge Page
object lifetimes to the next longer LC and huge pages with
blocks
LC A ① A ② A ③ A ④ O ⑤ over-predicted objects to the next shorter lifetime class.
< 10 ms
A
• • • • • •
⑥
• • • • • •
A ⑦ O
• • • • • •
⑧
• • • • • • • • • •
Huge Page
We detect under-prediction of lifetimes using deadlines.
< 100 ms
A
• • • • • •
⑨
• • • • • •
A ⑩ A
• •
⑪ O ⑫
Identifiers When a huge page becomes full for the first time, the global
<1s • • • • • • • • • • • • • • • • • • • • • allocator transitions it from open to active and assigns it a
large object allocation deadline as follows:
(a) Initial allocations. Huge pages are bump-pointer allocated into LC regions.
Each huge page is first filled with same LC blocks, marked residual with a dot.
deadline = current_timestamp + K × LC𝐻𝑢𝑔𝑒𝑃𝑎𝑔𝑒
A ① A ② A ③ A ④ O ⑤ When Llama changes the LC of a huge page, it assigns the
< 10 ms • • freed to OS • • • • • • • • • •
A ⑥ A ⑦ O ⑧ huge page a new deadline using the same calculation and the
< 100 ms • to OS
freed • • • •
A ⑨ A ⑩ A ⑪ O ⑫ new lifetime class. We experimented with 𝐾 = 2 and 𝐾 = 4.
<1s • • • • • • • • • • • • • • • •
When a huge page’s deadline expires, then the predictor
(b) After objects free, some blocks and huge pages are free (white). Llama made a mistake. To recover, Llama increases the huge page’s
immediately returns free huge pages to the OS to control maximum heap size. lifetime class and gives it a new deadline. Figure 6d depicts
After frees, Llama
< 10 ms
A returns completely
• •
① A free huge③pages
•
A to the OS. ④
• • • • •
O
• • • •
⑤
this case. The residual blocks in huge page 1 outlive their
< 100 ms
A
• •
⑦ O
• •
⑧ deadline and Llama increases its LC to 100 ms. A huge page
<1s
A ⑨ A ⑩ A ⑪ O ⑫ may also contain non-residual blocks which it leaves un-
• • • • • • • • • • • • • • • •
changed. Llama essentially predicts that the residual blocks
(c) Subsequent allocations of shorter LC small objects first fill free blocks in
were just mispredicted by one LC and non-residual blocks
the highest LC in A(ctive)
Llama preferentially hugeLCpages
allocates shorter blocks 9
onand 10,longer
A(ctive) andlived
then blocks
huge pages. in huge page are shorter lived than this LC. If either live longer that this
7. These blocks are not residual (no dot) and expected to be freed before the LC, this process will repeat until the blocks are freed or reach
residual blocks. O(pen) pages 5, 8, and 12 are ineligible for such allocation. the longest lived LC. This policy ensures that huge pages
< 10 ms • •
A ③
•
A
• • • •
④
•
O
• • • •
⑤
with under predicted objects eventually end up in the correct
< 100 ms
A
• •
① A
• •
⑦ O
• •
⑧ lifetime class, tolerating mispredictions.
<1s
A ⑨ A ⑩ A ⑪ O ⑫ Llama’s recycling mechanism works well for both accu-
• • • • • • • • • • • • • •
rate and under-predicted lifetimes. If all lifetimes are accurate
(d) When huge page 1’s deadline expires, residual blocks are still live (mis-
or under-predicted, a program will free all residual blocks be-
prediction). Llama
If residual blocks increases
on a huge page outlivethe huge
the LC, page’s
Llama LChuge
moves the bypage
one,to from
the next10 to 100
higher LC. 𝑚𝑠. fore their huge page deadline since the deadline is generous.
Residual blocks remain residual; their expected lifetime is now at least 100 𝑚𝑠. As blocks become free on active huge pages, the allocator
A ③ A ④ O ⑤ may recycle them for shorter lifetime classes, as explained
< 10 ms • • • • • • • • • •
A ⑨ A ① A ⑦ O ⑧ above. Llama may repeatedly recycle blocks on active huge
< 100 ms • • • • • • • • • •
A ⑩ A ⑪ O ⑫ pages, each time they are freed. Before the deadline expires,
<1s • • • • • • • • • • • • • •
if all blocks in the huge page are free at once, Llama simply
releases it to the OS. Otherwise given accurate or under pre-
(e) Huge page 9 only contains non-residual blocks and consequently, Llama
decreases
If all live its LC.
blocks It marks
belong all LC,
to a shorter liveLlama
blocks
movesresidual since
the huge page they
to the nextmatch or are less
shorter LC.
diction, the huge page will at some point contain only live
than the huge page’s LC. non-residual (shorter LC) blocks when the deadline expires.
Figure 6. Llama’s logical heap organization with three life- Llama will then decrease the huge page’s LC by one and
time classes (< 10 𝑚𝑠, < 100 𝑚𝑠, < 1 𝑠). Each live huge page is compute a new deadline using the current time and new LC.
A(ctive) or O(pen) and divided into blocks. Block color de- Figure 6e shows such an example. Because huge page 9
picts predicted LC or free (white). Residual blocks are marked contains only non-residual blocks, Llama decreases its LC
with a dot. Deadlines and lines are omitted. and marks all live blocks residual. With accurate and under-
predicted lifetimes, this process repeats: either the huge page
is freed or its LC continues to drop until it reaches the short-
est LC. In the shortest LC since no blocks are recycled and
policy guarantees that when a block is open for allocation, when prediction is accurate, all blocks are freed before the
it receives only one LC. deadline and the huge page is released.
Llama’s recycling policy is configurable. In the current
implementation, Llama prefers 𝑙𝑟 + 1 for large objects and 6.5 Data Structures
the longest available LC for small objects. Llama tracks liveness at the huge page, block, and line gran-
ularity. It stores metadata in small pages at the beginning
6.4 Tolerating Prediction Errors of each 16 GB LC region. Each huge page in a region corre-
Lifetime prediction will never be perfect. Llama tolerates sponds to one 256 B metadata region in the metadata page.
mispredictions by tracking block and huge page lifetimes us- Mapping between a huge page and its metadata therefore
ing deadlines. It promotes huge pages with under-predicted consists of quick bit operations.
The global allocator tracks active huge pages in a list for Allocation
Figure 8. Llama reduces huge page (HP) fragmentation com- (a) Image Processing Server
pared to TCMalloc on the Image Processing Server. TCMalloc
numbers optimistically assume all free spans are immediately
returned to the OS, which is not the case.