0% found this document useful (0 votes)
120 views

PHK Malloc Paper

The document discusses the history and evolution of dynamic memory allocation in C programming and operating systems. It describes how memory allocation has moved from using segments to using virtual memory with pages. The modern approaches of using the stack, heap via brk/sbrk system calls, and mmap are described. The typical implementation of malloc, realloc and free is also summarized, which uses a linked list of chunks to manage the memory between the program's bss section and the brk point.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

PHK Malloc Paper

The document discusses the history and evolution of dynamic memory allocation in C programming and operating systems. It describes how memory allocation has moved from using segments to using virtual memory with pages. The modern approaches of using the stack, heap via brk/sbrk system calls, and mmap are described. The typical implementation of malloc, realloc and free is also summarized, which uses a linked list of chunks to manage the memory between the program's bss section and the brk point.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Malloc(3) revisited

Poul-Henning Kamp 1
The FreeBSD Project

ABSTRACT
malloc (3) is one of the oldest parts of the C language environment and not surprisingly the
world has changed a bit since it was first conceived. The fact that most UNIX kernels have
changed from swap/segment to virtual memory/page based memory management has not been
sufficiently reflected in the implementations of the malloc/free API.
A new implementation was designed, written, tested and bench-marked with an eye on the
workings and performance characteristics of modern Virtual Memory systems. It works OK.

Introduction In the C language, there exists a little used inter-


face to the stack, alloca (3), which will explicitly allo-
All but the most trivial programs need to allocate
cate space on the stack. This is not a interface to the
storage dynamically in addition to whatever static
kernel, but an adjustment done to the stack pointer
storage the compiler reserved at compile-time. Pro-
such that space will be available and unharmed by any
gramming languages generally come in three flavours
subroutine calls yet to be made while the context of
on this point: those which handle&hide it for the pro-
the current subroutine is intact. As a consequence of
grammer, those which don’t allow for it and the C
this design, there is no need for an actual "free" opera-
programming language. As with so many other things
tor. The space is returned auto-magically when the
the C environment hands the programmer all the raw
current function returns and the stack frame is disman-
bits to play with, and does very little to prevent the
tled. This asymmetry is the cause of much grief, and
programmer from making mistakes.
probably the single most important reason that
A modern UNIX kernel provides three means for alloca(3) is not, and should not be, widely used.
dynamic memory allocation: the execution stack and
the heap, and mmap (2). mmap (2).
When hardware architectures which provided
The Stack paging became available, a new API was added which
The stack is usually put at the far upper end of gives the programmer detailed control over the indi-
the address-space, from where it grows down 2 as far vidual pages in the process 3. The API has two pri-
as needed. mary functions mmap (2) and munmap (2) as well as
some auxiliary functions. Unfortunately, most pro-
text data bss heap stack grams do not allocate memory in page-sized chunks,
so this interface is usually only used in specialised and
increasing addresses system applications. One typical and probably the
most widespread use in terms of number of calls to
There is no real kernel interface to the stack as this API is shared libraries.
such. The kernel will allocate some amount of mem-
ory for the stack, usually not even telling the process The heap
the exact amount. The process will simply try to The heap is an extension of the data segment of
access whatever it needs, expecting the kernel to the process, it starts at the end of the bss section and
detect the access outside the allocated memory and extends upwards. The storage in the heap area is
treat this as a request for extension. If the kernel fails explicitly allocated with the system call brk (2). which
to extend the stack, either because of lack of takes one argument: a pointer to where the process
resources, lack of permissions or because it may just wants the heap to end. The libc library also provides a
be plain impossible to do in the first place, the process function layered on top of brk (2) called sbrk (2) which
will usually be shot down by the kernel with a termi- takes as argument a (signed) increment to the current
nal signal. end of the heap.

1This work was not sponsored by anybody. Poul-Henning


Kamp was supported by his own daytime job. He would
have loved to do this for some sponsors money instead.
2A few mostly obsolete CPU designs can be considered an- 3To the extent the kernel implements this API that is. Not
tipodic in this respect. all kernels implement more than the bare minimum.
The kernel and memory Malloc(3), realloc(3) and free(3)
Brk (2) is a very inconvenient interface, for most The archetypical malloc (3) implementation
day-to-day uses it is completely impossible to use it. keeps track of the memory between the end of the bss
It is easy to allocate memory with it, but you can only section, as defined by the _end symbol, and the cur-
free it again in a LIFO order. As so many other things rent brk (2) point using a linked list of chunks of mem-
in UNIX, it was probably defined based on what the ory. Each item on the list has a status as either free or
kernel had to offer rather than a theoretical study of used, a pointer to the next entry and in most cases to
what programmers needed. the previous as well, to speed up inserts and deletes in
Before paged and/or virtual memory systems the list.
became common, the memory management facility
used for UNIX was segments. This was also very
often the only available vehicle for imposing protec-
tion on various parts of memory. Depending on the
hardware, segments can be anything, and consequently _end _brk
how the kernels exploited them varied a lot from When a malloc (3) request comes in, the list is
UNIX to UNIX and from machine to machine. traversed from the front and if a free chunk big enough
to hold the request is found, it is returned. If the free
Typically a process would have one segment for chunk is bigger than the size requested, a new free
the text section, one for the data and bss section com- chunk is made from the excess and put back on the
bined and one for the stack. 4 list. When a chunk is free (3)’ed, the chunk is found
data in the list, its status is changed to free and if one or
text bss stack both of the surrounding chunks are free, they are col-
heap lapsed to one.
A third kind of request, realloc (3), will resize a
increasing addresses chunk, trying to avoid copying the contents if possible.
It is seldom used, and has only had a significant
In this setup all the brk (2) system call needs to
impact on performance in a few special situations.
do is to find the right amount of free storage, possibly
The typical pattern of use is to malloc (3) a chunk of
moving things around in physical memory, maybe
the maximum size needed, read in the data and adjust
even swapping out a segment or two to make space,
the size of the chunk to match the size of the data read
and change the upper limit on the data segment
using realloc (3), or alternatively, to allocate with
according to the address given.
malloc (3) a chunk which can handle a large fraction
In a more modern page based virtual memory of the requests, and if this proves insufficient, reallo-
implementation this is still pretty much the situation, cating with realloc (3), possibly several times, until the
except that the granularity is now pages. The kernel necessary amount of memory has been obtained.
finds the right number of free pages, possibly paging
For reasons of efficiency, the original implemen-
some pages out to free them up, and then plugs them
tation of malloc (3) put the small data structure used to
into the page-table of the process.
contain the next and previous pointers plus the state of
Only very few programs deal with the brk (2) the chunk right before the chunk itself. As a matter of
interface directly. The few that do usually have their fact, the canonical malloc (3) implementation can be
own memory management facilities. LISP, MOD- studied in the ‘‘Old testament’’, chapter 8 verse 7 5
ULA-3 or FORTH interpreters and runtimes are good
Various optimisations can be applied to the
examples. Most other programs use the malloc (3)
above basic algorithm:
interface instead, and leave it to the malloc implemen-
tation to use brk (2) to get storage allocated from the • If when freeing a chunk, we end up with the last
kernel. chunk on the list being free, we can return that to the
kernel by calling brk (2)with address of that chunk
4On some systems the text shared a segment with the data
and then make the previous chunk the last on the
chain by terminating its ‘‘next’’ pointer.
and bss, and was consequently just as writable as them.
Some people will remember the undocumented way of com- • A best-fit algorithm can be used instead of first-fit at
piling awk (1) programs: given the right option awk (1) an expense of memory, because statistically fewer
would load and parse the program and then write the address chances to brk (2)backwards
space from the start of the text to the top of the heap into a • Splitting the list in two, one for used and one for
file. Another option read this file back in. This reduced the
startup time because the program was already parsed into in-
free chunks, to speed the searching.
ternal form. The initial version of this hack didn’t work on • Putting free chunks on one of several free lists,
machines where the text segment could not be written to.
TeX and GNU-emacs are other programs which have used
similar methods for similar reasons. 5Kernighan & Ritchie: The C programming language
depending on their size, to speed allocation. detects if the first reference is a read (which would
• &c &c &c return undefined values) and other such violations.
Purify is a commercial product of high quality and
The problems priced to reflect this.
Even though malloc (3) is a lot simpler to use Later actual complete alternative implementa-
than the raw brk (2) interface, or maybe exactly tions of malloc arrived, but many of these as well as
because of that, a lot of problems arise from its use. the code which sat comfortably in the libc library of
FreeBSD, still based their workings on the basic
• Writing to memory outside the allocated chunk. schema mentioned previously, oblivious to the fact
• Freeing a pointer to memory not allocated by mal- that in the meantime virtual memory and paging have
loc. become the standard environment rather than seg-
• Freeing a modified pointer. ments.
• Freeing the same pointer more than once. The most widely used ‘‘alternative’’ malloc is
• Accessing memory in a chunk after it has been undoubtedly ‘‘gnumalloc’’ which has received wide
free (3)’ed. acclaim and certainly runs faster than most stock mal-
locs. It does, however, just like most other malloc
The handling of these problems have tradition- implementations, have a tendency to fare badly in
ally been weak. A core-dump was the most common cases where paging is the norm rather than the excep-
form for ‘‘handling’’, but in rare cases one could expe- tion.
rience the famous ‘‘malloc: corrupt arena.’’, or simi-
larly informative messages right before the core dump. The particular malloc that prompted this work
Much worse though, very often the program will just basically didn’t bother reusing storage until the kernel
continue, quite possibly giving wrong results or weird forced it to do so by refusing further allocations with
behaviour. sbrk (2). That may make sense if you work alone on
your own personal mainframe, but as a general policy
An entirely different kind of problem is normal it is much less than optimal.
sloppy thinking: The manual pages clearly state the
memory returned by malloc (3) can contain any value, In order to select a candidate amongst the vari-
and that one should explicitly initialise the memory ous available free implementations of malloc, I tried to
before use. Unfortunately most kernels, correctly so, benchmark them from end to other. This was done on
zero out the storage they provide with brk (2) for secu- a tiny laptop with only 8MB of RAM 7, and it soon
rity reasons, and thus the storage malloc (3) return transpired that as soon as RAM was over-committed
happen to be zeroed in many cases as well, so pro- things went downhill very fast. This prompted me to
grammers are not particular apt to notice that their study what ‘‘performance’’ meant for a malloc imple-
code depends on malloc’ed storage being zero. mentation.
Malloc (3) has somewhat deserved the reputation Performance
it has gotten for being the first of ‘‘the usual suspects’’
Performance for a malloc (3) has two sides:
to round up when programs act weird.
A) How much time does it use for searching and
Alternative implementations manipulating data structures. We will refer to this
Detecting some or all of these problems was the as ‘‘overhead’’.
inspiration for the first alternative malloc implementa- B) How well does it manage the storage. This rather
tions. Since their main aim was debugging, they vague metric we call ‘‘quality of allocation’’.
would often use techniques like allocating a guard The overhead is easy to measure: Just do a lot of
zone before and after the chunk, usually filling these malloc/free calls of various kinds and combination,
guard zones with some known predictable pattern6, so and compare the results. This is unfortunately the
that write accesses outside the allocated chunk could most common basis for systematic comparison of mal-
be detected as changes to these patterns with some loc implementations. I say ‘‘unfortunately’’ because it
decent probability. Another widely used technique is should be obvious to anybody that if you can save just
to use tables to keep track of which chunks are actu- one disk access, you can do almost anything you like
ally in which state and so on. to your internal data structures for several millisec-
This class of debugging has been taken to its onds and still come out being faster in the end. To
practical extreme by the product ‘‘Purify’’ which does compound this oversight, most people who have com-
the entire memory-colouring exercise and not only pared malloc implementations have done so on sys-
keeps track of what is and what isn’t in use, but also tems where RAM was not over-committed, and conse-
quently the implementations abilities in this area have
6Amongst the many creative patterns are 0xDEADBEEF, 7A ‘‘GateWay 2000 Handbook’’, too bad they don’t make
0xCOFEBABE, 0xDEADDEAD and so on. them anymore.
not been measured. out. From the point of view of the OS it should be
The "quality of allocation" metric tries to mea- tuned to maximise the total throughput of all the pro-
sure this aspect. It is actually horribly complex to cesses running on the machine at that time. This is
measure. In fact, the only manageable way to measure usually done using various kinds of least-recently-
it is to run some complex and deterministic test cases used replacement algorithms to select page candidates
on a system where RAM is over-committed, measure for replacement.
the time it took and use that as the metric. With this miniature analysis, we can define the
To design an algorithm on the other hand, an performance goals for a modern malloc (3) implemen-
analytical attack is needed. Here is the one I used in tation as: Minimise the number of pages accessed.
the design of my malloc implementation: This really is the core of it all. If the number of
One indicator of this quality is the size of the accessed pages is smaller, then locality of reference is
process, that should obviously be minimised. Another higher, and all kinds of caches (which is essentially
indicator is the execution time of the process. This is what the primary storage in a VM system is) work bet-
not an obvious indicator of quality for malloc, but ter.
people will generally agree that it should be min- It’s interesting to notice that the classical malloc,
imised as well, and if malloc (3), as we will see and most of the alternatives available, fail decisively
shortly, can do anything to do so, it should. according to this criteria. The information about free
In a traditional segment/swap kernel, because the chunks is kept in the free chunks themselves. In other
entire process will either be swapped out to disk or be words, even though the application as such do not
resident in RAM, the desirable behaviour of a process need these chunks of memory, the malloc implementa-
is to keep the brk (2) point as low as possible, thus tion still does, and consequently those pages if paged
minimising the size of the data/bss/heap segment, out, will not stay there longer than till the next call to
which in turn translates to a smaller process and a malloc (3) or free (3) needs to traverse the free-list.
smaller probability of the process being swapped out. In some of the benchmarks this came out as all
QED: faster execution time as an average. the pages being paged in every time a malloc call was
In a paging environment this is not a bad choice made. This made as much difference as a factor of
for a default, but a couple of details needs to be looked five in wall-clock time for certain scenarios.
at much more carefully. First of all, the size of a pro- The secondary goal is more evident: Try to
cess becomes a more vague concept since only the work in pages. That makes it easier for the kernel,
pages that are actually used need to be in primary stor- and wastes less virtual memory. Most modern imple-
age for execution to progress, and they only need to be mentations do this when they interact with the kernel,
there when used. That implies that many more pro- but only a few try to avoid objects spanning pages.
cesses can fit in the same amount of primary storage, If an object’s size is less than or equal to a page,
since most processes have a high degree of locality of there is no reason for it to span two pages. Having
reference and thus only need some fraction of their objects span pages means that two pages must be
pages to actually do their job. From this it follows that paged in, if that object is accessed.
the interesting size of the process is a subset of the
total amount of virtual memory occupied by the pro- Implementation
cess. This subset isn’t a constant. It varies depending
on the whereabouts of the process, and it may indeed The implementation is 1136 lines of C code, and
fluctuate wildly over the lifetime of the process. can be found in FreeBSD 2.2 and later versions of
FreeBSD as src/lib/libc/stdlib/malloc.c.
One of the names for this vague concept is ‘‘cur-
rent working set’’. This is a most horribly ill-defined The main data structure is the page-directory
number, but for now we can simply say that it is the which contains a void* for each page we have control
number of pages the process needs in order to run at a over. The value can be one of:
acceptable low paging rate in a congested primary • MALLOC_NOT_MINE Another part of the code
storage. If the number of pages is too small, the pro- may call brk (2) to get a piece of the cake. Conse-
cess will wait for its pages to be read from secondary quently, we cannot rely on the memory we get from
storage much of the time. If it’s too big, the space the kernel being one consecutive piece of memory,
could be used better for something else. If primary and therefore we need a way to mark such pages as
storage isn’t congested, this may not seem important. "untouchable".
But many kernels today can use any available pages • MALLOC_FREE This is a free page.
for disk-cache or similar functions, so from that per- • MALLOC_FIRST This is the first page in a
spective main storage is always congested. (multi-)page allocation.
From the view of any single process, this number • MALLOC_FOLLOW This is a subsequent page in a
of pages is of course ‘‘all of my pages’’, since this multi-page allocation.
guarantees that no pages will need to be paged in or
• struct pginfo* A pointer to a structure describing a
partitioned page. To be 100% correct performance-wise these lists
In addition, there exists a linked list of small data should be ordered according to the recent number of
structures that describe the free space as runs of free accesses to that page. This information is not avail-
pages. able and it would essentially mean a reordering of the
list on every memory reference to keep it up-to-date.
Notice that these structures are not part of the Instead they are ordered according to the address of
free pages themselves, but rather allocated with malloc the pages. Other criteria has been tried and it looks
so that the free pages themselves are never referenced like any kind of stable and repeatable sorting of these
while they are free. result in the same performance. Sorting by address
When a request for storage comes in, it will be statistically keeps. brk (2) as lower.
treated as a ‘‘page’’ allocation if it is bigger than half a It is an interesting twist to the implementation
page. The free list will be searched and the first run of that the struct pginfo Is allocated with malloc. That
free pages that can satisfy the request is used. The is, "as with malloc" to be painfully correct. The code
first page gets set to MALLOC_FIRST status. If more knows the special case where the first (couple) of allo-
than that one page is needed, the rest of them get cations on the page is actually the pginfo structure and
MALLOC_FOLLOW status in the page-directory. deals with it accordingly. This avoids some silly
If there were no pages on the free list, brk (2) "chicken and egg" issues.
will be called, and the pages will get added to the
page-directory with status MALLOC_FREE and the Bells and whistles.
search restarts. brk (2) is actually not a very fast system call
Freeing an allocation of pages is done by chang- when you ask for storage. This is mainly because of
ing their state in the page directory to MALLOC_FREE, the need for the kernel to zero the pages before hand-
traversing the free-pages list to find the right place for ing them over. Therefore this implementation does
this run of pages, collapsing with either or both of the not release heap pages until there is a large chunk to
two neighbouring entries if possible, and if above the release back to the kernel. Chances are pretty good
threshold: releasing some pages back to the kernel by that we will need it again pretty soon anyway. Since
calling brk (2). these pages are not accessed at all, they will soon be
If the request is less than or equal to half of a paged out and don’t affect anything but swap-space
page, its size will be rounded up to the nearest power usage.
of two before being processed and if the request is less The page directory is actually kept in a
than some minimum size, it is rounded up to that size. mmap (2)’ed piece of anonymous memory. This
These sub-page allocations are served from avoids some rather silly cases that would otherwise
pages which are split up into some number of equal have to be handled when the page directory has to be
size chunks. For each of these pages a struct pginfo extended.
describes the size of the chunks on this page, how One particularly nice feature is that all pointers
many there are, how many are free and so on. The passed to free (3) and realloc (3) can be checked con-
description consist of a bitmap of used chunks, and clusively for validity. First the pointer is masked to
various counters and numbers used to keep track of find the page. The page directory is then examined, it
the stuff in the page. must contain either MALLOC_FIRST, in which case
For each size of sub-page allocation, the pginfo the pointer must point exactly at the page, or it can
structures for the pages that have free chunks in them contain a struct pginfo*, in which case the pointer
form a list. The heads of these lists are stored in pre- must point to one of the chunks described by that
determined slots at the beginning of the page directory structure. Warnings will be printed on stderr and
to make access fast. nothing will be done with the pointer if it is found to
be invalid.
To allocate a chunk of some size, the head of the
list for the corresponding size is examined, and a free An environment variable MALLOC_OPTIONS
chunk found. The number of free chunks on that page allows the user some control over the behaviour of
is decreased by one and, if zero, the pginfo structure is malloc. Some of the more interesting options are:
unlinked from the list. Abort If malloc fails to allocate storage, core-
To free a chunk, the page is derived from the dump the process with a message rather than
pointer, the pginfo info structure found from the page expect it to handle this correctly.
directory and the bit corresponding to this chunk is set Hint Pass a hint to the kernel about pages we no
in the bitmap, and the counter for free chunks is longer need using the madvise (2) system call.
increased by one. If this page has exactly one free This allows the kernel to discard the contents of
chunk now, it is linked back into the list for chunks of the page and reuse it as free. If this process
this size, if all chunks are free both the page and the accesses that page later on, the kernel can just map
pginfo structure are free (3)’ed too. a new page into the address space. This can
improve performance a fair bit in certain applica- Conclusion and experience.
tions since it has the potential to save a page-out
In general the performance differences between
and a page-in operation.
gnumalloc and this malloc are not that big. The major
Realloc Always do a free and malloc when difference comes when primary storage is seriously
realloc (3) is called. For programs doing garbage over-committed, and gnumalloc wastes time paging in
collection using realloc (3), this make the heap col- pages it’s not going to really use, in such cases as
lapse faster since malloc will reallocate from the much as a factor of five in time has been observed for
lowest available address. The default is to leave various programs.
things alone if the size of the allocation is still in
Several legacy programs in the BSD 4.4 Lite dis-
the same size-bracket.
tribution had code that depended on the memory
Junk will explicitly fill the allocated area with a returned from malloc being zeroed. In a couple of
particular value to try to detect if programs rely on cases, free(3) was called more than once for the same
it being zero. The value used, 0xd0, is selected to allocation, and a few cases even called free(3) with
maximize the probability of a coredump. pointers to objects in the data section or on the stack.
Zero will explicitly zero out the allocated chunk of A couple of users have reported that using this
memory, while any space after the allocation in the malloc on other platforms yielded "pretty impressive
chunk will be filled with the junk value to try to results", but no hard benchmarks have been made.
detect out of the chunk references.
sys-V quite to my surprise there were one bit of Acknowledgements & references.
the API which were not well agreed upon. What The first implementation of this algorithm was
should realloc (3) return when given a pointer and actually a file system, done in assembler using 5-hole
a new size of zero ? Well, some people expect it to ‘‘Baudot’’ paper tape for a drum storage device
return a NULL pointer, which makes sense, and attached to a 20 bit germanium transistor computer
some people expect it to return a valid pointer, with 2000 words of memory, but that was many years
which also makes sense. This option lets the pro- ago.
grammer choose.
A lot of people have provided ideas, bug-fixes
All these and a few other options can also be set and portability changes to the code. Special thanks
in a system-wide fashion, or at compile time. They and mention goes to: Peter Wemm, Lars Fredriksen,
have proved very popular with developers, and users Keith Bostic, Dmitrij Tejblum, John-Mark Gurney,
alike, and in particular the ’H’ option can have a deci- Joel Maslak, John Birrell, Warner Losh, Kaleb
sive performance impact. Keithly, Mike Pritchard, John D. Polstra and Archie
Future improvements Cobbs.
It is not obvious that having the free-page list is
an actual benefit, it may be equally fast to just search
for free pages in the page directory.
Truly transient programs like echo (1), date (1)
and similar shouldn’t bother with malloc/free, they
should simply use sbrk (2) for their needs. Maybe a
grace period should be implemented in malloc (3) so
serious memory management would only start after a
certain number of chunks or bytes have actually been
freed back.
Universally huge improvements in performance
in the future seems unlikely unless the malloc (3) API
is changed significantly. But doing so is by no means
a guarantee of better performance. The main stum-
bling block is that it is not possible for the malloc (3)
implementation to relocate in-use memory to improve
locality of reference.
This is not the same as to say that a few pro-
grams out there could not use a better and more intelli-
gent memory allocation policy.

You might also like