The
Silent
Tower

Visual overview of a custom malloc() implementation#

This post was discussed on Hacker News.

C programmers will undoubtedly recognize the standard function malloc(), the language's memory allocator and close confident of everyone's favorite segfaults. malloc() is the mechanism through which programs obtain memory at runtime, and it often ends up being the primary way in which objects are created and managed. Considering this central status, I'm surprised that most programmers treat it as a black box. This post is a bit of an excuse to tell you more about what happens under-the-hood and look at one possible implementation.

In a typical use case, memory is allocated by calling malloc() with a requested number of bytes as parameter. malloc() returns a pointer through which the memory can be used before being returned to the system by calling free() :

1
2
3
4
5
6
7
/* Allocate memory for a structure of type [struct article] */
struct article *my_article = malloc(sizeof *my_article);

if(my_article != NULL) {
    /* ... use *my_article ... */
    free(my_article);
}

This simple piece of code raises at least three important questions:

Let's go on a little adventure to answer all of these.

TL;DR for avid readers: it's a segregated best-fit with 16 classes, single-threaded, very standard.


The heap originates from the system#

The first question “Where does the memory actually come from?” leads us into system land, since the short answer that you might expect is the operating system is providing it. That's fine and all, but how do we get from that abstract concept to an actual pointer that we can write to?

In a classical userspace#

On classical operating systems, each userspace process lives in its own address space, where its code, libraries, and data are all loaded. Having separate address spaces for each processes is a key design feature for operating systems, and I'd get lost if I talked about why this is. The TL;DR is that it avoids processes reading or writing each other's memory (which nowadays is mostly for security) and that it's managed in hardware by an MMU.

The exact layout of the address space depends on the operating system. It can be anything; frankly you could make one up for breakfast and be fine.(1) A simple example might look something like this (we'll go over the contents in a minute):

Typical process address space (arbitrary addresses for illustration).

Such an address space is called a virtual address space because it doesn't actually own any memory. The real memory in the computer (the RAM chip) exists as a big block(2) in the so-called physical address space; but of course any process with full access to the entire RAM would be cybergod, so only the kernel is allowed to use it. Normal processes get access to memory when the kernel lends them some by mapping it to their virtual address space:

Mapping of virtual pages to physical pages by the kernel.

This process is called paging. Essentially, both the virtual and physical address spaces are divided into pages (of varying size but most commonly 4 kB) and the kernel decides which pages of virtual memory should be mapped to which pages of physical memory. When the program makes a memory access, the CPU checks if the accessed pointer falls into a mapped page. If so, then the associated physical page is accessed. Otherwise, the access is streng verboten and the program crashes with a segfault.(3) This gives the kernel complete control over what memory processes can read and write, and thus achieves the separation goals I mentioned earlier.

So let's go back to the layout from earlier. In it, we find:

Since code is read-only and the data segment is static (meaning its size is fixed before running the program), the only two regions that can grow during execution are the stack and heap. The stack itself is quite small; on Linux, ulimit -s will tell you its maximum size, which is typically 8 MB. It also comes with very strict lifetime limits, since any object created there is destroyed when the function creating it returns. This leaves the heap. On your personal computer or phone, heap data accounts for a vast majority of memory used by programs. When your web browser hogs 5 GB of memory then it's basically 5 GB of heap.(4)

Like everything in the virtual address space, heap is created when the kernel maps pages of physical memory at a suitable virtual address. So the answer to the question “where does the memory come from?” is “the kernel maps new pages into the address space”. The process requests this by invoking the syscall sbrk(2), and if the kernel allows it then new pages are mapped and pointers to the pages become valid. You can see this in action in pretty much any Linux-supporting allocator. For example, the glibc allocator implements this “get more memory” primitive in morecore.c and the call boils down to an indirect invocation of sbrk(2).

So, to sum up: malloc() gets memory by asking the kernel with sbrk(2) and the kernel provides it by mapping entire pages (of 4 kB or more). Then, malloc()'s job is to split up these pages into blocks of user-requested size while keeping track of what's used and what's free.

In embedded systems#

Of course, not all systems run a full-fledged kernel with paging-based virtual memory. There are plenty of other options, but the core mechanics remain the same. On many embedded systems, a fixed region of memory is provided for the heap; in that case malloc() just cannot expand. Allocators are pretty good at dealing with fixed-, variable- and pineapple-shaped memory so they don't really care. The core of the design is really how they split and manage the memory they get.

Making variables out of raw memory using “maximal alignment”#

So far, I've only talked about memory as raw data — sequences of bytes arranged in pages that the program can read and write. But what we want from malloc() is really to create variables. The C language constantly insists on having us declare the type of variables, so how are we able to create them with malloc() from just a size?

1
2
3
4
/* Local variable requires a type... */
int i;
/* ... but somehow malloc() doesn't? */
int *j = malloc(4);

You might think that I did declare the type for the allocation by writing int *j, but that's not the case. In terms of typing the second definition really has two steps: first the compiler types malloc(4), on its own without any context, then it checks that the assignment is legal, ie. that the type of malloc(4) is convertible to int *. This is an important property of C: expressions are always typed out of context, independently from what you do with it. C++ programmers may be more familiar with this concept as “return types do not contribute to overload resolution”.(5)

What's happening here is that malloc(4) is of type void *, i.e. a universal pointer, which is convertible to any non-function pointer type. No information related to the type of j is communicated to malloc(). So the question remains: how is malloc() able to allocate suitable memory without knowing the desired type?

It turns out that suitable memory really boils down to two things. In order to contain a variable of type T, a region of memory must:

  1. Have a size at least sizeof(T);
  2. Have an alignment greater than or equal to the alignment of T.

Alignment refers to the highest power of two which divides an address. For architectural reasons values of certain basic types are constrained to be stored on aligned boundaries. Typically, primitive values occupying N bytes (with N a power of two) will be stored exclusively at addresses that are multiples of N. So a 1-byte char can be stored at any address, but a 2-byte short only on even addresses, a 4-byte int on addresses that are multiples of 4 (called “4-aligned”), etc.

All possible ways to store suitably-aligned values.

Only basic types (integers, floating-point numbers and pointers) have architectural alignment requirements, so if you work out the layout of structures, unions and other composite types, you will find that every type ends up with the alignment requirement of a basic type.(6) This means that the largest alignment requirement of any basic type (max_align_t in C++, usually 4 bytes for 32-bit architectures and 8 bytes for 64-bit architectures) is suitable for every single type in the language.

The existence of a universal alignment solves our mystery, since malloc() can simply return maximally-aligned pointers in order to accommodate any data type. Unsurprisingly, this is exactly what good ol' C99 explicitly requires it to do [ISOC99 §7.20.3.1]:

The order and contiguity of storage allocated by successive calls to the calloc, malloc, and realloc functions is unspecified. The pointer returned if the allocation succeeds is suitably aligned so that it may be assigned to a pointer to any type of object and then used to access such an object or an array of such objects in the space allocated (until the space is explicitly deallocated).

Most implementations of malloc() are parametrized by this alignment. In my case, I'm working on a 32-bit architecture so the highest alignment I have to work with is 4. As you will see shortly, this bleeds onto block and header sizes, which are going to be multiples of 4.

Tour of a segregated best-fit implementation#

Ok, so an implementation of malloc() essentially has to:

The third point is where all the variety in implementation comes from. There's about a million designs for memory allocation strategies, data structures, not to mention garbage collectors, and I'd by lying if I pretended I could give you a visual overview of that. You can find a general introduction in the fantastic paper Dynamic storage allocation: A survey and critical review [Wilson1995].

And that's about all the generalities I have for you. The rest of this article will go through an implementation of the segregated best-fit algorithm that I wrote for gint. You can find the original code in gint's repository at src/kmalloc/arena_gint.c; here I'll start from intuitions and work my way up to (hopefully) understandable versions of malloc() and free() by alternating visuals and C code.

Note that the code is meant to be readable without context but isn't quite complete (some initialization is missing mostly).

Block structure and navigating the sequence#

As is typical with general-purpose allocators, I keep track of the metadata for each allocation in a short header before the allocated space itself. Which means that the heap region is made of blocks of the following shape:

Block structure depending on use status and size.

The simplest type of block is used blocks: they consist of a 4-byte header followed by user data represented by the hashed region. malloc() returns a pointer to the user data and that's all you ever see from the outside. The header records the size of the block and information that's useful for navigating the heap to find other blocks. This is the corresponding structure definition:

1
2
3
4
5
6
7
typedef struct {
    uint            :5;
    uint last       :1;     /* Marks the last block in the sequence */
    uint used       :1;     /* Whether the block is used */
    uint prev_used  :1;     /* Whether the previous block is used (boundary tag) */
    uint size       :24;    /* Block size in bytes (< 16 MB) */
} block_t;

Free blocks have the same 4-byte header, but they also contain extra data in their footers. That's the nice thing about free blocks: no data is being stored, so we might as well use the space to maintain the data structure. prev_link and next_link are used to connect some linked lists; I'll discuss that in a minute. All we need to know for now is that these are pointers to other blocks.

Blocks are stored immediately next to each other with no spacing, forming a long sequence that covers the entire heap region (the arena), apart from a small index structure placed at the start:

Arrangement of blocks in heap memory.

The two non-trivial flags in the block_t header are both related to this sequence: last marks the last block, and prev_used is a copy of the previous block's used flag, which has the following (quite tasty!) use.

Notice how we can pretty easily navigate the sequence forwards: from a pointer to the start of a block we can find a pointer to its end (ie. to the start of the next block) because the block's size is indicated in the header.

Forward navigation in the block sequence.

The code is as straightforward as you'd expect:

1
2
3
4
5
6
block_t *next_block(block_t *b)
{
    if(b->last)
        return NULL;
    return (void *)b + sizeof(block_t) + b->size;
}

Well, it turns out that free blocks are set up so that you can also do the opposite and find the start of a free block from its end. This boils down to determining the size from the footer, which is indeed possible:

So now, when we have a pointer to the end of a free block, we can look up the low-order bit of next_link; if it's 1 the block is of size 8, otherwise we can read the size from offset -12. In both cases we can recover a pointer to the start of the block:

Backwards navigation through free blocks in the block sequence.

Of course, in order to use this for reliable backwards navigation we need to know whether the previous block is in fact free or not. This is the purpose of the prev_used flag in the block header, and the technique as a whole is known as the boundary tag optimization — paying a single bit for the luxury of navigating back through empty blocks.

1
2
3
4
5
6
7
8
block_t *previous_block_if_free(block_t *b)
{
    if(b->prev_used)
        return NULL;
    uint32_t *footer = (void *)b;
    uint32_t previous_size = (footer[-1] & 1) ? 8 : footer[-3];
    return (void *)b - previous_size - sizeof(block_t);
}

The benefit of this little operation should be clear by the next section, but if you're curious you can already try to figure it out.

Quick aside. Tightly packing blocks is an obvious choice when considering memory usage and the need to not waste any space. It also makes writing out-of-bounds of dynamically-allocated arrays quite lethal, as that immediately corrupts the next block, basically guaranteeing a crash at the next malloc() or free(). This might seem like a problem but it's actually a good thing: it makes out-of-bounds writes more visible which leads to empirically faster bug resolutions (in fact, introducing this new allocator in gint led to a mini-wave of new crashes that all turned out to be application bugs).

Merging and splitting in the sequence#

As this point you might be wondering how we get from an empty, freshly-initialized heap to the mess of a sequence that I've pictured previously, with data spread around instead of being nicely arranged on one end with a nice, large block of free memory on the other end.

The short answer is you can't avoid the mess. The user is completely in control of allocated blocks, and in particular they can free them in any order, creating holes at any moments. Since we don't know how long blocks will live when they are allocated (... the user might not either!) it's hard to plan ahead.

There's one thing we can do to keep the sequence as tidy as possible, though: we can merge adjacent free blocks to make larger blocks. This increases the chance that we can answer large malloc() requests and ensures that if the user were to free every block we'd recover one large empty block spanning the entire heap.

With this simple idea in mind, we can now work out what malloc() and free() will have to do. We'll stick to a general procedure for now and go over the actual code at the end.

malloc(N):
  1. Search for b, the smallest free block of size ≥ N in the sequence.
  2. If none is found, we're out of memory, return NULL (... or use sbrk() to get more and retry).
  3. If the size of b is strictly larger than N+4, split b in two to keep some free memory.
  4. Mark b as used and return it.

Step 1 is what makes this algorithm a best-fit, as it finds the “best” block for the allocation (the one that's as close as possible to the requested size). Confusingly this isn't necessarily the best choice(7), and there are many other strategies. Anyway, the interesting part for now is that if we decide on a block that's larger than needed, we split it off so we don't waste any memory:

Splitting allocated blocks when they're larger than needed.

The function to do that is pretty straightforward, the only subtlety being the maintenance of prev_used flags in nearby blocks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
block_t *split(block_t *b, int N)
{
    size_t extra_size = b->size - N;
    /* Only split if there is enough room left for a new block */
    if(extra_size < sizeof(block_t) + 8)
        return NULL;

    block_t *second = (void *)b + sizeof(block_t) + N;
    second->last = b->last;
    second->used = false;
    second->prev_used = b->used;
    second->size = extra_size - sizeof(block_t);

    block_t *third = next_block(second);
    if(third)
        third->prev_used = second->used;

    b->last = false;
    b->size = N;
    return second;
}

Of course we want this splitting to be as temporary as possible; if at any point in the future both blocks are free again, we need them merged back. Fortunately this is fairly easy to track down; all cases of adjacent free blocks are caused by a free(), and we can catch 'em all if we simply merge newly-freed blocks with their neighbors.

free(ptr):
  1. Find the associated block b = ptr-4.
  2. Mark b as free.
  3. If the successor of b exists and is free, merge it into b.
  4. If the predecessor of b exists and is free, merge b into it.

In the best case where both neighbors are free, the operation looks like this:

Merging newly-freed blocks with their free neighbors.

Merging is even simpler since we erase headers instead of adding them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
void merge(block_t *left, block_t *right)
{
  size_t extra_size = sizeof(block_t) + right->size;
  left->last = right->last;
  left->size += extra_size;

  /* Footer is updated separately */

  block_t *next = next_block(left);
  if(next)
      next->prev_used = left->used;
}

And this is why the boundary tag optimization is useful: it lets us find the previous block if it's free and merge it, even when we are dropped in the middle of the sequence by the user-provided ptr.

Segregated linked lists#

Now you could go ahead and implement step #1 of the malloc() algorithm (find the smallest block of size ≥ N) naively, by checking every block in the sequence in order. You'd get a working and standard best-fit allocator, which honestly is very much viable for small embedded systems. However, if you have a large heap (say, 1 MB) you'll end up checking a lot of blocks and programs that allocate frequently will be bottlenecked by the allocator.

Now is the time were you can go on deep, fascinating studies about the allocation patterns of various programs, where you can treat blocks with different lifetimes separately, design better APIs where more information is communicated to enable better memory planning, and otherwise grind the response time down to absolute perfection.

I'm just going to make the easy assumption that the amount of work being done with a block is correlated to its size. So small block means little work means malloc() needs to answer quickly to not be the bottleneck. Large blocks means a lot of work so it's acceptable to be slower. This is undeniably an approximation, but still generally true.

To speed up the response time for small blocks (which also happen to be the most common), I'm going to partition free blocks into linked lists of related sizes:

Segregated linked lists relative to block sequence.

These are standard doubly-linked lists using the prev_link and next_link attributes stored in free blocks' footers. Unlike what the figure suggests, blocks in each linked list do not have to sorted by increasing address, it's just way more readable when represented this way.

The implementation has 16 lists holding blocks of the following sizes, which are easily classified by a small function.

1
2
3
4
5
6
7
8
int size_class(size_t size)
{
    if(size < 64)
        return (size - 8) >> 2;
    if(size < 256)
        return 14;
    return 15;
}

Holding pointers to these lists is the purpose of the index structure I mentioned earlier. It also keeps track of heap statistics, which I've left out of this presentation for simplicity.

1
2
3
4
typedef struct {
    block_t *classes[16];   /* Entry points of segregated linked lists */
    stats_t *stats;         /* Statistics (not addressed in this article)  */
} index_t;

The big idea here is that small requests are generally served in constant time because the first 14 lists with small blocks never need to be traversed. All the blocks in there are interchangeable, so we can always just take the first. Only the last two lists need to go through an actual best-fit search, and we only use them to answer small requests if every suitable small list is empty.

I'll spare you the code for manipulating the linked lists since it's very standard. Let's just say that remove_link() removes a free block from its associated list and prepend_link() adds it back at the start. best_fit(list, N) looks for the smallest block of size ≥ N in a list.

1
2
3
4
5
6
/* Remove b from its associated linked list */
void remove_link(block_t *b, index_t *index);
/* Add b to the start of its associated linked list */
void prepend_link(block_t *b, index_t *index);
/* Find the smallest block of size ≥ N in the list */
block_t *best_fit(block_t *list, size_t N);

You've probably figured it out by now but this is where the name “segregated best-fit” comes from, as blocks of different sizes are partitioned (segregated) into appropriate doubly-linked lists.

Final implementation of malloc() and free()#

We can now write the entire malloc() and free() functions. In malloc(), we round the request size up and then look for a free block in the associated linked list. If the list is empty, we go up to the next list until we either find a large block that can be split, or realize that we're completely out of memory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
index_t *index = /* ... */;

void *malloc(size_t size)
{
    /* Round request up to a multiple of 4 that is at least 8 */
    size = (size < 8) ? 8 : ((size + 3) & ~3);

    /* Try to find a class that has a free block available */
    block_t *alloc = NULL;
    for(int c = size_class(size); c <= 15 && !alloc; c++) {
        block_t *list = index->classes[c];
        /* For the 14 small classes any block is fine, take the first */
        alloc = (c < 14) ? list : best_fit(list, size);
    }
    if(!alloc)
        return NULL;

    /* Remove the chosen block from the index */
    remove_link(alloc, index);

    /* If it's larger than needed, split it and reinsert the leftover */
    block_t *rest = split(alloc, size);
    if(rest)
        prepend_link(rest, index);

    /* Mark the block as used and return a pointer to its data */
    block_t *next = next_block(alloc);
    alloc->used = true;
    if(next)
        next->prev_used = true;
    return (void *)alloc + sizeof(block_t);
}

In free(), we mark the block as free, merge it with its free neighbors (after removing them from their own linked lists) and finally insert the resulting block back into the index.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
void free(void *ptr)
{
    block_t *b = ptr - sizeof(block_t);
    block_t *prev = previous_block_if_free(b);
    block_t *next = next_block(b);

    /* Mark the block as free */
    b->used = false;
    if(next)
        next->prev_used = false;

    /* Merge with the next block if free */
    if(next && !next->used) {
        remove_link(next, index);
        merge(b, next);
    }
    /* Merge with the previous block if free */
    if(prev) {
        remove_link(prev, index);
        merge(prev, b);
        b = prev;
    }

    prepend_link(b, index);
}

Segregated fits are not particularly fancy allocation algorithms, but as you can see they're simple and compact enough to be explained in an Internet article (... if I've done my job right that is!). These are still non-trivial allocators and I think that's notable for a feature that's usually only presented as a mysterious black box.

Closing remarks#

Performance#

I don't have much to offer in the way of performance evaluation. To be honest, this algorithm doesn't have any particularly good qualities despite responding fast in general for small requests. In terms of fragmentation (breaking up the heap region in too many small pieces that waste memory), best-fit doesn't really shine in general, but few gint applications actually put enough pressure on the heap to fill it and run out of memory. Maybe another time!

realloc() and extensions#

If you look at src/kmalloc/arena_gint.c you will find a bunch of functions that I haven't mentioned here. For the sake of completeness, we have:

Managing multiple arenas#

As a unikernel, gint doesn't provide sbrk() and instead the heap region (the arena) is of a fixed size. However there are quite a few non-adjacent areas of memory that gint applications use, so the implementation makes sure to support multiple arenas so that independent heaps can be created. Extensions with custom allocators are also possible.

If you look at gint_malloc() and gint_free() you will see this extra parameter void *data; that's a generic user pointer passed by the arena manager to each implementation. I used it to access the index structure.

Conclusion#

Well, that's it for this time. I hope this article leaves you with good intuitions about memory management and a sense of understanding for what malloc() actually does in general.

For this article, I wanted to experiment with a mixed visuals-code approach with TikZ. It takes a long time to write up, but I like how it turned out, so I might come back to it.

References#

Dunfield2021

Jana Dunfield and Neel Krishnaswami. 2021. Bidirectional Typing. ACM Comput. Surv. 54, 5, Article 98 (June 2022), 38 pages. https://doi.org/10.1145/3450952

ISOC11

ISO/IEC 9899:2011: Information technology — Programming languages — C. International Organization for Standardization, Geneva, Switzerland. https://www.iso.org/standard/57853.html (open draft at open-std.org).

ISOC99

ISO/IEC 9899:1999: Information technology — Programming languages — C. International Organization for Standardization, Geneva, Switzerland. https://www.iso.org/standard/29237.html (open draft at open-std.org).

Wilson1995

Wilson, P.R., Johnstone, M.S., Neely, M., Boles, D. (1995). Dynamic storage allocation: A survey and critical review. In: Baler, H.G. (eds) Memory Management. IWMM 1995. Lecture Notes in Computer Science, vol 986. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60368-9_19


  1. It matters so little that Linux just puts everything at random addresses to improve security against attacks based on knowing where code is located in virtual memory. 

  2. I'm simpliyfing a bit here; at the lowest level RAM can appear fragmented in the physical address space. In fact, on x86, it can be anything; see BIOS call e820 if you have extra faith in humanity that you need to get rid of. Ultimately we don't care since everything is split into pages. 

  3. The term “segmentation fault” was inherited from the ancient time of segmentation, the predecessor to paging. If you don't know what segmentation is, then keep it this way — segmentation was terrible anyway. Page faults are also a thing but only when the kernel maps pages on-demand, which I'm not covering here. 

  4. Very large allocations on Linux are usually handled by directly asking the kernel for an entire set of pages via mmap(2). I'm putting that in the same bag as malloc(). Fun fact: 32-bit virtual address spaces are quite small so fragmentation gets spicy real fast. 

  5. This is in stark contrast with typing in functional languages, for which bottom-up type inference and top-down type checking typically complement each other in what is known as bidirectional typing [Dunfield2021]. Typing in C/C++ is bottom-up only. 

  6. You can also require a larger alignment on a type or variable with a compiler-specific attribute. This is called an over-aligned type. Crucially, local variables created with such a type will be properly aligned, but malloc() will not honor the requirement, so it has to be accounted for manually or using the C11 function aligned_alloc() [ISOC11 §7.22.3.1]. 

  7. Mostly because whenever we don't have a perfect match we split hairs off of a slightly larger block than requested and this creates a ton of very small blocks that are very hard to use.