Hawkins Allocator

Features

Early History

I needed a custom memory allocator for use in the DPU software framework I designed as part of my thesis. The built-in C++ new was unacceptable because it used a best fit algorithm that runs in unbounded time. The result of this effort was a less sophisticated algorithm than the one presented here that allowed only two block sizes (one large, one small).

Description of the original algorithm from my thesis

The one place where dynamic memory allocation is needed is the Scheduling System to allow for runtime creation of Tasks. Unfortunately the built-in C++ new operator is optimized for size efficiency and cannot meet our real-time constraints [41]. The simplest memory allocation algorithm we could use is to partition a large buffer into fixed-size chunks large enough to hold the largest possible Task. This would work fine if most Tasks were of comparable size, but we expect to have small Tasks for monitoring a single quantity and large Tasks for Event capture and histogramming. A one-size-fits-all approach would be very inefficient. Having two buffers using two different sized chunks could work, but we might find ourselves with no free memory in one buffer and plenty of free memory in the other. What we would like is to find a solution similar to the double-ended priority queue that would allow us to collocate the small and large chunks while allowing for inevitable fragmentation of memory.

The solution is to choose the large chunk size to be a power-of-two multiple of the small chunk size (for efficiency we choose both chunk sizes to be powers of two). The buffer is divided into large chunks that can be allocated for larger Tasks. If a smaller Task is needed, a large chunk can be subdivided into several small chunks (the ratio of the two sizes) and those small chunks can be handed out.

Diagram of memory use in Task Heap
Figure 29: A typical allocation of large and small chunks in the Task heap

The large chunks are managed by a set of THeapNode instances that contain a bitmap indicating which contained small chunks are in use when subdivided. These are created in a static array and the index in the array is the index of the corresponding large chunk of memory being managed. Each THeapNode instance stores previous/next indices and they are initially connected to form a doubly linked list that contains all free large chunks. When a large chunk is needed, the head of this list is removed and the corresponding chunk is given out for use. When the first small chunk is needed, the first small chunk in the head of the free large chunk list is marked as used in the bitmap and the THeapNode instance is moved to a new list for large chunks with one small chunk in use. There are similar lists for large chunks with two, three, or any number of small chunks in use including all of them. When a new small chunk is needed, the lists are checked to find the large chunk with the fewest small chunks free. Using this large chunk helps minimize memory fragmentation.

To free memory the address is used to locate the corresponding THeapNode instance (and the small chunk within it if necessary). If it is a large chunk being freed then the chunk is simply placed at the head of the free large chunk list. If it is a small chunk then the bitmap is updated to free the small chunk and the THeapNode instance is moved to the list that corresponds to the new number of small chunks in use. This requires splicing the list around the THeapNode instance where it is removed (which requires doubly-linked lists).

Allocating a large chunk, freeing a large chunk, and freeing a small chunk are all constant-time operations. Allocating a small chunk is linear in the ratio of the large and small chunk sizes (the head of each list must be checked to find the large chunk with the fewest free small chunks). Since this ratio is fixed, this is also constant time and does not scale with the total number of memory chunks.

McKusick-Karels Allocator

The algorithm I present here turns out to be very similar to the well-known McKusick-Karels allocator. The chief difference is that pages, not blocks, are placed into my primary linked lists and each page contains a linked list of its free blocks. The memory usage is essentially the same but additional operations are needed for allocations and deallocations. In exchange we get automatic coalescing and guaranteed constant-time operations.

From Design of a General Purpose Memory Allocator for the 4.3BSD UNIX Kernel

McKusick, Marshall and Karels, Michael: Proceedings of the San Francisco USENIX Conference: 295-303, June 1988

Another technique to improve both the efficiency of memory utilization and the speed of allocation is to cluster same-sized small allocations on a page. When a list for a power-of-two allocation is empty, a new page is allocated and divided into pieces of the needed size. This strategy speeds future allocations as several pieces of memory become available as a result of the call into the allocator.

Because the size is not specified when a block of memory is freed, the allocator must keep track of the sizes of the pieces it has handed out. The 4.2BSD user-level allocator stores the size of each block in a header just before the allocation. However, this strategy doubles the memory requirement for allocations that require a power-of-two-sized block. Therefore, instead of storing the size of each piece of memory with the piece itself, the size information is associated with the memory page. Figure 2 shows how the kernel determines the size of a piece of memory that is being freed, by calculating the page in which it resides, and looking up the size associated with that page. Eliminating the cost of the overhead per piece improved utilization far more than expected. The reason is that many allocations in the kernel are for blocks of memory whose size is exactly a power of two. These requests would be nearly doubled if the user-level strategy were used. Now they can be accommodated with no wasted memory.

Initialization Values

Data Structures

Data Structure Diagram

Diagram of Page Array, Size Array, and Head/Tail Table

Invariants

Linked-List Diagram

Diagram of linked list structures within the pages/blocks

Constants

Variables

Allocation Algorithm

Deallocation Algorithm

Size Algorithm

Conclusions

The table below compares the Hawkins allocator to the McKusick-Karels allocator with and without some form of coalescing. Note that McKusick-Karels can delay coalescing making it difficult to assign a cost to specific operations.

HawkinsM-KM-K with coalescing
Memory usage per page1 SizeType1 SizeType1 SizeType
Computing log2 for allocationO{ log2( bits in SizeType ) } O{ log2( bits in SizeType ) }O{ log2( bits in SizeType ) }
Allocation worst-caseO{ 1 }O{ 1 }O{ BlockCount }
Allocation typicalO{ 1 }O{ 1 }O{ 1 }
DeallocationO{ 1 }O{ 1 }depends on PageCount
Checking the size of a blockO{ 1 }O{ 1 }O{ 1 }
Constants in front of scalingGenerally largerGenerally smallerGenerally smaller
Coalesces?YesNoYes (may be delayed)

Overall the McKusick-Karels allocator is likely to be more efficient than the Hawkins allocator in run-time performance, and may also offer additional space savings if the kmemsizes[] array is changed to store log2 of the block size (useful when the maximum block size is too large to fit in a byte). The primary advantages of the Hawkins allocator are:

In cases where coalescing is needed, the Hawkins allocator provides bounded operations suitable for use in real-time applications.

Implementation

A fully-portable C++ implementation and a more constrained and efficient C implementation are in the works and will be posted under an open-source license when complete.

Extensions


©2007 Donovan Hawkins <hawkins@cephira.com>

Cephira, Cephira.com, Cephira.org, Cephira.net, and CMPLT are unregistered trademarks of Donovan Hawkins