New Malloc

The Problem

DJGPP's current malloc is fast but not efficient. It is also restricted by the BSD copyright, which some people object to.

The Project

This page details a number of malloc replacements, and describes the relative performance and efficiency of each.

The test program simulates a mixed environment. I don't claim that this is representative of any application. This is just a simulated suite of calls used to compare the various mallocs fairly. The test keeps track of elapsed CPU time, total memory used, and the amount of memory the program thinks it has. To offset the benchmark overhead, the test0 case uses a set of dummy routines (malloc0.c) that do nothing (malloc returns (char *)1, free does nothing, realloc returns the pointer passed to it). The results from this test case are saved and used to adjust all other test cases, so that the time of the benchmark itself is removed from the results.

Each test case is the result of linking the test program with a different malloc routine. Each testX uses mallocX.c. All tests were performed on my 150 MHz R5000 SGI. The malloc/free test performs about 600,000 mallocs and 600,000 frees.

Generated from test0, this straight line indicates the maximum amount of memory the program uses at any time. This is done by keeping track of the size requested to malloc, any resizes, and frees.

The results from test0. Note that the times displayed on the graph are scaled to be somewhat similar to the other times, so that the viewer can correlate the memory amounts with the other results. Otherwise, all the times would be zero.

This is SGI's native malloc.

This is SGI's -lmalloc library. This one claims to be faster than the regular one.

This is GNU's malloc library.

This is the malloc that's currently in DJGPP (BSD's). 2n-4 bytes per small chunk, 2n+4092 for large chunks.

Similar algorithm to BSD's malloc, but uses binary search to find buckets and doesn't pre-allocate small chunks. Allows exactly 2n bytes in each bucket.

Like test1, but uses shift/loop to find bucket, and pre-allocates small chunks.

like test1, but uses 128 buckets instead of 32 for less waste. Oddly enough, it wastes more because there's less re-use.

This one doesn't use 2n chunks. Chunks are sized according to the request. We keep the free chunks in 2n-sized buckets to speed up finding them, though. This algorithm allows for large chunks satisfying small requests (chunks are split) and for re-combining chunks as they're freed. We also remember the last big block we split up, so we can use it again, as well as any extra memory we've got from the last sbrk.

test4 with statistics

Like test4, but we cache the bucket number in the free'd chunk, and only compute it when needed. This version also has a disabled "small request optimizer" that didn't help my benchmarks, but might help C++ programs. The alignment is 8 instead of 4, and we don't treat the sbrk block at the end of memory specially (it was used rarely, but we still had to check it each time).

The Results

The X axis is the number of elapsed seconds (minus the baseline time). The Y axis is kilobytes of core used (sbrk() change).

malloc/free malloc/realloc/free
malloc/free malloc/free/realloc

The Sources

  webmaster     delorie software   privacy  
  Copyright © 1997     Updated Nov 1997