nedalloc v1.10 beta 4 (?)

by Niall Douglas

Web site: http://www.nedprod.com/programs/portable/nedmalloc/

Trunk build status:


Enclosed is nedalloc, an alternative malloc implementation for multiple threads without lock contention based on dlmalloc v2.8.4 and a specialised user mode page allocator (Windows Vista or later only). It has the following features:

  1. A per-thread small block cache for maximum CPU scalability.
  2. A per-thread arena to minimise lock contention.
  3. The ability to patch Windows binaries to replace the C memory allocation API malloc, realloc(), free() et al such that by simply inserting nedmalloc.dll into a process one realises performance improvements without recompilation.
  4. On POSIX, it knows how to talk to valgrind so you can track memory corruption and/or memory leaks.
  5. A unique user mode page allocator implementation which delivers O(1) scaling for blocks of any size, including an O(1) very fast realloc(). Improves medium sized block (~1Mb) allocation speeds by about 25 times on current hardware. Requires Windows Vista or later only, and requires Administrator privileges as well as either UAC disabled or a UAC prompt at the start of each program run.
  6. A malloc v2 API which enables considerable improvements in efficiency by allowing client code to better inform the allocator on what (not) to do.
  7. An enhanced C++ STL allocator implementation to enable super-fast std::vector<> [unfinished]

It is licensed under the Boost Software License which basically means you can do anything you like with it. This does not apply to the malloc.c.h file which remains copyright to others. Commercial support is available from ned Productions Limited.

It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) and Apple Mac OS X (x86). It works very well on all of these and is very significantly faster than the system allocator on Windows XP and FreeBSD <v7. If you are using >= 10.6 Apple Mac OS X or you are on Windows 7 or later then you probably won't see much improvement without modifying your source to use the v2 malloc API (and kudos to Apple and Microsoft for adopting excellent allocators).

The user mode page allocator returns jaw dropping real world performance improvements but requires running the process as the superuser. Without, it still offers sizeable gains on all older operating systems and through the v2 malloc API modest gains on all very recent operating systems, especially in these situations:

  1. If you are repeatedly extending large vector arrays, you will see a LARGE improvement if you use the address space reservation features.
  2. If you do a lot of work with 16 byte aligned vectors e.g. SSE or AVX vector arrays, you will find the v2 malloc API a godsend.

Table of Contents:

  1. How to use
  2. Notes
  3. Speed Comparisons
  4. Troubleshooting
  5. Changelog

A. To use:

The quickest way is to drop nedmalloc.h, nedmalloc.c and malloc.c.h into your project. Call nedmalloc(), nedcalloc(), nedrealloc() and nedfree() instead of your normal allocator, or nedpmalloc(), nedpcalloc(), nedprealloc() and nedpfree() if you want to segment your memory usage into pools. Make sure that you call neddisablethreadcache() for every pool you use on thread exit, and don't forget neddisablethreadcache(0) for the system pool if necessary. Run and enjoy!

To test, compile test.c (C) and test.cpp (C++). Both will run a comparison between your system allocator and nedalloc and tell you how much faster nedalloc is. They also serve as examples of usage.

If you'd like nedalloc as a Windows DLL or POSIX ELF shared object, the easiest thing to do is to use scons which comes with a myriad of build options listed using scons -h. If you want to build some MSVC project files for use with Microsoft Visual Studio then what you do is (i) install python (ii) install scons (iii) open a Visual Studio Command Box for the Visual Studio you wish to use via Start Menu => Programs => Microsoft Visual Studio XXXX => Visual Studio Tools => Visual Studio XXXX Command Prompt (iv) change directory to the nedmalloc directory (e.g. by dragging in its folder) (v) type "!MakeMSVCProjs" and hit Return. Note that for Visual Studio 2008 and later support you need scons v2.1 or later.

nedalloc comes with two new memory allocator APIs: one is for C++, and the other is for C. Full documentation for all nedalloc's APIs and features is provided in the enclosed nedalloc.chm which is in Microsoft HTML Help format (Linux and Apple Mac OS X will happily read this format too). If you don't want to use the CHM documentation, nedmalloc.h is extensively commented with doxygen markup.

A1: The C++ API:

For the v1.10 release which was generously sponsored by Applied Research Associates (USA), a C++ metaprogrammed STL allocator was designed which makes use of advanced nedalloc features to remedy many of the long standing problems and inefficiencies caused by C++'s traditional over-fondness for copying things. While its implementation is complex, usage is extremely easy - simply supply nedallocator<> as the custom allocator to STL container classes.

As nedmalloc can do even better for vector extension, nedmalloc.h also contains a nedvector<> implementation which is the standard STL vector<> implementation except that it makes use of the non-relocating facilities of realloc2() (see below). This allows nedvector<> to not need to overallocate memory (most STL vector<> implementations will overallocate by 50%) which saves a lot of memory as well as completely avoiding array copy construction which make std::vector<>::resize() so very, very slow.

Even without nedalloc's major speed improvements as a simple C style allocator, the improvements to the C++ memory infrastructure alone can generate huge performance gains.

A2: The v2 malloc C API:

[Note: This API will be completely replaced in v1.2]

For the v1.10 release which was generously sponsored by Applied Research Associates (USA), a new general purpose allocator API was designed which is intended to remedy many of the long standing problems and inefficiencies introduced by the ISO C allocator API. Internally nedalloc's implementations of nedmalloc(), nedcalloc(), nedmemalign() and nedrealloc() all call into this API:

If nedmalloc.h is being included by C++ code, the alignment and flags parameters default to zero which makes the new API identical to the old API (roll on the introduction of default parameters to C!). The ability for realloc2() to take an alignment is particularly useful for extending aligned vector arrays such as SSE/AVX vector arrays. Hitherto SSE/AVX vector code had to jump through all sorts of unpleasant hoops to maintain alignment during array extension :(.

The flags supported include the ability to zero memory, to prevent realloc2() from moving a memory block, to force mmap() to be used from the beginning (useful when you know an array will be repeatedly extended) and to cause malloc2() to reserve additional address space after the allocation such that a realloc2() up to that reserved space will be very quick. On 32 bit Windows and Linux this reservation costs no address space in your process, so using it will NOT cause premature address space exhaustion.

You should note that realloc()'s thunk to realloc2() defaults the flags to M2_RESERVE_MULT(8) i.e. if realloc() needs to allocate a block larger than mmap_threshold, it will also reserve eight times the address space of that allocation in order to make future realloc()'s up to that point much faster. This catches the vast majority of situations where large arrays are repeatedly extended.

B. Notes:

If you want the very latest version of this allocator, get it from the TnFOX GIT repository at either of (both are identical mirrors):

IF YOU THINK YOU HAVE FOUND A BUG, PLEASE CHECK ONE OF THESE REPOS FIRST BEFORE REPORTING IT!

B1: Memory Bloating

Because of how nedalloc allocates an mspace per thread, it can cause severe bloating of memory usage under certain allocation patterns. You can substantially reduce this wastage by setting DEFAULTMAXTHREADSINPOOL or the threads parameter to nedcreatepool() to a fraction of the number of threads which would normally be in a pool at once. This will reduce bloating at the cost of an increase in lock contention, with DEFAULTMAXTHREADSINPOOL=1 removing almost all bloating. If the block sizes typically allocated are less than THREADCACHEMAX, locking is avoided 90-99% of the time and if most of your allocations are below this value, you can safely set DEFAULTMAXTHREADSINPOOL or even MAXTHREADSINPOOL to one.

If you have LOTS of threads you may find that the threadcache held per thread is causing memory bloating. You can call nedtrimthreadcache() to trim the cache in a thread when you know that it won't be doing memory allocation (e.g. just before going to sleep), or alternatively you can set THREADCACHEMAXFREESPACE to something smaller than its default of 1Mb.

Lastly, some people find that memory is not returned to the system when they think it ought to be. dlmalloc only returns free memory to the system when there is DEFAULT_TRIM_THRESHOLD (default=2Mb) free in a mspace, and it only checks how much there is free outside the topmost segment every MAX_RELEASE_CHECK_RATE free()'s. In other words, if your program very rapidly deallocates an awful lot of memory and then does not call free() for some time thereafter, dlmalloc will not release memory to the system. Generally in any real world code scenario free() will be called fairly frequently, and if not then you can always force release using nedmalloc_trim().

B2: Memory Leakage

You will suffer memory leakage unless you call neddisablethreadcache() per pool for every thread which exits (unless you are using nedalloc from its DLL on Windows). This is because nedalloc cannot portably know when a thread exits and thus when its thread cache can be returned for use by other code. Don't forget pool zero, the system pool. On some POSIX threads implementations there exists a pthread_atexit() which registers a termination handler for thread exit - if you don't have one of these then you'll have to do it manually.

Equally if you use nedalloc from a dynamically loaded DLL or shared object which you later kick out of memory, you will leak memory if you don't disable all thread caches for all pools (as per the preceding paragraph), destroy all thread pools using neddestroypool() and destroy the system pool using neddestroysyspool().

B3: The Threadcache

For C++ type allocation patterns (where the same small sizes of memory are regularly allocated and deallocated as objects are created and destroyed), the threadcache always benefits performance as it will cache all malloc/free allocations under THREADCACHEMAX in size. If however your allocation patterns are different, searching the threadcache may significantly slow down your code - as a rule of thumb, if cache utilisation is below 80% (see the source for neddisablethreadcache() for how to enable debug printing in release mode) then you should disable the thread cache for that thread. You can compile out the threadcache code by setting THREADCACHEMAX to zero.

B4: Large Page support

For some applications defining ENABLE_LARGE_PAGES can give a 10-15% performance increase by having nedalloc allocate using large pages only (which are 2Mb on x86/x64). Large pages take much less space in the TLB cache and can greatly benefit programs with a large working set, particularly on 64 bit systems.

Support for large pages is limited to Linux and Windows. On Linux one must employ the libhugetlbfs library anyway as this is the "official" form of large page support, and setting it up and configuring it involves mounting a special hugetlbfs filing system. dlmalloc does not require a dependency on the libhugetlbfs headers, rather it searches for the library in the current process and if not found it silently disables support.

On Windows, large page support is only implemented on Windows Server 2003/Vista or later and they are only permitted to be allocated by users holding the "Lock pages in memory" local security setting which is DISABLED by default. Furthermore, the process using nedalloc must hold the SeLockMemoryPrivilege privilege. If you are using the DLL then the DLL attempts to enable the SeLockMemoryPrivilege during initialisation - therefore if you are not using the DLL you will have to do this manually yourself. As with Linux support, if at any stage large pages cannot be allocated, then dlmalloc silently disables support - this allows one binary to function correctly in any environment. Note that on Windows if your process allocates a lot of memory at once when the machine has been running for an extended period, then the whole computer may hang for several seconds as the Windows kernel copies memory around in order to coalesce a large page. This is a problem with the Windows kernel and its VM design, not nedmalloc! If you would like to see how large pages ought to be implemented, research how FreeBSD implemented them.

B5: Memory operation logging

It is often very useful to have a log of the memory operations which an application performs - you would be amazed at the inefficiencies in memory usage that this can reveal. nedalloc contains a very fast memory operation logger which keeps a per-thread log of selected operations, including an optional stack backtrace. On pool destruction, or nedflushlogs(), nedalloc will write out the log as a Comma Separated Value format file which can be loaded into applications such as Excel for analysis.

To use, define ENABLE_LOGGING to the bitmask of enum LogEntryType items in which you are interested, so 0xffffffff would log absolutely everything. The macro NEDMALLOC_TESTLOGENTRY, whose default is (ENABLE_LOGGING & logentrytype), is then used to determine which items should be logged. You can also enable stack backtracing on MSVC and GCC using NEDMALLOC_STACKBACKTRACEDEPTH.

B6: Windows-only features

If you are running on Windows, there are quite a few extra options available thanks to work generously sponsored by Applied Research Associates (USA):

Automatic threadcache cleanup and log output
If you build nedalloc as a DLL and link that into your application, then the DLL can trap thread exits in your application and call neddisablethreadcache() on all currently existing nedpool's for you. On process exit, the DLL will also call nedflushlogs() for you on all still extant nedpool's.
Replacing the system allocator in the whole process

If you define REPLACE_SYSTEM_ALLOCATOR when building the DLL then the DLL will replace most usage of the MSVCRT allocator (release MSVCRT, not debug MSVCRTD) within any process it is loaded into with nedalloc's routines instead, whilst remaining able to handle the odd free() of a MSVCRT allocated block allocated during CRT init. This very conveniently allows you to simply link with the nedalloc DLL and your application magically now uses it with no code changes required, and because the MSVC implementation of operators new and delete both call malloc() and free() it also covers all C++ code. The following code is suggested:

#pragma comment(lib, "nedmalloc.lib")

This asks the linker to link against nedmalloc.lib during linking - without this pragma the linker will generally leave out nedmalloc as there are no explicitly imported routines that it understands. This auto-patching feature can also be combined with Microsoft's Detours to run any arbitrary application using nedalloc instead of the system allocator:

withdll /d:nedmalloc.dll program.exe

For those not able to use Microsoft Detours, there is an enclosed unsupported/nedmalloc_loader program which does one variant of the same thing. It may or may not be useful to you - it is not intended to be maintained, and it probably doesn't work on newer systems.

The reason that only the release MSVCRT not the debug MSVCRTD is patched is twofold: (i) usually one wants the debug heap in debug builds so it does memory corruption checking and reports memory leaks and (ii) the MSVC CRT actually implements operator new and malloc using a completely different implementation based on the Windows kernel HeapAlloc() function and it does a lot of hoop jumping to handle mismatching CRT versions and lots of other stuff. You can enable patching of the debug memory allocation functions in winpatcher.c by uncommenting the relevant lines.

User mode page allocation
The user mode page allocator is a user space implementation of kernel memory page allocation made possible by misusing the Address Windowing Extensions (AWE) provided by newer versions of Microsoft Windows. AWE allows - with a bit of persuasion - direct control of the Memory Management Unit of the CPU, thus allowing memory pages to be arbitrarily remapped from one address to another. The user mode page allocator can therefore allocate memory in microseconds by simply mapping it into where it needs to be, or it can realloc() gigabytes of memory from its old location into a new bigger space in microseconds. This O(1) scaling gives processes running on the user mode page allocator an unholy speed increase which gets exponentially better the larger the data set.

Want to know more in lots of detail? Here are two academic papers on the topic:
  1. Douglas, N, (2011-May), 'User Mode Memory Page Management: An old idea applied anew to the memory wall problem', ArXiv e-prints, vol: 1105.1815.
  2. Douglas, N, (2011-May), 'User Mode Memory Page Allocation: A Silver Bullet For Memory Allocation?', ArXiv e-prints, vol: 1105.1811.

C. Speed comparisons:

See Benchmarks.xls for details.

The enclosed test.c can do one of two things: it can be a torture test which mostly hammers realloc() or it can be a pure speed test which sticks to simple malloc() and free(). If you enable C++ mode, half of the allocation sizes will be a two power multiple less than 512 bytes (to mimic C++ stack instantiated objects) which are extremely common in C++ code.

The torture test is designed to mercilessly work realloc() which is the most complex and complete code path in any memory allocator. Most allocators have very poor realloc() performance - not so nedalloc which makes use of mremap() support on Linux and Windows. Even without mremap() support nedalloc's realloc() tends to be significantly faster than any standard allocator.

The speed test is designed to be a representative synthetic memory allocator test where most allocations follow a stack pattern. It works by randomly mixing allocations with frees with sizes being a random value less than 16Kb.

The C++ test.cpp simply benchmarks how much difference nedalloc::nedallocatorise<> makes to std::vector<> performance, particularly the performance of push_back(), pop_back() and vector assignment all of which are very common in real world code. As you will see, the STL - even with C++0x move constructor support - does not perform anywhere close to nedalloc's version which achieves its gains by simply avoiding copy and move construction completely.

The real world code results are from Tn's TestIO benchmark. This is a heavily multithreaded and memory intensive benchmark with a lot of branching and other stuff modern processors don't like so much. As you'll note, the test doesn't show the benefits of the threadcache mostly due to the saturation of the memory bus being the limiting factor.

D. Troubleshooting:

I get a quite a few bug reports about code not working properly under nedalloc. I do not wish to sound presumptuous, however in an overwhelming majority of cases the problem is in your application code and not nedalloc (see below for all the bugs reported and fixed since 2006). Some of the largest corporations and IT deployments in the world use nedalloc pre-v1.10, and pre-v1.10 has been very heavily stress tested on everything from 32 processor SMP clusters right through to root DNS servers, ATM machine networks and embedded operating systems requiring a very high uptime. The v1.10 release adds a LOT of new code and features, and hence there are quite likely a lot of new bugs in the new code.

In particular, just because it just happens to appear to work under the system allocator does not mean that your application is not riddled with memory corruption and non-ANSI usage of the API! And usually this is not your code's fault, but rather it is usually the third party libraries being used which sadly often include system libraries.

Even though debugging an application for memory errors is a true black art made possible only with a great deal of patience, intuition and skill, here is a checklist for things to do before reporting a bug in nedalloc:

  1. Make SURE you try nedalloc from GIT HEAD. For around six months of 2007 I kept getting the same report of a bug long fixed in GIT HEAD.
  2. Make SURE you try nedalloc v1.06. If it works in v1.06 but isn't working in nedalloc >= v1.10, then it's probably a bug in the new code (please report it to me!)
  3. Make use of nedalloc's internal debug routines. Try turning on full sanity checks by #define FULLSANITYCHECKS 1. Also make use of all the assertion checking performed when DEBUG is defined as 1. A lot of bug reports are made before running under a debug build where an assertion trip clearly showed the problem. Lastly, try changing the thread cache by #defining THREADCACHEMAX - this fundamentally changes how the memory allocator behaves: if everything is fine with the thread cache fully on or fully off, then this strongly suggests the source of your problem.
  4. Make SURE you are matching allocations and frees belonging to nedalloc if you are not defining REPLACE_SYSTEM_ALLOCATOR. Attempting to free a block not allocated by nedalloc will end badly, similarly passing one of nedalloc's blocks to another allocator will likely also end badly. I have inserted as many assertion and debug checks for this possibility as I can think of (further suggestions are welcome), but no system can ever be watertight. If you're using C++, make use of the C++ nedallocatorise API provided or else use some form of strong template type system to have the compiler guarantee membership of a memory pointer - see the Boost libraries, or indeed my own TnFOX portability toolkit.
  5. If you're still having problems, or more likely your code runs absolutely fine under debug builds but trips up under release which suggests a timing bug, it is time to deploy heavyweight tools. Under Linux, you should use valgrind. Under Windows, there is an excellent commercial tool called Glowcode. Any programming team serious on quality should ALWAYS run their projects through these tools before each and every release anyway - you would be amazed at what you miss during all other testing.
  6. Lastly, in the worst case scenario, consider hiring in a memory debugging expert. There are quite a few on the market and they often are authors of memory allocators. Wolfram Gloger (the author of ptmalloc) provides consulting services. My own consulting company ned Productions Limited may be able to provide such a service depending on our current workload.

I hope that these tips help. And I urge anyone considering simply dropping back to the system allocator as a quick fix to reconsider: squashing memory bugs often brings with it significant extra benefits in performance and reliability. It may cost what appears to be a lot extra now, but it usually will save itself many times its cost over the next few years. I know of one large multinational corporation who saved hundreds of millions of dollars due to the debugging of their system software performed when trying to get it working with nedalloc - they found one bug in nedalloc but over a hundred in their own code, and in the process improved performance threefold which saved an expensive hardware upgrade and deployment. The conclusion can only be that fixing memory bugs now tends to be worth it in the long run.

E. ChangeLog:

v1.10 beta 4 ?:

v1.10 beta 3 17th July 2012:

v1.10 beta 2 10th July 2012:

v1.10 beta 1 19th May 2011:

v1.06 beta 2 21st March 2010:

v1.06 beta 1 13th January 2010:

v1.05 15th June 2008:

v1.04 14th July 2007:

v1.04alpha_svn915 7th October 2006:

v1.03 10th July 2006:

v1.02 15th May 2006:

v1.01 24th February 2006:

v1.00 1st January 2006: