nedalloc v1.10 beta 4 (?)

by Niall Douglas

Web site: http://www.nedprod.com/programs/portable/nedmalloc/

Trunk build status:

Enclosed is nedalloc, an alternative malloc implementation for multiple threads without lock contention based on dlmalloc v2.8.4 and a specialised user mode page allocator (Windows Vista or later only). It has the following features:

A per-thread small block cache for maximum CPU scalability.
A per-thread arena to minimise lock contention.
The ability to patch Windows binaries to replace the C memory allocation API malloc, realloc(), free() et al such that by simply inserting nedmalloc.dll into a process one realises performance improvements without recompilation.
On POSIX, it knows how to talk to valgrind so you can track memory corruption and/or memory leaks.
A unique user mode page allocator implementation which delivers O(1) scaling for blocks of any size, including an O(1) very fast realloc(). Improves medium sized block (~1Mb) allocation speeds by about 25 times on current hardware. Requires Windows Vista or later only, and requires Administrator privileges as well as either UAC disabled or a UAC prompt at the start of each program run.
A malloc v2 API which enables considerable improvements in efficiency by allowing client code to better inform the allocator on what (not) to do.
An enhanced C++ STL allocator implementation to enable super-fast std::vector<> [unfinished]

It is licensed under the Boost Software License which basically means you can do anything you like with it. This does not apply to the malloc.c.h file which remains copyright to others. Commercial support is available from ned Productions Limited.

It has been tested on win32 (x86), win64 (x64), Linux (x64), FreeBSD (x64) and Apple Mac OS X (x86). It works very well on all of these and is very significantly faster than the system allocator on Windows XP and FreeBSD <v7. If you are using >= 10.6 Apple Mac OS X or you are on Windows 7 or later then you probably won't see much improvement without modifying your source to use the v2 malloc API (and kudos to Apple and Microsoft for adopting excellent allocators).

The user mode page allocator returns jaw dropping real world performance improvements but requires running the process as the superuser. Without, it still offers sizeable gains on all older operating systems and through the v2 malloc API modest gains on all very recent operating systems, especially in these situations:

If you are repeatedly extending large vector arrays, you will see a LARGE improvement if you use the address space reservation features.
If you do a lot of work with 16 byte aligned vectors e.g. SSE or AVX vector arrays, you will find the v2 malloc API a godsend.

Table of Contents:

How to use
- A1. The C++ API
- A2. The v2 malloc C API
Notes
- B1. Memory Bloating
- B2. Memory Leakage
- B3. The Threadcache
- B4. Large Page support
- B5. Memory operation logging
- B6. Windows-only features
Speed Comparisons
Troubleshooting
Changelog

A. To use:

The quickest way is to drop nedmalloc.h, nedmalloc.c and malloc.c.h into your project. Call nedmalloc(), nedcalloc(), nedrealloc() and nedfree() instead of your normal allocator, or nedpmalloc(), nedpcalloc(), nedprealloc() and nedpfree() if you want to segment your memory usage into pools. Make sure that you call neddisablethreadcache() for every pool you use on thread exit, and don't forget neddisablethreadcache(0) for the system pool if necessary. Run and enjoy!

To test, compile test.c (C) and test.cpp (C++). Both will run a comparison between your system allocator and nedalloc and tell you how much faster nedalloc is. They also serve as examples of usage.

If you'd like nedalloc as a Windows DLL or POSIX ELF shared object, the easiest thing to do is to use scons which comes with a myriad of build options listed using scons -h. If you want to build some MSVC project files for use with Microsoft Visual Studio then what you do is (i) install python (ii) install scons (iii) open a Visual Studio Command Box for the Visual Studio you wish to use via Start Menu => Programs => Microsoft Visual Studio XXXX => Visual Studio Tools => Visual Studio XXXX Command Prompt (iv) change directory to the nedmalloc directory (e.g. by dragging in its folder) (v) type "!MakeMSVCProjs" and hit Return. Note that for Visual Studio 2008 and later support you need scons v2.1 or later.

nedalloc comes with two new memory allocator APIs: one is for C++, and the other is for C. Full documentation for all nedalloc's APIs and features is provided in the enclosed nedalloc.chm which is in Microsoft HTML Help format (Linux and Apple Mac OS X will happily read this format too). If you don't want to use the CHM documentation, nedmalloc.h is extensively commented with doxygen markup.

A1: The C++ API:

For the v1.10 release which was generously sponsored by Applied Research Associates (USA), a C++ metaprogrammed STL allocator was designed which makes use of advanced nedalloc features to remedy many of the long standing problems and inefficiencies caused by C++'s traditional over-fondness for copying things. While its implementation is complex, usage is extremely easy - simply supply nedallocator<> as the custom allocator to STL container classes.

As nedmalloc can do even better for vector extension, nedmalloc.h also contains a nedvector<> implementation which is the standard STL vector<> implementation except that it makes use of the non-relocating facilities of realloc2() (see below). This allows nedvector<> to not need to overallocate memory (most STL vector<> implementations will overallocate by 50%) which saves a lot of memory as well as completely avoiding array copy construction which make std::vector<>::resize() so very, very slow.

Even without nedalloc's major speed improvements as a simple C style allocator, the improvements to the C++ memory infrastructure alone can generate huge performance gains.

A2: The v2 malloc C API:

[Note: This API will be completely replaced in v1.2]

For the v1.10 release which was generously sponsored by Applied Research Associates (USA), a new general purpose allocator API was designed which is intended to remedy many of the long standing problems and inefficiencies introduced by the ISO C allocator API. Internally nedalloc's implementations of nedmalloc(), nedcalloc(), nedmemalign() and nedrealloc() all call into this API:

void* malloc2(size_t bytes, size_t alignment, unsigned flags)
void* realloc2(void* mem, size_t bytes, size_t alignment, unsigned flags)

If nedmalloc.h is being included by C++ code, the alignment and flags parameters default to zero which makes the new API identical to the old API (roll on the introduction of default parameters to C!). The ability for realloc2() to take an alignment is particularly useful for extending aligned vector arrays such as SSE/AVX vector arrays. Hitherto SSE/AVX vector code had to jump through all sorts of unpleasant hoops to maintain alignment during array extension :(.

The flags supported include the ability to zero memory, to prevent realloc2() from moving a memory block, to force mmap() to be used from the beginning (useful when you know an array will be repeatedly extended) and to cause malloc2() to reserve additional address space after the allocation such that a realloc2() up to that reserved space will be very quick. On 32 bit Windows and Linux this reservation costs no address space in your process, so using it will NOT cause premature address space exhaustion.

You should note that realloc()'s thunk to realloc2() defaults the flags to M2_RESERVE_MULT(8) i.e. if realloc() needs to allocate a block larger than mmap_threshold, it will also reserve eight times the address space of that allocation in order to make future realloc()'s up to that point much faster. This catches the vast majority of situations where large arrays are repeatedly extended.

B. Notes:

If you want the very latest version of this allocator, get it from the TnFOX GIT repository at either of (both are identical mirrors):

IF YOU THINK YOU HAVE FOUND A BUG, PLEASE CHECK ONE OF THESE REPOS FIRST BEFORE REPORTING IT!

B1: Memory Bloating

Because of how nedalloc allocates an mspace per thread, it can cause severe bloating of memory usage under certain allocation patterns. You can substantially reduce this wastage by setting DEFAULTMAXTHREADSINPOOL or the threads parameter to nedcreatepool() to a fraction of the number of threads which would normally be in a pool at once. This will reduce bloating at the cost of an increase in lock contention, with DEFAULTMAXTHREADSINPOOL=1 removing almost all bloating. If the block sizes typically allocated are less than THREADCACHEMAX, locking is avoided 90-99% of the time and if most of your allocations are below this value, you can safely set DEFAULTMAXTHREADSINPOOL or even MAXTHREADSINPOOL to one.

If you have LOTS of threads you may find that the threadcache held per thread is causing memory bloating. You can call nedtrimthreadcache() to trim the cache in a thread when you know that it won't be doing memory allocation (e.g. just before going to sleep), or alternatively you can set THREADCACHEMAXFREESPACE to something smaller than its default of 1Mb.

Lastly, some people find that memory is not returned to the system when they think it ought to be. dlmalloc only returns free memory to the system when there is DEFAULT_TRIM_THRESHOLD (default=2Mb) free in a mspace, and it only checks how much there is free outside the topmost segment every MAX_RELEASE_CHECK_RATE free()'s. In other words, if your program very rapidly deallocates an awful lot of memory and then does not call free() for some time thereafter, dlmalloc will not release memory to the system. Generally in any real world code scenario free() will be called fairly frequently, and if not then you can always force release using nedmalloc_trim().

B2: Memory Leakage

You will suffer memory leakage unless you call neddisablethreadcache() per pool for every thread which exits (unless you are using nedalloc from its DLL on Windows). This is because nedalloc cannot portably know when a thread exits and thus when its thread cache can be returned for use by other code. Don't forget pool zero, the system pool. On some POSIX threads implementations there exists a pthread_atexit() which registers a termination handler for thread exit - if you don't have one of these then you'll have to do it manually.

Equally if you use nedalloc from a dynamically loaded DLL or shared object which you later kick out of memory, you will leak memory if you don't disable all thread caches for all pools (as per the preceding paragraph), destroy all thread pools using neddestroypool() and destroy the system pool using neddestroysyspool().

B3: The Threadcache

For C++ type allocation patterns (where the same small sizes of memory are regularly allocated and deallocated as objects are created and destroyed), the threadcache always benefits performance as it will cache all malloc/free allocations under THREADCACHEMAX in size. If however your allocation patterns are different, searching the threadcache may significantly slow down your code - as a rule of thumb, if cache utilisation is below 80% (see the source for neddisablethreadcache() for how to enable debug printing in release mode) then you should disable the thread cache for that thread. You can compile out the threadcache code by setting THREADCACHEMAX to zero.

B4: Large Page support

For some applications defining ENABLE_LARGE_PAGES can give a 10-15% performance increase by having nedalloc allocate using large pages only (which are 2Mb on x86/x64). Large pages take much less space in the TLB cache and can greatly benefit programs with a large working set, particularly on 64 bit systems.

Support for large pages is limited to Linux and Windows. On Linux one must employ the libhugetlbfs library anyway as this is the "official" form of large page support, and setting it up and configuring it involves mounting a special hugetlbfs filing system. dlmalloc does not require a dependency on the libhugetlbfs headers, rather it searches for the library in the current process and if not found it silently disables support.

On Windows, large page support is only implemented on Windows Server 2003/Vista or later and they are only permitted to be allocated by users holding the "Lock pages in memory" local security setting which is DISABLED by default. Furthermore, the process using nedalloc must hold the SeLockMemoryPrivilege privilege. If you are using the DLL then the DLL attempts to enable the SeLockMemoryPrivilege during initialisation - therefore if you are not using the DLL you will have to do this manually yourself. As with Linux support, if at any stage large pages cannot be allocated, then dlmalloc silently disables support - this allows one binary to function correctly in any environment. Note that on Windows if your process allocates a lot of memory at once when the machine has been running for an extended period, then the whole computer may hang for several seconds as the Windows kernel copies memory around in order to coalesce a large page. This is a problem with the Windows kernel and its VM design, not nedmalloc! If you would like to see how large pages ought to be implemented, research how FreeBSD implemented them.

B5: Memory operation logging

It is often very useful to have a log of the memory operations which an application performs - you would be amazed at the inefficiencies in memory usage that this can reveal. nedalloc contains a very fast memory operation logger which keeps a per-thread log of selected operations, including an optional stack backtrace. On pool destruction, or nedflushlogs(), nedalloc will write out the log as a Comma Separated Value format file which can be loaded into applications such as Excel for analysis.

To use, define ENABLE_LOGGING to the bitmask of enum LogEntryType items in which you are interested, so 0xffffffff would log absolutely everything. The macro NEDMALLOC_TESTLOGENTRY, whose default is (ENABLE_LOGGING & logentrytype), is then used to determine which items should be logged. You can also enable stack backtracing on MSVC and GCC using NEDMALLOC_STACKBACKTRACEDEPTH.

B6: Windows-only features

If you are running on Windows, there are quite a few extra options available thanks to work generously sponsored by Applied Research Associates (USA):

Automatic threadcache cleanup and log output

If you build nedalloc as a DLL and link that into your application, then the DLL can trap thread exits in your application and call neddisablethreadcache() on all currently existing nedpool's for you. On process exit, the DLL will also call nedflushlogs() for you on all still extant nedpool's.

Replacing the system allocator in the whole process

If you define REPLACE_SYSTEM_ALLOCATOR when building the DLL then the DLL will replace most usage of the MSVCRT allocator (release MSVCRT, not debug MSVCRTD) within any process it is loaded into with nedalloc's routines instead, whilst remaining able to handle the odd free() of a MSVCRT allocated block allocated during CRT init. This very conveniently allows you to simply link with the nedalloc DLL and your application magically now uses it with no code changes required, and because the MSVC implementation of operators new and delete both call malloc() and free() it also covers all C++ code. The following code is suggested:

#pragma comment(lib, "nedmalloc.lib")

This asks the linker to link against nedmalloc.lib during linking - without this pragma the linker will generally leave out nedmalloc as there are no explicitly imported routines that it understands. This auto-patching feature can also be combined with Microsoft's Detours to run any arbitrary application using nedalloc instead of the system allocator:

withdll /d:nedmalloc.dll program.exe

For those not able to use Microsoft Detours, there is an enclosed unsupported/nedmalloc_loader program which does one variant of the same thing. It may or may not be useful to you - it is not intended to be maintained, and it probably doesn't work on newer systems.

The reason that only the release MSVCRT not the debug MSVCRTD is patched is twofold: (i) usually one wants the debug heap in debug builds so it does memory corruption checking and reports memory leaks and (ii) the MSVC CRT actually implements operator new and malloc using a completely different implementation based on the Windows kernel HeapAlloc() function and it does a lot of hoop jumping to handle mismatching CRT versions and lots of other stuff. You can enable patching of the debug memory allocation functions in winpatcher.c by uncommenting the relevant lines.

User mode page allocation

The user mode page allocator is a user space implementation of kernel memory page allocation made possible by misusing the Address Windowing Extensions (AWE) provided by newer versions of Microsoft Windows. AWE allows - with a bit of persuasion - direct control of the Memory Management Unit of the CPU, thus allowing memory pages to be arbitrarily remapped from one address to another. The user mode page allocator can therefore allocate memory in microseconds by simply mapping it into where it needs to be, or it can realloc() gigabytes of memory from its old location into a new bigger space in microseconds. This O(1) scaling gives processes running on the user mode page allocator an unholy speed increase which gets exponentially better the larger the data set.

Want to know more in lots of detail? Here are two academic papers on the topic:

Douglas, N, (2011-May), 'User Mode Memory Page Management: An old idea applied anew to the memory wall problem', ArXiv e-prints, vol: 1105.1815.
Douglas, N, (2011-May), 'User Mode Memory Page Allocation: A Silver Bullet For Memory Allocation?', ArXiv e-prints, vol: 1105.1811.

C. Speed comparisons:

See Benchmarks.xls for details.

The enclosed test.c can do one of two things: it can be a torture test which mostly hammers realloc() or it can be a pure speed test which sticks to simple malloc() and free(). If you enable C++ mode, half of the allocation sizes will be a two power multiple less than 512 bytes (to mimic C++ stack instantiated objects) which are extremely common in C++ code.

The torture test is designed to mercilessly work realloc() which is the most complex and complete code path in any memory allocator. Most allocators have very poor realloc() performance - not so nedalloc which makes use of mremap() support on Linux and Windows. Even without mremap() support nedalloc's realloc() tends to be significantly faster than any standard allocator.

The speed test is designed to be a representative synthetic memory allocator test where most allocations follow a stack pattern. It works by randomly mixing allocations with frees with sizes being a random value less than 16Kb.

The C++ test.cpp simply benchmarks how much difference nedalloc::nedallocatorise<> makes to std::vector<> performance, particularly the performance of push_back(), pop_back() and vector assignment all of which are very common in real world code. As you will see, the STL - even with C++0x move constructor support - does not perform anywhere close to nedalloc's version which achieves its gains by simply avoiding copy and move construction completely.

The real world code results are from Tn's TestIO benchmark. This is a heavily multithreaded and memory intensive benchmark with a lot of branching and other stuff modern processors don't like so much. As you'll note, the test doesn't show the benefits of the threadcache mostly due to the saturation of the memory bus being the limiting factor.

D. Troubleshooting:

I get a quite a few bug reports about code not working properly under nedalloc. I do not wish to sound presumptuous, however in an overwhelming majority of cases the problem is in your application code and not nedalloc (see below for all the bugs reported and fixed since 2006). Some of the largest corporations and IT deployments in the world use nedalloc pre-v1.10, and pre-v1.10 has been very heavily stress tested on everything from 32 processor SMP clusters right through to root DNS servers, ATM machine networks and embedded operating systems requiring a very high uptime. The v1.10 release adds a LOT of new code and features, and hence there are quite likely a lot of new bugs in the new code.

In particular, just because it just happens to appear to work under the system allocator does not mean that your application is not riddled with memory corruption and non-ANSI usage of the API! And usually this is not your code's fault, but rather it is usually the third party libraries being used which sadly often include system libraries.

Even though debugging an application for memory errors is a true black art made possible only with a great deal of patience, intuition and skill, here is a checklist for things to do before reporting a bug in nedalloc:

Make SURE you try nedalloc from GIT HEAD. For around six months of 2007 I kept getting the same report of a bug long fixed in GIT HEAD.
Make SURE you try nedalloc v1.06. If it works in v1.06 but isn't working in nedalloc >= v1.10, then it's probably a bug in the new code (please report it to me!)
Make use of nedalloc's internal debug routines. Try turning on full sanity checks by #define FULLSANITYCHECKS 1. Also make use of all the assertion checking performed when DEBUG is defined as 1. A lot of bug reports are made before running under a debug build where an assertion trip clearly showed the problem. Lastly, try changing the thread cache by #defining THREADCACHEMAX - this fundamentally changes how the memory allocator behaves: if everything is fine with the thread cache fully on or fully off, then this strongly suggests the source of your problem.
Make SURE you are matching allocations and frees belonging to nedalloc if you are not defining REPLACE_SYSTEM_ALLOCATOR. Attempting to free a block not allocated by nedalloc will end badly, similarly passing one of nedalloc's blocks to another allocator will likely also end badly. I have inserted as many assertion and debug checks for this possibility as I can think of (further suggestions are welcome), but no system can ever be watertight. If you're using C++, make use of the C++ nedallocatorise API provided or else use some form of strong template type system to have the compiler guarantee membership of a memory pointer - see the Boost libraries, or indeed my own TnFOX portability toolkit.
If you're still having problems, or more likely your code runs absolutely fine under debug builds but trips up under release which suggests a timing bug, it is time to deploy heavyweight tools. Under Linux, you should use valgrind. Under Windows, there is an excellent commercial tool called Glowcode. Any programming team serious on quality should ALWAYS run their projects through these tools before each and every release anyway - you would be amazed at what you miss during all other testing.
Lastly, in the worst case scenario, consider hiring in a memory debugging expert. There are quite a few on the market and they often are authors of memory allocators. Wolfram Gloger (the author of ptmalloc) provides consulting services. My own consulting company ned Productions Limited may be able to provide such a service depending on our current workload.

I hope that these tips help. And I urge anyone considering simply dropping back to the system allocator as a quick fix to reconsider: squashing memory bugs often brings with it significant extra benefits in performance and reliability. It may cost what appears to be a lot extra now, but it usually will save itself many times its cost over the next few years. I know of one large multinational corporation who saved hundreds of millions of dollars due to the debugging of their system software performed when trying to get it working with nedalloc - they found one bug in nedalloc but over a hundred in their own code, and in the process improved performance threefold which saved an expensive hardware upgrade and deployment. The conclusion can only be that fixing memory bugs now tends to be worth it in the long run.

E. ChangeLog:

v1.10 beta 4 ?:

[master 726d9c7] Fixed memory corruption introduced when creating more than two nedpool's (issue #7). Thanks to mxmauro for reporting this.
[master c191ea9] Merged dlmalloc v2.8.6.
[master 06f1c70] Added support for clang, plus fixed up some compile errors in C++11.
[master 8f8256c] Added support for valgrind instrumentation so valgrind can track programs using nedmalloc.
[master 69825ca] Fixed issue #8 where memory allocated via the independent_*() functions was being incorrectly identified as system allocated. Thanks to Geri for reporting this.
[master a6a0dec] Fixed issue #10 where a failure to allocate memory on POSIX was not being trapped correctly. Thanks to btaudul for reporting this.
[master a559f9e] Fixed issue #12 where RTLD_DEFAULT was undefined. Replaced this code entirely with new code which parses /proc/meminfo for huge page size. Thanks to Geri for reporting this.
[master 9119158] Added Travis CI build bot support to nedalloc, testing gcc, clang and clang static analyser.
[master xxxxxxx] Fixed issue #14 where nedalloc was using is_pod<> instead of is_trivially_copyable<>. Thanks to JustSid for reporting this.

v1.10 beta 3 17th July 2012:

[master 5f26c1a] Due to a bug introduced in sha 7a9dd5c (17th April 2010), nedmalloc has never allocated more than a single mspace when using the system pool. This effectively had disabled concurrency for any allocation > THREADCACHEMAX (8Kb) which no doubt made nedmalloc v1.10 betas 1 and 2 appear no faster than system allocators. My thanks to the eagle eyes of Gavin Lambert for spotting this.

v1.10 beta 2 10th July 2012:

[master 51ab2a2] scons now tests for C++0x support before turning it on and tries multiple libraries for clock_gettime() rather than assuming it lives in librt. This ought to fix miscompilation on Mac OS X. Thanks to Robert D. Blanchet Jr. for reporting this.
[master b2c3517] Mac defines malloc_size to be const void *ptr, not void *ptr
[master 9333e50] Updated to use the new O(1) Cfind(rounds=1) feature in nedtries
[master 54c7e44] Avoid overflowing allocation size. Thanks to Xi Wang for supplying a patch fixing this.
[master 5b614a0] Removed __try1 and __finally1 from MinGW support as x64 target no longer supports SEH. Thanks to Geri for reporting this.
[master 48f1aa9] Tidied up bitrot which had broken compilation due to mismatched #if...#endif.

v1.10 beta 1 19th May 2011:

[master 89f1806] Moved from SVN to GIT. Bumped version to v1.10 as new ARA contract will involve significant further improvements mainly centering around realloc() performance.
[master 254fe7c] Added nedmemsize() for API compatibility with other allocators. Added DEFAULTMAXTHREADSINPOOL and set it to FOUR which is a BREAKING CHANGE from previous versions of nedalloc (which set it to 16).
[nedmalloc_fast_realloc 97d1420] Added win32mremap() implementation.
[nedmalloc_fast_realloc 8a1001e] Significantly improved test.c with new test options TESTCPLUSPLUS, BLOCKSIZE, TESTTYPE and MAXMEMORY.
[nedmalloc_fast_realloc 7ea606d] Implemented two variants of direct mremap() on Windows, one using file mappings and the other using over-reservation. The former is used on 32 bit and the latter on 64 bit.
[nedmalloc_fast_realloc 26ff9a7] Added the malloc2() interface to nedalloc.
[nedmalloc_fast_realloc 5bc5d97] Rewrote Readme.txt to become Readme.html which makes it much clearer to read.
[nedmalloc_fast_realloc 2efa595] Added doxygen markup to nedmalloc.h and a first go at a policy driven STL allocator class.
[nedmalloc_fast_realloc d851bde] Added a CHM documenting the nedalloc API.
[nedmalloc_fast_realloc dbd3991] Added a fast malloc operations logger which outputs a CSV log on process exit.
[nedmalloc_fast_realloc d6a8585] Added stack backtracing to the logger.
[master c7ea06d] Finished user mode page allocator, so merged nedmalloc_fast_realloc branch.
[master 9a8800f] Fixed small bug which was preventing the windows patcher from correctly finding the proper MSVCRT.
[master 37c58b1] Fixed leak of mutexes when using pthread or win32 mutexs as locks. Thanks to Gavin Lambert for reporting this.
[master f67e284] Fixed nedflushlogs() not actually flushing data and/or causing a segfault. Thanks to Roman Tatkin for reporting this.
[master 1324bf3] Finally got round to retiring the MSVC project files as they were sources of never ending hassle due to being out of sync with the SConstruct config. Rebuilt scons build system to be fully compatible with MSVC instead (long overdue!)
[master 068494e] As the release of v1.10 RC1 approaches, fixed a long standing problem with the binary patcher where multiple MSVCRT versions in the process weren't handled - everything was sent to one MSVCRT only, and needless to say that sorta worked sometimes and sometimes not. Now when nedmalloc passes a foreign block to the system allocator, it runs a stack backtrace to figure out what MSVCRT in the process it ought to pass it to. It's slow, but fixes a very common segfault on process exit on VS2010.
[master 4cca52c] Very embarrassingly, nedmalloc has been severely but unpredictably broken on POSIX for over a year now when built with DEBUG defined. This was turning on DEFAULT_GRANULARITY_ALIGNED whose POSIX implementation was causing random segfaults so mysterious that neither gdb nor valgrind could pick them up - in other words, the very worst kind of memory corruption: undetectable, untraceable and undebuggable. I only found them myself due to a recent bug report for TnFOX on POSIX where due to luck, very recent Linux kernels just happened by pure accident to cause this bug to manifest itself as preventing process init right at the very start - so early that no debugger could attach. After over a week of trial & error I narrowed it down to being somewhere in nedmalloc, then having something to do with DEBUG being defined or not, then two hours ago the eureka moment arrived and I quite literally did a jig around the room in joy. Problem is now fixed thank the heavens!!!
[master 3d55a01] Fixed a problem where the binary patcher was early outing too soon and therefore failing to patch all the binaries properly. It would seem that the Microsoft linker doesn't sort the import table like I had thought it did - I would guess it sorts per DLL location, otherwise is unsorted. Thanks to Roman Tatkin for reporting this bug.
[master 6c74071] Added override of _GNU_SOURCE for when HAVE_MREMAP is auto-detected. Thanks to Maxim Zakharov for reporting this issue.
[master dee2d27] Marked off the v2 malloc API as deprecated in preparation for beta release. Updated CHM documentation.

v1.06 beta 2 21st March 2010:

{ 1153 } Added detection of whether host process is using MSVCRT or MSVCRTD and the fixing up of which runtime tolerant nedalloc should use if nedalloc was linked differently. This ought to save a great deal of hassle later on by preventing failed-to-RTM user bug reports :)
{ 1154 } Fixed nedalloc trying to use MLOCK_T even when USE_LOCKS=0. Thanks to Ariel Manzur for reporting this.
{ 1155 } Fixed USE_SPIN_LOCKS=0 not compiling on Windows.
{ 1157 } Fixed bug where foreign blocks entering the threadcache weren't being marked as such, thus typically causing a segfault on process exit.
{ 1158 } Fixed compilation problems on mingw. Thanks to Amanieu d'Antras for reporting these.
{ 1159 } Released as beta2.

v1.06 beta 1 13th January 2010:

{ 1079 } Fixed misdeclaration of struct mallinfo as C++ type. Thanks to James Mansion for reporting this.
{ 1082 } Fixed dlmalloc bug which caused header corruption to mmap() allocations when running under multiple threads.
{ 1088 } Fixed assertion failure for nedblksize() with latest dlmalloc. Thanks to Anteru for reporting this.
{ 1088 } Added neddestroysyspool(). Thanks to Lars Wehmeyer for suggesting this.
{ 1088 } Fixed thread id high bit set bug causing SIGABRT on Mac OS X. Thanks to Chris Dillman for reporting this.
{ 1094 } Integrated dlmalloc v2.8.4 final.
{ 1095 } Added nedtrimthreadcache(). Thanks to Hayim Hendeles for suggesting this.
{ 1095 } Fixed silly assertion of null pointer dereference. Thanks to Ullrich Heinemann for reporting this.
{ 1096 } Fixed lots of level 4 warnings on MSVC. Thanks to Anteru for suggesting this.
{ 1098 } Improved non-nedalloc block detection to 6.25% probability of being wrong. Thanks to Applied Research Associates for sponsoring this.
{ 1099 } Added USE_MAGIC_HEADERS which allows nedalloc to handle freeing a system allocated block. Added USE_ALLOCATOR which allows the changing of which backend allocator to use (with choices between the system allocator and dlmalloc - choosing the system allocator is intended for debug situations only e.g. valgrind). Thanks to Applied Research Associates for sponsoring this.
{ 1105 } Added ability to build nedalloc as a DLL. Added support for a run time PE binary patcher which can patch all usage of the system allocator replacing it with nedalloc. Thanks to Applied Research Associates for sponsoring this.
{ 1108 } Added patcher loader which can load any arbitrary program injecting the nedalloc DLL which then patches in its replacement for the system allocator. Doesn't work on all programs, but does on most e.g. Microsoft Word. Thanks to Applied Research Associates for sponsoring this.
{ 1116 } Finished debugging and optimising the latest additions to the codebase. The patcher now works well on x64 as well as x86. Added support for large pages on Windows. Thanks to Applied Research Associates for sponsoring this.
{ 1125 } Added nedpoollist() which returns a snapshot of the nedpool's currently existing. The Windows DLL thread exit code now disables the thread cache for all currently existing nedpool's. Thanks to Applied Research Associates for sponsoring this.
{ 1126 } Added ENABLE_TOLERANT_NEDMALLOC which allows nedalloc to recognise system allocator blocks and to do the right thing with them.
{ 1139 } Added link time code generation support for Windows builds. This currently has zero performance improvement on x64 (on MSVC9) but can add 15% to x86 performance (on MSVC9). Also added scons SConstruct and SConscript files.

v1.05 15th June 2008:

{ 1042 } Added error check for TLSSET() and TLSFREE() macros. Thanks to Markus Elfring for reporting this.
{ 1043 } Fixed a segfault when freeing memory allocated using nedindependent_comalloc(). Thanks to Pavel Vozenilek for reporting this.

v1.04 14th July 2007:

Fixed a bug with the new optimised implementation that failed to lock on a realloc under certain conditions.
Fixed lack of thread synchronisation in InitPool() causing pool corruption.
Fixed a memory leak of thread cache contents on disabling. Thanks to Earl Chew for reporting this.
Added a sanity check for freed blocks being valid.
Reworked test.c into being a torture test.
Fixed GCC assembler optimisation misspecification.

v1.04alpha_svn915 7th October 2006:

Fixed failure to unlock thread cache list if allocating a new list failed. Thanks to Dmitry Chichkov for reporting this. Futher thanks to Aleksey Sanin.
Fixed realloc(0, <size>) segfaulting. Thanks to Dmitry Chichkov for reporting this.
Made config defines #ifndef so they can be overriden by the build system. Thanks to Aleksey Sanin for suggesting this.
Fixed deadlock in nedprealloc() due to unnecessary locking of preferred thread mspace when mspace_realloc() always uses the original block's mspace anyway. Thanks to Aleksey Sanin for reporting this.
Made some speed improvements by hacking mspace_malloc() to no longer lock its mspace, thus allowing the recursive mutex implementation to be removed with an associated speed increase. Thanks to Aleksey Sanin for suggesting this.
Fixed a bug where allocating mspaces overran its max limit. Thanks to Aleksey Sanin for reporting this.

v1.03 10th July 2006:

Fixed memory corruption bug in threadcache code which only appeared with >4 threads and in heavy use of the threadcache.

v1.02 15th May 2006:

Integrated dlmalloc v2.8.4, fixing the win32 memory release problem and improving performance still further. Speed is now up to twice the speed of v1.01 (average is 67% faster).
Fixed win32 critical section implementation. Thanks to Pavel Kuznetsov for reporting this.
Wasn't locking mspace if all mspaces were locked. Thanks to Pavel Kuznetsov for reporting this.
Added Apple Mac OS X support.

v1.01 24th February 2006:

Fixed multiprocessor scaling problems by removing sources of cache sloshing.
Earl Chew <earl_chew <at> agilent <dot> com> sent patches for the following:
1. size2binidx() wasn't working for default code path (non x86).
2. Fixed failure to release mspace lock under certain circumstances which caused a deadlock.

v1.00 1st January 2006:

First release