Performance Cost and Measurement of Embedded Dynamic Memory Allocation

Executive Summary: In embedded C/C++ applications, dynamic memory allocation (DMA) operations typically take in the order of 0.1 to 1 microsecond to execute, depending on the target platform and the patterns of temporal coherence of the application’s allocation sizes. Realtime embedded applications whose code invokes excessive DMAs can readily suffer from milliseconds of lost performance per main loop, with the incurred penalty being distributed throughout all parts of the main loop and thus not readily detected by hierarchical performance profiling techniques.

Introduction

We’ve all heard the mantra around software performance optimization – don’t optimize prematurely. The idea here is that code correctness and robustness should always take precedence over performance – which is true. However, in practice there are many domains where correctness and robustness alone are NEVER going to be good enough. Two such domains where I have spent much of my career are console video games development, and embedded systems development for medical and automotive applications. In my view, unless your computer science career happens to unfold entirely in the arena of general purpose computing platforms, performance WILL matter profoundly to the efficiency of your programs and the quality of the customer experience with your software.

In this article, I intend to present some performance statistics related to dynamic memory allocation (DMA) that highlight how important a consideration this should be in your overall application efficiency. Rarely does DMA feature on the radar of teaching material on algorithm performance and scalability, and it is true that in practice the most profound and far-reaching performance optimizations usually occur at the algorithmic level. But at the same time, that guiding principle is based on a certain kind of assumption – that performance bottlenecks are concentrated in certain necks of source code, or certain modules, and that identifying and improving such bottlenecks is the surest way to achieve the biggest performance gains for the least effort. Again, this is not false – but at the same time, it is not the whole story. It is one strategy among a suite of effective strategies to make your application perform better. In particular, it fails to recognize that some very significant performance costs can be distributed homogeneously throughout your source code – and it is into this category that DMA performance fits.

So why did I choose, for the first article in this blog, to focus on DMA as a potentially fertile area of optimization effort? Because today we as software engineers and developers are increasingly taking DMA completely for granted. We ‘new’ objects prodigiously with scant regard for the cumulative effect of such habits on an embedded system (you do realize that is performing DMA, right?). Another common source of prolific DMA in C++ is the STL, which abstracts away the much of resource management complexities involved with boilerplate container types, and provides reasonably optimal general-purpose implementations. Indeed, this is one of the main reasons to use STL. I’ll focus on C++ STL-related performance issues in a future article. For the rest of this article we will examine the difference in DMA performance across platforms, and between heap allocation libraries.

Methodology

It is my goal here to back up claims with concrete evidence, hence I will endeavor to present hard data on the cost of heap memory allocation, and compare the cost of DMA between a general purpose PC and a typical embedded platform. On each platform, I will run a test application which models 3 patterns of DMA sizes that may be found in a realtime application, but approximately measure the time cost of only the DMA, with other time costs being subtracted out. I will also compare, on each platform, the native general purpose C++ ‘malloc’ and ‘free’ implementations from the platform-native C standard library with an alternate implementation, Doug Lea’s malloc (or ‘dlmalloc‘) version 2.8.6. It is worth mentioning that the internals of a platform-native malloc and free are expected to be platform and toolchain specific, so results reported here may be specific to the platforms tested. To gain direct insight into your own development platform, you should run your own comparative tests.

Test platforms

For the testing platforms, I have used a middle range desktop PC, and a Raspberry Pi. These choices intended to encapsulate the behavior of a common general purpose desktop PC versus a typical embedded platform architecture. The Raspberry Pi, a cheap, low powered multiprocessing ARM-based computing platform, is typical of today’s consumer level embedded computing platforms.

Specifications of test PC:

  • Intel Core i5-4690K @ 3.6 GHz,  4×32 kB L1$, 4×256 kB L2$, 6 MB  L3$
  • Motherboard: Intel Haswell H97 Chipset
  • RAM: 16 GB DDR3 dual channel, timing 9-9-9-24
  • OS: Windows 10 Professional, version 1703
  • Compiler: Visual Studio Community 2015 (version 14.0.25431.01 update 3), 64 bit release build configuration

Specifications of test embedded platform:

  • Raspberry Pi 3 Model B
  • Broadcom BCM2837 64 bit ARM Cortex-A53 Quad Core Processor SoC @ 1.2GHz
  • RAM: 1 GB LPDDR2-900 SDRAM
  • OS: Raspbian 4.4.50-v7+ #970 (Raspbian Jessie Lite + minimal PIXEL install)
  • Compiler: Raspbian GCC 4.9.2-10, optimization setting -O2

All tests were run as a using single threaded test application with no other substantial load on the computer. In particular, the Raspbian tests with performed at command line without the PIXEL GUI running.

Modelling memory usage patterns

In order to model memory allocation and freeing behavior realistically, I started with these premises to guide the design of the program logic:

  • that allocation sizes are variable, and
  • that allocation and de-allocation occur out-of-order, and
  • that allocation sizes are randomized (though not necessarily equally likely, see below)

The out-of-order deallocation premise means that we don’t de-allocate in the exact reverse order to allocation. Unpredictability, and even non-determinism, in the deallocation ordering is typical of sophisticated, real world applications. If an application always deallocates in the exact opposite order to allocation, then it behaves effectively in a stack-like (LIFO, – Last In First Out) manner rather than the much more typical out-of-order behavior (pseudo-FIFO – First In First Out with random variation).  Stack-like heap allocation/deallocation patterns are of much less interest to this study, they are relatively benign to the memory allocation algorithm.

Test case design

The test program flow was very simple – perform M=10000 main loop iterations in which memory chunks of random size ranging from 4 (0x00004) bytes up to 524,287 (0x7ffff) bytes are allocated in batches of N=100 per main loop and stored in a FIFO queue of maximum length 1000 items. This gives a total of M*N=1,000,000 allocations in total, and roughly 1000-fold throughput of the queue during the course of the test program execution.

Whenever the queue becomes full, 100 items are deallocated from the queue to make space for the next 100 allocations. At the end of the M iterations, the remaining allocated blocks in the queue are freed.

Here are some additional relevant implementation details for the reader to help lend credibility and clarity to my test procedure:

  • The allocation and de-allocation functions are inlined, and use a preprocessor macro to select which version of malloc and free to call, to avoid any branching overhead.
  • The test code is compiled with release (i.e. optimized) settings.
  • The test application is run from the command line interface.
  • Native OS timing functions are used in the source code to accurately compute elapsed time.
  • In order to baseline the cost of the program logic itself apart from the DMA calls, a preprocessor macro is used to effectively replace the malloc call with a NOOP. This measure program logic overhead is then subtracted from the complete run timings to measure only the cost of malloc and free.
  • The FIFO queue is implemented as a statically allocated plain C array of void* with integer counters to keep track of the queue head position and current size.  (Note: this is many multiples faster than using an STL container for the same purpose.)
  • The standard C++11 STL random distribution functions were used to generate random allocation sizes in accordance with the allocation size distributions illustrated below.
  • Even though the test application contained a single thread, it was built with the multithreaded build option. dlmalloc was configured to use thread locks during allocation.

Characterization of memory allocation patterns

In real applications, the size and frequency of heap allocation sizes varies greatly, depending on the resource requirements of the application and the capabilities of the host platform. It is not possible to present any particular pattern of allocation size as typical or more common than any other. Hence I have chosen 3 distinct size distributions for their ease of illustration, to show how the distribution of memory allocation sizes could vary, and then show in the reported results how the distribution can affect malloc performance.

DMA performance result have been generated using 3 distinct patterns of memory allocation size, expressed in terms of histogram distributions:

  1. Case #1: All allocations are equally likely (uniform distribution)
  2. Case #2: Smaller allocations are more likely (exponential distribution)
  3. Case #3: Intermediate sized allocations are more likely (log-normal distribution)

These 3 distribution patterns are illustrated below.

Distributions

Following are the distributions of the first 10000 allocations presented as histograms, for each of the distribution types. These are based on the first main loop iteration of the actual DMA performance measurement application. The horizontal axis represents a base-2 logarithmic scale. The number labels above each bar represent the actual count for that histogram bin.

The distributions were generated using a deterministic pseudo-random number generator and the distribution types defined in the C++11 standard library.

Uniform

Models all allocation sizes equally likely.  Not very realistic for most applications, but provides a useful reference point for assessing the effect of allocation size distribution on DMA performance.

uniform_distribution

Exponential

Models small allocations being exponentially more likely than larger allocations, on a geometric sliding scale.

exponential_distribution

Log-Normal

Models medium sized allocations being most likely. The distribution is asymmetric with a peak close to 64 bytes, a head of small allocations and tail of larger allocations.

lognormal_distribution

Note that since the exponential and log-normal distributions are unbounded, all allocation sizes generated from those distributions were clipped to the range 0x00004-0x7ffff, to ensure the same size range was respected across all 3 distributions.

The exact bin counts for each distribution were tested and confirmed to be identical for Windows and Raspbian.

Timing Measurement Procedure

First the application logic cost was baseline by building the application with NOOP allocation and running 5 times. For each run, the application reports overall elapsed time to execute the main test loop with each of the 3 size distributions. For each size distribution, the mean of these 5 runs was taken as the measurement of program logic overhead.  Note that the program logic overhead measurements varied significantly between distributions, because this overhead includes the computation time required to generate the random allocation sizes on-the-fly.

The application was then rebuilt using the C standard library malloc and free functions. Again, the application was run 5 times and for each run a measurement of main test loop execution time was captured for each of the 3 size distributions. The mean of these 5 runs provided a raw mean execution time for each distribution, from which the corresponding baseline mean timing was subtracted to obtain a final performance measurement based on the standard library.

Then the application was rebuilt again, switching in the dlmalloc allocator in place of the standard library, and the previous procedure repeated again to get a corrected mean of 5 run times for each distribution.

This entire test procedure was replicated on both Windows and Raspbian platforms to generate the complete set of DMA performance measurements presented below.

Results

The final performance results, comparing the effects of allocation size distribution, malloc library and host platform are readily summarized in the 2 graphs below. The vertical scale indicates the mean time in microseconds to allocate and free a memory block, so smaller is better. Note the vertical scale is the same on both graphs to make cross-platform comparison easier.

Variances are not displayed, since the individual performance measurements were all highly reproducible and variance in all cases is much less than the measured differences between cases.

Windows

windows_malloc_perf_comparison

Raspbian

raspbian_malloc_perf_comparison

Commentary

Windows

On Windows, there is a very pronounced allocation performance difference between the native malloc and dlmalloc. The uniform distribution represents the most challenging behavior to the memory management algorithm, since each allocation size is totally unpredictable. For the other distributions, the difference was less, but dlmalloc still outperformed Windows native C++ library malloc in each case.

Notice how much more efficient the memory allocation is for the exponential and log-normal distributions. Presumably the greater predictability in the allocation size (to within a factor of 2) significantly lowers the cost of performing allocations.

Raspbian

On the Raspbian platform, dlmalloc consistently outperformed the native C++ library malloc here as well. The outperformance is most pronounced for the exponential distribution, indicating dlmalloc is especially well suited when there is a predominance of small allocation sizes in the application.

For the exponential and log-normal distributions, these results suggest a roughly 1.4-2.0 times speedup in heap allocation performance using dlmalloc rather than the native C++ library.

How to use these results

Accurately performance profiling DMA can be extremely difficult, especially on platforms where instrumentation software is limited and in the presence of multithreading. The measurements shown in these graphs provide a way for back-of-the-envelope estimation of the total cost of heap memory allocation in an application. Since they are presented as a mean allocation time per allocation, an estimate of the number of allocations is sufficient to estimate total heap allocation time cost.

Firstly, capture a histogram of the allocation sizes in your application, and visualize it to see which distribution it most closely resembles from those above. Next, try to count the number of heap allocations per main loop in your application. Combining your data with the result of this article, you can multiply your allocation count by the estimated mean time per allocation from one of the graphs above.

Even if you’re unable or don’t have time to measure your application’s histogram and allocation event count, you can still use the results of this article to estimate a likely worst-case cost from an upper estimate of allocation counts.

If indeed you find that you do have a performance problem with heap allocation, consider remedial strategies:

  • replace dynamic heap allocation with statically allocated memory
  • try alternate memory allocators such as dlmalloc, or eheap [1]
  • switch the most prolific heap allocation offenders to use custom block allocators with fixed or strictly defined block sizes

Conclusion

The dynamic memory allocator for your application can be a performance factor to consider. The actual cost of DMA will depend on the patterns of memory allocation in your application. As it is a highly distributed cost, it can pass unnoticed when doing hierarchical code performance profiling. Changing or customizing your allocators can make a big difference to DMA performance. The performance measurements presented in this article provide hard numbers that may be useful for estimating overall potential cost of heap memory usage distributed throughout an application.

If your application’s overall use of heap memory is well controlled and amounting to merely a handful of allocations, then the performance cost is likely to be negligible. However, if the allocation count per main loop is getting up into the hundreds or thousands of allocation events, then you may well have a hidden performance cost that can be remedied by reducing dynamic memory allocation or implementing custom block allocators.

Possible Extension Work

  • Extend this analysis to profile the performance of eheap allocator [1] against CRT malloc and dlmalloc.
  • Use valgrind [2,3] to compare the peak memory fragmentation with the different allocators.
  • Test alternate embedded OS’s other than Raspbian.
  • Measure results for a wider range of allocation size distributions, or by sampling real allocation size distributions from actual applications.

References

[1] Ralph Moore, “eheap vs. dlmalloc” http://www.smxrtos.com/articles/eheapvdlmalloc.htm

[2] Valgrind for Raspberry Pi

[3] Using valgrind to measure memory fragmentation

One thought on “Performance Cost and Measurement of Embedded Dynamic Memory Allocation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s