I am a software developer who is enthusiastic about many subjects, including algorithms, computer graphics, analog/digital electronics, science, and engineering.
I use this space to organize my thoughts on various subjects, which may be incomplete or inaccurate, but may still have value for those willing to excercise a bit of skepticism.
Source code for all demos can be found here.
One of the most powerfully expressive features of modern C++ is the lambda, often used with its std::function wrapper. These functional tools are readily employed in writing generic, reusable code. As asynchronous computing becomes more and more important to make the most of our multi-threaded CPU and GPU resources, these functors also find use in asynchronous event handlers and task distribution/scheduling systems.
What is the cost of all these function objects? The ability to capture arbitrary data in the fixed-size std::function implies that, at some point, we must incur a heap allocation. I would expect, however, that the majority of captures should be relatively small, in which case the obvious optimization is to provide a small, fixed-size storage area within the functor itself to enable stack allocation within these limits. Some quick searching appears to support this (see Futher Reading below), but at what point does this degrade to heap allocation in practice, and just how bad is it?
In this post I’ll analyze the performance characteristics of functors of varying capture sizes to determine under what conditions we begin to observe performance degredation. It is worth noting that the performance we’re measuring is that of std::function, rather than C++ lambdas theselves. C++ lambdas’ types are unspecified, meaning they may not have the same limitations and performance characteristics as std::functions on their own, but they’re also generally less useful for defining interfaces.
The test consists of creating and assigning 10,000,000 functors to an array in a tight loop, measuring total duration, capturing an integer array of varying size (up to 20). The “zero size” capture is handled specially as a lambda with no array capture. The functors do nothing and return no result. After the array of functors has been created/assigned, it is iterated over and each is called, again measuring total duration. A “control” is also measured, where we repeat the process with a regular function pointer instead of a lambda/functor.
Results are reported with durations specified for the average time of a single iteration (total time divided by number if elements). All tests conducted on a Windows 10 system with an Intel Core i5, 6600K processor.
Toolchain | Version | Platform | Arguments | Notes |
---|---|---|---|---|
gcc | 7.2.0 | mingw64 | -O2 -std=c++14 |
|
clang | 5.0.0 | mingw64 | -O2 -std=c++14 |
using libstdc++, not libc++ |
msvc (cl) | 19.11.25508.2 | x64 | /O2 /EHsc |
We observe similar results for GCC and Clang, as we are using the same standard library for both (libstdc++); performance is quite good (~3-7ns) up to 16 bytes of capture data, jumping to ~45-70ns when allocating functors with captures >16 bytes.
MSVC performs substantially worse for small captures (~13-15ns), but performance doesn’t degrade until beyond 48 bytes of capture data. MSVC outperforms GCC/Clang (libstdc++) for captures in the (16, 48] byte range, but performs comparably (or slightly worse) for other test cases.
Control durations for this test were 0.8ns (gcc), 0.6ns (clang), and 0.3ns (msvc).
This result is particularly interesting; I expected performance degration during the allocation test, but expected invocation to suffer no penalty from large captures. While this is what we see from GCC/Clang, MSVC jumps from ~3-4ns, for small captures, to ~9-10ns for captures >48 bytes. I’m not sure what is happening here, but I think it’s beyond the scope of this post.
Control durations for this test were 1.5ns for each compiler tested.
When performance is a concern, it is best to keep capture data under 16 bytes to get optimal performance across compilers. MSVC is able to handle up to 48 bytes of data, but it has worse allocation performance for small captures (≤16 bytes) and worse invocation performance for large captures (>48 bytes).