"Premature optimization is the root of all evil". Or is it?

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

The famous sentence uttered by great computer scientis Donald Knuth still rings a bell with many software developers. But how true is it? My personal opinion is that is half-way truth, at least for C++ developers. C++ and its STL are huge; some features and coding practices are fast, others are not. Using them without regard to their performance eventually results in slow code. Even worse, it can lead to slow code that is difficult to figure out why it is slow.

In this post we will talk about coding practices and features of C++ that are slow and that would require a large rewrite to fix. You should always keep in mind that “Who repairs not his gutters repairs his whole house”!

Slow features of C++

Some features weren’t slow when they are originally introduced, but they are slow now. The computer architectures have evolved, but the beloved language features (and developers’ mindsets) haven’t changed to follow them. Features that were once good now use the underlying hardware poorly. Here are a few of them:

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Vectors of pointers

Regardless if it is std::vector<object*> or std::vector<unique_ptr<object>>, vectors of object pointers is the number one features of C++ that can lead to bad performance. Hardware is really efficient in working with vectors since the data is located at neighboring memory addresses. Vector pointers completely defeat the predictability of data layout in memory since the neighboring elements do not necessarily point to neighboring memory objects. As we measured in the post about dynamic memory in C++, in the worse case, the performance can be 10x slower than the best case!

Additionally, many small allocations will have an impact on memory fragmentation. Programs that run for a long time will especially notice this, since it can lead to slow memory allocation and lack of memory even though the system has enough of it.

C++ actually forces you to use pointers in containers since polymorphism is only available if objects are accessed through pointers. Some C++ gurus will recommend using vectors of pointers everywhere “just in case you need to add a hierarchy of objects later”. Please don’t listen to them.

Unfortunately, STL library doesn’t offer a good and simple alternative. A few ideas on how to avoid vectors of pointers:

Use STL std::vector<std::variant<…>> : even though it doesn’t use the virtual function dispatching mechanism, the replacement is good enough. It doesn’t allocate small objects on the heap and it guarantees the predictable memory layout that will positively affect your program’s performance. Learn more about it here.
Use polymorphic_array. This is a data structure we tested in one of our previous posts. It is very similar to std::vector<std::variant<...>> we just mentioned, the only difference is that dispatching is done using virtual function dispatching mechanism instead of std::visitor.
Use multivector. This is a good alternative if your container doesn’t need to be sorted. Multivector is a wrapper which holds a std::vector for each type provided as an argument. For example, multivector<circle, rectangle, triangle> holds three vectors for each of the types. If you wish to call the same method for each type in the multivector (e.g. method draw), you will use its for_each method to do that elegantly multivector.for_each([&](auto& shape) { shape.draw(bitmap); }. We wrote about it in the post about performance of virtual functions. There is a similar container in the Boost called poly_collection, but it’s more complicated and takes time to master.
Use std::vector<std::unique_ptr>> with a custom allocator. std::unique_ptr can be customized with an allocator, you can use it to make sure all the objects are allocated from a single block of memory and when they are destroyed, the whole block can be returned to the operating system.

Using std::vector<std::variant<...>> and polymorphic_array will result in larger memory consumption, but they allow for sorted containers. The good thing about them is that when the container is destroyed, the whole block of memory can be returned to the operating system. multivector and poly_collection are better choices if the container doesn’t need to be sorted since they require a minimum amount of memory and they lend themselves easily to compiler optimizations.

Large classes

Another important “anti-pattern” that you should avoid is creating large classes. Everyone has dealt with a “megaclass” or “world-class” that contains many data members. Nobody is actually sure what are the responsibilities of those classes, and nobody has the guts and nerves to refactor them.

Apart from being difficult to maintain, from the performance perspective they are a nightmare. Modern processors profit from data locality, i.e. when the data they access together are close to each other in memory. With large classes this is not possible. The class should have a clear responsibility and only contain those members which are relevant for its job. If you need more data to do another type of work, consider creating another class. Also keeping classes independent of each other will help the performance.

We investigated performance and class size in the earlier post about performance and class size. The performance difference between a small and a large class that does the same job can be substantial (in our measurements, up to four times difference in speed)!

Large virtual functions in unsorted containers

In CPU terminology there is “hot code”, code that the CPU has already seen and knows some statistics about. CPU typically runs hot code faster than the code it sees for the first time. CPU has a “limited memory”, i.e. running the new code will make it forget the statistics about the old code.

The problem with large virtual functions in unsorted containers is that the CPU always runs cold code. The CPU runs several different implementations of the same virtual functions. Every time it switches to a new implementation, it runs it for the first time, and while doing so, it forgets the statistics about the previous implementation. In the previous post about performance of virtual functions we showed the performance difference between large and small virtual functions, a large virtual function can be up to 1.6 times slower in an unsorted container compared to the sorted one.

Using multivector we mentioned a few sections earlier will help the performance because the CPU is processing data by type, first objects of type1, then objects of type2. Sorting the container by type also helps. In both cases the hardware calls the same function many times, which makes the function hot and it runs faster.

Alternatively, you can try to avoid large virtual functions, for example, by placing the common code in the base class, and then use small virtual functions to configure the common code of the base class.

Use algorithms from the STL library as much as possible

There is an excellent one-hour talk by Jonathan Boccara on YouTube about STL algorithms. In one hour he covers all the algorithms in the STL library. I highly recommend it.

The reasoning for using the STL algorithms is performance. STL algorithms are well written and well tested. That code can be easily ported to other STL implementations that are faster, run on GPUs or multiple CPUs. The same code will benefit from the improvements coming from the upcoming versions of C++ standard.

An important note about for loops. The old school developers love for (int i = 0; i < v.size(); i++) { v[i].do_something(); }; I love them too, however, for loops with explicit indexes are harder for the compiler to optimize. For instance, do elements in the array need to be processed in the order from 0 to v.size() ? It is not always clear for the compiler and might prevent some nice optimizations. Prefer instead range for or std::for_each, since these two constructs don’t imply the ordering of elements. The compiler has a better understanding of what’s going on, additionally, this code is easier to port to run on several CPU cores.

There are two other algorithms in the STL you should be aware of: std::transform and std::reduce. Both are easy to parallelize and vectorize. If there is an input vector, an output vector and a transformation (e.g. division by three), you should use std::transform. This algorithm will apply the transformation function on the input vector to create the output vector. If you need to reduce a vector of objects to a single value (e.g. calculate an average mark of all the students in the vector), you should use std::reduce. The two algorithms are key to parallelization in a framework called MapReduce, there is an excellent article about them written by ITHare.

What about other features?

C++ has many also other features related to performance that are part of standard development practice. For example, passing large objects by reference, implementing move constructors and assignment operators, etc. We wrote about some of them in the article about avoiding excessive copying in C++ , and you should definitely know and use them, however, for this is a type of optimizations you don’t need to worry too much in advance. Even if you get them wrong, they are easy to spot and fix.

You should strive to write code that is simple and easy to understand. Compilers are best at optimizing that kind of code. There is no need to do manual loop unrolling and complicate the design unnecessarily; only after the profiler has shown you the slow code, you should focus your efforts there.

Conclusion

For performance aware codes there is no such thing as “premature optimization”. There are just optimizations that are inadequate for the development phase. Using bad coding practices or bad data structures will lead to code that is slow and inefficient, and worse, more difficult to speed up. And remember, any software project can become “performance aware” as its complexity, working set and number of users grow.

If you need to rewrite large parts of your programs to make it faster, then that’s bad. In this post we focused on those features of C++ that are easy to get wrong and would require a large rewrite to fix. We didn’t talk about those features that are easy to get wrong but that are easy to fix. A lot of times you don’t know how a certain change will affect the performance, and there is no point in thinking about it before it turns out to be important.

“Premature optimization is the root of all evil”. Or is it?