We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
Classes are the normal ways C++ developers organize their data. Sometimes they have few data members, other times they have many, depending on how we translate our problem into classes. Inside them, developers will sort their data members according to some criteria. Criteria can be, for example, to group the data members by usage pattern (those that are used together are declared together) or to put the most important of them at the beginning.
We might ask ourselves: do these choices matter for the speed of your program? Let’s find out.
Introduction
The first thing to know about your class is that its memory footprint1 is directly proportional to the number of data members. The second thing is that the compiler will lay out data members in memory in exactly the same order you declare them. An example:
class point { public: int x; int y; }; class rectangle1 { public: bool visible; point p1; point p2; }; class rectangle2 { public: point p1; point p2; bool visible; };
Let’s assume that the sizeof(bool)
is 4 and the sizeof(int)
is 4. The size of the class point
is 8, and the size of both rectangle1
and rectangle2
is 20. In the case of class rectangle1
, the variable visible
will be at offset 0 from the class beginning, the variable p1
will be at offset 4 and the variable p2
will be at offset 12. In the case of class rectangle2
, the variable visible
will be at offset 16, variable p1
at offset 0 and variable p2
at offset 8.
From the functional point of view, the class size and its layout are completely irrelevant. But from the performance point of view, they do matter. To verify this, let’s make an experiment.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
The experiment
To test the performance of various class sizes and memory layouts, we use class rectangle
from the previous example, but with slight modifications (more on that later). We wrote two functions: calculate_surface_all
that sums up the surface of all rectangles in the vector, calculate_surface_visible
that sums up the surface of only visible rectangles.
template <typename R> int calculate_surface_visible(std::vector<R>& rectangles) { int sum = 0; for (int i = 0; i < rectangles.size(); i++) { if (rectangles[i].is_visible()) { sum += rectangles[i].surface(); } } return sum; } template <typename R> int calculate_surface_all(std::vector<R>& rectangles) { int sum = 0; for (int i = 0; i < rectangles.size(); i++) { sum += rectangles[i].surface(); } return sum; }
The difference is that for the case of calculate_surface_all
we are accessing only members p1
and p2
(points of the top left and bottom right. For calculate_surface_visible
we are accessing the member visible
as well.
To simulate a different memory layout of the class rectangle
, we added padding between the data member visible
and other data members. We also wanted to keep the class size constant, so we added another padding at the end. In the real-world code, you will have other member variables instead of padding. The definition of our class rectangle
looks like this:
template <int pad1_size, int pad2_size> class rectangle { private: bool m_visible; int m_padding1[pad1_size]; point m_p1; point m_p2; int m_padding2[pad2_size]; };
The results
Class size
So, does the runtime of our two functions depend on the class size? Here are the results:
For the smallest class size (20 bytes), both functions are the fastest. As the size of class rectangle
grows, so the functions take more and more time to complete. Function calculate_surface_visible
is two times slower in the worst case than in the best case; function calculate_surface_all
is five times slower. Why is this so?
The memory accesses on modern CPUs use a caching mechanism. Every time our program accesses a single byte, the whole block of data (which is typically 64 bytes) will be brought into the cache as well. The access to memory inside the same block will be very fast. But the accesses which fall outside the block will be slow.
When the class is small (20 bytes), three instances of the class can fit a cache block. Accessing any data member in a single rectangle
instance will also load the data for two additional instances. We basically get those accesses for free. As the class size grows, the hardware still loads data into the data cache, but our program never touches it. This is a lost performance opportunity: the data is in the cache, but our program never uses it. Instead, we are asking the hardware to load another batch of data from a different block. This is what is slowing down our computation with large class size.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Padding between member visible
and members p1
and p2
What happens when we are accessing two data members in the same instance, but they are not close to each other in memory? In our case, we added padding between members visible
and members p1
and p2
. We control the size of that padding. Let’s measure how much time do our functions take to perform the task, depending on the size of the padding:
The runtime of function calculate_surface_all
doesn’t depend on the padding between visible
and other data members. The runtime of function calculate_surface_visible
depends on the padding, but only when both the padding and the class size are large. A larger gap between the member visible
and members p1
and p2
automatically translates to slower computations. The phenomenon becomes visible for classes larger than 128 bytes and as the class size grows becomes more and more pronounced.
Conclusion
How do these results translate to performance tips for a C++ engineer? Here are a few rules of thumb if you want to achieve good performance in your C++ program:
- Focus on hot classes, classes that your program spends a lot of time processing.
- Keep hot classes small. Move all rarely used members into separate classes.
- Alternatively, extract hot code from larger classes into separate smaller classes and keep them in a dedicated vector. Don’t process them by iterating over the larger classes. Process them by iterating over the dedicated vector.
- Group together data members you access together in the class definition.
In the case of C++, small changes go a long way and if done right can make your program run a few times faster.
NOTE: The conclusion of this post inevitably leads to the story of Entity-Component-System paradigm and Data-Oriented Design I plan to cover in one of the upcoming posts.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Featured image: https://www.educative.io/edpresso/what-is-a-cpp-abstract-class
- Memory footprint of the class is the amount of memory that a single instance of a class consumes [↩]
Informative blog on C++ and memory layout