We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
As stated many times in the previous posts, modern CPUs are heavily memory bound. This means that the CPU often doesn’t run at full speed, having to wait for the data to be brought from the memory subsystem to the CPU registers.
The nature of many problems that developers are solving result in solutions that are memory bound. And there is no way around it. Unless your code is working on simple arrays/vectors and sequentially processing simple data types (floats, ints), your program is very likely to be memory bound, i.e. your program will not use the CPU computational resources to the fullest.
One of the approaches to fix the problem of memory boundness is to decrease the dataset size. This decreases the amount of traffic between the CPU and memory subsystem, increases the efficiency of the data caches and TLB caches1 and often results in speedup.
In this post we will talk about dataset size and how it influences the performance of your program. We will also give a few common-sense way to improve your program’s speed by decreasing the dataset size. Finally, we will give experiments to quantify the speed improvements. But before we begin, a few definitions.
What is your program’s dataset?
A dataset is collection of data your program is currently using. To illustrate, imagine a loop looking up values in a hash map. Its dataset is all the data accesses inside the loop. When the loop finishes, the program moves on to execute different code on a different dataset.
Memory consumption is the amount of memory your program needs to run. This include both the memory for the useful data and the memory for the system allocator’s metadata, class paddings, etc. Although dataset size and memory consumption are correlated, and decreasing one will often result in decreasing the other, there are certain cases where you can decrease one and increase the other2.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
A few important notes about decreasing dataset size
- When talking about dataset size, we include not only useful data, but all other metadata like pointers, class paddings, memory allocator metadata etc. All of this takes place in the data cache, and increases the size of dataset.
- Some of the following recommendations also result in a decrease to the number of calls to the system allocator, which typically result in less memory fragmentation, less data traffic, less data cache misses and faster speed.
- Before trying this recommendations, make sure your program is actually memory bound. We use Intel’s VTUNE in this post. Alternatively, you can use pmu-tools to measure this. Denis Bakhvalov in his book gives information on how to check if the program is memory bound using pmu-tools.
Techniques to decrease the dataset size
Here we list a few programming techniques to decrease the size of the dataset. You can skip this section and go right to the experiment, and come back later only if you need it. Alternatively, you can go through the techniques first.
Smaller Types
Using smaller types will automatically decrease the size of the dataset and decrease the pressure on the memory subsystem. If precision allows, you could use floats (4 bytes) instead of doubles (8 bytes), or shorts (2 bytes) instead of ints (4 bytes). If the hot loop is vectorized, the switch to a smaller type will mean that more data can be processed in a single instruction, which is also useful for code performance. We give an example in the experiment section of this post.
Pointers
If your dataset consists of pointers, then it is possible to decrease the size of your dataset in a few ways. On 64-bit platforms, pointers are 8 bytes in size. On 32-bit platforms, pointers are 4 bytes in size. A very simple way to check if your hot loop would benefit from reduction in pointer size is to recompile the program for a 32 bit architecture (the compilation options is -m32
). If this makes the hot loop faster, then your code would definitely benefit from pointer size reduction.
Another way to decrease the dataset size is to replace pointers with array indexes. The idea is quite simple and can be illustrated with an example. Let’s say we want to implement a binary tree. Binary trees are notorious for memory consumption, because for one value they hold they require two pointers. A typical insertion algorithm would look something like this:
struct binary_tree_node_t { binary_tree_node_t* left; binary_tree_node_t* right; int value; }; binary_tree_node* alloc_and_insert(int value, binary_tree_node* left, binary_tree_node*) { binary_tree_node* result = new binary_tree_node_t; result->left = left; result->right = right; result->value = value; return result; }
On a 64 bit system, the size of this struct is 20 bytes. To allocate node, we make a call to the system allocator through new
. In addition to these 20 bytes, the memory allocator might allocate some more space for the meta data for each memory chunk, which further increases the dataset size. An alternative way to implement binary tree is by using std::vector<binary_tree_node_t>
as a backing storage. Then the code would look like this:
struct binary_tree_node_t { unsigned int left; unsigned int right; int value; }; unsigned int alloc_and_insert(int value, unsigned int left, unsigned int right, std::vector<binary_tree_node_t> storage) { storage.push_back(); unsigned int index_allocated = storage.size() - 1; storage[index_allocated].left = left; storage[index_allocated].right = right; storage[index_allocated].value = value; return index_allocated; }
When rewritten like this, the size of the struct is 12 bytes. Memory allocation is performed using storage.push_back()
(line 8). The pointers to the left and right node are replaced with indexes in the storage
array. This solution comes with both advantages and disadvantages, but for our case the decrease in dataset size is very visible and measurable, as we will see later. This approach also increases locality of reference, since all the memory for the binary tree is allocated from the same block of memory.
Pointer Tagging
On 64-bit platforms, pointers are 64-bits in size, but the upper 16-bits are unused. A technique called pointer tagging can be used to take advantage of this. In this case, you can use uintptr_t
type to store the value of the pointer, and then use masking and shifting to get the value of the actual pointer and the value stored in the unused part. For example:
int* get_pointer(uintptr_t p) { return (int*) (p & 0xFFFFFFFFFFFF); } short get_value(uintptr_t p) { return (short) (p >> 48); }
Modern CPUs have hardware support for pointer tagging, so when the CPU is properly configured, the CPU will ignore the upper 16 bits in the pointer. So the function get_pointer
can be implemented as a zero-op on such hardware.
Another interesting property of pointers is that pointers are almost always at least 4-bytes aligned. This means that the lowest two bits in the pointer are zeros. You can use this fact to tag a pointer. If the value of the lowest two bits is 00, then it is a pointer, otherwise, it is something else.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
Zero-overhead memory allocator
If the data structure is really big, then using a custom allocator can decrease the dataset size because custom allocators can be implemented with zero overhead. Alternatively you could try a different system allocator (e.g TCMalloc, Jemalloc or MiMalloc) which has a lower or no overhead.
Booleans
To represent a boolean, the computer needs one bit. But, CPUs don’t work readily with bits and memory is difficult to manage in bits, therefore representation of a boolean will depend on the usage. A single automatic variable defined as bool
in C++ will typically be 1 byte or 4 bytes in size, but if you store booleans as part of std::vector
, then the size of a boolean will be 1 bit.
If you need to store booleans in containers, you should strive for minimal memory consumption. Each boolean should be represented by 1 bit, and you can use C bit fields or unused bits in pointers to store the boolean values.
For example, in a multithreading hash map implementation, you implement a hash map bucket as a pointer to the bucket data and a mutex used to lock the bucket. A locking mechanism requires one bit to store the lock state, and this bit can be the least significant bit in a bucket’s pointer. So, a possible implementation could look like this:
template <typename T> struct hash_map_bucket_t { std::atomic<uintptr_t> data; T* lock() { atomically_clear_bit(data, 0); return reinterpret_cast<T*>(data.load()); } void unlock() { atomically_set_bit(data, 0); } };
We use the least-significant bit in hash_map_bucket_t.data
to lock the bucket. If the bit is 0, the bucket is locked, otherwise it is unlocked. We need atomic operations since this variable is used for synchronization and can be modified by several threads. Introducing bool
as a field in hash_map_bucket_t
would most likely double its size, because of padding and alignment requirements (see here). This is an efficient way to keep the bucket and the dataset small.
Bit Fields
Another way to decrease the size of the dataset is to use C bit fields. Bit fields allow you to explicitly control how much memory is used by each field in the struct/class. Take the example the binary tree node:
struct binary_tree_node_t { uintptr_t left; uintptr_t right; int value; };
We know that on 64 bit machines pointers are 48 bits in size. We can use bit fields to explicitly specify the size of the pointer:
struct binary_tree_node_t { uintptr_t left:48; uintptr_t right:48; int value; };
The original struct was 20 bytes in size, the new one is 16 bytes in size. Structs with bit fields introduce more work for the CPU, but for memory bound programs this doesn’t necessarily make them slower, as CPUs are stalled waiting for the data from the memory.
Bit fields are especially useful if your struct/class is storing booleans or enums, since these types typically have very narrow value ranges and pack very nicely.
Struct/Class Padding
In C/C++, each simple type needs to be properly aligned. A 2 byte type (e.g. short) needs a 2 byte alignment, a 4 byte type (e.g. float) needs 4 byte alignment, etc. When simple types are declared inside a struct/class, the alignment requirements still hold. Therefore, the compiler will insert padding: unused bytes to guarantee proper alignment. Consider the following example:
struct my_struct_t { double d1; int i1; double d2; int i2; };
At first, one would expect the size of this data structure to be 24 bytes on most architectures: 8 bytes for two ints and 16 bytes for two doubles. But this is not the case. All the members must be properly aligned when stored in an array, and this is guaranteed to happen if offsets for each field is properly aligned and the alignment of the whole struct/class is same as the alignment of its biggest field.
If we examine this struct, the offset of d1
is 0 and this field is properly aligned. The offset of i1
is 8, and this field is properly aligned too. The offset of d2
is 12, which is not divisible by 8. So the compiler inserts a padding of 4 bytes between i1
and d2
. The offset of d2
is now 16 bytes and this field is properly aligned. The offset of i2
is 24 and this field is properly aligned. The size of the whole struct is now 28 bytes, but alignment requirement for struct is 8 (same as the alignment requirement for the biggest field – double). So the compiler adds 4 more bytes of padding at the end. The total size of the struct is 32 bytes.
Padding bytes also count in in dataset size, since they consume memory and data cache lines. But they are not useful data, and typically (although not always), decreasing the padding will result in faster execution.
To get rid of unnecessary padding, there are two ways:
- Rearrange the fields in the struct/class, so bigger fields are at the beginning and smaller at the end.
- Use
#pragma pack
to tell the compiler to use different values for data alignment and therefore make the struct/class more compact. This way is simpler, but relies on compiler extensions that are not necessarily supported everywhere. Also, the problem of this approach is that the compiler must emit unaligned loads and stores, which can be slower on some architectures.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
Small Buffer Optimizations
Small buffer optimizations can be used in cases where you have a struct/class that is a handler to a heap allocated object. There are numerous examples of such classes:
- In string implementation, a string class can contain fields such as string size, capacity and a pointer to the actual string.
- In vector implementation, a vector class can contain fields such as vector size, capacity and pointer to the data block containing data.
- In heap object implementation, a heap object class contains a pointer to the object allocated on the heap.
- In bit field implementation, a bit field class contains bit field size in bits, capacity in bytes and pointer to the bit field storage.
For some of these applications, it is very common to have small objects. For instance, an average string is 7 characters in size. Or, in some program, the average heap object is so called small integer (integer with value < 1 million).
For these use cases, a technique called small buffer optimization (also called small string optimization, small object optimization or small size optimization) is used to decrease the amount of memory used and also decrease the number of allocations. The idea is simple. There are two cases: the default case (with an externally allocated storage and a pointer to it) and the small case (where everything is kept locally in the struct/class). These two cases are overlapped using a C union. Have a look at the example of a string class on a 64 bit little-endian system:
struct small_string_t { unsigned char size; char data[23]; }; struct large_string_t { size_t capacity; size_t size; char* data; }; struct string_t { union { small_string_t small_string; large_string_t large_string; }; };
The only question remaining is how do we distinguish if the class is running in small string or large string mode? To do this, we use the least-significant bit of the member small_string.size
, which overlapped using union
with the least-significant bit of the member large_string.capacity
. Capacity is always a multiple of 2, so its LSB is always zero. If this bit is zero, we are working with large string, if it is one we are working with small string.
Another example is the heap object class. If the heap object often points to a small integer, and we know that all pointers are at least 4 bytes aligned, then:
- If the LSB of the pointer is 0, we have a real pointer pointing to an object on the object heap.
- If the LSB of the pointer is 1, we have a special case of small integer. We shift the value by 1 bit and then we can use the number.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
The experiment
Dataset size has several “interesting” effects on program’s performance and not all of them are necessarily positive. The performance improvements will depend on memory access pattern, dataset size, vectorization, instructions latencies and available instruction level parallelism. We devised two experiments to explore this effect.
For the experiments, we used Intel(R) Core(TM) i5-10210U CPU, a fairly typical laptop CPU. The source code is available in our repository.
Arithmetic intensity, data types and performance
We compare the times needed to process 128M floats (4 bytes) vs 128M doubles (8 bytes) vectorized and non-vectorized versions. We initialize the vectors outside of the critical loops to avoid page faults. The arrays are large in order to make sure that the data caches are bypassed. We use two test loops. The first loops is arithmetically light, and the second is arithmetically heavy.
// Arithmetically light for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; } // Arithmetically heavy for (int i = 0; i < n; i++) { c[i] = sqrt(a[i] * a[i]) + sqrt(b[i] * b[i]); }
The arithmetically light loop has two loads, one store and one addition. The arithmetically heavy loop has two loads, one store, two multiplications, an addition and a square root.
Here are the runtimes, executed instructions and CPI (cycles-per instruction) metric:
Arithmetical Intensity and Type | Not-vectorized | Vectorized |
---|---|---|
Light (floats) | Time: 0.11 s Instr: 805 M CPI: 0.4 | Time: 0.09 s Instr: 62 M CPI: 3.9 |
Light (doubles) | Time: 0.19 s Instr: 805 M CPI: 0.71 | Time: 0.19 s Instr: 126 M CPI: 3.9 |
Heavy (floats) | Time: 0.14 s Instr: 1342 M CPI: 0.35 | Time: 0.09 s Instr: 167 M CPI: 1.53 |
Heavy (doubles) | Time: 0.26 s Instr: 1342 M CPI: 0.67 | Time: 0.17 s Instr: 335 M CPI: 1.59 |
A few observations about these numbers:
- Number of instruction for the non-vectorized version is same for doubles and floats. This is expected because the number of pieces of data is same.
- Number of instructions for the vectorized version is two times lower for floats than for doubles. Again expected, since on AVX vector extension works on four doubles or eight floats simultaneously, so fewer instructions are needed.
- For the vectorized light loop, the CPI metric is high and bad (e.g. 3.9, ideal is 0.25, 0.4 would be considered good). The vectorized version executes much less instruction than non-vectorized version, but because of its low hardware efficiency (high CPI), there is very little effect on runtime.
- Non-vectorized light loop has good CPI (0.4 or 0.71) but because of its high instruction count it is not faster than the vectorized version.
- Vectorized heavy loop is faster than its non-vectorized counter part. Only when the arithmetic intensity is high enough, vectorization actually makes sense.
- CPI of vectorized heavy is about 1.5. In contrast, CPI of vectorized light is 3.9. Light versions are more bound, particularly memory bound.
Unfortunately, I couldn’t provide snapshots from Intel’s VTUNE to prove it, but loops with high CPI metric are such because of the DRAM bottleneck: the CPU was mostly waiting for data from the memory. Loops with low CPI metric have bottlenecks elsewhere.
There are few reasons why float processing loops are up to two times faster than double processing loops. For arithmetically light loops, it is because it is fetching more megabytes of data from the memory. For vectorized loops, an instruction is processing two times more pieces of data for floats than for doubles. There is also a third reason: some instructions are slower when working with doubles. Whereas addition, substraction and multiplication are equally fast for floats and doubles, divisions, module and square root (as well as many library calls: pow, exp, log, sin, etc) are slower for doubles than floats. This is also a factor that influences data processing.
Binary tree experiment
To illustrate how the program behaves with random memory accesses, we use the binary tree. We have two identical binary trees, with identical memory layouts, but with two different binary tree node implementations: one (simple_binary_tree_node
) which uses regular pointers and another one (vector_backed_binary_tree_node
) which allocates memory from a std::vector
and uses array indexes instead of pointers. Here are the code snippets:
struct simple_binary_tree_node { using node_id = simple_binary_tree_node*; ... T value; simple_binary_tree_node* left; simple_binary_tree_node* right; }; struct vector_backed_binary_tree_node { using node_id = unsigned int; ... T value; unsigned int left; unsigned int right; };
To decrease the memory consumption for the tree with plain pointers, we used a custom allocator that guarantees very little memory fragmentation and no memory chunk overhead. The size of a single chunk in first implementation is 24 bytes, the size of a single chunk in a second implementation is 12 bytes. That means that the second, vector-backed implementation, uses two times less memory than the simple implementation.
The vector backed implementation needs more instructions to do its work. Whereas the simple implementation just uses addr(next) = node->left
to calculate the next element, the vector backed implementation needs to add together the base of the std::vector
and the value for left or right: adr(next) = base + sizeof(vector_backed_binary_tree_node) * node->left
. For this reason, the second implementation executes about 1.38 times more instructions. Which, if the hardware efficiency is the same, should make it 1.38 times slower.
We used tcmalloc
as the system allocator and we ran two types of testing: regular and with huge pages3. We measured runtimes, instruction count and CPI (cycles per instruction). Instruction count is important because sometimes the slowdowns come from doing more work, not being less efficient. CPI metric is important because it measures the hardware efficiency. The smaller the better.
Please note that the lookup algorithm has very little instruction level parallelism. Between access to two independent nodes there is a long stall, because the CPU has nothing else to do. There are ways to give CPU more to do, but this is not the topic of this post.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
The measurements
We measure the runtimes and other metrics for binary trees with 8K, 16K, 32K, …, 16M nodes. The number of lookups is always the same: 16M lookups. This means for the smallest case we will rerun the lookup procedure 2K times, and for the largest case we will run it only once.
Here is the graph with the runtimes:
The first thing to notice is that the runtime gets worse as the dataset size grows. The worst runtime is almost 10 times slower than the best runtime! There are differences in speed between the different versions for the same dataset size, but they are not large. In this case huge pages didn’t significantly influence the program’s performance.
Here are the vector-backed tree to simple tree runtime ratio:
Remember that the vector-backed version executed about 1.35 times more instruction than the simple version. When the dataset is really small (fits the L1 cache) and very large (doesn’t fit any of the caches), the simple version is faster. But in between, the vector-backed version is up to 1.4 times faster, in spite of doing a more work.
Let’s observe the CPI metric. This metric is “cycles per instruction”, i.e. how efficiently the program uses the hardware. The perfect value is 0.25, a value up to 0.5 would be considered good. A value above 1 can be considered not that good, and a value over 4 can be considered disaster. Here is the graph:
The first thing to note is that the vector-backed binary tree is always more efficient. But since it is always executing more instructions than the simple version, the efficiency doesn’t translate automatically to speed boost.
The total number of executed instructions grows as well, because with more nodes the tree becomes deeper. But this growth is limited, and between the largest and the smallest tree the number of the executed instructions has increased by about 1.7 for both implementations. Therefore, the speed degradation can be explained mostly by CPI getting bad, especially with the simple version.
Please note there are two reasons why CPI is better with vector-based version: (1) the dataset is small and (2) the version has more available instruction level parallelism because of more “work”. The (1) is generally positive, but (2) is positive only to the extend that the additional work doesn’t slow down the program.
The last, but most important thing is confirming that the program is actually memory bound. Here are the LIKWID perf-ctr TMA counter report for the case of binary tree with 8K, 512K and 16M nodes:
Node count | Simple version | Vector-backed version |
---|---|---|
8K | Front End [%] | 24.0049 Speculation [%] | 51.8606 Retiring [%] | 7.1858 Back End [%] | 16.9487 | Front End [%] | 15.7893 Speculation [%] | 55.5708 Retiring [%] | 10.7352 Back End [%] | 17.9047 |
512K | Front End [%] | 13.3525 Speculation [%] | 26.8133 Retiring [%] | 3.0412 Back End [%] | 56.7930 | Front End [%] | 10.1397 Speculation [%] | 35.2016 Retiring [%] | 6.1912 Back End [%] | 48.4674 |
16M | Front End [%] | 8.6298 Speculation [%] | 16.9045 Retiring [%] | 1.7837 Back End [%] | 72.6821 | Front End [%] | 5.2761 Speculation [%] | 16.9566 Retiring [%] | 2.7303 Back End [%] | 75.0370 |
The bottleneck in the smallest version is CPU speculation, i.e. branch guessing. The CPU guesses the branch outcome (going to the left or right child), and since this is a random data lookup, it guesses badly. But as the dataset increases, the bottleneck moves to the backend (CPU core or memory). Here is the detailed TMA analysis for the case of 16M nodes, done by Intel’s VTUNE:
The simple binary tree TMA analysis reveals that the CPU is predominantly DRAM bound (46.3%), the number of retiring slots is very low (2.2%). The program uses hardware resources very inefficiently. Similar can be said for vector-backed binary tree, although the numbers look slightly better.
The last thing, we want to do the detailed TMA analysis for the binary tree size of 512K nodes, since this is the size where there are most difference between the two implementations.
The main difference is the DRAM bound metric, which is 43.6% with simple binary tree and 11.8% with vector-backed binary tree, something that for larger trees becomes the dominant bottleneck.
Final Words
This has been an exciting adventure, and the results are very interesting. When it comes to dataset size, general impression is that decreasing the dataset size has a positive impact on performance, but only conditionally, since it depends on several factors. Here we list them:
- If the memory access pattern is sequential (running over vectors) and the loop is arithmetically light (mostly loads and stores with little arithmetics), the dataset size reduction can result in substantial speed improvements because of the decreased pressure on the memory subsystem.
- If the loop is vectorized (and many loops with sequential memory access pattern and simple types are), the speed improvement will probably come from processing more data per instructions and not decrease in memory boundness.
- If the memory access pattern is random (which is the case with pointer chasing code, e.g. binary tree or hash map), then the speedup will depend on the dataset size. If the decrease in size means that more data can be packed in faster type of cache, then speedup can be substantial. But, if the dataset is really big or really small, then dataset decrease will probably not pay off.
- If the loop is arithmetically heavy, then decreasing the dataset size doesn’t necessarily have to result in speed improvements, since the bottleneck will probably not be the memory subsystem. Especially if the loop is not vectorized.
- If decreasing the dataset doesn’t require more data processing (more instructions) then you can expect no slowdowns, only speedups4. However, if additional instructions are needed to handle the decrease in set size, then you can expect slowdowns as well.
- For strided memory accesses, the reasoning will depend on the stride. Small stride will more probably resemble the sequential memory access pattern, and large strides will resemble more random memory access pattern. We didn’t do any experiments about this one, so this is an educated guess.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
- TLB caches are used to speed up translations of virtual to physical addresses [↩]
- E.g. if your loop is processing two classes X and Y, but it only uses some data from Y. If you create a copy of the Y’s used data in X, the total program’s memory consumption will increase (because of the copy), but the dataset size will decrease (less data touched by the loop). [↩]
- Huge pages are a special feature of Linux that decreases the pressure on TLB cache, which can be very important for these type of workloads. [↩]
- This can happen, for example, if you exchange the allocator to use a zero-overhead allocator. [↩]
Would you be surprised to hear that the storage required by vector can be improved upon by using a novel hash table instead?
https://www.linkedin.com/feed/update/urn:li:activity:6997699967066791936/
The link you shared is dead.