We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
This is the last memory optimization that we are covering in this blog. You can see the full list of all memory subsystem optimization that we covered earlier here. Definitely a read for anyone who is trying to improve performance of memory intensive software.
In this post, we are covering a few remaining optimization techniques that didn’t fit other sections, without any particular order. Enjoy!
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Techniques
Calculation vs Memory Lookup
One of the ways people used to speed up their loops was to use a precalculated values from the memory instead of calculating them when they were needed. For example, instead of calculating the value of sine
everytime you need it, you could use a lookup table to lookup values for the cosine. This would work e.g. int16_t fixed-point types, where the lookup table would have only 64K entries.
In modern world, we need to seriously reconsider this recommendation. A CPU needs might need 4 cycles to fetch a value from the memory, and this is in the best of scenarios. If the data is not coming from the fastest level of the data cache, this number can easily grow to 15, 20, 50, etc. On the other hand, with vectorization, a CPU can perform calculations with e.g. 8 floats whereas with memory lookup the CPU needs to perform 8 individual memory loads.
We are not saying that you should replace all memory lookups with calculations, but the bar is higher than you think and with vectorization you could perform a lot of calculations before being slower than doing memory lookups.
Flat Matrices vs C Style Matrices
A flat matrix is a matrix represented as one-dimensional array. The address of element matrix[i, j]
of the matrix is calculated as matrix + i * width + j
, where i
is the row and j
is the column.
A C-style matrix is essentially an array of pointers, and each pointer points to a row of data. So, in this scheme, if we want to access an element matrix[i, j]
, we do this by reading the address of the row which is stored in matrix + i
, dereferencing it and then adding j
to it. So, the address is calculated as *(matrix + i) + j
.
The difference between the two approaches is similar to the difference between calculation or memory lookup. For flat matrices, we perform a calculation to get the address of the row. For C-style matrices, we perform a memory lookup.
For C-style matrices you can use matrix[i][j]
notation. They also allow to have matrices where rows have different sizes, so they can save some memory. But for performance, flat matrices are much better:
- Calculation is much cheaper than a memory lookup, algorithms that work with flat matrices, especially algorithms that run through matrices column-wise, tend to be faster than the same algorithms that run on C-style matrices.
- No overhead of an additional array pointing to rows.
- Less memory fragmentation, flat matrices can be allocated and deallocated with one call to the memory allocator (this can be also achieved with C-style matrices, but it’s trickier).
- Compiler can do more optimizations with flat matrices than with C style matrices, because with C style matrices rows can potentially alias one another and the compiler cannot do aggressive optimizations.
If you want the performance of flat matrices with the syntax of full matrices, in C++ the simplest way is to overload the T& operator()(size_t i, size_t j)
. Then, you can use your matrix like matrix(i, j)
to get or set a value in the matrix.
We investigate the performance of flat matrices and C style matrices in the experiment section.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Transposing Matrices for Better Performance
When working with matrices (or any data type that is stored in 2 dimensions like images), software performance is much better if you process your data row-by-row instead of column-by-column. We already covered this in a lot of detail in the post For Software Performance, the Way Data is Accessed Matters! That post covers many techniques you can improve the performance of matrix processing code. However, there is one super-simple approach that works when your algorithm is processing matrix data column-by-column.
The idea is to transpose the matrix that has column-wise accesses before doing the processing on it. Matrix transposition is itself O(n2) operation, so for this to technique to work the algorithm you are optimizing needs to have at least O(n2) complexity. Good examples of such algorithms are matrix multiplication with complexity O(n3) or vertical Gaussian filtering with complexity O(kn2), where k
is the size of the Gaussian filter.
We illustrate the technique using matrix multiplication code. Bellow is the code snippet:
for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { c[i * n + j] = 0.0; for (int k = 0; k < n; ++k) { c[i * n + j] += a[i * n + k] + b[k * n + j]; } } }
Let’s inspect the innermost loop and observe how the pointers to a
, b
and c
change when the iterator k
increases by one:
- When
k
increases by one, the pointer toc[i * n + j]
doesn’t change. It is always hitting the same memory address. This corresponds to constant access pattern. The compiler will probably promote it to a hardware register instead of memory. - When
k
increases by one, the pointer toa[i * n + k]
also increases by one. It is hitting the neighboring memory address. This corresponds to sequential access pattern. Generally, this access pattern optimally uses the memory subsystem resources. - When
k
increases by one, the pointer tob[k * n + j]
increases byn
. This corresponds to strided access pattern, andn
is the stride. This access pattern poorly uses the memory subsystem resources and it is responsible for poor performance.
The above algorithm has O(n3) complexity. A very simple approach would be to transpose the matrix b
. Our code looks like this:
double* b_transposed = (double*) malloc(sizeof(double) * n * n); for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { b_transposed[i*n + j] = b[j*n + i]; } } for (int i = 0; i < n; ++i) { for (int j = 0; j < n; ++j) { c[i * n + j] = 0.0; for (int k = 0; k < n; ++k) { c[i * n + j] += a[i * n + k] + b_transposed[j * n + k]; } } } free(b_transposed);
Observe that we created a new matrix called b_transposed
which is a transposed version of b
. In the main loop, we are accessing b_transposed[j * n + k]
. Contrary to the original pointer b
, this pointer has a sequential memory access pattern, meaning this loop has a potential to execute faster.
For this approach to yield performance improvements, the matrix dimension needs to be big enough. Otherwise, the matrices would fit the fastest levels of the data cache and transposing would just waste time.
We investigate with various sizes of matrices in the experiment section.
Sequential Reading vs Sequential Writing
Consider the following example of matrix transpose:
void matrix_transpose_in(float* out, const float* in, size_t n) { for (size_t i = 0; i < n; i++) { for (size_t j = 0; j < n; j++) { out[j * n + i] = in[i * n + j]; } } }
This is a very simple matrix transpose. If we analyze the access pattern for the pointers in the innermost loop in the same way as in previous section, the access pattern for out
is strided and the access pattern for in
is sequential. This is one possible way to implement transposition. The other way is to exchange the places of the outer and the inner loop, the code looks like this:
void matrix_transpose_out(float* out, const float* in, size_t n) { for (size_t i = 0; i < n; i++) { for (size_t j = 0; j < n; j++) { out[i * n + j] = in[j * n + i]; } } }
For this code, the access patterns for pointers out
and in
got inverted: out
has a sequential access pattern now, and in
has a strided access pattern.
When you are doing modifications to speed up access to your code, such as loop interchange, loop tiling, etc, you might get into a situation that you cannot get rid of all the strided accesses. But you can pick between sequential data reading and sequential data writing. The question is, which one is better?
Loads are beginning of most dependency chains and stores are their ends. One would assume that efficient data loading to be more important than efficient data storing. But in our experiments, sequential data storing yields better performance!
There are several reasons for this, and the most important one is write combining, a technique the hardware employs to speed up data writes. Imagine we need to modify one byte of data in the memory. What happens behind the scenes is that the CPU fetches a whole block of 64 bytes, modifies one byte and then stores the block back to the memory. But if the CPU is modifying the whole 64 byte block, then it can do write combining of several stores, omit the load step and just store a modified 64 byte block!
In the experiment section we investigate the performance improvements we get from sequential writing.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Efficient Data Filtering: Data Moving vs Data Skipping
Imagine the following scenario: you have a collection of data, and you have to filter our some values depending on some condition. E.g. you want to filter out all values smaller than X because they are noise. In addition, you will be going through the same collection of data several times, and you will be performing filtering each time with different filter criterion – the filtering criterion for the next iteration of filtering might be calculated in the previous iteration of filtering.
The question is, how to organize your data for efficient filtering? The natural answer to this question is a linked list, but if linked lists are notorious for bad memory access pattern and low instruction level parallelism. Therefore, we opt out of linked list and try with arrays. To implement our filtering list with arrays, we have two options:
- SKIP Implementation: Instead of storing only value in the array, we store a value, and an index of the next valid value. This essentially converts our array to a linked list. To remove a value, we change the next field of its predecessor. The removal of the value is very cheap, but, after the first iteration, your code will be skipping over deleted values which results in jumping around the memory, which, as we have reaffirmed many times, is bad for performance.
- COPY Implementation: this implementation is without the additional next field; the array stores only values. We delete a value at position X by simply overwriting it using some value that comes later in the array that we wish to keep (similar to
std::copy_if
algorithm). The price of deletion is higher because we essentially overwrite deleted values with useful values; this involves a lot of data copying. But iterating over the array is much cheaper since the program is sequentially running
The question is, which one of these is faster? This depends on a lot of factors:
- A small filtering list favors the SKIP implementation. Since the whole list fits the fast data cache, the overhead of deletion will not be offsetted by the save we get from accessing our data sequentially. A big filtering list favors the COPY implementation. The price of copying is minuscule compared to the price of random memory accesses.
- A filtering list with small types (e.g. uint8_t, float) favors the COPY implementation. A filtering list with large types (big classes) favors the SKIP implementation. This is because smaller types are cheaper to copy.
- If the filtering list is iterated many times, this favors the COPY implementation. Otherwise, SKIP implementation would be better. The more we iterate the filtering list and remove data, the more we suffer from data cache misses – the SKIP implementation resembles more a sparsely populated array.
Additional advantage of the COPY implementation is that there exists a fast vectorized implementation for simple data types in eve::algo::copy_if.
In the experiment section we experiment with filtering lists of various sizes and patterns.
Lookup Inversion
To explain lookup inversion, consider the following scenario. We have a large unsorted array of values lookup_vector
and a small array lookup_vals
. We want to check if values from lookup_vals
are present in lookup_vector
. The search algorithm might look something like this:
std::vector<size_t> result(lookup_vals.size(), NOT_FOUND); for (size_t i = 0; i < lookup_vals.size(); i++) { int64_t val = lookup_vals[i]; for (size_t j = 0; j < lookup_vector.size(); j++) { if (val == lookup_vector[j]) { result[i] = j; break; } } } return result;
We take a value from lookup_vals
and then we sequentially search the lookup_vector
. If the value is found, we break the loop earlier (early break) and move on to the next value from lookup_vals
.
The problem with this approach is that we are iterating over and over over the large array lookup_vector
, which probably means the data for lookup_vector
is coming from slower data caches or memory.
A canonical way to improve the performance of this loop is loop tiling, which we covered in the post For Software Performance, the Way Data is Accessed Matters! But there is an alternative, a simpler way. If the size of lookup_vals
is much smaller than the size of lookup_vector
, we could do the opposite: we could take values from lookup_vector
and do the sequential search inside lookup_vals
. The algorithm changes a bit:
std::vector<size_t> result(lookup_vals.size(), NOT_FOUND); size_t values_found = 0; for (size_t i = 0; i < lookup_vector.size(); i++) { int64_t val = lookup_vector[i]; if (values_found >= lookup_vals.size()) { break; } for (size_t j = 0; j < lookup_vals.size(); j++) { if (val == lookup_vals[j]) { result[j] = i; // We don't break the loop because lookup_vals // array can contain duplicates values_found++; } } } return result;
In this case, we don’t have an early break if we found the value, because the value can be repeated in lookups_vals
. We can perform early exit if the number of values we have found equals the total numbers of values in the lookup_vals
.
The bold statement: this implementation will be faster than the original implementation if the lookup_vals
is much smaller than lookup_vector
. Why?
Both algorithm have a similar algorithmic complexity which is O(lookup_vals.size()*lookup_vector.size())
. But in the fist algorithm, we perform small number of passes over a large array, and in the second case we perform a large number of passes over a small array. And doing large number of passes over a small array is faster than the other way around, because a small array can be cached in a fast level of data cache.
We investigate the effect of this technique in the experiment section.
Partially Sorting the Lookup Keys
To explain this technique, let’s say we have N keys and we want to check if they are in our binary tree or if they are in our sorted array using binary search. Now the bold claim: if the lookup keys are partially sorted, the lookups will be faster!
In the context of this post, we use the term partial sorting for an array that has been divided into N buckets. Within a bucket, the values are unsorted, but all values in the bucket X + 1 are smaller than any value in the bucket X. For example, an array:
Totally sorted | 1, 2, 3, 4, 5, 6, 7, 8 |
Partially sorted, four buckets | [2, 1], [4, 3], [5, 6], [8, 7] |
Partially sorted, two buckets | [1, 3, 2], [8, 7, 4, 6, 5] |
Why is lookup faster if the keys are partially sorted ?
Some data structures, like binary trees or sorted arrays for binary search have a property that values in the data structure that are close to one another in value are also close to one another in memory.
- In a sorted array, keys coming from the same bucket will be located only in one part of the array, not the whole array.
- In a tree, keys coming from the same bucket will be located in only one part of the tree, not the whole tree.
So, when we perform lookups using the partially sorted key array, we get better data locality since, while the keys from a single bucket are being processed, we are accessing only a part of the data structure instead of the whole data structure. And with better locality comes better performance.
This approach works for many lookup-type data structures with O(log2(n)), including kD trees (looking up points in space), graphs, etc.
Partially Sorting the Lookup Array
Partial sorting can be done in O(n) time, if enough memory is available. In C++, you would allocate as many instances of std::vector
as you have buckets. Then, depending on the key value value, you would put the key in one of the buckets using std::vector::push_back
.
After partitioning the key array into buckets, in the second part of the algorithm we recreate a partially sorted key array. We do this by visiting buckets from 0 to bucket_count, and copying the values from the buckets to the partially sorted output array. The example source code is available here.
Performance Considerations
For this approach to work a few things need to be takes into account:
- The data structure (tree or sorted array) needs to be big enough so it doesn’t fit L1 (or maybe L2) data cache. If the structure is small, the whole data structure will fit the L1 cache so partial sorting of the keys doesn’t bring any speed improvements.
- You need to properly pick the size of each individual bucket. A bucket that contains one million elements is too big and will probably hit too big part of the data structure for it to fit the fast data cache. And a small bucket is wasted effort, because it could have been merged with neighboring buckets for better effect.
- The difference between the minimum and maximum key in a bucket should not be too large. If this is not the case, key lookup would be hitting too big part of the data structure and the speedup effect will gone.
You probably have a hunch that creating and maintaining a partially sorted array of lookup values can be tricky. To do efficient bucketing, you need to know the statistical shape of your lookup keys. Are all keys equally probable? Or some keys are more likely to appear? If you don’t pay attention to this, you might end up with some buckets that are huge (which is bad for performance), and others that are almost empty (buckets that could have been merged with other buckets).
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Experiments
In the experiment section we are investigating how the techniques we presented earlier affect software performance. All experiments were executed on Intel(R) Core(TM) i5-10210U system and Ubuntu. We repeated all the experiments 10 times and report the average numbers. Unless stated differently only small standard deviations in all the experiments were observed and therefore we don’t report them.
We measure runtime of a simple and improved version, speedup which is simply runtime of simple and improved version. For some experiments we also report instruction count, since it is important to understand whether the speed improvement comes from reducing instruction count or improving hardware efficiency. And finally, we report CPI metric (Cycle-Per-Instruction) which measures hardware efficiency – smaller number means better hardware efficiency.
Flat Matrices vs C-Style Matrices
We measure the performance of matrix transpose for a flat matrix and C-style matrix, for various matrix sizes. Each experiment was repeated ten times, we report the average numbers. No big deviations in measurements in any of the columns observed. The source code we used for the experiments is available here. The results are in the table bellow:
Dimension (Repeat Count) | Runtime Flat | Runtime C-Style | Speedup Flat/C-Style | Instruction Count Flat | Instruction Count C-Style | CPI Flat | CPI C-Style |
---|---|---|---|---|---|---|---|
20 (2684 K) | 0.384 s | 0.520 s | 1.35 | 5043 M | 5411 M | 0.305 | 0.382 |
50 (429 K) | 0.337 s | 0.532 s | 1.58 | 3652 M | 4429 M | 0.367 | 0.477 |
100 (107 K) | 0.392 s | 0.502 s | 1.28 | 3520 M | 4338 M | 0.443 | 0.463 |
200 (26 K) | 0.452 s | 0.521 s | 1.15 | 3460 M | 4295 M | 0.52 | 0.483 |
500 (4294) | 0.486 s | 0.554 s | 1.14 | 3219 M | 4121 M | 0.601 | 0.535 |
1000 (1073) | 0.659 s | 0.712 s | 1.08 | 3207 M | 4111 M | 0.786 | 0.672 |
2000 (268) | 1.028 s | 1.081 s | 1.05 | 3141 M | 4062 M | 1.125 | 0.923 |
5000 (42) | 1.164 s | 1.267 s | 1.09 | 3088 M | 3987 M | 1.271 | 1.075 |
If we observe the table, the runtime of the transpose matrix algorithm for flat matrices is always better than for C-style matrices. The speedup is higher for smaller matrices, and decreases as the size of the matrix grows. The main reason is the fact that the flat matrix executes considerable less instructions compared to the C-style matrix.
Transposing Matrices for Better Performance
For this experiment we use the matrix multiplication example we described earlier. We measure the performance of matrix multiplication without and with transpose. The runtime for the version with transpose includes memory allocation, transposition and the actual algorithm. The source code can be found here. The version without transpose only does the algorithm.
Here are the results:
Dimension (Repeat Count) | Runtime Simple | Runtime with Transpose | Speedup | CPI Simple | CPI with Transpose |
---|---|---|---|---|---|
200 (33) | 0.268 s | 0.248 s | 1.08 | 0.518 | 0.526 |
500 (2) | 0.271 s | 0.246 s | 1.1 | 0.557 | 0.555 |
1000 (1) | 1.805 s | 1.190 s | 1.52 | 0.846 | 0.644 |
2000 (1) | 36.309 s | 10.124 s | 3.59 | 2.191 | 0.637 |
The instruction count is similar for both versions and we are not reporting it. The compiler didn’t vectorize any of the versions; it could have vectorized the transpose version for even better runtime.
The performance doesn’t change considerably for smaller matrices (dimensions 200 and 500). Only for larger matrices (larger than 1000×1000) does the version with transposition become considerably faster than the simple version. And the reason is improved memory efficiency: the CPI of the version with transposition is much better than the CPI of the simple version.
Sequential Reading vs Sequential Writing
For this experiment, we compare the performance of two versions of transpose matrix kernel, as we described earlier. The first version performs sequential reads/strided writes and the second version performs sequential writes/strided reads. Both kernels use loop tiling to speed up performance of matrix transpose. We perform measurements for various matrix sizes. The source code is here, and these are the results.
Dimension (Repeat Count) | Runtime Seq. Reading | Runtime Seq. Writing | Speedup Write/Read | CPI Seq. Reading | CPI Seq. Writing |
---|---|---|---|---|---|
20 (2684 K) | 0.440 s | 0.384 s | 1.15 | 0.328 | 0.305 |
50 (429 K) | 0.388 s | 0.337 s | 1.15 | 0.391 | 0.367 |
100 (107 K) | 0.503 s | 0.392 s | 1.28 | 0.536 | 0.443 |
200 (26 K) | 0.561 s | 0.452 s | 1.24 | 0.61 | 0.52 |
500 (4294) | 0.586 s | 0.486 s | 1.21 | 0.695 | 0.601 |
1000 (1073) | 1.107 s | 0.659 s | 1.68 | 1.058 | 0.786 |
2000 (268) | 1.587 s | 1.028 s | 1.54 | 1.247 | 1.125 |
5000 (42) | 1.673 s | 1.164 s | 1.43 | 1.416 | 1.271 |
The sequential write version is always faster the sequential read version, and the speedup improvement is greater for large matrices. The reason is improved efficiency (CPI) of the sequential write version.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Efficient Data Filtering: Data Moving vs Data Skipping
For this experiment, we use an a filtering list consisting of floats. We use two implementations, the COPY implementation that overwrites the deleted values and SKIP implementation that skips over them (described earlier). We run the algorithm with several passes, and each pass removes the certain number of values. For this test, the filtering removes all values that are greater than 0.9 times the maximum value remaining from the previous pass. The source code is here.
We fix the size of the starting list to 10 M floats. We vary the number of passes. Here are the results:
Number of Passes (Remaining Values) | Runtime SKIP | Runtime COPY | Speedup | Instructions SKIP | Instructions COPY | CPI SKIP | CPI Copy |
---|---|---|---|---|---|---|---|
1 (9 M) | 0.030 s | 0.017 s | 1.76 | 131 M | 87 M | 0.881 | 0.744 |
2 (8.1 M) | 0.056 s | 0.032 s | 1.75 | 248 M | 165 M | 0.882 | 0.743 |
5 (5.9 M) | 0.125 s | 0.069 s | 1.81 | 356 M | 536 M | 0.896 | 0.753 |
10 (3.4 M) | 0.212 s | 0.108 s | 1.96 | 853 M | 566 M | 0.938 | 0.744 |
20 (1.2 M) | 0.338 s | 0.146 s | 2.31 | 1150 M | 763 M | 1.069 | 0.743 |
From the table we can reach to a few interesting observations:
- Even for a single pass, the COPY version is faster by a factor of 1.76. The reason is simply a smaller number of instructions!
- As the number of passes increase, so does the CPI for the SKIP implementation. This happens because the SKIP implementation has worse and worse memory characteristics as we remove data from it. The CPI of the COPY implementation remains the same.
- Generally, the speedup of COPY implementation ranges for 1.7 to 2.3, with more passes resulting in a higher speedup for the COPY implementation.
Of course one should not generalize: the speedup you get with floats will not be the speedup you get with big classes. Also, we didn’t test it for small lists, but since the performance improvements mostly comes from reduction in instruction count, one could expect this approach would reasonably work even for small lists.
Lookup Inversion
For the lookup inversion experiment, we perform use the same code as we used to demonstrate the technique. We fix the size of the bigger array to 10 M elements, and we vary the size of the small array to observe how the runtimes change. Here are the result:
Lookup Keys Count (Repeat Count) | Runtime Simple | Runtime with Inversion | Speedup | Instructions Simple | Instructions with Inversion | CPI Simple | CPI with Inversion |
---|---|---|---|---|---|---|---|
1 (10) | 0.009 s | 0.204 s | 0.04 | 16 M | 1800 M | 0.661 | 0.373 |
5 (2) | 0.045 s | 0.061 s | 0.74 | 121 M | 600 M | 0.759 | 0.338 |
20 (1) | 0.084 s | 0.053 s | 1.58 | 260 M | 500 M | 0.770 | 0.353 |
100 (1) | 0.343 s | 0.160 s | 2.14 | 1306 M | 1900 M | 0.775 | 0.309 |
500 (1) | 1.681 s | 0.714 s | 2.35 | 6.56 G | 8.9 G | 0.779 | 0.312 |
2000 (1) | 6.733 s | 2.679 s | 2.51 | 26.57 G | 35.15 G | 0.780 | 0.293 |
10000 (1) | 34.136 s | 13.835 s | 2.47 | 131.67 G | 175.15 G | 0.783 | 0.293 |
For very small number of keys, the version with inversion is slower. The reason is that the simple version performs early exit, and considering the instruction count, this is precisely what happened. But as the number of lookup keys grow, the version with inversion becomes faster. Even with 20 keys, the version with inversion is 1.58 times faster than the simple version and this number continues to grow!
Note also that all performance improvements we get comes from increased hardware efficiency (note CPI) and not reduction of instruction count.
Partially Sorting the Lookup Keys
For this experiment, we use binary search on a sorted array to demonstrate the effectiveness of partially sorting the lookup keys. When it comes to this experiment, there are three parameters that we can modify to measure the performance of partial sorting:
- The size of the sorted array
- The number of lookup keys
- The number of buckets in the partitioned lookup key array
It is unfeasible for us to test them all in combination, therefore, we test them in isolation. We vary one parameter and keep the other two constant. The source code for the experiment is available here.
Size of Sorted Array
The size of the lookup array is fixed to 100K uint32_t values (97.65 kB), and and the bucket count is fixed to 256 buckets. We vary the size of the sorted array to see what kind of speed improvement we get depending on the size of the array.
Sorted Array Size | Runtime Simple | Runtime PS | Speedup | CPI Simple | CPI PS |
---|---|---|---|---|---|
10 K (103) | 0.600 s | 0.351 s | 1.70 | 1.838 | 0.822 |
20 K (95) | 0.601 s | 0.374 s | 1.61 | 1.882 | 0.906 |
50 K (89) | 0.635 s | 0.410 s | 1.55 | 1.978 | 1.005 |
100 K (83) | 0.672 s | 0.419 s | 1.60 | 2.169 | 1.071 |
200 K (78) | 0.729 s | 0.423 s | 1.72 | 2.373 | 1.104 |
500 K (74) | 0.811 s | 0.440 s | 1.84 | 2.606 | 1.148 |
1 M (70) | 0.874 s | 0.456 s | 1.91 | 2.786 | 1.193 |
2 M (67) | 1.229 s | 0.505 s | 2.43 | 3.373 | 1.291 |
5 M (61) | 1.616 s | 0.630 s | 2.57 | 4.460 | 1.489 |
10 M (58) | 1.854 s | 0.736 s | 2.51 | 5.156 | 1.749 |
20 M (55) | 2.063 s | 0.843 s | 2.44 | 5.804 | 2.000 |
50 M (53) | 2.361 s | 1.068 s | 2.21 | 6.540 | 2.520 |
Speedups can be observed for all sorted array sizes, and generally it is bigger for bigger sorted arrays. It completely comes from improved hardware efficiency (high CPI); the instruction count for partially sorted array is over higher than the instruction count for unsorted arrays.
Number of Lookup Keys
For this experiment, we fix the size of the sorted array to 10M and set the bucket count to 256, and then we vary the size of the lookup keys to see how performance depend of the size of the lookup array.
Lookup Keys (Repeats) | Runtime Simple | Runtime PS | Speedup | CPI Simple | CPI PS |
---|---|---|---|---|---|
500 (11.7 K) | 1.135 s | 1.229 s | 0.92 | 3.298 | 1.337 |
1000 (5.8 K) | 1.279 s | 1.260 s | 1.02 | 3.591 | 1.476 |
2000 (2.9 K) | 1.437 s | 1.174 s | 1.22 | 3.982 | 1.613 |
5000 (1.2 K) | 1.779 s | 1.118 s | 1.59 | 4.888 | 1.850 |
10 K (583) | 1.830 s | 1.069 s | 1.71 | 5.037 | 2.002 |
20 K (291) | 1.828 s | 0.973 s | 1.88 | 5.053 | 2.002 |
50 K (116) | 1.819 s | 0.819 s | 2.22 | 5.061 | 1.849 |
100 K (58) | 1.816 s | 0.726 s | 2.50 | 5.055 | 1.738 |
200 K (29) | 1.820 s | 0.637 s | 2.85 | 5.067 | 1.557 |
500 K (11) | 1.725 s | 0.496 s | 3.48 | 5.063 | 1.430 |
1 M (5) | 1.621 s | 0.424 s | 3.82 | 5.141 | 1.382 |
With 500 lookup keys, the simple version is faster. This is because the overhead of partial sorting is higher than the memory subsystem improvements we get from partial sorting. But as the number of the keys grows, the version with partial sorting becomes faster and faster. The speed improvement dominantly comes from improved hardware efficiency, which we can see through improved CPI numbers. For 1 M partially sorted keys, the improvement is 3.82 times!
Bucket Count
For this experiment, we fix the size of the sorted array to 10M uint32_t values (38.147 MB) and a lookup array containing 100K uint32_t values (97.65 kB). All of our buckets are similarly sized. Each experiment was repeated 58 times. Here are the numbers:
Bucket Count | Values in a Bucket | Runtime | Speedup Runtime / Runtime Simple | CPI with PS |
---|---|---|---|---|
1 (no partial sorting) | 100K | 1.858 s | 1 | 5.168 |
2 | 50K | 1.576 s | 1.18 | 3.890 |
4 | 25K | 1.288 s | 1.44 | 3.165 |
8 | 12.5K | 1.035 s | 1.80 | 2.616 |
16 | 6250 | 0.963 s | 1.93 | 2.422 |
32 | 3125 | 0.887 s | 2.09 | 2.215 |
64 | 1562 | 0.819 s | 2.27 | 2.030 |
128 | 781 | 0.770 s | 2.41 | 1.881 |
256 | 391 | 0.733 s | 2.53 | 1.750 |
512 | 195 | 0.703 s | 2.64 | 1.625 |
1024 | 98 | 0.698 s | 2.66 | 1.523 |
2048 | 49 | 0.706 s | 2.63 | 1.425 |
4096 | 24 | 0.731 s | 2.54 | 1.319 |
Increasing the number of buckets yields better speed improvements, but only up to a certain point. The best speed improvement was for 2048 buckets, with more buckets being slower.
The problem with large number of buckets is that having many buckets makes partial sorting a memory bound problem! There is a sweet spot when it comes to the number of buckets, in our case it’s 2048, but depending on other factors it can be more or less.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Final Words
If you got to this point, congratulations, you are a diligent reader! With this post we close the topic of memory optimizations, having covered everything the author knows about making software faster by better using the memory subsystem. The full list of topics is available here.
This post focused on fringe optimization techniques that we couldn’t fit in any of the other categories, but they might be useful for some people. A small leitmotif that pops up are tradeoffs – between sequential reading or writing, copying or skipping, storing in memory or calculating on the fly. The development of modern computers have definitely made some earlier recommendation on what is better requiring serious rethinking.
Lookup inversion is interesting as a technique, but I don’t think it will be finding serious usage. And finally, partial sorting is a relatively cheap and simple way to speed up your code without having to touch on the data structure or the algorithm. The price you have to pay is a bit increased usage of memory, but memory is cheap nowadays and this approach definitely can pay off.