Use explicit data prefetching to faster process your data structure

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

Data structure performance is tightly connected to the way they are accessed. And there are two ways to access them: sequentially and randomly. Sequential access is reserved mostly for arrays, and the hardware is really good at it because it can guess what memory location the program is going to access next. But, fast algorithms usually achieve their speed through random accesses, and the hardware cannot do any data prefetching there because it cannot easily guess which memory location will be accessed next. And this leads to poor performance.

Basic idea

If we want to access an element, let’s say, in a hash map, we calculate its place in the hash map by applying the hash function, and then we access that element. This is random access. The hardware cannot guess which memory location will be accessed next and therefore cannot prefetch it.

If you are lucky enough that the data structure is small and can completely fit the data cache, you should see no slowdowns with random accesses. However, if the size of the data structure is large enough not to fit in the cache (which is typically a few megabytes in size), then we can expect with a large certainty a cache miss for most memory accesses. To illustrate, the work we had to do calculate the hash function of the hash map entry took 20 cycles, but now the CPU is stuck for 300 cycles waiting for the piece of data to be fetched from the memory. Can we do better? That depends.

When accessing a large data structure in a random fashion, the CPU is mostly idle because most accesses to the hash-map entry will result in a cache miss.

If we are accessing a single element, the answer is no. But if we are trying to access several elements, then there is a trick that we can use to speed up things. Here it goes.

Modern CPUs offer explicit cache prefetching. What this means is that the program can issue a special prefetch instruction, that will go to the main memory, fetch a cache line and load it into the data cache. All this happens in the background while CPU is free to do other things. When the cache line is fetched, the CPU can access the piece of data quickly.

In the case of our hash map, we can split the access to the value into two parts. The first part is calculating the hash function and prefetching the entry. The second part is accessing the entry, doing the comparison, and returning the result.

Between the first part and the second part, some time needs to pass for the hardware to prefetch the cache line. It is hard (although not impossible) to find a useful thing to do during this short time if we are processing a single element.

But if we are processing a batch of elements, we can do the following: iterate through the batch, do the prefetching part for element X, and while the value is prefetched, go and do other operations for other elements. What this other operation is, that depends on how we write your algorithm.

Nano threads

One way to organize the prefetch/access split is nano threads. The idea is to pick a constant, for example, 16. Then we do 16 prefetches for elements N, N+1, …, N+15 followed by 16 accesses for the same elements in the same order.

Nano threads: we first perform prefetches, then we perform accesses

It is important to pick the constant properly. Too small, and the access phase will start before the hardware completes the prefetching. Too large, and some of the prefetched values will be evicted from cache.

The approach is called nano-threads because it resembles instantiating a bunch of threads that all do the same job. Please note that no actual threading is involved, the program runs completely serially.

Alternating prefetch/access

I named the second approach alternating prefetch/access because it does exactly that, it alternates between prefetching for element X and accessing for already prefetched element X-n.

In the alternating prefetch/access approach, the program alternates between prefetching for element X and accessing prefetched element X – n

Again it is important to pick the constant n properly for the same reason as with the previous approach.

Which approach is better?

There is no clear answer. From the performance perspective, they are similar. From the simplicity point of view, that again depends on the underlying data structure. In our hash map example, the alternating prefetch/access approach was simpler and also faster, but this doesn’t necessarily have to be the case with other data structures.

For both approaches, we need to experimentally determine the value of the constant. A good rule of thumb is to start with constant 16, and then go up or down until reaching the best performance. Also note, that the constant that works best on one system might not be the same that works on another. In that case you might want to perform the calibration in runtime.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

What other people already did?

Well, this idea is not new and has been tried out and found working. In a talk by G. Nishanov Nano-coroutines to the Rescue! (Using Coroutines TS, of Course), the author first presents a simple binary search algorithm and then goes into details on how to transform this algorithm by applying the principles we already discussed.

Given an element to find in a sorted array of size N, it will take approximately log2(N) steps to check if the element is in the array. Most of the checks will result in a cache miss. So the idea is to split each check into two phases: a prefetch phase and an access phase. And the program is alternating between prefetching element X and accessing already prefetched element Y.

The author shows how it is done simply with coroutines, and I suggest you watch because the concept of coroutines is fun to know and useful in this context, albeit not necessary. The resulting speed-up is what interests us: if the sorted array cannot fit the last-level cache (which is typically a few megabytes), it takes 26 ns per lookup. However, when the author used batch processing with prefetch it is 10 ns per lookup. Not bad.

Our experiment

To try this out, we made our own experiment. We created a hash map called fast_hash_map that implements methods:

  • find_multiple_simple checks if std::vector of inputs is present in the hash map, without any prefetching
  • find_multiple_nanothreads checks if std::vector of inputs is present in the hash map, using prefetch with nano threads implementation
  • find_multiple_alternate checks if std::vector of inputs is present in the hash map, using prefetch with alternate prefetch/access approach

There was a question on how to deal with collisions in our hash map. A hash map entry can have zero (entry is available), one (entry is occupied), or more than one element (there are collisions with the entry). Many collisions slow down the access to the hash map, but with a good hash function and rehashing when the hash map becomes too big, typically there shouldn’t be many collisions.

Hash maps are typically implemented using an array. A simple implementation is a data structure where each element of the array is a pointer to another array with one or more elements. This implementation has one major drawback when it comes to cache efficiency: to get to the first value in the entry, the program needs to follow two pointers, which often results in two cache misses!

Simple implementation of hash map. If the entry is occupied, we will need to follow two pointers. The first is to get the entry in the hash map array. If the count of the entry > 0, then we need to follow the second pointer. Following two pointer typically results in two cache misses.

Another, more complicated implementation, stores the first element of the hash map inside the entry itself, and the pointers to other elements in an additional array with one or more elements. The good thing about this approach is that in the typical case (zero or one value in the entry) we will have only one cache miss. In case of collisions, we will have two cache misses.

Advanced hash map implementation. We use the additional array only to store collisions. In case of no collisions, this means one cache miss and better speed.

The input is given as a vector of values and the result is returned as a vector of bools. So how did it measure up?

We used a hash map with N entries. First we insert 0.7 * N entries into it, and then we remove 0.3 * N entries from it. So the hash map is halfway full.

Let’s see how the three implementations (STL, simple and advanced) measure against each other when no prefetching is involved.

No of entries and iterationsSTL std::unordered_setsimple hash mapadvanced hash map
32 entries
2M iterations
1435 ms1824 ms1427 ms
1M entries
64 iterations
2849 ms4161 ms2481 ms
64M entries
1 iteration
5853 ms5838 ms3785 ms
Hash map comparisons for a simple case without prefetching

For small to medium loads, the STL implementation looks good which is reasonable to expect. STL is used for general-purpose programming, it will be used by millions of developers, and mostly on small to medium loads.

For large loads (64 million entries), the advanced hash map is the best. It is 35 % faster than both STL and simple hash map implementation. This example sheds light on the importance of memory layout! By decreasing the number of pointer dereferencing from two to one, we achieve a large speedup. This is applicable for all data structures, not just here. A B* tree is much faster than the red-black tree for the same reason: B* tree keeps more than one value in the node; it better uses cache lines and it is shallower. Too bad that in C++, all std::map and std::set are implemented using red-black trees “for historical reasons”.

Now, let’s see what happens when the prefetching comes into play. Here are the results for the large load:

Hash map implementationNo prefetchingNano-threads approachAlternating prefetch/access approach
STL std::unordered_set5853 ms
Simple hash map5838 ms3522 ms3356 ms
Advanced hash map3785 ms2257 ms2009 ms
Comparison of prefetching on a large load (64 million entries), with and without prefetching

We went from 5.8 seconds on the regular STL implementation, to 2 seconds with the advanced hash map with prefetching. A very good result. But this is the large load. What happens to small and medium loads?

Hash map implementationNo prefetchingNano-threads approachAlternating prefetch/access approach
STL std::unordered_set1435 ms
Simple hash map1824 ms1910 ms2376 ms
Advanced hash map1427 ms1514 ms2015 ms
Comparison of prefetching on a small load (32 entries, 2M iterations), with and without prefetching

Measuring small loads is always difficult. The batch processing algorithms need to allocate memory from the system (to store the results) and this can influence the measurement results. But we tried to make all algorithms do the same thing for the sake of measurements. We know that from one of the previous posts that prefetch instruction actually slows down the program in case the data is already in the data cache (which is the case for small load). So, this optimization is actually a slow down on a small load.

Hash map implementationNo prefetchingNano-threads approachAlternating prefetch/access approach
STL std::unordered_set2849 ms
Simple hash map4161 ms2409 ms2481 ms
Advanced hash map3184 ms1731 ms1854 ms
Comparison of prefetching on a medium load (1M entries, 64 iterations), with and without prefetching

For medium load, the prefetching makes sense. It is not as fast as in the case of a large load, but the numbers are better than with no prefetching.

As you might have seen, as the load becomes larger and larger, so prefetching makes more and more sense.

Final Words

This is not a technique that you will see every day and that is meant for everyday use. But in case performance is important and the load is large, this technique can achieve great results with a little effort. Keeping the CPU when it needs to wait for data from the memory is an important way to speed up your program.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Leave a Reply

Your email address will not be published. Required fields are marked *