In this blog, we write a lot about software performance and how to make software run faster. In all our previous posts we focused on how to make a piece of software run faster. In contrast, in this post we try to make our software as frugal as possible: we are not interested in top speed, but we are interested in saving memory subsystem bandwidth. This can mean several things:
- Decrease the amount of data the exchanged between the memory and the CPU.
- Decrease the data cache requirements (the amount of data that needs to be stored in the cache).
- Decrease the required memory throughput (e.g. instead of 10 GB/s, the program might require 1 GB/s).
This is important for programs that run in multithreaded environment. In a multithreaded environment, if you have two memory subsystem intensive processes A and B, you can make A faster if you make B consume less shared memory resources: last-level cache space and the memory bandwidth. When B is using too much resources, it is often called noisy neighbor.
So, this post will try to explain the techniques on how you can make your programs or loops better neighbors. In essence, there are two ways: CPU policies and modifying the program.
CPU Policies
Some CPUs, and notably server CPUs, allow you to:
- Reserve a part of the LLC per CPU core, per application, virtual machine or container.
- Limit the LLC bandwidth and memory bandwidth per CPU core, application, virtual machine or container.
These configuration options fall under Quality of Service. For example: Intel’s server CPUs offer Resource Director Technology to allow you to monitor the memory bandwidth usage, cache bandwidth usage, cache size etc. per application, virtual machine, container, etc. AMD server chips also similar technology called Platform Quality of Service Extensions.
In ideal world, this is the best way to solve of noisy neighbors is to use these technologies to limit the allocated resources to noisy neighbors. If you have them, feel free to stop reading now, go to Intel’s or AMD’s web site and follow the instructions there.
Modifying the Program
Sometimes, you might take the longer way to actually modify the program in order to make it a better neighbor. For instance, if the quality of service options are not available on the CPU that you are using. Or, if you are delivering a library and you want to make sure that your library doesn’t consume too much of CPU’s valuable resources.
If your goal is to decrease the pressure on the memory resources, than modifying your program is much more difficult thing to do. Especially because CPUs and compilers don’t provide primitives to make intentional slowing down or limiting your program. Remember, programs were supposed to always run fast, regardless of how much resources they consume.
Techniques
Here we present techniques you can use to make your program neighbor friendlier. Of course, everything we say here needs to be applied to hot loops in your program, and to all of them! Only when you make all your hot loops memory-resource frugal, will this guarantee that the whole program is memory-resource frugal. Also, a very important note: many techniques presented here will make your program slower.
Limiting the cache pollution
Cache pollution happens when the CPU is bringing in data from the memory to the CPU that the program doesn’t use. With cache pollution, there are essentially three types:
- A cache line is brought up from the CPU to the memory, but only a part of the data from the cache line is actually accessed before the cache line is evicted. E.g. when the cache line size is 64 bytes, but the program consumes only 16 bytes, because this is the size of the class.
- A cache line is brought up from the CPU to the memory, but the program never accesses any data in the cache line.
- A cache line is brought up from the CPU to the memory, but it is accessed only once and never again. The result is that some useful data is evicted from the cache.
Being a good neighbor means optimally using the memory resources, and we have many posts in this blog are dedicated to this. With regards to (1), many techniques (e.g. loop interchange, loop tiling, decreasing class size, switching to Struct of Arrays, etc.) can be used to make the program more efficient. We won’t discuss them here. With regards to (2), the biggest culprit is branch speculation, and this is the topic of the next section. And with regards to (3), this typically happens because the working set is too large, and therefore it is never reused.
Cache Pollution due to Speculation
Wit regards to (2), the problem of bringing data to the CPU that the program never accesses is related to speculation. To explain what is going on, consider the following example:
for (int i = 0; i < n; ++i) { value_exist[i] = false; size_t hash = get_hash(value[i]); if (hashset_bucket_full[hash]) { value_exist[i] = (value[i] == hashset_bucket_value[hash]); } }
The above code is a very simple hash set lookup (no collisions). The values to lookup are given in an array value
. The program calculates the hash for each value on line 4 (hash = get_hash(...)
). If the hashset bucket is full (hashset_bucket_full[hash]
), we compare the value in the bucket to the lookup value to determine if the value is present (line 6).
Let’s assume the value hashset_bucket_full[hash]
is not in the data cache, and it needs to be fetched from the main memory. But the CPU doesn’t need to wait for this value to become available to start loading hashset_bucket_value[hash]
from the memory. It can start speculatively executing the body of the if condition; if the speculation turns to be correct, hashset_bucket_value[hash]
will already be loaded.
When hashset_bucket_full[hash]
arrives to the CPU, only then does the CPU know for sure if hashset_bucket_value[hash]
is at all needed. Because of the speculation, there is a chance that the CPU has brought this value from the memory to the CPU without need. And thus cache pollution!
In this case, a solution would be to use a speculation barrier1. When the developer puts a speculation barrier, the CPU will pause the program’s execution until all needed data is fetched or calculated, so there is no need for speculation. Sadly, there is no dedicated speculation barrier instruction, but there are a few alternatives. On X86-64, one would use _mm_lfence()
intrinsic, called load fence, a subtype of a memory fence. According to the documentation for lfence
:
Perform a serializing operation on all load-from-memory instructions that were issued prior to this instruction. Guarantees that every load instruction that precedes, in program order, is globally visible before any load instruction which follows the fence in program order.
In simple words, all loads before the fence must be completed, before any loads after the fence start executing2. So, the modified code would look like this:
for (int i = 0; i < n; ++i) { value_exist[i] = false; size_t hash = get_hash(value[i]); if (hashset_bucket_full[hash]) { _mm_lfence(); value_exist[i] = (value[i] == hashset_bucket_value[hash]); } }
Important Note
This is the simplest way to stop speculative execution, but not the only way. Consider the following snippet of binary search:
int binary_search(int* array, int number_of_elements, int key) { int low = 0, high = number_of_elements-1, mid; while(low <= high) { mid = (low + high)/2; if(array[mid] < key) low = mid + 1; else if(array[mid] > key) high = mid-1; else return mid; } return -1; }
The value that the CPU speculates in this example is array[mid]
. But the value of mid
depends on low
and high
, and the values of low
and high
can be speculatively calculated in the previous iteration. If the data is truly random, these speculations will often fail and the CPU will trash the caches. A way to fix it without fences is through branchless code:
int binary_search_branchless(int* array, int number_of_elements, int key) { int low = 0, high = number_of_elements-1, mid; while(low <= high) { mid = (low + high)/2; low = SELECT(array[mid] < key, mid + 1, low); high = SELECT(array[mid] > key, mid - 1, high); if (array[mid] == key) { return mid; } } return -1; }
Using SELECT
3 (a conditional move) instead of the condition creates a data dependency, and the CPU cannot speculate on what values will be in low
or high
. This makes calculation of mid
halt, and with it fetching of array[mid]
.
Cache Pollution Because the Data is Accessed Only Once
If your algorithm works with large blocks of data, loading the block will evict some other useful data without being reused itself. Again, this pollutes the caches for other processes running on the system.
A common way to fight this employs cache-bypass techniques. Probably all hardware architectures offer ways to bypass cache. On X86-64 systems, there are two ways to bypass cache:
- Non-temporal memory accesses. These are special types of memory accesses that bypass the cache. The data, instead of being loaded into caches, is loaded into special non-temporal buffers on the CPU.
- Non-temporal prefetches. Another approach is, before issuing a memory access, the program would issue a non-temporal prefetch instruction. This instruction would load the data into the cache, but not as the most recently used, but as the least recently used. This piece of data will first to be evicted when the space is needed.
Non-temporal Stores
From this point, we focus on X86-64 architecture. There are two types of non-temporal memory accesses: loads and stores. Unfortunately, non-temporal loads are available only for write-combine memory, which is mostly reserved to working with memory mapped devices and typically not available in the userspace.4. So, for our purpose, non-temporal loads are not useful, and we won’t discuss the further.
Non-temporal stores work as expected, but with a caveat: in order for them to be efficient, you must write data in continuous data blocks. So, the loop for (i = 0; i < n; ++i) a[2*i] = i;
NT stores would not be fast, as it writes every other member of array a
, and therefore it is not continuous.
Non-temporal stores are available only through compiler intrinsics. There are a few of them, for example _mm256_stream_si256
will store 32 bytes of data to the memory without polluting the cache. Here is an example of memcpy
implementation that stores data using non-temporal stores:
void memcpy (int * dst, const int * src, size_t cnt ) { for (int i = 0; i < n; i+= 8) { __m256i val = _mm256_lddqu_si256(src + i); _mm256_stream_si256(dst + i, val); } }
This code uses two intrinsic, _mm256_lddqu_si256
to load 32 bytes of data into the vector register, and _mm256_stream_si256
to store the data to the destination without polluting the cache.
Non-temporal Prefetches
Since non-temporal loads are not available, let’s investigate the alternative, and these are non-temporal prefetches. A non-temporal prefetch issued before a load or a store instruction will load the cache line to the cache, but mark it as a least recently used, and therefore first to be evicted. To use them, you would first issue a non-temporal prefetch, and immediately after it you would insert a load or a store.
On GCC and CLANG, prefetches are available through a builtin __builtin_prefetch(addr, rw, locality)
. The rw
parameter has a 0 if you are just loading the data without modification, or 1 if you are also modifying it. The parameter locality
specifies a type of prefetch, the value 0
is reserved for non-temporal prefetches. Alternatively, you could use the compiler intrinsic _mm_prefetch(addr, hing)
, and for hint
use _MM_HINT_NTA
.
So, the the memcpy
example that minimally pollutes the data caches would look like this:
void memcpy (int * dst, const int * src, size_t cnt ) { for (int i = 0; i < n; i+= 8) { _mm_prefetch(src + i, _MM_HINT_NTA); __m256i val = _mm256_lddqu_si256(load_addr); _mm256_storeu_si256(dst+i, val); } }
Slowing Down the Program
Another way to decrease the pressure on the memory subsystem is by slowing down the hot loop. The simplest way to slow down a program is to decrease the speed of the CPU core it is running. This automatically decreases the memory bandwidth (in bytes/second) that the program consumes, and also saves power.
Slowing down the program can be done in the program itself, as well. Consider the memcpy
examples (this time written using pseudoassembly):
for (int i = 0; i < n; i++) { val = LOAD(src + i); STORE(dest + i, val); }
Inside this loop there is a data dependency: STORE
depends on the LOAD
. But one can execute many LOAD
in parallel, because there are no loop-carried dependencies. The modern out-of-order CPU will do just that: it will execute as many loads as possible, as long there are enough resources on the chip. So, e.g. it could have 10 pending loads, coupled with 10 pending stores.
We can decrease the pressure on the memory subsystem by forcing the CPU to execute loads/stores one by one: do not start the next load/store until the current one has finished. To do this, there are two ways: (1) memory fences and (2) fake dependencies.
Important Note
Memory Fences
A load/store memory fence means that the CPU needs to wait until all loads/stores in front of the fence complete, before it can start executing loads/stores after the fence. We can use memory fences to force serialization of loads and/or stores and thus make the program slower.
A very common pattern in loops is to load some data, do some processing with it, and store the results. Using load fence would be enough to serialize both loads and stores. So, serializing memcpy
using a memory fence:
for (int i = 0; i < n; ++i) { uint64_t val = src[i]; _mm_lfence(); dst[i] = val; }
If the loop doesn’t have any loads, but only stores, you could use store fence instead. Example of memset
:
for (int i = 0; i < n; ++i) { dst[i] = v; _mm_sfence(); }
Fake Loop-Carried Dependencies
A loop that processes a linked list doesn’t profit much from out-of-order CPUs. In order to calculate the address of the next element in the list, the CPU must have processed the current element, which creates a loop-carried dependency.
A way to slow down a loop is to introduce a fake loop-carried dependency. This is simplest to achieve by “faking” that the array is in fact a linked list. Here we rewrite memcpy
by introducing a fake dependency:
uint64_t* src_ptr = src; for (int i = 0; i < n; ++i) { uint64_t src_val = *src_ptr; dst[i] = src_val; src_ptr += ALWAYS_ZERO(src_val) + 1; }
We converted the an array access to a linked list access. The key component is ALWAYS_ZERO
function. It is a function that depends on src_val
, but it always returns 0. But be quite about it, only we know it, the CPU doesn’t, so it will need to calculate the result in every iteration.
The macro ALWAYS_ZERO
can be very simple, e.g. src_val - src_val
. But it needs to be implemented in inline assembly, otherwise the compiler can optimize it out. An example implementation is available in our repository
Both memory fences and fake dependencies serialize loads and/or stores, forcing them to execute one after another. Fake dependencies, however, do not stop speculation, only memory fences do that.
Experiments
To measure the effect of the proposed techniques, we will need to experiment. For all experiments, we use Intel(R) Core(TM) i5-10210U CPU with 6 MB of last-level (shared) cache. We are running on a system with frequency scaling disabled, otherwise when we switch from one thread to two threads, the CPU will lower the frequency even if the second thread is just a loop of NOPs.
Source code for all the tests is in our repository.
Binary Search Experiment
The first experiment is the binary search experiment. In our experiment, we run two threads. The first thread is running a simple binary search. We call this important thread, and ideally we want to see the smallest decrease in speed while the second thread is running. The second, unimportant thread, is running three different flavors of binary search:
- Simple binary search: the same code as on important thread.
- Branchless binary search: similar to the previous, but we use branchless code instead of branches.
- Branchless binary search with non-temporal prefetches: similar to the previous, but we also add non-temporal prefetches to limit cache pollution.
The size of the workload is the second parameter to our experiment. We have a total of five workload sizes: 1.5 MB, 3 MB, 6 MB, 12 MB and 24 MB. This is the size of the sorted array for the both the important binary search and unimportant binary search.
We measure performance in lookups per second. Here are the performance numbers for the important thread, depending on the type of binary search running on the unimportant thread and workload size.
The performance drop is largest when the workload size is 6 MB. This happens because the size of the LL cache available to the important thread is effectively halved. This is also the point where our interventions on the unimportant thread, NT prefetches and branchless code, will result in the highest speed improvement on the important thread.
We want to check what happens to the performance on unimportant thread, once we do the interventions. Again, here are the numbers:
When the workload size is small, branchless version is even a bit faster. This is because branchless version avoids speculation, which is costly when the CPU makes a bad choice (and this happens often on sorting or lookup-type problems). In all other case, making the code branchless result in the binary search slowdown, and adding NT prefetches slows it down even more. Whether the slowdown is acceptable or not, it is up for you to decide.
Memcpy Experiment
We perform a similar experiment as previously. As the important search we use the same binary search as in previous example. For the unimportant thread we use memcpy
. We wrote several versions of memcpy using AVX2 intrinsics:
- Simple memcpy: a basic version that uses regular loads and regular stores
- Memcpy with a fake dependency: a version of memcpy where we injected a fake loop-carried dependency, as described in section about fake dependencies.
- Memcpy with a memory fence: a version of memcpy where we use a load fence.
- Memcpy with streaming stores (SS): a version of memcpy which uses streaming stores to avoid data cache pollution.
- Memcpy with non-temporal (NT) prefetches: a version of mempcy with NT prefetches used to avoid data cache pollution.
- Memcpy with NT prefetches and SS.
- Memcpy with NT prefetches, SS and a memory fence.
The second parameter is the workload size: 1.5 MB, 3 MB, 6 MB, 12 MB or 24 MB. For the important thread, the meaning is the same as in the previous experiment, it is the size of the sorted array. For the unimportant (memcpy
) thread, these are the sizes of the source and destination buffers. Here are the throughput of the important thread:
Again, the slowdown is the biggest when the workload size is 6 MB, which corresponds to the case when LL cache capacity wasn’t enough to serve both threads.
Memory fences and fake dependencies definitely fix the problem of the important thread slowing down. Non-temporal prefetches and streaming stores also help, but the fences and dependencies do better.
But, the second question is what happens to the unimportant thread? How does memcpy
performance change depending on the fix we implemented. Again the numbers:
Fences and fake dependencies “kill” the performance of memcpy
. memcpy
is highly parallel operation with a lot of instruction-level parallelism (ILP) in it. Introducing fences or fake dependencies removes all available ILP, and for this reason we see this huge drop in performance.
With regards to streaming stores and non-temporal prefetches, in our cases with a relatively large workset size, we even see an increase in performance compared to the baseline version.
Conclusion
When making your app frugal, the first line of action is to improve the cache line utilization. Remember that the data is brought up from the cache to the memory in chunks of (typically) 64 bytes, and ideally, the program should consume all of it. There are many techniques to achieve this, converting to SoA, using N-ary trees instead of binary trees (we plan to write about them in the future so stay tuned).
The good thing about this approach is that it both decreases the required memory subsystem bandwidth and also makes your program faster. The bad thing is there is a limit on how much memory efficient you can make your code. And this is the point where you should try out the techniques described in this post.
The next step in making your program a better neighbor is to avoid unnecessary cache pollution due to errors in branch prediction. Of course, this makes sense only if (1) the branch is not easily predicted and (2) going branchless does not result in additional loads or stores.
If a program is working with large blocks of memory, non-temporal prefetches and streaming stores will result both in improved performance and less cache pollution. But, if the program is working with small blocks of memory, NT prefetches and streaming stores will decrease cache pollution, at the expense of increased activity on the memory bus. Since the memory is slower than caches, the program will using NT prefetches and streaming stores will also become slower.
Until this point, all the proposed techniques were conservative. The next techniques would be considered controversial because of the resulting slowdown. Fake dependencies will cripple the program’s performance, especially if the loop in question has a lot of ILP. The good thing about fake dependencies is that their effect is limited to the core where the program is running.
And finally, memory fences will slow down the program even more, essentially disabling all out-of-order potential of the CPU and stopping all speculation. The disadvantage of memory fences, not seen with fake dependencies, is that they can cause synchronization between caches on different cores. Therefore, you will need to take a very careful analysis between pros and cons of using fences.
- C++ Developer Guidance for Speculative Execution Side Channels [↩]
- It also guarantees that these loads are visible to other CPU cores, but this is not important to us [↩]
- To guarantee branchlessness,
SELECT
needs to be implemented through inline assembly or compiler intrinsics (blend instructions) [↩] - Although there are techniques to allocate write-combine memory in userspace through special kernel drivers. [↩]