The memory subsystem from the viewpoint of software: how memory subsystem affects software performance 2/3

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

In the previous post we started a series of experiments to see how the memory subsystem affects software performance. In a certain sense, that post is a prerequisite for this post. There we experimented with performance and different levels of cache, memory access pattern and performance, load vs store performance as well as arithmetic intensity. In this post, we experiment with:

Table Of Contents

The Experiments
Final Words

The Experiments

Cache Line

As mentioned in the previous post, data is fetched from the memory in chunks corresponding to cache line size. Typically this is 64 bytes, but can be smaller (embedded systems) or larger (high-performance systems). If the size of chunk is N, then chunks are aligned at N bytes.

When a chunk of memory is brought up from the memory to the caches, access to any byte inside the chunk will generally be fast. Access to a byte outside the chunk will require another memory transfer.

To test the effect of cache line size we a struct called test_struct and two small kernels:

template <int CacheLineSize = CACHE_LINE_SIZE>
struct test_struct {
    unsigned char first_byte;
    unsigned char padding[CACHE_LINE_SIZE - 2];
    unsigned char last_byte;
};

void initialize_struct(test_struct<CACHE_LINE_SIZE> * p, int n) {
    for (int i = 0; i < n; i++) {
        p[i].first_byte = i;
        p[i].last_byte = n - i;
    }
}

int sum_struct(test_struct<CACHE_LINE_SIZE> * p, int n, std::vector<int>& index_array) {
    int sum = 0;
    for (int i = 0; i < n; i++) {
        int index = index_array[i];
        sum += p[index].first_byte + p[index].last_byte;
    }

    return sum;
}

The test_struct struct is peculiar. Its size is the size of the cache line, which is 64 in all our architectures. There are two important members that we are using first_byte and last_byte. Between them, there is padding to make the struct exactly 64 bytes in size. If the struct is aligned at 64 bytes, then the first_byte and the last_byte will be in the same cache line. For all other possible alignments, first_byte and last_byte will be in different cache lines.

The values of test_struct are stored in a vector (array). There are two functions we test. The first one is called initialize_struct which goes through the array of structs sequentially and initializes all the fields. The second is called sum_struct and performs random accesses in the array of structs using the vector index. We will see that the memory access pattern (sequential vs random) is very important for performance.

For the testing we use two arrays. The first one is aligned at 64 bytes. The second one is misaligned from 64 bytes by one byte. Again, we test for various array sizes while keeping the number of iterations constant: 256M iterations. Here are the runtimes for the first function initialize_struct that traverses the array sequentially:

The server and the desktop system don’t care if the data is aligned or unaligned; the runtimes are the same. For the embedded system, there is a critical array size where the runtimes differ. This is between 32K (2MB) and 2M (128MB).

Here are the runtimes for the second function sum_struct which traverses the array randomly:

In all three architectures, we observe the same behavior. Once the dataset doesn’t fit the data caches, the aligned version becomes faster.

The question: why is the impact higher with random accesses compared to sequential accesses? With sequential accesses, even when the piece of data is split between the two cache lines, there is some cache line reuse because, even though p[i].first_byte and p[i].last_byte don’t share the same cache line, p[i].last_byte and p[i+1].first_byte share the same cache line. This is not the case for random memory accesses.

Unaligned accesses also increase the amount of data that the system needs to transfer from the memory to the CPU. Of course, this only applies to random accesses and sum_struct function. For initialize_struct, the amount of data that is transferred is independent if the array is aligned or not.

Here is the data volume for the desktop and the embedded system:

Once the dataset doesn’t fit the data caches, the total data volume is almost two times higher for the unaligned version compared to the aligned version.

Cache line size needs to be taken into account when designing efficient random memory access data structures, such as trees, hash maps or linked list. Ideally, the size of a single element of the data structure should be a multiple of cache line size and also properly aligned. This will ensure that a single element is never split among two cache lines, which improves speed and decreases the total data volume exchanged between the memory and the CPU.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Hiding Memory Latency

The RAM memory, as well as each individual cache in the memory hierarchy, has two important properties:

Memory throughput, which measures the amount of data that the CPU can get from the memory/LLC/L2/L1 cache per unit of time. Typically, it is measured in MB/s or GB/s.
Memory latency, which measures the time that passes between the CPU issuing a memory access request, and the memory/LLC/L1/L1 cache responding with the data. Typically, it is measured in nanoseconds or cycles.

The story is not this simple, however. Because of its out-of-order nature, the CPU can keep several pending memory accesses (loads or stores). And the memory can process several interleaved load requests. The CPU can therefore issue several load/store requests in parallel. The effect of this is “hiding” memory latency. The latency might be long, but since the CPU is issuing many requests in parallel, the effect of memory latency will be hidden.

To explain it more visually, let’s use an analogy with a transport company. Imagine you have a transport company with only one truck. You want to transport your goods to a city which is 3 hours away. The latency is therefore three hours. You can only transport one truckload every six hours. Now imagine your company has 6 trucks. You can dispatch a new truck every half an hour. Although the latency is the same, having 6 trucks effectively hides it.

Back to our story. With regards to memory latency, there is a software limitation about how much latency the CPU can hide. If the piece of code is essentially following a chain of pointers, that is, it is dereferencing the current pointer to get the address of the next pointer, there are very few loads/stores that the CPU can do in parallel. It needs to wait for the current load to finish in order to issue the next store. We say this code has a low instruction level parallelism, and we dealt this in the this post.

Following a chain of pointers is not that uncommon. E.g. traversing a linked list, a binary tree or a hash map with collisions in the bucket results in following a chain of pointers. One of the ways to increase the available instruction level parallelism is by interleaving additional work, e.g. by following two chains of pointers or following one chain of pointers but doing more work for each pointer in the chain.

To illustrate this effect, we take a kernel that traverses several linked lists at once. The sizes of linked lists vary between 1K elements (16kB) and 64M elements (1GB), and the number of traversed linked lists vary between 1 and 11. Here is the kernels source code:

double calculate_min_sum(std::vector<elem_t*> heads) {
    std::vector<elem_t*> currents = heads;
    std::vector<double> sums(heads.size(), 0.0);
    int list_cnt = heads.size();

    while (currents[0] != nullptr) {
        for (int i = 0; i < list_cnt; i++) {
            sums[i] += i * currents[i]->value;
            currents[i] = currents[i]->next;
        }
    }

    double min = sums[0];
    for (int i = 0; i < list_cnt; i++) {
        min = std::min(sums[i], min);
    }

    return min;
}

We use the time needed to traverse a single linked list as a baseline. So, if there is no instruction level parallelism available, the relative time needed to traverse a single linked list 1 for one linked list, 2 for two linked lists, etc. However, with out-of-order execution, we will see a smaller runtime ratio.

Here are the relative times for the desktop system depending on the number of linked list traversed:

For the most linked list sizes, the slowdown is not linear to the number of linked lists. The reason is as already explained: memory latency hiding. For the traversing 11 linked lists in parallel, the smallest slowdown is 3.64 (list size 16K or 256kB) and the largest slowdown is 12.2 (list size 256K or 4MB). The size of L3 cache is 6MB, so one linked list completely fits the L3 cache, but two linked lists don’t. This is probably the explanation of the largest slowdown; the misses in the L3 cache are the most costly.

Here is the graph for the server system:

A similar effect is observed with the server system. The server system is also good in latency hiding. In fact, in comparison to the desktop system where the largest slowdown was 12.2 for 11 parallel linked lists, here the largest slowdown is 4.9 for the linked list with 512K elements (8MB in size). This CPU has 22 MB of cache, so the largest slowdown corresponds to the same case of exhausting the L3 cache.

Lastly, here is the same diagram for the embedded system:

The graph, however, paints a different picture of the embedded system. The worst slowdown is 20.4 for the kernel with 10 linked lists and 32K or 512kB. This system, although with an out-of-order CPU is in no way as good in hiding memory latency compared to the other two systems. It lies to both the CPU and the memory subsystem: the CPU can hold less pending loads/stores, and the memory subsystem has only limited parallelism.

In all the experiments we had until now, this one shows the largest differences between the three systems. Whereas the server system has a lot of hardware helping it hide memory latency, the situation is worse with the desktop system and much worse with the embedded systems¹.

TLB cache and large pages

TLB caches are small caches that speed up translation from virtual to physical addresses. There is one entry in the TLB cache for each memory page, which is most often 4kB. The size of this cache is quite limited, and therefore if the size of the dataset is larger than the size of this cache, this will result in performance penalty.

To give an illustration, on the desktop system under test, the size of this cache is 64 entries for L1 TLB (4-way associative) and 1536 entries for L2 TLB (6-way associative). So, the L1 cache can hold translation information for a block of memory 256kB in size and the L2 cache can hold the translation information for a block of memory 6MB in size. This is quite small.

If the code is doing random memory accesses in the block of data that is more than 256kB in size, it will experience L1 TLB cache misses. If the block of data is more than 6MB in size, the program will experience L2 TLB cache misses.

Problems with TLB cache misses really depend on the memory access pattern and data set size. Sequential access pattern and small strided access pattern rarely experience problems with TLB cache misses. On the other hand, random memory accesses on large datasets are very prone to TLB cache misses and performance losses.

Most modern CPUs support large memory pages, also known as huge memory pages. Instead of a memory page wich is 4kB in size, there are larger pages, e.g. 2MB in size or 1GB in size. Using larger pages decreases the pressure on the TLB caches. To illustrate this, we use the same kernel from the previous post with memory access pattern section:

int calculate_sum(int* data, int* index, int len) {
    int sum = 0;
    for (int i = 0; i < len; i++) {
        sum += data[indexes[i]];
    }
    return sum;
}

The values in the indexes array are created in such a way to create sequential, strided or random memory accesses. For the testing we use an array of 64M ints. We run the kernel with standard pages (4kB) and large pages (2 MB). Here are the results for the desktop system:

The graph looks very interesting. The difference in speed is very obvious for the random memory access (speed up of 2.8x), and doesn’t exist at all for other access types.

Unfortunately, we couldn’t run the experiment on other two systems, because large pages require a special system configuration, but we don’t expect any difference in results with this kernel.

Cache Conflicts

Cache conflicts happen when two data blocks map to the same line in cache. A data block has an exact cache line where it can be stored. This is determined by higher N bits of the memory address accessing the data block. If that cache line is already occupied, it needs to be evicted from the data cache to the memory.

Cache conflicts typically appears when the data memory access pattern is a power of two: 256, 512, 1024, etc. It often happens when processing matrices whose size is a power of two, but it can also happen when processing classes with the size of power or two. It can happen when processing two or more arrays, when the difference between the arrays start addresses is a power of two.

To illustrate the performance penalty of cache conflicts, we use matrix multiplication example:

for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
        c[i][j] = 0;
        for (int k = 0; k < n; k++) {
            c[i][j] = c[i][j] + a[i][k] * b[k][j];
        }
    }
}

The algorithm is quite simple, it’s a triply nested loop that performs sequential access over the array a[i][k] and strided access over the array b[k][j]. The algorithmic complexity of this loop nest is O(n³), so we can expect that the runtime gets worse as the number n grows.

Here are the runtimes for the desktop system, for the value of n between 768 and 1280.

Although the general trend is upwards, as one might expect, we see a large spike in runtime for the case when the matrix size is 1024. This is the case that corresponds to a situation with a large amount of cache conflicts.

Here is the same graph for the server system:

There is a large spike for the matrix size 768 (3 * 256), 1024 (4 * 256) and 1280 (5 * 256), but there are also smaller spikes for matrix sizes 896 (7 * 128) and 1152 (9 * 128). So, as you can see, the code can have various amounts of cache conflicts which will incur smaller or larger performance penalty.

Here is the same graph for the embedded system:

The diagram is similar to previous diagrams and doesn’t require further explanations.

The bad thing about cache conflicts is that, to my knowledge, there is no simple readily-available tool that detects cache conflicts.

Vectorization

When the CPU is running vectorized code, instead of performing one operation per instruction, the CPU performs N (where N is typically 2, 4, 8, 16 or 32) operations per instruction. Vectorization is used to speed computations; ideally, the speedup should be N. With vectorization, the limiting factor in speedup is the memory subsystem. If the memory subsystem cannot deliver data fast enough, then vectorization will be slower than one would expect.

To measure the effect of memory subsystem on vectorization and software performance, we devise the following experiment. We use the two kernel from the previous post to measure the vectorization speedups. We compare the scalar vs vectorized version of the kernel. Here are the two kernels:

void sum(double* a, double* b, double* out, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] + b[i];
    }
}
void div(double* a, double* b, double* out, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] / b[i];
    }
}

The kernels are very simple. Each of them performs two loads, one arithmetic operation and one store. We repeat the experiment for various array sizes and repeat counts. In total, we have 256 M iterations, which means that if our array has 1M elements, we will repeat the experiment 256 times, but if it has 128 M, we will repeat the experiments only two times.

Here is the runtime ratio between the scalar and vector version of the same kernel for the desktop system. In this case, the vectorization width is 4, which means that ideally the runtime ration should be 4:

Vectorization pays off until the data fits the data caches. Beyond that point, vectorization doesn’t pay off at all for the SUM kernel, and pays off very little for the DIV kernel. This is because both loops have low arithmetic intensity² so most of the time is spent waiting for data.

The same graph for the server system:

This graph is very similar to the previous one and doesn’t require additional explanation:

And finally, the same graph for the embedded system:

Although this might seem like an error, I double checked the results and this is what reproduces consistently. SUM kernel has some vectorization benefits for really small array size. DIV vector doesn’t care, vectorized or not, the results are always the same.

Bottom line, in the presence of limitations in the memory subsystem, vectorization is more or less useless. Embedded system shows no advantage for these small examples. Other systems show speed improvements with vectorization, but this speed improvement is limited and related to the data set size.

Branch Prediction

A branch predictor is a piece of CPU that predicts the outcome of the branch, before the condition is evaluated. This allows the CPU to speed up execution because it breaks the dependency chain; the CPU doesn’t need to wait for the data, it can continue executing instructions. If the prediction is correct, then the CPU has saved some time because it has already done useful work. If not, the CPU needs to revert the instructions executed until that point and change the execution path. There is a certain performance penalty if the branch prediction is false, and this price depends on the CPU.

Sometimes, avoiding branches can lead to speed improvements. But, other times, it can lead to speed degradation. Why? If the CPU cannot evaluate the condition of a branch because the condition operands are not available, then the CPU can use branch prediction to move the work forward. In the presence of a high data cache miss rate, the price of not doing anything while waiting for the data is much higher than the price of doing something, and then reverting if it turns out to be useless.

In the cases of low data cache miss rates, branchless versions are typically faster. But in the cases of high data cache miss rates, branchfull versions are faster. To illustrate this, we use the example of binary search: we perform 256M searches on binary trees of various sizes. We use two algorithms, one is the branchless algorithm, and the other is branchfull algorithm.

Here is the runtime ratio between the branchfull and the branchless version for the desktop system:

Values greater than 1 mean that the branchless version is faster. Values smaller than 1 mean that the branchfull version is faster.

In the above graph, branchless version is faster while the binary search array fits the L1 or L2 data caches. Once it doesn’t, the branchfull version becomes faster for the reasons already explained.

Here is the same graph for the server system:

A similar observation can be made for the server system.

The same graph for the embedded system:

The embedded system paints a different picture. While the data mostly fit the L1 cache, branchless version is faster. When the data doesn’t fit the L1 cache, but fits the L2 cache, the branchfull version becomes faster. But quickly, as the data doesn’t fit the data caches anymore, the branchless version becomes faster again.

I don’t have an exact reason why this happens, but if I would have to guess, I would think that in the cases where the dataset doesn’t fit the data caches anymore, there is a large stress on the RAM memory. Branch prediction wastes more memory because some loads will not be used. In contrast, branchless version only brings in the data that is actually useful. Decreased pressure on the RAM memory results in higher speeds for the embedded system.

Let’s measure the total memory data volume for the embedded system³ for branchless and branchfull binary search. Here is the graph:

Branchless version brings in roughly two times less data from the memory to the CPU compared to the branchfull version. We can safely assume the same behavior can be observer in other two systems.

Branch prediction can hide the memory latency, i.e. it can keep the CPU working while the data is still being fetched. But this will work only if the memory bus is not saturated. If this happens, then the performance will suffer, as we have just seen.

Final Words

The lessons learned in this post:

It is good idea to put the commonly used data inside the same cache line. If the data we are accessing is split between two cache lines, this has a negative effect on software performance.
In the codes with a long dependency chains on memory loads (think pointer chasing code like traversing linked lists and trees), interleaving additional work can help hide the memory latency.
If a code is performing random memory accesses on a large data structure, it will likely suffer from TLB cache misses. Enabling large memory pages can help mitigate the issue.
A program accessing data with a power of two stride will suffer from conflict cache misses, a situation where useful data is evicted too early from the cache.
Vectorization increases the pressure on the memory subsystem and limits the available speed up. Therefore, only arithmetically intense codes actually benefit from vectorization
Branch prediction increases the pressure on the memory subsystem, but on a single core system results in performance improvement in the presence of many data cache misses. However, the price to pay is the increase in the total data volume transferred between the CPU and the memory.

In the following, last post of the series, we investigate the behavior of the memory subsystem in multithreaded environment. Multithreading should in principle make software faster because the work is divided among several CPU cores, but in reality, the memory subsystem is one of the limiting factors in multithreading.

An important note: Which each new linked list the data set size increases. Larger datasets don’t fit the data caches well, which results in performance loss. If the dataset size doesn’t increase with the additional work, then the results will be better. [↩]
Arithmetic intensity is the ration of arithmetic operations to memory operations [↩]
We didn’t have the corresponding counters to measure total memory data volume for the other two systems. [↩]

2 comments / Add your comment below

David G says:
November 16, 2022 at 3:16 pm
I know there’s going to be a 3rd part of this series, but for proper vectorization it’s best to use the restrict/__restrict keyword (https://en.cppreference.com/w/c/language/restrict) as often as possible on memory pointers, or code may be less optimal. Otherwise writing to out[i] in “out[i] = a[i] + b[i];” might potentially affect the values read at a[i+1] and/or b[i+1] and the compiler should take this case into account as well – unless “restrict” is used.
sum and div without restrict: https://godbolt.org/z/5Thhf3YPa
sum and div with restrict: https://godbolt.org/z/aPGP8jqE5
See also https://stackoverflow.com/questions/146159/is-fortran-easier-to-optimize-than-c-for-heavy-calculations/146186#146186
1. Ivica Bogosavljević says:
  November 17, 2022 at 10:54 am
  Generally speaking you are right. Because of pointer aliasing, code with restricts will generally be easier to optimize and yield a better assembly. But in this particular case, because there is only three pointers, the compiler managed to do so called “runtime pointer aliasing analysis”. So it did this and ran the vectorized version of the code; otherwise the results wouldn’t have made sense. But this analysis is only possible if the pointer count is low, e.g. 3 or 4.
  CLANG has an opt-viewer.py tool you can use to check for compiler optimizations. Then it makes sense to apply the `__restrict` keyword, as you’ve said.