The memory subsystem from the viewpoint of software: how memory subsystem affects software performance 1/3

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

We talked about the memory subsystem in one way or another many times (e.g. Instruction-level parallelism in practice: speeding up memory-bound programs with low ILP or Memory consumption, dataset size and performance: how does it all relate?). And not without reason: memory performance is a limiting factor in many types of computation.

In this post we are going to experiment with memory subsystem from the software viewpoint. We will write small programs (kernels) that illustrate different aspects of the memory subsystem to explore how parts of the memory subsystem behave under various loads.

The idea of the post is to give you an overview of how different types of software interact with the memory subsystem and how you can take advantage of this knowledge. We intentionally omit many details of the memory subsystems, because they often differ between vendors or architectures. Instead, we focus on effects that are common to all architectures and most visible from the software viewpoint. This is the first post in the series of three posts. In the second post, we will talk about cache conflicts, cache line size, hiding memory latency, TLB cache, vectorization and branch prediction. And in the third post, we will talk about multithreading and the memory subsystem.

Table Of Contents

A short introduction to the memory hierarchy
The experiments
Final Words

A short introduction to the memory hierarchy

CPU’s are fast and memory is slow. That’s why CPU’s have several levels of cache memories: level 1, level 2 and level 3 (also called last-level) caches to speed up access to commonly used data. Level 1 data cache memories are the fastest, but the smallest, followed by L2 and L3. Here is the snapshot of the memory hierarchy in my laptop taken by Hardware Locality tool lstopo:

The system displayed has 32 kB of per-core L1 cache, 256 kB of per-core L2 cache and 6 MB of L3 cache common to all cores.

A piece of data that has been recently accessed will be in the cache. After some time of non use, it will get evicted, i.e. moved to the slower parts of the memory subsystem. For this reason, a repeated access to a small hash map (e.g. 8 kB) will be faster than a repeated access to a large hash map (e.g. 8 MB). This is because 8 kB of data completely fits the L1 cache, whereas 8 MB of data only partially fits the slowest L3 cache.

An important part of the memory hierarchy is the data prefetcher: a circuitry in the CPU that prefeteches data before it’s needed. This works however, only if the memory access pattern is predictable. This makes iterating through a vector much faster than iterating through a linked list.

Another important aspect of the memory subsystem is the cache line. A cache line is typically 64 bytes in size, and data is brought from the memory to the data caches in blocks of 64 bytes. What this means is that if your codes touches a certain byte, with high probability the few next bytes or few previous bytes will be already in the cache and access to them will be fast.

And lastly, there is TLB cache. Modern system don’t use physical addresses, but only virtual addresses, and there is a small TLB cache used to speed up the translation from physical do virtual addresses. A typical memory page size is 4 kB and there is one translation entry for each page. If the program is accessing many different pages in a short period of time, the translation entry will be missing in the TLB cache and will need to be fetched from the memory. This is a very expensive memory operation, and it actually does happens with random memory accesses on very large data sets – imagine accessing a hash map which is 256 MB in size.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

The experiments

Here are few experiments that demonstrate the effects of the memory subsystem on software performance. For these experiments we use three test systems: Raspberry Pi 4 (embedded), Intel’s CORE i5 (desktop) and Intel’s Xeon Gold (server). We use three different systems for two reasons. The first reason is to illustrate the difference in memory subsystem for each of them and second, to illustrate the complex interplay of CPU and memory and their effect on performance.

Here are the specifications for the systems

Desktop system: Intel Core i5-10210U CPU, one socket in one NUMA domain, 4 cores, 2 threads per core. Each core has 32kB of L1i cache, 32kB of L1d cache and 256kB of L2 cache. There is a total of 6MB of L3 cache shared by all the cores.
Embedded system: Broadcom BCM2711, ARM Cortex-A72, one socket in one NUMA domain, 4 cores, 1 thread per core. Each core has 16kB L1i cache and 16kB L1d cache. There is a total of 1MB of L2 cache shared by all the cores.
Server system: Intel Xeon Gold 5218 CPU, two sockets in two NUMA domains, 16 cores per socket, 2 threads per core. Each core has 32kB of L1i cache, 32kB of L1d cache and 1MB of L2 cache. There is a total of 22MB of L3 cache common to all the cores in the socket.

Performance and levels of cache – sequential access

In the first experiment, we have two small kernels, one doing additions and one doing divisions. They all iterate sequentially through arrays. Here are they:

void sum(double* a, double* b, double* out, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] + b[i];
    }
}

void div(double* a, double* b, double* out, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = a[i] / b[i];
    }
}

The kernels are very simple. Each of them performs two loads, one arithmetic operation and one store. We repeat the experiment for various array sizes and repeat counts. In total, we have 256 M iterations, which means that if our array has 1M elements, we will repeat the experiment 256 times, but if it has 128 M, we will repeat the experiments only two times.

Here are the runtimes depending on the array size for the desktop system. The total number of accessed elements is always the same, 256M.

In modern CPUs, additions is relatively cheap compared to divisions. And this is the case while the data is coming either from L1, L2 or L3 cache. When the dataset doesn’t fit any of the caches and needs to be fetched from the memory, both kernels run at about the same speed. The kernels run about 3 to 3.5 times slower when the data is fetched from the memory compared to when it is fetched from the data caches.

For smaller arrays, the bottleneck is the CPU and therefore the SUM kernel is faster. For larger arrays, the bottleneck is the memory and both kernels run at about the same speed.

Here are the same graphs for the server system:

And the same graph for the embedded system:

As far as the server system, the graph looks similarly to the desktop system. The SUM kernel is faster for the small dataset, but they even out as soon as the dataset doesn’t fit the L1 cache. As far as the embedded system, the graph looks a bit different. The SUM kernel exhibit similar properties as with the other two systems, being faster when data is served from the cache and slowing down once data starts being served from the main memory. But the runtime of DIV kernel is mostly independent from the dataset size, and this is because division is very slow on this chip and therefore the bottleneck is always the CPU regardless of the dataset size.

Memory access pattern, levels of cache and performance

In the previous experiment, we were concerned only with sequential memory accesses. Sequential memory accesses are very common in image processing, video processing, matrix manipulation, etc. Other than sequential memory accesses, we have two important types of memory accesses:

Strided memory accesses: whereas sequential memory access is iterating neighboring memory cells from left to right (or right to left), strided memory accesses skip a fixed number of elements. The number of elements that is skipped is called stride. For example, we can iterate an array but touch every odd element of the array. In this case the stride is two. Another example is iterating an array of classes. If the size of class is X, and we are touching a data member of X.y, from the hardware viewpoint we are iterating the data members of X.y with the stride sizeof(X).
Random memory accesses: random memory access happen either when we dereference a pointer (through * or -> operators), or when we are accessing elements of an array, but the index is not directly related to the iterator variable (e.g. a[index[i]] is a random access to a, and index[i] is a sequential access to i if iterator increments using i++).

These are the main differences between sequential, strided and random memory accesses:

Sequential memory accesses are the most efficient, because they completely use all the data from the data cache lines. In addition, the data prefetcher can prefetch the data because the access pattern is predictable.
Strided memory accesses come next in efficiency. The utilization of cache line is smaller, and the bigger the stride, the smaller the utilization. But the access pattern is predictable, so the code still profits from data prefetcher.
Random memory accesses are the least efficient, because neither prefetching is possible and the cache utilization rate is typically low.

Having covered the memory access patterns, let’s move on to experimenting. We want to measure how dataset size, memory access pattern, data caches and runtime are connected. To do this, we use the following kernel:

int calculate_sum(int* data, int* index, int len) {
    int sum = 0;

    for (int i = 0; i < len; i++) {
        sum += data[indexes[i]];
    }

    return sum;
}

By just observing the code, the access to data using data[index[i]] is random. But for the purpose of this experiment, we fill out the index array to obtain sequential, strided or random accesses.

We measure runtimes, but also we measure memory throughput (in MB/s) and total memory volume (in MB). Here are first the runtimes:

A few interesting observations:

Performance is very much related to the access pattern, the best being sequential, then small strides, then large strides and followed by random memory access pattern
With a sequential memory access pattern, all three systems behave similarly. The desktop system is a bit faster than the server system. The Embedded system is the slowest, but this is to be expected due to the smaller and cheaper CPU.
As the access pattern worsens, so does the performance.
With random access, both the desktop system and the embedded systems are equally slow. The reason is that the bottleneck is no longer the CPU, but exclusively the memory
The server system is much more resistant to worsening of access pattern. This is probably because the memory subsystem being much faster and having a higher throughput than the others.

Let’s measure memory throughput and data volume. Unfortunately, the data is only available for the desktop system, but this is enough to help us illustrate some facts. First memory throughput:

General trends is that the memory throughput is maximum for sequential accesses and small stride accesses, and then begins to fall quickly. Notice that the maximum throughput is not for sequential accesses, but small strided accesses. With sequential accesses the kernel is still not completely memory bound and there is some unused throughput left.

We know that our program uses exactly 512MB of data. Since there are two arrays of integers, each with 64M elements, this sums it up to exactly 512MB of memory. From the viewpoint of the CPU, the total memory data volume transferred will depend a lot on the memory access pattern, because of the cache line organization.

With the sequential access, the amount of data is as expected (0.54GB). But as the stride grows, so does the data volume. The difference between the minimum and maximum data volume is 17x. The main reason is the cache line nature of memory transfers. Data is brought from the memory to the data caches in blocks of 64 bytes¹. Let’s calculate cache line utilization: number of bytes consumed vs number of bytes fetched. Here are the numbers:

Access Type	Cache Line Utilization
Sequential	100% (64/64)
Stride 2	50% (32/64)
Stride 4	25% (16/64)
Stride 8	12.5% (8/64)
Stride 16	6.25% (4/64)
Stride 32	6.25% (4/64)

Cache Line Utilization for an Array of Integers

For larger strides, many cache lines will have to be brought from the memory to the CPU caches several times. This increases data volume. If only cache line utilization were responsible for the increase in data volume, then we could expect a maximum volume of 256MB + 256MB x 16 = 4352 MB. We see higher data loads here, so implies that additional data volume comes from other places as well.²

Loads vs stores performance

What is more expensive, loading data from the memory or storing it back? There are two rationales: the first one is that loads are more expensive, because until the data is loaded, no operation depending on the loaded value can continue (see the post about instruction level parallelism). The second one is that stores are more expensive, because stores in principle require two memory transfers (loading the cache line, modifying only some part of it and then storing the modified cache line).

To measure this, we use two small kernels that load from the memory and store to it. We measure the performance of loading and storing for both sequential memory accesses and strided memory accesses with stride 2. Here is the source code of the memory load (read) kernel:

template <int stride = 1>
double test_read(double* in, int n) {
    double sum = 0.0;
    for (int i = 0; i < n; i+=stride) {
        sum += in[i];
    }
    return sum;
}

This kernel reads n elements from the array in. We store the results in variable sum, because we don’t want the compiler to optimize away the reads.

The second kernel is the store (write) kernel. Here is the source code:

template <int stride = 1>
void test_write(double* in, int n, double v) {
    for (int i = 0; i < n; i+=stride) {
        in[i] = v;
    }
}

This code is equivalent of memset function, which writes the same value to all elements of an array.

To test the memory subsystem, we repeat the test on arrays between 8K elements and 256K elements. For each array size, we repeat the experiment so that in total we get 1G memory accesses for the sequential kernels and 512M accesses for the stride 2 kernels.

We are interested in the runtime ratio between the corresponding load and store kernels. If the value is above 1, this means the writes are faster. If the value is below 1, this means that the reads are faster. Here are the numbers for all three systems:

When the array is small (i.e. the data is mostly served from the caches), stores are faster by a large margin. This is to be expected because in real conditions loading stalls the CPU until the data is fetched. Memory stores are never instruction dependencies so the CPU has the ability to delay them. Loading is faster when the size of the array becomes larger than 8M elements, on embedded and server systems, but not on the desktop system.

The absolute results are very different between the three platforms, but the general trend is similar: stores are faster when going through the cache, but once the workload doesn’t fit the data caches, they are roughly the same speed.

Here is the same graph, but this time when the kernels are running with stride 2.

We see different results here. Again, stores are faster than loads, up to a certain dataset size. When the dataset doesn’t fit the data caches, stores become slower than loads. This is especially true for the embedded system.

Why is this happening? As already said, loads are almost always parts of the instruction dependency chains, because the loaded data is used later. In contrast, stores are never part of the dependency chains. So stores should be in principle faster.

However, the store consists of reading the cache line from the memory to the data cache, modifying it, and then writing the modified cache line back to the memory. So, when a piece of data is not in the data cache, storing data is more expensive than loading data. But, this only works for non-sequential accesses (strided and others).

With sequential accesses, the CPU doesn’t need to fetch the whole line from the memory. If the whole cache line is modified, several stores can be merged into one large memory write that modifies the whole cache line in the process of write combining. This allows skipping one memory transaction and makes stores faster.

When the stores are not served from the main memory, then stores are almost always faster than corresponding loads. But once they start getting served from the main memory, they can be marginally faster, or slower, depending on the architecture. They are more slower when with strided accesses than with sequential accesses (2.05 vs 1.21 for the desktop system, 0.86 vs 0.8 for the server system and 0.86 vs 0.26 for the embedded system).

The bottom line is that loading is more expensive than storing, so having a good memory access pattern for loading is more important than having a good memory access pattern for storing³. But the details really depend on the architecture and are likely to change with newer architectures.

Arithmetic Intensity

Another interesting effect of the interplay of the memory subsystem and the CPU is arithmetic intensity and performance. By arithmetic intensity, we mean the ratio between the number of bytes loaded/stored from the memory and the number of arithmetic operations in the single iteration of the loop. Here is the example kernel we are going to use for the testing:

template<int p>
void pow_of_2_array(double* in, double* out, int n) {
    for (int i = 0; i < n; i++) {
        out[i] = pow_of_2<p>(in[i]);
    }
}

Notice the call to the function pow_of_2<p>(...). This function is written using C++ templates. The number of multiplications performed by the function is p. To test, we use the values for p: 1, 4, 7, …, 25, 28.

Since the number of loads and stores doesn’t change, increasing p also means increasing the arithmetic intensity of the loop. We want to see how the runtime changes by increasing the number of multiplications.

We are also interested in dataset size, so we vary the array size from 4K elements up to 256M elements. In total, the kernel performs 1G iterations, corresponding to 16 GB data load, 8 GB data stores and from 1 until 28 multiplications, depending on the value of p.

Here are the runtimes for the desktop system:

As expected, when the number of multiplication is low, the runtime doesn’t depend on the multiplication count. This is because loading and storing data is the bottleneck. However, at some point (e.g. 4 multiplications for array size 64K, 10 multiplications for array size 1M and 13 multiplications for array size 256M) the runtime starts to increase. The increase in runtime corresponds to a point where the loop stops being memory bound and starts being computationally bound. The smaller the dataset, the smaller the number of multiplication that is needed for the loop to become computationally bound.

We observe a similar pattern in the server system:

And the similar pattern in the embedded system:

With the embedded system, due to the limitation in the chip, the loop becomes very quickly computationally bound, because the price of multiplication is much higher than the price of memory data loads and stores.

The relationship between arithmetical intensity and performance has been formalized in the roofline model. The roofline model tells you the memory and computational limit of the target architecture, and also tells you if your hot loop is hitting the computational or memory limits. More about the roofline model can be found here, here and here.

Final Words

Although the exact details of architectures differ, most of our observations apply to all three investigated architectures:

Working with smaller datasets is faster than working with larger datasets.
A predictable memory access pattern is faster than an unpredictable memory access pattern.
The sequential memory access pattern is faster than the strided access pattern. The bigger the stride, the bigger the difference.
Memory stores are typically faster than memory loads, but this can depend on the dataset size and the underlying architecture.
More arithmetically intense code puts less pressure on the memory subsystem. Between loading and storing data, if the code does a lot of computations, the bottleneck will not be the memory subsystem but the CPU.

This is the first post in the series where we experiment with the memory subsystem. In the second post, we will talk about cache conflicts, cache line size, hiding memory latency, TLB cache, vectorization and branch prediction. And in the third post, we will talk about multithreading and the memory subsystem. Stay tuned!

Typical cache line sizes are 64 and 128, but other sizes are possible in specialized systems (32 with embedded systems or 256 with high-performance systems). [↩]
It is very difficult to say where the additional data volume comes from, because of the complexity of the memory subsystem and the inaccuracy of the measurement system. [↩]
e.g. with loop tiling, where you can have only good memory access pattern only for either loads and stores, but not for both [↩]

2 comments / Add your comment below

Edwin Franks says:
August 8, 2022 at 12:59 am
Thank you very much for publishing this article. There is a minor typo where you are comparing efficiencies of sequential versus strided versus random memory access. The second dot point should be “Strided memory accesses”.
1. Ivica Bogosavljević says:
  August 8, 2022 at 1:59 pm
  Thanks, fixed!