Multithreading and the Memory Subsystem

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

This is the 14th post in memory optimization series. Feel free to checkout other posts on the same topic.

In this post we investigate how the memory subsystem behaves in an environment where several threads compete for memory subsystem resources. We also investigate techniques to improve the performance of multithreaded programs – programs that split the workload onto several CPU cores so that they finish faster.

A Multicore System

A typical multicore system will have one CPU socket (a physical CPU you can hold in your hand). Inside the socket there are several CPU cores, and inside each core there are one or two hardware threads. From a software viewpoint, each hardware thread is a logical “processor” a software thread can run on. In this post, when talking about thread, most of the time we mean a software thread which is bound to a specific hardware thread.

Each CPU core has its own L1 and possibly L2 cache. If a core has two hardware threads, then they share L1 and L2 cache. The last-level cache (which is typically L3 in desktop and server systems and L2 cache in embedded systems) is shared by all CPU cores in the socket. Here is an output of lstopo utility for our desktop system we use for testing.

The image above shows a system with 1 physical socket containing 4 cores. Each core has two hardware threads. L1 and L2 caches are per-core and L3 cache is common to all cores.

When talking about performance: while the cores are only accessing data from their own caches, the cores will not slow down one another. But once two or more cores start accessing a common resource, either the last-level cache or the memory, the performance of programs running on those cores will go down.

When it comes to performance of software running on a multicore system, most recommendations we have made in the previous posts that aim to decrease the dataset size and use caches more optimally apply here as well. The first step in optimizing multithreaded programs is making sure that the CPU uses memory resources efficiently even when running on a single-thread.

A Multisocket System

A multisocket system has more than one sockets. As already said, each socket consists of several CPU cores and each socket has its own last-level cache.

From the software point of view, a multisocket system behaves similarly to a multicore system. The difference is that in this case of a multisocket system the point of contention is memory and not the last-level cache. While cores in two sockets don’t access data in memory, they don’t influence one another when it comes to speed. But as soon as both sockets start fetching data from the memory, they content for memory bandwidth and the performance of cores in both sockets starts to drop.

NUMA Systems

Many server systems are NUMA systems. NUMA stands for non-unified memory access, and it describes a system where some memory accesses are more expensive then the others. The system itself is divided into two or more NUMA domains and each domain consists of several CPU cores and memory attached to them.

If a CPU core is accessing memory in the same NUMA domain, we talk about a local memory access. If a CPU core is accessing memory in another NUMA domain, we talk about remote memory access. In case of NUMA systems, local memory accesses are cheaper (more bandwidth and less latency) than remote memory accesses. Here is a graphical representation of a NUMA system with two NUMA domains.

In this system, each NUMA node consists of 1 socket, and each socket consists of 4 CPU cores.

For software development in NUMA systems, it is important to understand the way the operating system allocates memory. The memory allocation works like this: initially, when a program allocates a block of memory (using malloc, new, mmap, etc), the operating system will reserve virtual memory for it, but it will not allocate any physical pages to back it up. Only when the program first time touches a memory page of such a memory block, the operating system takes over to allocate a physical memory page. The page will be allocated in the same NUMA domain that belongs to the thread that first touched it.

For good performance in NUMA systems, it is important to control the thread that initializes memory pages!

Techniques

In this section we present techniques that are used to improve the performance of multithreading and NUMA systems.

Divide the Data Set among CPU Cores

When using multithreading to speed up a loop, we divide the the loop workload into several chunks that can be done in parallel, and then distribute each chunk to a dedicated thread. Since the work done by each thread is smaller than the original work, if all threads run in parallel, the total execution time should be shorter. But, for optimal memory performance, the way we divide work among threads is important. To illustrate it, consider a simple example:

We have a sorted array of 8 MB (named sorted) and we need to perform 128 M binary searches into this array. We want to parallelize the search using 8 CPU cores. For simplicity, each CPU core has one CPU thread. Also, each CPU core has its own L1 and L2 caches. L3 cache is shared among all the cores.

The straightforward way to parallelize binary search is to just divide the 128 M lookup values into 8 parts, and forward each part to a thread for processing. Each thread will perform 16 M lookups on the whole sorted array.

The problem with this approach is that each thread accesses the whole sorted array which is 8 MB in size. Each CPU core has its own L1 and L2 caches, and but they are smaller than the array size of 8 MB, so CPU cores content for L3 cache.

A different approach to dividing the workload between CPU cores would be like this: each thread gets a dedicated part of sorted array it will access. In our case, each thread accesses only 1 MB of sorted array. This improves performance, since 1 MB fits the L2 cache and the data will be fed from there instead of L3 cache. Since each CPU core has its own L2 cache, there is less contention for memory resources.

But, to implement this, we need an additional scheduling thread that reads values from the lookup array and assigns them to threads. If a lookup value is between sorted[0] and sorted[1M - 1] it gets assigned to thread 0. If the value is between sorted[1M] and sorted[2M - 1] it gets assigned to thread 1, etc.

A similar approach can be used in other algorithms that go over the same data many times:

  • Lookups in a binary tree, where each thread works on a dedicated part of the binary tree.
  • Lookups in a hash map. In this case, the scheduling thread needs to calculate the index position in the hash array, and then the work is distributed to threads depending on the index value.
  • Search in an unsorted array. Each thread searches one part of the unsorted array instead of having each thread searching the whole array.

This approach is however not without its drawback and in practice it can be difficult to implement. The first problem is related to the scheduling thread that is used to distribute data to the worker thread. Simpler approaches do not require a scheduling thread.

The second, more difficult problem, is related to load balancing. If lookup data is not random but some values are more probable than others, this could lead to unbalanced multithreading system where some threads are overworked and some starve. In that case we would need to implement some kind of load balancing scheme to ensure each thread gets enough work.

This technique applies both to multicore and multisocket systems. For a multisocket system, this approach will result in more data being read from last-level cache and less visits to the main memory.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

NUMA Optimizations

NUMA optimizations only apply to NUMA systems, which we described in the beginning of this post. In order to make sure data is allocated in the same NUMA domain it will be later accessed, each thread should initialize the buffer or the part of the buffer it will later mostly access.1

In the case of binary search example from the previous section, each thread should initialize the part of the sorted array where it will be performing most searches.

If you use OpenMP API to program a multithreaded program, you can use static scheduling for both the initialization loop and the processing loop in your program. If the loop trip count and number of threads are same for both loops, static scheduling will ensure that the thread in the initialization loop and the thread in the processing loop will be same software threads running on same hardware thread in the same NUMA domain.

Here is an example to illustrate how to use OpenMP static keyword to initialize the data at its proper place:

// Initialization loop
#pragma omp parallel for default(none) shared(src, v, size) schedule(static) num_threads(num_threads)
for (size_t i = 0; i < size; i++) {
    v[i] = src[i];
}

// Processing loop
#pragma omp parallel for default(none) shared(v, size) reduction(+:sum) schedule(static) num_threads(num_threads)
for (size_t i = 0; i < size; i++) {
    sum += v[i];
}

This kind of initialization will ensure threads in the processing loop are mostly accessing data in the same NUMA domain.

If for some reason it is not possible to initialize your data in such a way that each thread accesses its own block of memory, the next best thing is to initialize it using a pool of threads instead using a single thread. This will ensure that the data is distributed evenly to all NUMA nodes. With single thread initialization, all the data will be stored in a single NUMA node, and this node will become a bottleneck when many threads from different NUMA domains are accessing it. You can use OpenMP to initialize data using a thread pool (note that the scheduling is not static):

// Initialization loop
#pragma omp parallel for default(none) shared(src, v, size)
for (size_t i = 0; i < size; i++) {
    v[i] = src[i];
}

Experiments

In this section, we will try the techniques proposed in the earlier sections. All the exeriments were executed on Intel Xeon Gold 5218 CPU with two sockets in two NUMA domains, 16 cores per socket, 2 threads per core. Each core has 32kB of L1i cache, 32kB of L1d cache and 1MB of L2 cache. There is a total of 22MB of L3 cache common to all the cores in the socket.

Dividing the Data Set

For this experiment, we are using two versions of binary search, similar to what we described earlier. The first version is the No Division version, where each thread accesses the whole sorted array. The second version is the Division version, where each thread performs binary search only on one part of the sorted array. The source code is available here.

There are not differences in binary search implementation between the division and no division versions. The only difference is the implementation of the scheduling thread: for the no division version, the scheduling thread assigns lookup values to the threads in round-robin fashion, so that all threads are equally balanced. For the division version, the scheduling thread assigns each lookup value exactly to one specific thread, which is in charge for a range of values this lookup value belongs.

The scheduling thread creates a packet of 128 values before sending it for lookup. We did this to decrease the overhead of synchronization between threads. The lookup values were randomly generated in the range of the sorted array, so all the threads are equally busy.

We used 8 lookup threads and one scheduling thread. We measured the average time to process a packet for all the threads for 5 seconds. Here are the measurement results:

Sorted Array SizeNo DivisionWith Division
16 kB7872.4 (41.5)6753.8 (40.7)
64 kB9510.8 (40.8)8035.6 (39.1)
256 kB11203.4 (29.6)9399.3 (50.4)
1 MB14447.6 (135.6)11276.8 (40.4)
4 MB18631.4 (83.9)14152.7 (234.1)
16 MB23114.3 (65.3)17790.2 (160.0)
64 MB47358.7 (1102.6)32146.3 (561.5)
256 MB81758.6 (5032.3)61973.7 (2629.7)
Mean time to process a 128 value lookup packet in ns (the number in braces is the standard deviation). Total 20 repeats.

As you can see, the division version is consistently faster than the original version, but the difference in speed is not too big. The smallest difference is 1.16x (for 16 kB version) and the largest is 1,47x (for the 64 MB version).

NUMA Experiment

To measure the effect of NUMA on system performance, we use the following kernel:

#pragma omp parallel for shared(v, size) reduction(+:sum) schedule(static) num_threads(num_threads)
for (size_t i = 0; i < size; i++) {
    sum += v[i];
}

The kernel is very simple, it just adds together all the values of v[i] to sum. The size of array v is 8 GB.

We use OMP pragmas to parallelize this loop (line 1). The most important part of the pragma is schedule(static). Static scheduling means that the loop will be split into smaller ranges and each range assigned a CPU core statically and predictably. If we have another loop with the same indexes (start value, increment and length) and static scheduling, then the other loop will be split into smaller ranges and assigned to identical CPU cores.

We use three type of initialization loop:

  • Single thread initialization, where all the data is initialized from a single thread. We expect this to be the worst case, because all the data will be on a single NUMA node, which will be the bottleneck.
  • Random thread initialization, where we initialize the data in a parallel region, but the data is randomly placed across NUMA domains. Some of the memory accesses will be local, others will be remote. The memory bottleneck will not be in a single NUMA node.
  • Static thread initialization, where we initialize data from a parallel region, and the assignment of data across NUMA domains is predictable and same as it will be used in processing. With this initialization, there are no accesses to remote memory and we can expect the best performance.

We test using 16 threads in 2 configurations: (1) all the threads belong to NUMA domain 0 and (2) the first half of the threads belong to NUMA domain 0 and the second half to NUMA domain 1.

The testing results look like this (10 repetitions, we present mean values, the standard deviation is in parentheses):

Initialization Type8 threads in NUMA 0
8 threads in NUMA 1
16 threads in NUMA 0
Single Thread Initialization0.371 s (0.003)0.225 s (0.001)
Random Thread Initialization0.206 s (0.003)0.223 s (0.002)
Static Thread Initialization0.140 s (0.001)0.221 s (0.001)

When all 16 threads are in the same NUMA domain, then initialization type doesn’t play any role with regards to performance. But, when 8 threads in are in the first NUMA domain and other 8 are the second NUMA domain, then we see performance differences. Single thread initialization is the worse, followed by random and static initialization, for the reasons already explained.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

  1. Please note that a block of memory that is allocated in one place can have its various parts in different NUMA domains []

Leave a Reply

Your email address will not be published. Required fields are marked *