Making your program run faster in a multithreaded environment

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

Multicore systems are everywhere nowadays, but there aren’t many programs that really utilize the full available processor power. Two reasons seem to stand out: multithreaded programming is difficult, and speed improvements are not always worth the effort.

In the multithreaded world, a completely new type of bugs related to race conditions comes into play, something that doesn’t exist in a regular day-to-day serial development. And these bugs are difficult to reproduce and difficult to debug. Also, most modern languages don’t have multithreading as something built-in, but more like an add-on, which requires time to master but is not useful outside multithreading programming.

Central concepts of multithreading: threads and mutexes

When it comes to development on multicore and multiprocessor systems¹, two concepts are crucial: threads and mutexes. You can imagine threads as small programs that run side by side, but with addition that threads can communicate with each other using global variables, shared class instances, etc, in the same manner two functions communicate.

Two threads can run on the same processor, or on different processors. In case they run on the same processor, the operating system will run the first thread for a short time (called time slice), and then the second thread, etc. Since these slices are really short, the user will see them as running together in parallel.

Race condition

When two threads try to access the same part of memory, a problem known as race condition can arise. A consistent state of an object typically requires modifying a few memory locations in a certain order. Imagine a thread trying to inform another thread that it finished a certain task. It writes a result of the operation to memory location A and a completion flag to memory location B.

Imagine what would happen if thread 1 would write to memory location A, but then got interrupted, and then thread 2 would write to memory location A and then to memory location B. In this case part of the computation would get lost. Or imagine a situation where thread 1 wrote to location A then location B, but thread 2 saw the modification of location B without the modification to location A². Thread 2 would read the wrong value.

Race conditions and mutexes

The event when the state of an object becomes inconsistent because of multithreading synchronization issues is called a race condition and can lead to difficult to debug bugs. To prevent them, programs use something called synchronization primitives. The most famous one is mutex (short for mutual exclusion): when a thread 1 wants to modify the resource which it doesn’t own exclusively, it calls the lock on the associated mutex. When thread 2 tries to lock the same mutex, the operating system will suspend it until thread 1 unlocks the mutex. Every change of the resource state while the mutex is locked will appear as an atomic operation for every other thread in the system.

This is the simplest mode of avoiding race conditions, but it comes with a price. In case of contention (when several threads are trying to lock the same mutex simultaneously), many threads will get suspended until the common mutex is unlocked. A system that is supposed to be faster when running on several processors suddenly becomes slower than when running on a single processor.

In our measurements, a single-threaded program took 0.5 ms to sum up an array of 100 million numbers. But a very naive implementation where four threads were accessing the same mutex-protected memory location to keep the intermediate result faired off much worse: 25 seconds. This is not the way to do multithreading!

Solutions for contention

Call critical section less often

The first remedy is to call the code in the critical section less often or to make it take less time. Imagine you wrote a memory allocator with function malloc that returns a single chunk of memory. If the program needs 2000 chunks, it will need to call malloc 2000 times. A simple extension to the API that returns an array of 2000 chunks will help mitigate the issue.

Move the complicated logic to a separate thread

If the code in the critical section takes too long to execute, you should consider moving it to a separate thread. Imagine you have a component that is responsible for the management of TV channels on a TV set. There are several API functions add_channel, remove_channel, find_channel etc. that are protected with a single mutex. Now imagine, there is a large contention on that mutex which is blocking many threads that are trying to use the API. The solution is simple: the logic of the component should be moved to a separate thread. All the API functions now become simple wrappers that send requests that the component thread which does the actual processing. The API functions become short and the contention is gone.

Create a resource for each thread and avoid resource sharing

And lastly, are you sure you really need that several threads share a single resource. Again, in the example of a memory allocator in a multithreaded environment: many allocators implement per-thread cache of available memory in order to avoid synchronization. Another example: if several threads want to log data to the same log file, you will need a mutex. But an alternative would be to have each thread log into a separate log file, in which case no mutex is needed.

Spinlocks – alternative to mutex when the critical section is short

The cost of mutexes is related to two things: the first is that thread suspension and unsuspension consume a lot of processor cycles and are done periodically; not as soon as mutex gets unlocked. Even when the mutex is unlocked very soon after the thread suspension, it will take time before the suspended tread gets resumed.

If the critical section is really short and there is a large contention, you might want to try spinlocks. Spinlock is a bit different implementation of mutex. In the case that the spinlock is already locked, the thread trying to get a lock will not get suspended. Instead, it will continue to run, querying the spinlock until the spinlock gets unlocked. With a short critical section, this should happen very quickly and this thus avoids the expensive thread suspension.

Spinlocks are not available in the standard C++ library, but boost offers them. Notice, however, that your program is actually consuming CPU cycles while waiting on a spinlock and this can have a negative effect on other running threads. This is especially a problem if two threads are running on the same processor. Can you guess why?

Mutexes cost even when there is no contention

In spite of making your critical section very short, it can happen that the time spent in the critical section is larger than you would expect, even though there is zero contention. The reason is that mutex and spinlock take time because they employ special hardware mechanisms.³ In our measurements, summing up an array of 100 million elements with no contention (everything was done in one thread) took 0.6 seconds without any synchronization and 3.7 seconds when the access to the memory location containing intermediate result was protected with a mutex.

Atomic variables instead of mutexes

There are a few ways you might work around this issue. If inside the critical section you are modifying a single instance of a simple data type (e.g. increasing a counter, changing the value of a flag, etc.), you could try using atomic variables. As their name suggests, modification done to the atomic variable is atomic and all other threads will see the change as atomic.

Hardware atomic support is provided in C++ for all basic types, and for derived types it is emulated in software. An example of an atomic variable:

std::atomic<int> access_count_atomic;

void increase_access_count() {
    access_count_atomic++;
}

If you are programming in C, support for atomics is provided in C11 standard. If you are using an older standard, every compiler has some kind of extension to support atomics.

We wrote a small program that sums up 100 million elements. We created two versions: one with no contention (everything happens in one thread), the other with a huge amount of contention (every thread was accessing the same memory location to store the intermediate results). Here are the numbers:

	Mutex	Spinlock	Atomic variable	No locks
No contention (serial addition)	3.7 s	2 s	1.6 s	0.6 s
Large contention (4 threads)	24 s	11 s	8 s	–

Comparison between different types of synchronization primitives

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Lock-free data structures

Standard C++ containers (arrays, queues, stacks etc.) are not thread-safe. If you want thread safety, you will need to explicitly make them thread-safe using mutexes. However, in the case of a large contention, the mutexes will cause your threads to get suspended often. The solution are lock-free data structures. Lock-free data structures don’t use mutexes to guarantee thread safety. Instead, they use atomic variables in a clever way to achieve the same.

Lock-free data structures are very complicated to get right. Luckily, boost provides implementation of three lock-free data structures: queue, stack and ringbuffer.

Types of multithreaded programs

There are basically two types of multithreaded programs. In the first type, the threads do unrelated jobs and they are generally tied to components. In an example of software for a TV set, one thread processes events from the remote controller, the other thread is in charge of drawing the UI on the screen. The system might or might not run faster on the multithreaded processor, but that is not the ultimate goal. The goal is to separate logic to different threads to make sure each component gets its share of the CPU time.

In the second type of multithreading, the goal is to make the program run faster by distributing work to several threads. Imagine a program that has to find the sum of all elements in a huge array. The program can spawn several threads on each processor. Each thread would be assigned its own part of the array on which to perform the calculation.

The first rule of thumb for fast multithreading: to achieve maximum speed, the threads should communicate or share common resources as little as possible. From our previous example, the threads that are summing up parts of the array may hold the intermediate value in a global counter protected with mutex, in a global counter which is an atomic variable, or they can have a local copy of the current sum (in which case, after the summing is done, all the local copies needs to be merged into the final result). We did our research on a program with four threads. The mutex version took 24 s, the atomic version took 8.4 s and the version with local copies took 0.13 s. The serial version took 0.6 ms. Only the version with local copies makes sense to use and actually makes your program faster.

False sharing

And for the end, one step lower into the hardware. Even though two threads might not share the memory location logically, they can share it physically. If two threads are accessing two neighboring memory locations, there is a large chance that two memory locations are in the same cache line.⁴ This is called false sharing and when this happens, the performance of a multithreaded program might be low even though no logical sharing takes place. An example of false sharing:

// Create threads array
std::vector<std::thread> threads;
threads.reserve(num_threads);
// Thread i keeps its intermediate result in result_arr[i]. Please notice
// we are using volatile to force the compiler to write to array on
// every access in the thread body
volatile int result_arr[64];

// Thread body. The thread is summing up elements from start to stop
auto f = [&](int thread_id, int start, int stop) {
	result_arr[thread_id] = 0;
	for (int i = start; i < stop; i++) {
		result_arr[thread_id] += v[i] / rand_num;
	}
};

// Starting threads
for (int i = 0; i < num_threads; i++) {
	 threads.emplace_back(f, i, i * count_per_threads,
			      (i + 1) * count_per_threads);
}

// Summing up the result of each thread
int res = 0;
for (int i = 0; i < num_threads; i++) {
	threads[i].join();
	res += result_arr[i];
}
std::cout << "Result = " << res << std::endl;

Thread i keeps its intermediate result in variable result_arr[i]. Each thread has its own place in result_arr, but those variables are neighbors in memory and are shared on the hardware level. In our measurements, summing up 100 million elements with false sharing took 200 ms, in contrast to 120 ms for the version without false sharing.

The good thing about false sharing is that it happens rarely. Compilers usually optimize away series of writes to the same memory location. However, in case of complicated memory patterns or pointer aliasing, the compiler might not be able to remove writes to memory and voilà false sharing.

If you doubt your code is experiencing false sharing, there are tools to help you detect that. These are Intel’s VTune Profiler and open-source perf c2c.

Final Words

I hope this article gave you a good introduction about the challenges of multithreaded programming. Indeed, it is not too difficult to write a program that actually runs slower in a multithreaded environment. But, when written carefully, those programs will run faster and they will scale in the future as more cores become available.

Featured image by: vector PNG Designed By pikepicture from Pngtree.com

Further in text I will use multiprocessor to refer to both multiprocessor and multicore systems, because when we are talking [↩]
This is actually possible on modern out-of-order execution hardware [↩]
Special hardware mechanisms are employed to maintain memory coherency between different processors. [↩]
Typically, cache lines are 64 bytes wide, which means that 8 doubles, or 16 floats, or 16 integers can share the same cache line [↩]

2 comments / Add your comment below

Tan Do says:
July 31, 2021 at 5:34 am
Thanks for your great article. Can you do a benchmark with Thread Affinity ? I use the same example with Thread Affinity but don’t see the latency reduction. In some case, the Thread Affinity is slower than context switch
1. Ivica Bogosavljević says:
  July 31, 2021 at 8:38 am
  I am not sure what you mean, but all the code is available here, so feel free to experiment on your own. You could post the results of your experiments here 🙂