CPU Dispatching: Make your code both portable and fast

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

Let’s say you’ve established that there is a particular function that is performance critical. Your laptop has a newest CPU, so you give it a try and compile your program with -march=native¹ and you see great speed improvements. But when you compile it like this, it only works on your processor. The compiler has optimized the binary for the local processor and you cannot distribute it like this.

Now there are several ways to fix this. E.g. send the user source code and let him compile it by himself. Or you can give up on optimizations and simply deliver the slow binary. But there is a more elegant way to solve this. Welcome to the world of CPU dispatching.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

CPU Dispatching

CPU Dispatching is a technique where your binary detects which features your CPU has, and based on that decide which version of code to execute. For CPU dispatching to work you will need to provide several flavors of your critical function. The CPU dispatcher will then pick the right one for the CPU at program runtime. CPU dispatching can be implemented manually, here is for example one way to do it:

typedef int (*func_pointer)(int *arr, int len);

func_pointer my_function = &my_function_dispatch;

int my_function_cpu_fast(int *arr, int len) {
   // Implementation for CPU 1
   ...
}

int my_function_cpu_generic(int *arr, int len) {
    // Implementation for CPU 2
    ....
}

int my_function_dispatch(int *arr, int len) {
    int cpu_type = detect_cpu();
    if (cpu_type > 15) {
        my_function = &my_function_cpu_fast;
        return my_function_cpu_fast(arr, len);
    } else {
        my_function = &my_function_cpu_generic;
        return my_function_cpu_generic(arr, len);
    }
}

int main() {
    ...
   int res = my_function(arr, len);
}

In order to enable dispatching, we call our function using a function pointer called my_function and this pointer is initialized with my_function_dispatch. The first time this function executes, my_function_dispatch detects the CPU and updates my_function pointer based on detected CPU. From that point on, the code uses the version of the code set by the dispatcher.

Now there are good things about this approach and there are bad things. Good thing is that this method is very flexible. You can have many implementations for different kinds of processors and you can even switch them at runtime. But there are several bad things about this approach:

Somebody can intentionally or unintentionally change the function to which the pointer points.
In case of concurrent access to the pointer we need to introduce synchronization in the dispatcher.
We are not getting any help from the compiler. The compiler doesn’t know that my_function_cpu_fast will only be run on CPU that support feature X. Therefore it will compile its content just as everything else.

Compiler leveraged CPU dispatching

On GCC and CLANG there is a way to ask help from the compiler for the CPU dispatching. In GCC this feature is called Function Multiversioning. Different CPUs have different features, examples being sse4, avx or avx512f. To utilize these features you define several functions with the same name, same parameters and same return values and mark each of them as different target. Target specify which feature the processor has. Here is an example:

__attribute__((__target__ ("avx512f")))
void add_vectors(float* res, float* in1, float* in2) {
    ...
}

__attribute__((__target__ ("default")))
void add_vectors(float* res, float* in1, float* in2) {
    ...
}

We defined two functions. One is marked with target("avx512f"), the other is marked with target("default"). In case the binary runs on a CPU that has avx512f support, the system will call that function, otherwise it will fallback to the function with default target.

Cool thing about this approach is that the compiler will actually compile the function marked with target(“avx512f”) using instructions and registers supported by AVX512F CPUs. Also, you don’t need to write a dispatch function, the compiler and dynamic linker take care of everything. Please note that the function with target("default") is mandatory and it is a fallback in case the CPU doesn’t support any of the advanced features.

Manual CPU dispatching with compiler’s help

The above approach works nicely in case you want to support features the compiler is aware of, like supported instructions set. But in certain cases you want to dispatch your functions based on some other parameter². This is possible too, but you need to write your dispatching function manually.

__attribute__((ifunc ("my_function_dispatch")))
void my_function(float* a, float* b, float* res, int len);

typedef void my_function_t(float*, float*, float*, int);

extern "C" {
    my_function_t* my_function_dispatch() {
        int dispatch_criteria = get_dispatch_criteria();
        if (dispatch_criteria > 10) {
            return &my_function_optimized;
        } else {
            return &my_function_default;
        }
    }
}

In the above example, we declare a function my_function and we mark it with __attribute__((ifunc("my_function_dispatch"))). When the function is declared in this way, this means that the function inside ifunc (in our case my_function_dispatch) will get called before the program starts in order to resolve which function to use later on.

When you write your dispatch function you need to pay attention to a few details. Dispatch function must not assume the system is initialized. Also it doesn’t receive any parameters and it returns a pointer to a function that should be used from that point on. The function pointer returned by my_function_dispatch needs to match the signature of function my_function.

In practice…

A short introduction to SIMD

In order to correctly test CPU dispatching, we need to introduce some CPU features we will be using for testing. Most common feature available on desktop CPUs are instruction for vector processing (SIMD), so let’s cover them first.

SIMD stands for Single Instruction Multiple Data and what it basically means is that CPU with SIMD has additional instructions than can process several data at once. These instruction are also called vector instruction, and the process of making your code use these instructions is called vectorization. All desktop CPU’s since 2005 have some kind of SIMD.

SSE³ is a SIMD extension on x86 and x86-64 that introduces 128 bit registers and instructions that can work on these registers. This means that the processor can process 2 doubles or 4 floats or 4 integers or 16 booleans in one instruction.

AVX⁴ is a newer SIMD extension on x86 and x86-64 that introduces 256 bit registers and instructions that can work on these registers. This means that the processor can process 4 doubles or 8 floats or 8 integers or 32 booleans in one instruction. AVX512⁵ is an even newer SIMD extension with 512 bit registers. Currently AVX512 is supported only by Intel processors.

These extensions are very useful for processing a large amount of data, e.g. audio and video processing, scientific computing etc. Unfortunately compiler with default compiler flags never produce AVX and AVX512 code. Even when AVX or AVX512 extensions are enabled through compiler switches, there is no guarantee the compiler will actually use them.

The experiment

We’ve tried the compiler leveraged CPU dispatching and it mostly worked as described. You can find the examples in out Github repository. In the directory 2020-06-cpudispatching run make cpudispatching.

For the test we decided to implement function add which takes as an argument arrays a and b, sums them up and puts the result into res. Here is the declaration of the dispatchable function addand its dispatcher add_dispatch.

__attribute__((ifunc ("add_dispatch")))
void add(float* __restrict__  a, float* __restrict__  b, float* __restrict__  res, int len);

extern "C" {
    add_t* add_dispatch() {
        __builtin_cpu_init ();
        if (__builtin_cpu_supports ("avx")) {
            return add_avx_manual;
        } else if (__builtin_cpu_supports ("sse2")) {
            return add_sse_manual;
        } else {
            return add_default;
        }
    }
}

GCC and CLANG provide nice builtin called __builtin_cpu_supports that can be used to check if the CPU supports SIMD extensions. We came across a problem with naming the dispatcher function for manual dispatching. Namely, both compilers were complaining being unable to find the name provided inside ifunc("add_dispatch"). A quick search on the internet suggests that it has something to do with name mangling⁶. Surrounding the function with extern "C" solves this issue by disabling name mangling for the dispatcher function.

We used an older AMD A8-4500M CPU that supports SSE4 and AVX. We tried both automatic dispatching by the compiler and manual dispatching. Nothing fancy here, everything worked straight out of the box.

This is the function we used for testing:

__attribute__((__target__ ("default")))
void add_default(float* __restrict__ a, float* __restrict__ b, float* __restrict__ res, int len) {
    float* __restrict__ aa = (float*) __builtin_assume_aligned(a, 128);
    float* __restrict__ ba = (float*) __builtin_assume_aligned(b, 128);
    float* __restrict__ resa = (float*) __builtin_assume_aligned(res, 128);

    for (int i = 0; i < len; i++) {
        resa[i] = aa[i] + ba[i];
    }
}

The function takes two float arrays a and b, sums them up and puts a result into array res. Note a few unusual things about the above code. There is a __restrict__ keyword that tells the compiler the pointers don’t alias each other (this is a magical keyword that enables compiler optimization, I will write about it in the future). Also notice __builtin_assume_aligned builtin. This is a hint to the compiler that those pointers are 128 byte aligned.

These two keyword should allow the compiler to figure out that it can safely use SSE4 and AVX instructions. Unfortunately this didn’t happen. The dissassembly of function add for functions with targets sse4 and avx did show that the compiler generated some instructions with XMM registers (SSE extension) and YMM registers (AVX extension), but there were no differences in speed for any of them. Disappointing!

So we manually implemented AVX and SSE4 versions, using functions C/C++ SSE and AVX intrinsic vector functions available through headers emmintrin.h and immintrin.h⁷. Here are the two implementations:

__attribute__((__target__ ("sse2")))
void add_sse_manual(float* __restrict__ a, float* __restrict__ b, float* __restrict__ res, int len) {
    __m128 aa, bb, rr;
    int n = len / 4;
    for (int i = 0; i < n; i += 4) {
        aa = _mm_load_ps(a + i);
        bb = _mm_load_ps(b + i);
        rr = _mm_add_ps(aa, bb);
        _mm_store_ps(res + i, rr);
    }
}


__attribute__((__target__ ("avx")))
void add_avx_manual(float* __restrict__ a, float* __restrict__ b, float* __restrict__ res, int len) {
    __m256 aa, bb, rr;
    int n = len / 8;
    for (int i = 0; i < n; i += 8) {
        aa = _mm256_load_ps(a + i);
        bb = _mm256_load_ps(b + i);
        rr = _mm256_add_ps(aa, bb);
        _mm256_store_ps(res + i, rr);
    }
}

Implementation might look complicated, but it actually isn’t. __m256 is a type that holds 8 floats. We use _mm256_load_ps to load eight floats from the vector into __m256 variable, we add them together using _mm256_add_ps and store them back to the memory using _mm256_store_ps.

The automatic dispatcher correctly picked the fastest function on our testing environment (AVX). Manual dispatcher that uses __builtin_cpu_supports did the same. Here are the numbers for doing 100 calls to the function add with arrays of 100 MB in size:

Implementation	Default	SSE	AVX
Time (ms)	220	70	40

Runtimes for three different implementations of function add

We were expecting some improvements but we didn’t didn’t expect this. Great results, so therefore you can expect vectorization to be part of our future investigations.

Final Words

CPU dispatching mechanism is a great way to maximally utilize the CPU’s resources without sacrificing portability. Paired with vectorization, your programs will run at full speed everytime and everywhere!

Further Read

Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms, chapter 13.6 “CPU dispatching at load time in Linux”

The Intel Intrinsics Guide

Featured image courtesy of: https://www.pcgamer.com/intel-core-i9-10900k-review-performance-benchmarks/

Compiler switch -march=native tells the compiler to compile the code for the CPU the compiler is running on. [↩]
For example, during testing you want to be able to select which version of the function to dispatch since the same CPU can support several features at once [↩]
First processors with SSE extension became available in 2003 [↩]
First processors with AVX extension became available in 2011 [↩]
First processors with AVX512 extension became available in 2016 [↩]
Name mangling is a technique C++ compilers use to generate the names for the linker. Since there can be several functions with the same name, the compilers mangles the name so that the name includes return value type, parameters type etc. [↩]
Check out Intel Intrinsic Guide for more information [↩]