Decreasing the Number of Memory Accesses: The Compiler's Secret Life 2/2

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

The previous post was about decreasing the number of memory accesses that were happening because your piece of code was actually making them. In this post we talk about a different types of memory accesses: memory accesses that the compiler inserted for you, without your knowledge, either to make your program correct (pointer aliasing) or possible at all (register spilling),

The good thing about these type of memory accesses is that they often have a very good locality, meaning they hit the fastest levels of cache. The bad thing is that they increase instruction count, which results in slower programs and waste of memory subsystem bandwidth. But, oftentimes, with a bit of skill, it is possible to “nudge” the compiler to do things correctly.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Unnecessary Loads and Stores due to Possible Pointer Aliasing

If you are a user of C and C++, you are allowed, through the use of pointers, to have two pointers pointing to the same memory location. Unfortunately, this makes things much more complex for an optimizing compiler. Take the following example:

void inc_if_greater (double* arr, int n, double& val) {
    for (int i = 0; i < n; i++) {
        if (arr[i] > val) {
            arr[i] += 1.0;
        }
    }
}

When the compiler emits assembly with full optimizations, one could reasonably expect to load the value of val to the register once before the loop on line 2 and then reuse it in the loop. But this doesn’t happen. What actually happens is that the value for val needs to be repeatedly brought up from the memory to the register in each iteration of the loop. Why?

The reason is simple. Imagine we call our function as inc_if_greater(arr, 10, arr[5]). This is perfectly legal. When i hits value 5, the value of parameter val suddenly changes. So, in order to emit a correct assembly, the value for val needs to be reloaded from the memory in every iteration.

Therefore, in order to emit the correct assembly, the compiler will need to load the value val from the memory in every iteration of the loop.

Calling inc_if_greater(arr, 10, arr[5]) like this would mostly considered shady developer practice. But the compiler doesn’t know that and it needs to make sure that the function works correctly for all possible input parameters.

When the compiler has to use less optimal code to ensure that the emitted assembly works for possible pointer aliasing, but we as developers can guarantee there is no actual pointer aliasing, then this kind of deoptimization can be fixed.

More info about pointer aliasing here.

Debugging Slowdowns due to Possible Pointer Aliasing

The best way to debug pointer aliasing issues is through investigating the compiler optimization report. CLANG’s compiler optimization report contain places where the compiler had to perform additional loads because of pointer aliasing.

Interpreting the compiler reports is an art in itself, so please refer to the compiler optimization report post for more info on how find the find places where the compiler is emitting additional loads because of the possible pointer aliasing.

Another way is to have a look at the emitted assembly to check if variable under investigation is in a register.

Fixing Issues with Possible Pointer Aliasing

Fixing issues with possible pointer aliasing is relatively easy. If you know for sure that two pointers don’t alias, you can:

If a pointer to an array never aliases a pointer to a single value, then copying the single value to an local variable should help. The compiler can easily establish that a local variable is not aliased. (in the example we gave earlier, copying double& val to double val_priv would fix the problem with aliasing).
If two pointers to an array never alias, then marking the pointers with __restrict__ (C++) or restrict (C) would inform the compiler about that.
Using vectorization pragmas (e.g. #pragma omp simd) to force vectorization of a loop has a side-effect of ignoring potential pointer aliasing.

Bear in mind however, that if two pointers actually do alias, but you performed some of the before-mentioned operations, the result of such intervention can be incorrect results. This is especially true for using the restrict keyword

Unnecessary Loads and Stores due to Register Spills

Register spills happen when the compiler doesn’t have enough hardware registers to hold all the values that it needs at some point in your program. It will then spill values in registers to the stack to make space for newly needed data¹. And when those values are again needed, it will reload then from the stack to registers.

Some register spilling is expected on most codes. But, the bad kind of register spilling happens inside hot loops. A value gets spilled and reloaded and this keeps happening in every iteration of the loop. Spilling is not useful work; in an ideal case it should not happen at all. If that’s not possible, then it should happen on those values that are needed the least.

Codes that typically suffer from register spilling issues are large functions and loops with large bodies. The problem is made worse if inside a loop you have hot parts, that get executed a lot, and cold parts, that get executed rarely. Spilling in cold parts would be fine, but spilling in hot parts would lead to performance degradation. Most of the time the compiler doesn’t have runtime information, so it can make a wrong guess and optimize the cold path at the expense of hot path.

How Does the Compiler Determine the Number of Needed Register?

To determine if a register is needed for a variable, the compiler performs live range analysis. At each point in a program, the compiler is aware what values² are in use (live) and what values are not needed anymore (dead). A value becomes alive when its value is initialized, under the condition that someone later is using it. A value dies when it is read for the last time.

When a variable dies (it is not needed anymore), the register that was used to hold it can be reused for something else. If the compiler determines that there are more live variables than available hardware registers, it will need to spill some of the live variables.

To illustrate live range analysis, consider the following example.³

void foo(int c) {         // live { c },    dead { a, b }
    int a = 0, b;         // live { a, c }, dead { b }
    do {                  // live { a, c }, dead { b }
        b = a + 1;        // live { b, c }, dead { a }
        c = c + b;        // live { c },    dead { a, b }
        a = b * 2;        // live { a, c }, dead { b }
    } while(a < 0);       // live { c },    dead { a, b }
    return c;             // live { c },    dead { a, b }
}

On the left is the source code. On the right is the liveness status for each variable in the program after the statement on the left executes. Here we see that at any point in time only two values are alive. Therefore, the compiler needs only two registers to completely avoid register spilling. Here is one possible allocation of registers:

void foo(reg0) {            // c in reg0
    reg1 = 0;               // a in reg1
    do {
        reg1 = reg1 + 1;     // b in reg1
        reg0 = reg0 + reg1;  // c in reg0 , b in reg1
        reg1 = reg1 * 2;     // a in reg1
    } while(reg1 < 0);       // a in reg1
    return reg0;             // c in reg0
}

It is interesting to observe, on line 4, the transfer of reg0 from value a to value b. As an input data, it holds a and as an output data, it holds b.

The biggest obstacle for the compiler is which variable to pick for spilling. Ideally, the compiler should pick a variable that is used the least often; this will result in the smallest loss of performance. For this reason, the compiler has a cost model that determines the price of spilling for each variable. Using the cost model, the compiler will pick the cheapest variable to spill.

Debugging Codes with Register Spills

Debugging codes with register spills isn’t too difficult. There are a few ways:

The compiler optimization report will tell you, for each loop and function, how many spills and reloads are. Unfortunately, it doesn’t tell you what variables were spilled.
Tools like Intel Advisor can detect register spills at runtime.
By investigating the program’s memory access pattern using memory observation tools (e.g. valgrind lakey tool). A single memory address that get accessed way too many times is the address of potential register spilling and reloading.
By investigating the assembly output of the compiler.

Fixing the Codes with Register Spills

Developers have no explicit control over register spills for all codes written in higher-level languages. The only way to have explicit control over register spilling is by coding directly in assembly⁴ However, there are ways to implicitly influence register spilling, by modifying your high-level code.

IMPORTANT NOTE 1: Modifying source code to change how the compiler allocates registers is controversial. These changes are not necessarily portable, i.e. what works on one compiler doesn’t necessarily work on another. They can also break with compiler upgrades. And many times they recommend things the compiler already does, resulting in no speedup. Use at your own discretion.

IMPORTANT NOTE 2: I don’t have personal experience with these techniques, and as far as I know, there aren’t any official guidelines or best practices on how to optimize register spilling. I have collected here some “anecdotal evidence”, from forums, blog posts and twitter threads. If you know more about this topic, feel free to comment or contact me directly, in order to update this post.

If we want to fix register spilling in our hot code, there are two types of solutions:

Decreasing the number of spills or avoiding it completely. This means using techniques to decrease the number of live values in the part of the code where this number is highest.
Changing which value gets spilled. In this case, the number of spills remains the same, but what variable gets spilled changes.

Decreasing the Number of Spills

As already said, these techniques aim to reduce the number of needed registers at parts of the code where this number is highest.

Shortening the Value Lifetime

A value lifetime looks like this:

A value is initialized. After initialization, value can be used by subsequent expression in the program. A typical example of initialization is call to a constructor.
First access is performed. The value which is initialized is used for the first time.
Other accesses are performed.
Last access is performed. The value is used for the last time. After this, the value is dead. In principle, the compiler can release the register used by the value and reuse it elsewhere.
The value is destroyed. With class types, destruction will happen when the destructor is called. Although the value is dead with the last access, the compiler becomes aware of this only at this point.

Here is an example code to illustrate this:

void foo() {
    object_t o = construct_object();
    ...
    if (condition) {
        o.my_method(x); // Last usage of variable o
        ...
    }
    ...
}

Object o is initialized on line 2. The first and simultaneously the last access is on line 5. The object is destroyed when the control exists function foo.

To decrease register spills, we want to make sure that initialization is close to the first access as possible. Also, we want to make sure that destruction is close to the last access as possible. In principle, the compiler has a certain freedom to move initialization and destruction statements. But, this is not guaranteed and the compiler might not make the move spilling registers instead. For example, in the presence of function calls, the compiler has a limited information about what kind of movements are legal and what aren’t.

In our example, we move initialization (constructor) and destruction (destructor) as close to the usage as possible:

void foo() {
    ...
    if (condition) {
        {
            object_t o = construct_object();
            o.my_method(x);
        }
        ...
    }
    ...
}

We introduced a pair of braces {} between line 4 and 7, and also we moved the construction to line 5. The initialization now happens on line 5, access on line 6, and destruction on line 7. By manually ensuring that initialization is close to the first usage and destruction is close to the last usage, you help the compiler decrease the amount of register spills. Note, that compilers can often do this things automatically, but when debugging register spills, writing calls like this might be necessary!

Keyword `register`

Keyword register was used as a hint to the compiler to keep a certain variable in a register. For example, declaring register int a = 0; was hinting to the compiler to store the variable a in a register.

This keyword is deprecated in C++ 11, and in most compiler doesn’t hint anything. We are mentioning it here for completeness.

Reductions

Generally, reductions are any operations that take more than one value and produce a single value. So, for example, all binary operators (+, -, *, &, |) are reductions. Ternary operator ( ? : ) is also a reduction. And you can consider a function call that takes more than one parameter and returns one result as a reduction.

Reductions are often composed together to make larger expressions. For example, the expression x = a + b + (c * d) consists of two + reductions and one * reduction. So, the compiler might first calculate a + b, then calculate (c * d), then add them together. But, since + is associative, it can change the order of calculation. For example, it can first calculate c * d, then add b to this sum, then add a to this sum.

Reductions, and especially composed reductions, can often be places where many registers are needed. To decrease the number of needed registers, the compiler can reorder reductions, for those reductions that are associative, i.e. if a + (b + c) = (a + b) + c. Note, however, that the compiler might pick the less optimal solution which forces it to spill some registers. In that case, changing the order of reductions manually can help the compiler avoid register spilling.

Consider the following example⁵

for (int i = 0; i < longRounds; i++) {
    char c0 = s.charAt(off);
    char c1 = s.charAt(off + 1);
    char c2 = s.charAt(off + 2);
    char c3 = s.charAt(off + 3);
    char c4 = s.charAt(off + 4);
    char c5 = s.charAt(off + 5);
    char c6 = s.charAt(off + 6);
    char c7 = s.charAt(off + 7);
    if (c7 > 127 || c6 > 127 || c5 > 127 || c4 > 127 ||
            c3 > 127 || c2 > 127 || c1 > 127 || c0 > 127) {
        return i << 3;
    }
    ...
}

The line 10-11 is a huge reduction, and the author was experiencing register spilling because of it. Splitting the large statement into two smaller statements help alleviate the spill. The modified code looks like this:

for (int i = 0; i < longRounds; i++) {
    long batch1 = ( (long)s.charAt(off)) << 48 |
            ( (long) s.charAt(off + 2)) << 32 |
            s.charAt(off + 4) << 16 |
            s.charAt(off + 6);
    long batch2 = ( (long)s.charAt(off + 1)) << 48 |
            ( (long) s.charAt(off + 3)) << 32 |
            s.charAt(off + 5) << 16 |
            s.charAt(off + 7);
    if ( ( (batch1 | batch2) & 0xff80ff80ff80ff80L) != 0) {
        return i << 3;
    }
    ...
}

The original expression consisted of eight joined || conditions. The modified expression breaks it into two smaller expressions. This decreases the pressure on the register allocator, something the compiler was unable to do on its own.

Rematerialization

There are two compiler optimizations that are at the opposite end points of optimization space. One is common subexpression elimination, which looks for identical expressions in code, and instead of evaluating them multiple times, it evaluates them only once For example, if an expression v + 1.0 is used several times in the program, the compiler could evaluate it only once, store the value in a register and reuse it when needed. This approach increases register pressure.

Opposite of this is rematerialization. With rematerialization the compiler takes a variable used two or more times, and replace it with an expression used to create the variable.

In a world where we have unlimited supply of registers, we wouldn’t need remateralization. Accessing registers has zero overhead. But, when the number of registers is limited, and spilling and reloading registers incurs some costs, then rematerializing values can be faster than spilling and reloading them.

To illustrate, consider the following example:

double foo(double v) {
    double v_1 = v + 1.0;
    double v_2 = v + 2.0;
    double v_3 = v + 3.0;
    double v_4 = v + 4.0;
    double sum = v + v_1 + v_2 + v_3 + v_4;
    double mul = v * v_1 * v_2 * v_3 * v_4;
    return mul / sum;
}

The program uses v_1, v_2, v_3 and v_4 values two times, so a programmer might consider storing them in variables in order to help the compiler avoid the price of recalculating them.

If there is no register spilling, i.e. there are enough registers for all temporary values, then this approach works. But what happens if some variables need to be spilled and reloaded? The price of memory access is typically three cycles, whereas the price of calculating this value on the fly is one cycle. The price of spilling and reloading is higher then calculating the value from variable v when needed. We could rewrite the code like this:

double foo(double v) {
    double sum = v + (v + 1.0) + (v + 2.0) + (v + 3.0) + (v + 4.0);
    double mul = v * (v + 1.0) * (v + 2.0) * (v + 3.0) * (v + 4.0);
    return mul / sum;
}

The compiler is free to move between the two versions, but there is a catch. If you write your program with explicit rematerialization, it increases the chances that the compiler chooses that version. If you write your program with common subexpression elimination, it increases chances of compiler picking that version. Because the compiler can move between these two representation, changing the C/C++ code will not guarantee that the created assembly will contain them.

Changing Which Value Gets Spilled

In a code that has a hot/cold split, i.e. that has a part that executes a lot (hot path) and path that does error processing that happens rarely (cold path), we can perform register spilling optimizations by avoiding spilling for values on the hot path, and enabling spilling for values on the cold path.

The main problem for the compiler is figuring out what is hot and what is cold. Although it has some heuristics (e.g. if (!pointer) return ERROR; will be recognized as cold code), since it lacks runtime information, it can get the hot/cold split wrong and start spilling registers on the hot path. All the techniques we present here “nudge” the compiler into figuring out what is hot and what is cold.

Profile Guided Optimizations

Many compilers support profile-guided optimizations: compilers can use runtime information collected earlier to generate a more optimized binary. Among other things, this includes optimizing register allocation on hot/cold path.

Likely/Unlikely Branches

Another way to inform the compiler about hot/cold split is through likely/unlikely annotations for branches. Here is an example:

void process(object* o) {
   if (!o) {
       [[likely]]
       return;
   }

   ...
}

The [[likely]] and [[unlikely]] attributes are introduced in C++ 20. If you don’t use this C++ version, the alternative is __builtin_expect. These attributes let the compiler know about the hot/cold split. But, as already said, the compiler has heuristics related to probabilities of the branches, so these attributes only make sense in the case of counter-intuitive code, like the one presented above.

Tail Calls

One of the ways to perform hot/cold split is by separating your code into hot and cold functions. Of course, this works only if the function body is not inlined. Also, function calls come with an overhead, which can be significant for small functions.

So, one of the ways to split the code into smaller functions without introducing the overhead of full function calls is through tail calls. Tail calls happen when a function calls another function as the last thing it does. Here is an example:

bool func(double x) {
   double b = func_double(x);  // Not a tail call 
   if (b > 0.0) {
       return func_bool(b);    // Tail call
   } else {
      return !func_bool(b);    // Not a tail call
   }
}

In function func, we see three different calls. The first call, func_double (line 2) is not a tail call, because after it the result of this function are further used. The second call to func_bool (line 4) is a tail call. And the third call to func_bool (line 6) is not a tail call, because the result of this function needs to be inverted in func, therefore inversion is the last thing the function does.

Tail calls are cheap because, when optimization options are turned on in the compiler, the compiler simply performs a jump to the destination function, so it is only one instruction. There is no need to save registers to stack and restore them⁶. The returns are cheaper as well, because there is only one return statement (in the callee) instead of two return statements.

This approach to fixing register allocation issues through tail calls has been extensively described in blog post “Parsing Protobuf at 2+GB/s: How I Learned To Love Tail Calls in C”. The main idea is to replace all calls with tail calls. This makes the cold/hot split very obvious to the compiler. It also removes the register spilling needed for making function calls. The authors use [[clang::musttail]] function attribute to emit a compilation error if the compiler does not perform a tail call where expected.

If it is impossible to make a tail call, an alternative would be to use __attribute((preserve_most)) on cold functions. This will move the price of the function call to the callee, because it will need to save all the registers it is modifying to the stack.

Experiments

In this experiment section, we experiment only with pointer aliasing. We are not doing any experiments with register allocation.

For the testing, we are running on a Intel(R) Core(TM) i5-10210U CPU (desktop system) with Linux the compiler is CLANG 15. The tests were executed five times and the mean values reported. All the code is available in our repository.

Pointer Aliasing

For pointer aliasing, we use a really simple string implementation:

struct my_string {
    char* ptr;
    size_t size;
};

The implementation doesn’t use a null-terminated string, instead, it has a pointer to a block of memory containing the string, and the string size. We are interested in implementation of replace method, that replaces all occurrences of character find with character replaceWith. Here is the simplest implementation of method replace.

void replace(char find, char replaceWith) {
    for (int i = 0; i < size; i++) {
        if (ptr[i] == find) {
            ptr[i] = replaceWith;
        }
    }
}

Here is the compiler optimization report for the above code.

If we investigate the compiler optimization report as we described here, we notice that the compiler was unable to move the load of variable size and variable ptr outside of the loop. The char * pointer in ptr can, according to C/C++ pointer aliasing rules, alias anything. So, it could alias variable size (ptr = &size) or it can alias itself (ptr = &ptr).

To help the compiler, we copy the variables size and ptr to temporary local variables. This eliminates both sources of aliasing. Here is the code with modifications:

void __attribute__((noinline)) replace2(char find, char replaceWith) {
    size_t size_priv = size;
    char* ptr_priv = ptr;
    for (int i = 0; i < size_priv; i++) {
        if (ptr_priv[i] == find) {
            ptr_priv[i] = replaceWith;
        }
    }
}

Variables size_priv and ptr_priv contain the private copies. If we again investigate the compiler optimization report for the second case, this is what we get:

After this simple modification, the compiler eliminated all redundant memory accesses and even vectorized the loop.

We tested the above code on a string which is 256 MB in size. Here are the results of both versions:

	Original Version	Avoiding Pointer Aliasing
Runtime	0.129 s	0.117 s
Instructions	1613 M	991 M
Cycle-Per-Instructions (CPI)	0.257	0.347

So, the version that avoids pointer aliasing has a slightly better runtime (0.117 s vs 0.129 s). However, it executes much fewer instructions. This is because the compiler has managed to vectorize the optimized loop, but the vectorization benefit were small because the loop’s bottleneck is in fetching data from the memory, not number of executed instructions. We see that the hardware efficiency (CPI) is much worse for the vectorized version.

Conclusion

Compilers are not all powerful and they are known in many cases to produce unoptimal code. In the context of memory accesses, they can produce unneeded memory accesses, either to enable strict C/C++ conformance (pointer aliasing), or because there are not enough hardware registers to store all the data the program needs (register spilling).

Solving pointer aliasing should be easy: looking at the compiler optimization report should give you enough clue where pointer aliasing happens, and tips we gave here will help you fix it.

As far as register spilling is concerned, the only 100% sure way to avoid it is to write in assembly. However, assembly is for many developers a no-gone zone for various reasons (difficult, takes time to write, etc). Alternatively, you can try tips about decreasing register spilling we gave here in order to avoid register spilling or at least mitigate its overhead. Whether they will work, will depend on many circumstances. If you actually profit from the tips given here, please share it with us so we can update this post. The information would help us gain a clearer picture of what is possible and what isn’t.

Alternatively, the compiler can spill the content of general purpose registers to vector registers. These spills are cheaper. [↩]
Values can be variables, but also results of temporary expressions. [↩]
Source: “Data Flow Analysis” by Suman Jana, Adopted From U Penn CIS 570: Modern Programming Language Implementation (Autumn 2006) [↩]
LuaJIT, a JIT compiler for Lua language was written in assembly. The author claims that writing an efficient interpreter was not possible in C, since the emitted assembly would always contain register spills. [↩]
Courtesy of Francesco Nigro, original version, modified version [↩]
When calling a function, the compiler needs to push a few registers to the stack, because the callee function has right to change them without saving them. These registers are called caller saved registers. They are one of the main reasons why calls to functions are expensive. [↩]

Decreasing the Number of Memory Accesses: The Compiler’s Secret Life 2/2

Unnecessary Loads and Stores due to Possible Pointer Aliasing

Debugging Slowdowns due to Possible Pointer Aliasing

Fixing Issues with Possible Pointer Aliasing

Unnecessary Loads and Stores due to Register Spills

How Does the Compiler Determine the Number of Needed Register?

Debugging Codes with Register Spills

Fixing the Codes with Register Spills

Decreasing the Number of Spills

Shortening the Value Lifetime

Keyword `register`

Reductions

Rematerialization

Changing Which Value Gets Spilled

Profile Guided Optimizations

Likely/Unlikely Branches

Tail Calls

Experiments

Pointer Aliasing

Conclusion

Leave a Reply Cancel reply

Unnecessary Loads and Stores due to Possible Pointer Aliasing

Debugging Slowdowns due to Possible Pointer Aliasing

Fixing Issues with Possible Pointer Aliasing

Unnecessary Loads and Stores due to Register Spills

How Does the Compiler Determine the Number of Needed Register?

Debugging Codes with Register Spills

Fixing the Codes with Register Spills

Decreasing the Number of Spills

Shortening the Value Lifetime

Keyword register

Reductions

Rematerialization

Changing Which Value Gets Spilled

Profile Guided Optimizations

Likely/Unlikely Branches

Tail Calls

Experiments

Pointer Aliasing

Conclusion

Leave a Reply Cancel reply

Keyword `register`