Hiding Memory Latency With In-Order CPU Cores OR How Compilers Optimize Your Code

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

Modern out-of-order CPU cores are very good at hiding instruction (and memory) latencies: if an instruction cannot execute because its input operands are unavailable, the core will move on to execute instructions that follow it. If the code has enough instruction level parallelism (ILP), the out-of-order core will find instructions to execute to make itself busy while the unavailable instruction is waiting for operands.

We covered the cases when a core doesn’t have enough ILP and how to increase it here and here. In this post, we will investigate in-order cores. An in-order core cannot skip over instructions that are stuck, instead, it has to wait for them to complete before moving on to other instructions. This makes them much more susceptible to getting stuck because of low ILP, even if the dependency chain is very short. For in-order cores, it is crucial to avoid the situation where input operands of instruction X are outputs produced by the instruction X-1.

In-order cores are common in low to medium embedded systems. ARM’s most sold core, Cortex A53, is an in-order core. In-order cores are simpler and consume less power. In our time, a mobile phone CPU will consist of a smaller number of performance cores, which are typically out-of-order, and a larger number of efficiency cores, which are in-order. For example, Google Tensor G2 (used in Google Pixel 7 phone) has two Cortex-X1 out-of-order cores and four Cortex-A55 in-order cores.

Techniques

Following techniques are essentially used when programming using compiler intrinsics or assembly. When programming using a higher level language, such as C or C++, we don’t have control on how individual instructions are scheduled. Nevertheless, the compiler will try to break short dependency chains using similar techniques.

Loop Unrolling and Interleaving (UAI)

Consider the following loop (written in pseudoassembly):

// Original code
for (int i = 0; i < n; i++) {
   c[i] = a[i] + b[i];
}

// Pseudoassembly
for (int i = 0; i < n; i++) {
   register a_v = load(a + i);
   register b_v = load(b + i);
   register c_v = a_v + b_v;
   store(c + i, c_v);
}

The code loads two values to registers a_v and b_v from the memory (lines 8 and 9), adds them together (line 10), stores the result to memory location c+i (line 11).

In this code there is a short dependency chain. The addition on line 10 depends on two loads on lines 8 and 9. And the store on line 11 depends on the addition on line 10. With an in-order CPU, the addition can only run when both loads finish. And the store can only run when the addition finish.

A technique called loop unrolling and interleaving (UAI) can be used to break the dependency chain. The technique consists of two steps. In the first step, the loop unrolled by some factor, for example 2. Here is the code after unrolling:

for (int i = 0; i < n; i+=2) {
  register a_v_0 = load(a + i);
  register b_v_0 = load(b + i);
  register c_v_0 = a_v_0 + b_v_0;
  store(c + i, c_v_0);
  register a_v_1 = load(a + i + 1);
  register b_v_1 = load(b + i + 1);
  register c_v_1 = a_v_1 + b_v_1;
  store(c + i + 1, c_v_1);
}

With unrolling, we repeated the body of the loop two times. In the second step, the instructions from two iterations are interleaved. Here is a possible interleaving:

for (int i = 0; i < n; i+=2) {
  register a_v_0 = load(a + i);
  register b_v_0 = load(b + i);
  register a_v_1 = load(a + i + 1);
  register b_v_1 = load(b + i + 1);
  register c_v_0 = a_v_0 + b_v_0;
  register c_v_1 = a_v_1 + b_v_1;
  store(c + i, c_v_0);
  store(c + i + 1, c_v_1);
}

Using interleaving, we interleaved the load phases from two iterations in the first part of the loop body, (lines 2-5), addition phases in the second part (lines 6-7) and store phases in the third part (lines 8-9). With this intervention, there is no direct dependency of instruction X on the previous instruction X – 1. For example, instruction c_v_0 = a_v_0 + b_v_0 (line 6) depends on load(a + i) (line 2) and load(b + i) (line 3). Instruction store(c + i + 1, c_v_1) (line 9) depends on instruction c_v_1 = a_v_1 + b_v_1 (line 7).

Unroll and interleaving is a simple and powerful technique used to break dependency and speed up execution on in-order CPUs. The technique is especially useful for dual-issue in-order CPUs, like Cortex A53. Dual issue CPUs can issue two instructions in a single cycle, but only if there is not direct dependency between the two instructions.

Unroll and interleaving comes with drawbacks:

  • It increases register pressure. A CPU has a limited number of registers, and interleaved code requires more registers to store intermediate results than non-interleaved code. If your compiler runs out of available registers, it can start register spilling them which can degrade performance. This is a problem when programming using compiler intrinsics, the register spilling happens quietly.
  • It doesn’t use the CPU ports uniformly. The CPU has several execution ports, for example, one port for memory accesses, and another for arithmetics. If a port is oversubscribed, then the CPU pipeline will be blocked until the port is freed. Interleaving aggravates this problem, because interleaved instructions will often execute on the same CPU port.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Loop Pipelining

Another technique used to break direct dependency chains for in-order CPUs is loop pipelining. The idea is simple: split the work inside the loop into phases (e.g. three phases), then execute phase 1 of iteration X, phase 2 of iteration X – 1 and phase 3 of iteration X-2. To illustrate, we use the same code as in previous section:

// Original code
for (int i = 0; i < n; i++) {
   c[i] = a[i] + b[i];
}

// Pseudoassembly
for (int i = 0; i < n; i++) {
   register a_v = load(a + i);
   register b_v = load(b + i);
   register c_v = a_v + b_v;
   store(c + i, c_v);
}

Let’s split the instructions in the pseudoassembly loop into three phases. Load phase (instructions on lines 8 and 9), addition phase (instruction on line 10) and store phase (instruction on line 11). For loop pipelining, we execute the load phase from iteration i + 2, the addition phase from iteration i + 1 and the store phase from iteration i. Here is the resulting code:

register a_v = load(a + 0);
register b_v = load(b + 0);
register c_v = a_v + b_v;

a_v = load(a + 1);
b_v = load(b + 1);
for (int i = 0; i < n - 2; i++) {
   store(c_v, c + i);
   c_v = a_v + b_v;
   a_v = load(a + i + 2);
   b_v = load(b + i + 2);
}

store(c_v, c + n - 2);
c_v = a_v + b_v;
store(c_v, c + n - 1);

The code is essentially simple, but the devil is in the details. The code must have at least two iterations to work correctly. The instruction store(c_v, c + i) on line 8 stores the data from iteration i. The instruction c_v = a_v + b_v on line 9 adds together the data from iteration i+1 and the instructions a_v = load(a + i + 2) on line 10 and b_v = load(b + i + 2) on line 11 from iteration i + 2.

Loop pipelining doesn’t have the problems of UAI, such as increase in register pressure or uneven usage of resources, but its more difficult to program.

Experiments

For the experiments, we compared the performance of UAI and loop pipelining coded using inline assembly and vector instructions. The examples are same as when we presented the techniques section. We ran the code on two cores: ARM Cortex A53 (in-order) and ARM Cortex A72 (out-of-order). We tested for various array sizes, in order to understand how latency hiding depends on the data cache we are hitting. The source code is available here. Here are the numbers for small array sizes (44 kB and 236 kB):

TransformationInstructionsARM Cortex A53ARM Cortex A72
Original loop323 M0.67 s0.147 s
Unrolling (4x)219 M0.67 s0.145 s
Unrolling and interleaving (4x)219 M0.55 s (+21.8 %)0.145 s
Pipelining322 M0.52 s (+28.8%)0.148 s
Array size 44 kB
TransformationInstructionsARM Cortex A53ARM Cortex A72
Original loop433 M1.2 s0.202 s
Unrolling (4x)293 M1.2 s0.200 s
Unrolling and interleaving (4x)293 M1.05 s (+14.2%)0.202 s
Pipelining433 M1.01 s (+18.8%)0.200 s
Array size 236 kB

For small arrays that fit the L2 data cache, the in-order core sees a speedup when either UAI or pipelinining is used. In this experiment, the speedup is 14.2%-21.8% for UAI and 18.8%-28.8% for pipelining. The smaller array (44kB) benefits more than the larger array (236 kB). The performance of the out-of-order CPU core changes little regardless of the used technique.

TransformationARM Cortex A53ARM Cortex A72
Original loop2.46 s0.649 s
Unrolling (4x)2.46 s0.649 s
Unrolling and interleaving (4x)2.38 s (+3.3%)0.651 s
Pipelining2.39 s (+2.9%)0.648 s
Array size 1004 kB
ARM Cortex A53ARM Cortex A72
Original loop2.54 s0.672 s
Unrolling (4x)2.54 s0.673 s
Unrolling and interleaving (4x)2.41 s (+5.3%)0.673 s
Pipelining2.42 s (+4.9%)0.673 s
Array size 4076 kB

For larger array sizes, the techniques have small but noticeable effect on performance (speedup of 3% to 5%). For larger arrays, the data is served from the memory and not from the data caches. So the trick using pipelining or UAI doesn’t help to cover the increased latency of fetching data.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Do these techniques work with out-of-order cores?

As the measurement show, UAI and pipelining do not help performance on the out-of-order core at all. The out-of-order core will, in the case it doesn’t have enough work in the current iteration, move on to execute instruction from the next iteration of the loop.

But this is valid only loops with small loop bodies. In the case of a loop with a large loop body and a long dependency chain, the core will eventually get filled with partially executed instructions, and from that point it behaves in the same way as an in-order core. A compiler can of course perform loop pipelining or UAI, but with large loops it gets more difficult to perform such optimization and maintain result correctness. So, it might pay off to do such optimizations even in C/C++. If you have any experience, feel free to leave a comment.

Leave a Reply

Your email address will not be published. Required fields are marked *