Performance Debugging with llvm-mca: Simulating the CPU!

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

Some time ago I had a performance problem that wasn’t easy to explain by just looking at the code, since the version I expected to be faster was actually slower. Since the problem is simple yet illustrative, I am using it as a showcase on how to debug performance issues using llvm-mca.

According to it’s documentation llvm-mca is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU. In other words, you feed it some instructions and it simulates how the CPU executes those instructions.

The problem we are debugging is a very simple convolution kernel. The plain-old C version of this kernel looks like this:

for (size_t i = 0; i < (n - kernel_size); i++) {
    out[i] = 0.0;
    for (size_t k = 0; k < kernel_size; k++) {
        out[i] += input[i + k] * kernel[k];
    }
}

We want to vectorize this loop for ARM NEON using outer loop vectorization. What this means is that the outer loop runs 4 instances of inner loop in parallel, something like this:

for (size_t i = 0; i < (n - kernel_size); i+=4) {
    out[i]   = 0.0;
    out[i+1] = 0.0;
    out[i+2] = 0.0;
    out[i+3] = 0.0;

    for (size_t k = 0; k < kernel_size; k++) {
        out[i]   += input[i+k]   * kernel[k];
        out[i+1] += input[i+1+k] * kernel[k];
        out[i+2] += input[i+2+k] * kernel[k];
        out[i+3] += input[i+3+k] * kernel[k];
    }
}

After this, the same repeated statements (out[i+x]…) are grouped into one vector intrinsic. The vectorized code looks like this:

for (size_t i = 0; i < (in_size - kernel_size); i+=4) {
    // Original four statements out[i+x] = 0.0; are now one.
    float32x4_t out_v = vdupq_n_f32(0.0f);

    for (size_t k = 0; k < kernel_size; k++) {
        // Load 4 inputs from address input[i+k],
        // originally these were four accesses input[i+k+x]
        float32x4_t in_v = vld1q_f32(in + i + k);
        // Loading kernel from address kernel+k into
        // all the lanes of the vector register
        float32x4_t kernel_v = vld1q_dup_f32(kernel + k);
        // Parallel four multiplications and additions
        out_v = vaddq_f32(out_v, vmulq_f32(in_v, kernel_v));
    }
    vst1q_f32(out + i, out_v);
}

For the simplicity, let’s assume kernel_size is five. If this is the case, we don’t have to reload kernel_v over and over in the loop – we can load it once outside of the loop. Instead of a combination vaddq_f32 and vmulq_f32, we will use vmlaq_laneq_f32 – fused multiply-add. Since the kernel_size is 5, we can completely unroll the inner loop. The resulting code looks like this:

float32x4_t const kernel_0 = vld1q_f32(kernel);
float32x4_t const kernel_1 = vld1q_f32(kernel + 4);

for (size_t i = 0; i < (in_size - 5); i+=4) {
    float const * in_i = in + i;
    float32x4_t in_0 = vld1q_f32(in_i);
    float32x4_t in_1 = vld1q_f32(in_i + 1);
    float32x4_t in_2 = vld1q_f32(in_i + 2);
    float32x4_t in_3 = vld1q_f32(in_i + 3);
    float32x4_t in_4 = vld1q_f32(in_i + 4);

    float32x4_t out_v = vmulq_laneq_f32(in_0, kernel_0, 0);

    out_v = vmlaq_laneq_f32(out_v, in_1, kernel_0, 1);
    out_v = vmlaq_laneq_f32(out_v, in_2, kernel_0, 2);
    out_v = vmlaq_laneq_f32(out_v, in_3, kernel_0, 3);
    out_v = vmlaq_laneq_f32(out_v, in_4, kernel_1, 0);

    vst1q_f32(out + i, out_v);
}

This is the final version. But, the obvious inefficiency are five repeated vld1q_f32 which load almost the same data over and over. For example, the value at location in[i + 4] will be touched 4 times.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

To improve this, we will use use vextq_f32 intrinsic. This intrinsic is essentially a register concatenation instruction and it works like this:

So, instead performing five loads, we can perform two essential loads, and generate the remaining 3 input values by concatenating the two loads using vextq_f32.

// Two essential loads
in_0 = vld1q_f32(in_i)
in_4 = vld1q_f32(in_i + 4)

// Other input values are recreated
// from these ones
in_1 = vextq_f32(in_0, in_4, 1);
in_2 = vextq_f32(in_0, in_4, 2);
in_3 = vextq_f32(in_0, in_4, 3);

The idea behind such an approach are of course optimizations: the latency on Cortex A-72 of vld1q_f32 is 5 cycles, and the latency of vextq_f32 is 3 cycles. In theory, the new version should be faster.

So, I implemented the optimization and tried to run it. The results didn’t look good. The runtime of the original 5 load version (abbreviated 5L) version was 0.194 s, and the runtime of the improved, 2-load-3-ext version (abbreviated 2L3E) was 0.245 s. A slowdown.

Investigation with llvm-mca

Although I had some hunches about what could be an issue, I wanted to confirm them using llvm-mca. The tool accepts only assembly files, and in order to make things easier, I rewrote my two loops using inline assembly (source here and here).

Llvm-mca simulates code execution in a loop, running the loop several times. It works with assembly, so I used the -S switch in the compiler to generate the assembly and then copy the innermost loop to an .s file. Here are two versions, side by side for easier comparison:

### 5-LOAD-VERSION                 ### 2-LOAD-3-EXT-VERSION
add     x4, x1, x2                 add     x4, x1, x2
ldr q0, [x4]                       ldp q0, q1, [x4] // Load pair instruction
ldr q1, [x4, #4]                   ext v11.16b, v0.16b, v1.16b, 4
ldr q2, [x4, #8]                   ext v12.16b, v0.16b, v1.16b, 8
ldr q3, [x4, #12]                  ext v13.16b, v0.16b, v1.16b, 12
ldr q4, [x4, #16]
fmul v15.4s, v0.4s, v7.s[0]        fmul v4.4s, v0.4s, v7.s[0]
fmla v15.4s, v1.4s, v7.s[1]        fmla v4.4s, v11.4s, v7.s[1]
fmla v15.4s, v2.4s, v7.s[2]        fmla v4.4s, v12.4s, v7.s[2]
fmla v15.4s, v3.4s, v7.s[3]        fmla v4.4s, v13.4s, v7.s[3]
fmla v15.4s, v4.4s, v6.s[0]        fmla v4.4s, v1.4s, v6.s[0]

str     q15, [x0, x2]              str     q4, [x0, x2]
add     x2, x2, 16                 add     x2, x2, 16
cmp     x2, x3                     cmp     x2, x3

The 2L3E version has one instruction less since it uses ldp for the double load; other than that, they look mostly the same.

To run them with llvm-mca, we use the command llvm-mca -mtriple=aarch64 -mcpu=cortex-a72 outerloop-neon-ext.s So, let’s run them both using llvm-mca and show result side by side (left column 5L version, right column 2L3E version):

The first part of the report is a summary. The emulation runs for 100 iterations in both cases – the sequence of instructions in the .s file will be repeated 100 times. The 5L version has more instructions (1400 vs 1300) but fewer uOperations (1700 vs 1800) and uses less cycles (629 vs 681). All this was to a certain extent expected, since we know the 5L version was faster, even though it has more instructions.

The Dispatch Width is a property of the architecutre; in this case it is 3 uOps per cycle. Another metric, uOps Per Cycle, tells us how may microoperations per cycle the program was executing. The 5L version is marginally better.

There are two other metrics: IPC (instructions per cycle) and Block Reverse Throughput. In the presence of uOps per cycle, the IPC metric is not that important. Block RThroughput is however interesting – it tells us how many cycles can the CPU initiate a new iteration. The 5L version was spitting out a block every 5.7 cycles, and the 2L3E was doing the same every 6.5 cycles.

Instruction Info Table

Llvm-mca tool was also producing information about instruction latencies and throughput. Here is the table side by side (if the image is too small, right click on it and then Open in a New Window):

The meanings of columns

[1] is number of uOperations per instruction
[2] is the latency of the instruction (number of cycles an instruction needs to execute).
[3] Is the throughput, i.e. how many instructions of the same type can be executed if there is no instruction dependencies between them

The 2L3E wins this round, because both the latency metric and the throughput metric are the better than 5L. Or so it seems. Unfortunately, the resource consumption and instruction dependencies are completely missing from this graph, so it usefulness in analyzing what’s going on is quite limited. But there’s no need to feel down!

Resource Consumption View

In this view, we can see a what CPU resources (bettern known as execution ports) are consumed by the assembly sequence. The Resources table lists all the execution ports, e.g. [0] – A57UnitB . For those unfamiliar with CPU execution ports, the CPU has several of them, but not all uOperations can execute on all execution ports. In the above example, loading instrucions (ldur, ldr, ldp) execute only on execution port [2] A57UntiL, and the fused multiply accumulate (fmla) can execute on two execution ports [5] A57UnitW and [6] A57UnitX.

In the above table, the small table titled Resource pressure per iteration tells us how many cycles an execution port was used in one iteration of the assembly. To remind you, the 5L version was spewing one iteration every 5.7 cycles, and the 2L3E version one iteration every 6.5 cycles.

The 5L version has a much more balanced use of resources. The execution units [5] and [6] were used 5 cycles in each iteration. The 2L3E version uses execution ports [5] and [6] 6.5 cycles in each iteration – this suggests that the reason why 2L3E version is slower is because of contention on execution ports [5] and [6]. The 5L version has managed to move part of the execution to port [2], thereby freeing execution units [5] and [6].

Timeline Graph

A timeline graph is a very interesting feature of llvm-mca: for each instruction it tells you what stages the instruction went through while it was executing and when did these stages start and stop. Timeline graph is produced by specifying the --timeline option to llvm-mca.

On the diagram, the letters represent the following:

D : Instruction dispatched – the CPU has scheduled the instruction for execution
= : Instruction already dispatched, waiting to be executed – the reason why it is waiting is either instruction dependencies (the value needed as an input for this instruction is not yet available) or execution port congestion (the CPU execution port needed to execute the instruction is not yet available)
e : Instruction executing
E : Instruction executed
R : Instruction retired
– : Instruction executed, waiting to be retired.

This is by far the most interesting graph if you want to know what the CPU is actually doing. The columns are cycles, and as we can see, the CPU is taking in 3 instruction in each cycle – the letter D appears three times in each column. After that, an instruction either begins executing immediately, which is marked with a letter e coming directly after D (e.g. instructions (0,0], (0,1]) or the function is delayed for some cycles before it starts executing. The delay is marked with =.

The reason of the delay is not marked in the graph, which is a shame, but we can speculate. It can be either that an execution port is unavailable or one input is not ready.

If we look at the 5L version, we see that one load instruction is issued in each cycle. We know from the previous table that RThroughput for load is 1 cycle, which means that the CPU is issuing loads without stalls. But there is a problem with 2L3E version. A load is issued immediately (0,1], but the three ext instructions need to wait for load to complete before they can be issued ((0,2], (0,3] and (0,4]). So the time between the start of data fetching (the load instruction) and end of data fetching (the last load instruction for 5L or the last ext instruction for 2L3E) is longer for 2L3E version.

This again doesn’t necessarily has to be a problem, because the CPU can execute instruction from other iterations.

Apart from this, there is one problem with instruction dependencies that is present in both the 5L and 2L3E version of the loop. Can you spot it and propose a solution? Does the solution work, and if not, why?

Bottleneck Analysis

The last part of the analysis is the bottleneck analysis. Although until this point we have some solid idea what could be the problem, the bottleneck analysis will definitely confirm or refute it. Bottleneck analysis is performed with --bottleneck-analysis command line flag. First there is the output for the 5L version:

No resource or data dependency bottlenecks discovered.

Not that interesting. Now the same report for 2L3E version:

This time llvm-mca was more generous with the information. It detected that in 38.62% of the total cycles there was an increase in pressure on the backend. Execution port pressure was present 36.56% of the time and data dependencies pressure was present 37.59% of the time. The documentation is sparse on exact meaning of these numbers, but a reasonable guess would be that there were cycles when there was both a pressure because of busy execution ports and instruction dependencies.

The graph shows critical sequence, which, according to documentation is the most expensive sequence of instructions according to the simulation. As you can see with the annotation, the instructions on critical path are mostly putting pressure on the CPU’s execution units. There are other dependencies as well, for example, between the sequence of fmla instructions, but these apparently are not on the critical path.

Conclusion

So, the 5L version is faster for two reasons (1) it uses CPU execution units in more balanced way and (2) five load instructions can execute independently since there are no instruction carried dependencies between them. So what on paper seemed like a good idea in reality didn’t work.

On llvm-mca: this is the best tool we have, at least as far as I know, that you can use to simulate an execution of a small instruction sequence and find problematic spots. It detects only backend problems – so it won’t detect problems in instruction fetching, branch prediction, etc. Also, the load instructions are emulated with smallest possible latency – which was fine in our case (3 cycles), but in case of random memory accesses these latencies can be much much bigger.

The authors admit that the simulation is not perfect, but considering the number of CPU’s simulated, this is to be expected. Nevertheless, the tool is very useful in the utility box of a performance engineer and I would highly recommend anyone working with vectorization and compiler intrinsics or assembly to get familiar with it.

3 comments / Add your comment below

Denis Bakhvalov says:
February 28, 2025 at 10:54 pm
Hey Ivica, great post, thank you!
I think you can speed up your loop.
Notice, that you have a dependency chain over v4.4s register:
fmul v4.4s, v0.4s, v7.s[0]
fmla v4.4s, v11.4s, v7.s[1]
fmla v4.4s, v12.4s, v7.s[2]
fmla v4.4s, v13.4s, v7.s[3]
fmla v4.4s, v1.4s, v6.s[0]
You cannot start the next fmla until the previous one has finished. This loop becomes limited by the fmul/flma *latency*. This is also visible in the llvm-mca output.
If you make multiple accumulators, you can break the dep chain. And then, you will be limited by the fmul/flma *throughput*.
Shameless plug: I talk about it in the second edition of my book in Section 5.6.1 “Case Study: Using UICA to Optimize FMA Throughput”
1. Ivica Bogosavljević says:
  March 27, 2025 at 2:02 pm
  TBH, I left it there intentionally so readers have something to think about. Great work!
Min Hsu says:
June 29, 2025 at 6:58 pm
> The reason of the delay is not marked in the graph, which is a shame, but we can speculate. It can be either that an execution port is unavailable or one input is not ready.
I believe there is an on-going work trying to improve this: https://github.com/llvm/llvm-project/issues/123756
> Also, the load instructions are emulated with smallest possible latency – which was fine in our case (3 cycles), but in case of random memory accesses these latencies can be much much bigger.
Another knob related to memory operations is `–noalias=false`, which asks LSU simulator to treat every memory accesses to be alias with each other.