When an instruction depends on the previous instruction depends on the previous instructions… : long instruction dependency chains and performance

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

This post has a second part, the same problem is solved differently. Read more.

In this post we investigate long dependency chains: when an instruction depends on the previous instruction depends on the previous instruction… We want to see how long dependency chains lower CPU performance, and we want to measure the effect of interleaving two dependency chains (by interleaving two operations) reflects on software performance.

Operations with long dependency chains are bottleneck for modern out-of-order CPUs. We wrote about them in much more detail in one of the previous posts about instruction level parallelism and memory bound programs. In that post we explored dependency chains in memory loads, here we explore dependency chain in arithmetic operations.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

The test environment

To start, let’s take a simple kernel that calculates cosine mathematical function. Here is the source code:

double cos(double x) {
    constexpr double tp = 1. / (2. * M_PI);
    x = x * tp;
    x = x - (double(.25) + std::floor(x + double(.25)));
    x = x * (double(16.) * (std::abs(x) - double(.5)));
    x = x + (double(.225) * x * (std::abs(x) - double(1.)));
    return x;
}

This is a short snippet of code, but from top to bottom there is a long dependency chain. You can see it more clearly if you rewrite each statement in a very simple form with one operation only:

Dependency chain for cosine operation

We marked the dependencies between statements with arrows going backwards. As you can see, each statement in the function depends on the previous statement. This effectively serializes the code: there is only one way to execute this function from top to bottom, and thus such code has a very low instruction level parallelism.

When executing such codes, modern CPUs “suffer”. Although they can in theory execute four instructions in a single cycle, in this particular example they can execute only one. Because, except for the one instruction they are currently executing, there are no other instructions remaining.

On the good side, this dependency chain is relatively short. Top to bottom it is 13 statements, in instructions probably around twenty instructions. Take for example the following loop:

for (int i = 0; i < 1024; i++){
    out[i] = cosine(v[i]);
}

This loop calculates 1024 independent cosine. Although the cosine itself has a dependency chain, outside of it there are no dependencies. The task of the CPU is to execute 1024 independent dependency chains.

Modern CPUs can have hundreds of instructions that are “on standby” – stuck because of dependency chain. So, when stuck in the current iteration, the CPU can move to the following loop iteration and start executing cosine from there. CPUs actually employ this.

But what happens when the dependency chain is really long? Consider this loop:

for (int i = 0; i < 1024; i++){
    s = cosine(s);
}

To calculate the value for cosine(s), we need the value of s calculated in the previous iteration: a recursive cosine! The dependency chain in this loop is very long. It’s 1024 cosine long, which is roughly 20,000 instructions. The modern CPU can have some instructions on stand-by, but certainly not 20,000 instructions!

Interleaving two dependency chains

It is obvious we cannot break the dependency chain in the cosine function, because this is the nature of the algorithm. What we can do instead is interleave two dependency chains. Our cosine function would accept two doubles and return two results. Here is the source code:

std::pair<double, double> cosine(std::pair<double, double> x) {
    constexpr double tp = 1. / (2. * M_PI);

    double x1 = x.first * tp;
    double x2 = x.second * tp;

    x1 = x1 - (double(.25) + std::floor(x1 + double(.25)));
    x2 = x2 - (double(.25) + std::floor(x2 + double(.25)));

    x1 = x1 * (double(16.) * (std::abs(x1) - double(.5)));
    x2 = x2 * (double(16.) * (std::abs(x2) - double(.5)));

    x1 = x1 + (double(.225) * x1 * (std::abs(x1) - double(1.)));
    x2 = x2 + (double(.225) * x2 * (std::abs(x2) - double(1.)));
    return {x1, x2};
}

Interleaving is simple: where we originally had one statement, now we have two statements. E.g. originally we had x = x * tp, and in interleaved version we have x1 = x.first * tp and x2 = x.second * tp. Where in the simple version we had one statement using x or modifying x, in the interleaved version the first statements is using and modifying x1, and the second statement is using and modifying x2. This makes for two independent dependency chains.

When it comes to solving the problem of recursive cosine(cosine(cosine(...(x)))), interleaving doesn’t help us. But if we have to solve the problem of recursive cosine for two or more numbers, then we can use the interleaved version.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

The test

To test, we use the following kernel:

for (int i = 0; i < size; ++i) {
    double s = v[i];
    for (int j = 0; j < recursion_length; ++j) {
        s = cosine(s);
    }

    sum += s;
}

We calculate the recursive cosine for size elements stored in an array v[i] (line 1). The length of the recursion is in recursion_length (line 3). The total number of cosine operations is 60M, so size * recursion_length is 60M.

The corresponding interleaved version of the above loop looks like this:

for (int i = 0; i < size; i+=2) {
    std::pair<double, double> s = { v[i], v[i+1] };
    for (int j = 0; j < recursion_length; ++j) {
        s = cosine(s);
    }

    sum += s.first + s.second;
}

For the simplicity, we are displaying scalar version of the simple and interleaved cosine. But for the test itself, we manually vectorized the two version, for AVX2 (Intel) and NEON (ARM). The simple version of cosine therefore processes 4 doubles on Intel and 2 doubles ARM, and the interleaved version processes 8 doubles on Intel and 4 doubles on ARM. The source code is available here (A note: interleaving reminds a lot on outer loop vectorization we wrote about earlier).

The results

For testing we use two systems: desktop system based on Intel(R) Core(TM) i5-10210U and Raspberry Pi based embedded system (ARM Cortex A-72). We repeat the experiment ten times and report the mean numbers here.

The graph depicts the ration between runtime of the simple cosine and interleaved cosine depending on the recursion length.

As you can see, when the dependency chain is short, interleaving has little or no effect on performance. But as the dependency chain becomes longer, so does interleaved version becomes faster, asymptotically up to two times faster.

This effect is more pronounced with the embedded system than the desktop system. The embedded system can have less “on-standby” instructions, which in turns leads to saturation quicker.

With regards to instruction count, the interleaved version executed less instructions than the simple version. The ratio is 1.14 on the desktop system and 1.27 on the embedded system for recursion length 1, but as the recursion length grows, these ratios asymptotically approach 1 on both systems.

An important measure in hardware efficiency is CPI (cycles per instruction) metric. With this metric, the smaller number is the better. Here is the graph of CPI depending on recursion length:

For small recursion sizes, it doesn’t matter if you use simple or interleaved cosine. For large recursion sizes, hardware efficiency is almost two times better with the interleaved version.

Conclusion

Long computational dependency chains (as opposed to long memory dependency chains) do not come up often in software development (they mostly appear in high-performance computing and security domains). When they do come up, most CPUs are forced to execute code sequentially, utilizing only a part of their full capabilities.

Interleaving additional dependency chains helps ease the burden without significantly changing the algorithm itself. Actually, this is what the CPU does in hardware if the dependency chain is short: it overlaps instructions belonging to different dependency chains and belonging to different loop iterations. But when the dependency chain is too long, doing interleaving in software can really help the performance.

We didn’t investigate it here, but it is possible to interleave more than two dependency chains.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Leave a Reply

Your email address will not be published. Required fields are marked *