Speeding Up Convergence Loops. Or, on Vectorization and Precision Control

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

Convergence loops are while or do while loops that are very common in scientific computing. The calculation is performed (or better say refined) until it reaches a certain precision. Take the example of calculating π/4 using Leibniz formula:

If we were to code this to calculate the value of π/4, the code would look something like this (the source code is available here, we use pseudocode here for easier readability):

double calculate_pi(double convergence_diff) {
    // Leibnitz series 1 - 1/3 + 1/5 - 1/7 ...
    double sum_old = 0.0;
    double sum = 0.0;

    double i1 = 1.0;
    double i2 = 3.0;

    do {
        double x1 = 1.0 / i1;
        double x2 = -1.0 / i2;

        i1 += 4.0;
        i2 += 4.0;

        sum_old = sum;
        sum += x1 + x2;

    } while( (std::abs(sum_old - sum) > convergence_diff));

    return sum;
}

The input to the function is a convergence difference, which we used to determine if our result is precise enough. If a difference between the old sum (calculated in the previous iteration) and the new sum (calculated in this iteration) is smaller than the convergence difference, the calculation stops.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Vectorization

They say that if you are a hammer every problem looks like a nail. And if you are a performance engineer, every performance problem looks like a vectorization problem. So, let’s nail this problem with our vectorization hammer. In this case however, there are a few challenges for successful vectorization.

How to vectorize?

The original loop calculates the terms 1.0/i1 and -1.0/i2, accumulates them to sum, and then checks if the difference between this sum and previous sum is less than convergence difference. But for vectorization we need to work with more than two terms. For simplicity, let’s say we are working with four terms simultaneously. Our vectorized loop would look something like this:

double calculate_pi_vector(double convergence_diff) {
    vec4<double> sum = { 0.0, 0.0, 0.0, 0.0 };

    vec4<double> i1 = { 1.0, 5.0, 9.0, 13.0 };
    vec4<double> i2 = { 3.0, 7.0, 11.0, 15.0 };

    double sum_v, sum_old_v;

    do {
        vec4<double> x1 = { 1.0, 1.0, 1.0, 1.0 } / i1;
        vec4<double> x2 = { -1.0, -1.0, -1.0, -1.0 } / i2;

        i1 = i1 + { 16.0, 16.0, 16.0, 16.0 };
        i2 = i2 + { 16.0, 16.0, 16.0, 16.0 };

        vec4<double> x = x1 + x2;
        vec4<double> x_old = x;
        x_old[3] = 0.0;

        sum_old = sum + x_old;
        sum = sum + x;

        sum_v = sum[0] + sum[1] + sum[2] + sum[3];
        sum_old_v = sum_old[0] + sum_old[1] + sum_old[2] + sum_old[3];

    } while ( (std::abs(sum_old_v - sum_v) > convergence_diff));
    return sum_v;
}

First we calculate two vectors of four terms x1 and x2 (lines 10 and 11), then accumulate them to a vector of four sums (lines 20 and 21). In this way, each lane of sum holds part of the result, e.g.

sum[0] = 1/1 - 1/3 + 1/17 - 1/19 + 1/33 - 1/35 ...
sum[1] = 1/5 - 1/7 + 1/21 - 1/23 + 1/37 - 1/39 ...
sum[2] = 1/9 - 1/11 + 1/25 - 1/27 + 1/41 - 1/43 ...
sum[3] = 1/13 - 1/15 + 1/29 - 1/31 + 1/45 - 1/47 ...

The total sum is calculated in the reduce step, as sum_v = sum[0] + sum[1] + sum[2] + sum[3]. We don’t calculate sum_old independently of sum, instead, the sum_old is calculated by just clearing out the lane 3 of x (line 18).

There are a few things to notice about this vectorization. First, this loop doesn’t correspond to the original scalar loop function-wise. Whereas the original loop was checking difference after every negative term (e.g. after -1/3, -1/7, -1/11, etc.) the vectorized code checks differences more rarely (after -1/15, -1/31, -1/47, etc). So, if a scalar loop would end with a term -1/19, the vector loop would end with the term -1/31 or later.

Doing a bit more work in the vectorized version is not a problem. But, the problem can happen if the convergence criterion is reached for the term e.g. -1/19, but then broken for the term -1/31. In this case, the scalar version would converge but the vectorized version would continue calculating. Imagine the scalar loop reaches convergence after -1/65535, but the vector version reaches convergence after -1/524287. This would negate the benefits of vectorization! We need to be aware that scalar and vector version might not reach convergence at the same term.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

The Expensive Reduction Step

This vectorization looks fine, but a keen eye sees a problem with the reduction step. There is either (1) no instructions that can add the lanes of a vector (SSE, AVX) so it has to be emulated or (2) the instruction is too slow and renders vectorization useless (NEON, AVX512).

The problem can be easily solved if we pay a little attention to what this code does. A bit of symbolic math can help simplify reduction. We are interested in sum_v - sum_old_v.

  • sum_v - sum_old_v = reduce(sum) - reduce(sum_old)
  • reduce(sum) - reduce(sum_old) = reduce(sum_prev) + reduce(x) - (reduce(sum_prev) + reduce(x_old)) = reduce(x) - reduce(x_old)
  • reduce(x) - reduce(x_old) = x[0] + x[1] + x[2] + x[3] - (x[0] + x[1] + x[2]) = x[3]

So, the only value which we actually need to compute the difference is the third lane of vector x. Knowing this, we can skip the reduce step completely, and only reduce after finishing the loop! Here is the updated code:

double calculate_pi_vector(double convergence_diff) {
    vec4<double> sum = { 0.0, 0.0, 0.0, 0.0 };

    vec4<double> i1 = { 1.0, 5.0, 9.0, 13.0 };
    vec4<double> i2 = { 3.0, 7.0, 11.0, 15.0 };

    double sum_v, sum_old_v;

    do {
        vec4<double> x1 = { 1.0, 1.0, 1.0, 1.0 } / i1;
        vec4<double> x2 = { -1.0, -1.0, -1.0, -1.0 } / i2;

        i1 = i1 + { 16.0, 16.0 , 16.0, 16.0 };
        i2 = i2 + { 16.0, 16.0 , 16.0, 16.0 };

        vec4<double> x = x1 + x2;
        sum = sum + x;

    } while (std::abs(x[3]) > convergence_diff);

    return sum[0] + sum[1] + sum[2] + sum[3];
}

Now, if you think about it a little bit more, instead of comparing x[3] only, we can compare all four differences from x against convergence difference using a single vector instruction. In this way our vectorized loop is more similar to the scalar loop, which compares sums after each two terms. This allows us to detect the convergence earlier.

double calculate_pi_vector(double convergence_diff) {
    vec4<double> sum = { 0.0, 0.0, 0.0, 0.0 };

    vec4<double> i1 = { 1.0, 5.0, 9.0, 13.0 };
    vec4<double> i2 = { 3.0, 7.0, 11.0, 15.0 };

    vec4<double> convergence_diff_vec = { convergence_diff, convergence_diff, convergence_diff, convergence_diff };
    double sum_v, sum_old_v;

    do {
        vec4<double> x1 = { 1.0, 1.0, 1.0, 1.0 } / i1;
        vec4<double> x2 = { -1.0, -1.0, -1.0, -1.0 } / i2;

        i1 = i1 + { 16.0, 16.0 , 16.0, 16.0 };
        i2 = i2 + { 16.0, 16.0 , 16.0, 16.0 };

        vec4<double> x = x1 + x2;
        sum = sum + x;

    } while (ANY_LANE_FALSE(x > convergence_diff_vec) );

    return sum[0] + sum[1] + sum[2] + sum[3];
}

Note the line 20, specifically ANY_LANE_FALSE(x > convergence_diff) which compares four terms in vector x against convergence_diff. If any of the comparison yields false, the loop ends and the result is returned.

This code will detect if convergence has happened after any of the terms and doesn’t suffer from the problem of skipping terms. However, in this particular case, we know that among x[0] … x[3], x[3] will be the smallest so only checking x[3] is enough.

This code looks nice and would pass code review. I wanted to compare the vector and scalar version for performance. I set the convergence difference for the vector version to be 0, same as scalar version. But when I ran it to measure the speed improvement it …

Spectacularly Failed

The vectorized loop became endless. It never converges. It is infinitely times slower than the scalar loop.

Life coaches are telling us we need to get out of our comfort zones in order to progress. This is the place. The things don’t work and I have no idea why. It’s time for a new adventure!

OK, this was a bit of exaggerating. I only had to do some investigating to see what was going on. I increased the convergence difference for the vector version to see if the loop will converge, and this is exactly what happened! With a larger convergence difference, the loop does converge!

But is this what we want? If we increase the convergence difference, wouldn’t that make our result less precise? We can measure the precision of the result, since we know the exact value of π/4. So, I ran the vectorized loop with various values for convergence difference, and measured error from π. These are the results (10 executions):

VersionConvergence differenceError from πLast term before convergenceRuntime
Scalar02.25×10-91/189 million0.094 s
Vector0Infinity
Vector10-189.61×10-101/1414 million0.174 s
Vector10-171.4×10-91/447 million0.055 s
Vector4×10-172.24×10-91/223 million0.028 s

For the similar error in π/4, the vector loop runs for 0.028 s, whereas the scalar loop runs for 0.094 s, which is a speedup of about 3.35. Notice that the scalar version converges faster; the last term in the scalar version before convergence is 1/189 million whereas the vector version’s last term is 1/223 million.

Also, the vector version converges to the PI with much bigger convergence difference compared to the scalar version for a similar error for π/4.

Where do these differences in result between vector and scalar loops come from?

The answer is floating-point precision loss. Floating-point numbers have a limited precision so some parts of the result get lost. But that’s not the whole story. The precision is the best for the numbers close to zero. The further the number from zero, the less precise can it be represented.

In the original scalar version, we were subtracting two sums. Both of the sums were at the final stage of computation close to 0.785 (π/4). Precision around this number is not high. At some point, adding really small numbers to this value doesn’t change the result – that’s why the loop was able to converge with the convergence difference 0.

On the other hand, the vector version compares only the difference which is close to 0. Numbers close to zero can be represented with a much higher precision, therefore the loop didn’t converge when the convergence difference was 0 and we needed to increase it.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Final Words

We’ve seen how to vectorize a convergence loop, but also what problems come with vectorization. Floating-point operations are not commutative, and with vectorization we change the order of operations. This can lead to faster or slower convergence for vectorized loops, and it might happen that convergence difference needs to be different for vectorized and scalar versions to get the similar precision in the result.

One additional thing I did try is to improve the precision of the calculations, hoping this would speed up convergence. Therefore, I rewrote the code to decrease the precision loss due to floating-point rounding. To do this I introduced two sum variables: big sum and little sum. We initialize the little sum to 0, update little sum, and after 256 iterations, we append the value of little sum to the big sum. The values in the little sum would be closer to zero and we expect less floating-point precision. The code is available here.

However, this improved precision only a bit. The original error was 2.241×10-9, and the precision with the new code is just slightly better at 2.236×10-9. So essentially it converges marginally faster and probably won’t be any faster for practical purposes. I conclude this experiment failed, but I would like to see it tried out with other codes as well.

If you reached this point, thank you very much for the attention and I hope this was a fun read. Please share this content as it helps out reach our target audience; otherwise at some point there will be no work for us and no point in producing more content. Also, if you or someone you know needs performance consulting, either debugging, development, code reviews or trainings, please pass the word around. Thanks a lot! Ivica

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Leave a Reply

Your email address will not be published. Required fields are marked *