Why is quicksort faster than heapsort? And how to make them faster?

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

There are two major sorting algorithms: heapsort and quicksort. Both of them have a similar algorithmic complexity, O(n log n), but quicksort is a few times faster than heapsort. What is the reason for it? Does it need to use fewer instructions, or is it better at using the hardware resources? Or something else?

In this post, we will deep dive into these two algorithms to investigate where this difference in speed comes from?

A few notes about the algorithms

Quicksort

Quicksort works like this: it picks a pivot, and then partitions the elements of the array around the pivot. Elements that are smaller than the pivot are placed in the left part of the array, and elements that are larger than the pivot are placed in the right part of the array. This is called array partitioning. The pivot is in its right place in the sorted array. Then it applies the same procedure to the part of the array left of the pivot and to the part of the array that is right of the pivot.

Ideally, the pivot should split the array into two equal parts. In the worst case, if the pivots are picked such that one side of the array is much larger than the other part, the runtime of the algorithm will slowly approach O(n2). Luckily this happens rarely.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Heapsort

Heapsort is a bit more complicated. It works by creating a heap, which is a special type of binary tree where the root of the array is always bigger (or smaller) than the children. The algorithm first converts the array into a heap. In the next step it will take the element at the root of the heap, and put it at the end of the array. Since the root of the heap has been removed, a new element needs to be picked up for the root, which means rebuilding at least part of the heap.

There is a catch with heapsort, however. The heap is not allocated inside an external binary tree, instead, the whole operation is done inside the array itself. It is possible to represent a binary tree using an array. In this case, if the root is located at index i, its left child is at position 2 * i + 1 and its right child is at position 2 * i + 2.

Heapsort’s runtime is always O (n log n), but it is generally considered slower than quicksort. In practice, it is used in combination with quicksort: if quicksort picks a wrong pivot several times in sequence, the sorting switches to heapsort. This algorithm is called introsort.

The test

To test the performance of the sorting algorithm, we used this code. This is a straightforward C implementation one can find in algorithm textbooks. We are sorting an array of 32M floats. The tests are executed on Intel(R) Core(TM) i5-10210U CPU, which is a commodity desktop CPU. We disabled turbo boost. For the compiler, we used CLANG 12.

To limit the testing to the algorithm itself (as opposed to testing the whole program, including the random number generation), we used LIKWID and its marker API.

Runtimes, instruction counts and cycle count

The first thing to do when comparing two algorithms that do the same thing is to measure the runtime, the instruction count and the cycle count. Here are the numbers for the two algorithms:

QuicksortHeapsortRelative difference
Runtime7.03 seconds20.99 seconds2.99
Instructions8.94 billion40.92 billion4.58
Cycles11.2 billion33.39 billion2.98
IPC0.81.22
Basic stats for quicksort and heapsort

At the first glance, things are clear: heapsort is about 3 times slower than quicksort, and the reason is simple: quicksort executed almost 4.6 times less instructions than heapsort.

The IPC (instruction per cycle) metric is important: it shows how many instructions are executed per cycle. A bigger number is better, because it means the CPU is working more efficiently. Neither algorithms are particularly efficient, although heapsort faires better. The reason? Sorting requires a large number of comparisons, and branching on modern CPUs can be expensive.

We could end our experiment here. Heapsort is almost three times slower because it performs more than four times more instructions than quicksort. Hardware utilization for both sorting algorithms is similar (and low). If we wanted to make any of the algorithms faster, we have two strategies: decrease instruction count (by making the program somehow do less work) or utilize the underlying hardware more optimally.

Better algorithms

Using better algorithms is a very common way of speeding up processing. For quicksort, a cleverer selection of pivot could yield a better runtime. For heapsort, using another way of building a heap called bottom-up heapsort has been shown to increase its performance. These topics have been extensively covered by many authors and we will not talk about them in this post.

What we are interested in, is it possible to speed up any of these two algorithms by better using the hardware?

Why is IPC count low?

To understand why is IPC slow, we will use LIKWID’s top-down cycle allocation counter group (its name is TDA). Here are the numbers for both heapsort and quicksort:

QuicksortHeapsort
Front-end stalls9.84%3.97%
Speculation stalls60.10%44.54%
Retired18.68%28.32 %
Backend stalls11.29%23.24%
Top-down cycle allocation

Some slots go unused because of inefficiencies in instruction fetching and decoding: these are front-end stalls. Some slots go unused because the CPU guessed wrongly the outcome of the branch or a jump: these are speculation stalls. Some cycles get lost because the processing units in the CPU were unavailable: these are backend stalls.

The number under Retired category represents the number of cycles where the instruction retired, i.e. finished execution and did some useful work. Here we see that the retired percentage is low for both algorithms, which is an explanation why the IPC count is low. The next question is how can we make it better: how can we increase the number in the Retired column and decrease it in other columns.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Improving hardware efficacy

The highest number of lost cycles happens in the Speculation category for both algorithms, so it makes sense to start off our investigation there.

Here are the hot loops for both quicksort and heapsort; we will need this to understand what kind of changes are possible:

// Quicksort hot loop
for (int j = low; j < high; j++) {
    if (vector[j] <= pivot) {
        i++;
        std::swap(vector[i], vector[j]);
    }
}

// Heapsort hot loop
for (int j = start; j < end; j++) {
    if (vec[j] > vec[largest]) {
        largest = j;
    }
}

The unpredictable condition for quicksort is if (vector[j] <= pivot) on line 3 and the unpredictable condition for heapsort is if (vec[j] > vec[largest]) on line 11. These conditions are causing the stalls.

By observing the assembly output, we see that for both algorithms the compiler has generated jump instructions, which explains the high number of lost slots in the Speculation category. To remedy it, we will rewrite the loop in a branchless fashion. We use SSE intrinsics to rewrite the loop.

The source code of the branchless loops is available here and here. Let’s measure the runtimes, instruction counts and hardware efficiency after the intervention:

Branchless: runtimes, instruction counts, cycle count and hardware efficiency

Here are the numbers for quicksort:

Quicksort originalQuicksort branchless
Runtime7.03 s4.18 s
Instructions8.94 billion15.1 billion
Cycles11.2 billion6.6 billion
IPC0.82.28
Total wasted slots36.32 billion9.17 billion
Front-end stalls9.84 %21.94 %
Speculation stalls60.10 %5.47 %
Retiring18.68 %65.35 %
Backend stalls11.29 %7.25 %
Quicksort original vs branchless

The branchless version is 1.68 times faster even though it performs 1.69 times more instructions! The reason is of course hardware efficiency, which can be clearly seen by comparing the original version with the branchless version: in the original version only 18.68 % of slots retired. Contrast this with 65.35% slots in the branchless version.

Here are the numbers for heapsort:

Heapsort originalHeapsort branchless
Runtime20.99 s39.98 s
Instructions40.92 billion36.38 billion
Cycles33.39 billion63.68 billion
IPC1.220.57
Total wasted slots95.86 billion217.66 billion
Front-end stalls3.97%2.41 %
Speculation stalls44.54%1.73 %
Retiring28.23%14.55 %
Backend stalls23.24%81.3 %
Heapsort original vs heapsort branchless

What happened? The same intervention, but the numbers look completely different. The slowdown is almost 2x. Hardware efficiency (IPC) also fell by a factor of 2. The retiring percentage fell and the backend stalls percentage exploded. What has gone wrong?

Why did heapsort get slower?

Well, it’s something to do with the backend, but we don’t know what. Backend can be divided into two large groups: slowdowns due to memory cache misses and slowdowns due to lack of computational resources in the processor. LIKWID unfortunately doesn’t allow us to see what’s going on, so it’s time to pull out the big guns: Intel’s VTUNE profiler.

We took a microarchitecture snapshot for the hot function for both the original and branchless versions of heapsort:

Heapsort original microarchitectural analysis
Heapsort branchless microarchitectural analysis

With the original version, the situation was clear: speculation was the lack of inefficiency. But after we got rid of it, the sources of inefficiencies became the memory subsystem, more specifically memory latency and memory bandwidth. How? Why?

Where do the problems with memory latency and memory bandwidth happen?

Here are again hot loops of quicksort and heapsort:

// Quicksort hot loop
for (int j = low; j < high; j++) {
    if (vector[j] <= pivot) {
        i++;
        std::swap(vector[i], vector[j]);
    }
}

// Heapsort hot loop
for (int j = start; j < end; j++) {
    if (vec[j] > vec[largest]) {
        largest = j;
    }
}

Both loops look very similar, sweeping the array from left to right. Quicksort loop actually does some more work, but nothing really too much different.

The main difference between them, which is not visible here, is the trip count1. The hot loop in quicksort will typically have a large trip count2. The heapsort loop, on the other hand, will have 0, 1 or 2 iterations. After finishing the short loop, the algorithm switches to a (from the hardware viewpoint) completely random memory location and repeats the process.

Processing two floats and then switching to process another two floats is what’s causing the slowdown. The new values that need processing are typically not in the data caches, and they need to be brought from the memory to the data caches. Quicksort doesn’t have that problem: once the loop starts running, the hardware quickly picks up the memory access pattern and data is prefetched before it is even needed.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

The tale of good speculations and bad speculations

The next question is, why didn’t the problem with memory come up in the original version of the heapsort? To understand it, we need to look at additional lines of the heapsort’s source code:

static void heapify_k(std::vector<T>& vec, int n, int i) {
    int largest = i;

    int start = 2 * i + 1;
    int end = std::min(start + 2, n);

    for (int j = start; j < end; j++) {
        if (vec[j] > vec[largest]) {
            largest = j;
        }
    }

    if (largest != i) {
        std::swap(vec[i], vec[largest]);

        heapify_k(vec, n, largest);
    }
}

This code finds the largest element between the root (vec[j]), left child(vec[2*j+1]) and right child (vec[2*j+2]) in for loop on line 7. If the root is the largest (line 13), then nothing needs to be done. Otherwise, the root and the largest child switch places (line 14), and the procedure is repeated for the side of the heap where the root was swapped in.

A modern speculative CPU executes instructions out of order. It will continue executing instructions as long as it can. If the operands for the current instruction are available, it will execute it even if the instruction before it didn’t finish. If there is a branch, the CPU will speculate (guess) and decide what to do with the branch before the values needed to evaluate the branch’s condition become available.

On line 8 (if (vec[j] > vec[largest])), this is exactly what happened: the values needed to evaluate this condition were unavailable, but the CPU has continued running the code nonetheless. It guessed the outcome of the branch somehow. When the loop on line 7 was done, the CPU has picked some value for the variable largest. And now it is moving to the next recursive call of heapify_k function3.

At some later point, the values needed to evaluate the condition if (vec[j] > vec[largest]) arrive. If the speculation is correct, that’s awesome, the CPU can continue doing what it used to do. If not, the CPU has to throw away useless work and start over. There is a 33% chance of the correct value being picked for the variable largest, but from the performance perspective, it is better to do anything possibly useful than sit idle.

There is one more thing: let’s say the CPU guessed that largest was X when in fact it was X + 1. Even then, the data needed for running the loop for X + 1 is probably in the data cache, since data for X and X + 1 are neighbors in memory. Speculating on X, even when wrong, brought in the data needed for X + 1.

Speculation actually helped by bringing in the data from the main memory to the data cache earlier! The speculation category in the microarchitecture analysis was high, but it was useful because it made the program run faster. This is the tale of good speculations!

In the case of quicksort, there were no data cache misses and the data was readily available for the CPU to consume. The CPU was nevertheless speculating, but half of the time it was wrong so it had to revert some useful work. That’s why we saw such a large performance increase when we switched to the branchless version. This is the tale of bad speculations!

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Conclusions

  • Heapsort is slower than quicksort because it executes more than 4x more instructions.
  • Even though heapsort is more hardware-efficient, it is nevertheless slower. Efficiency matter, but instruction count matters as well.
  • Quicksort got a really nice speedup from going branchless. Speculations were the limiting factor for quicksort and rewriting it branchlessly helped a lot.
  • Heapsort got a large slowdown from going branchless. Speculations were hiding the problems with the memory usage and actually helped the algorithm’s performance.
  • In microarchitectural analysis, speculations can hide the issues with memory bandwidth and memory latency.
  • Efforts to speedup heapsort will need to resolve around changing the memory layout with the goal to decrease data cache misses.
  1. Trip count is same as loop number of iterations []
  2. If the size of the array is n, then it will be n, then n/2, then n/4, etc []
  3. Recursion gets optimized away in the tail call optimization, but this is another story []

Leave a Reply

Your email address will not be published. Required fields are marked *