2-minute read: the Magic Touch of Parallel Algorithms

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

In the previous article we made our program run faster by replacing std::endl with '\n' when writing to a file. Our flamegraph for the same program, after this modification, looks like this:

As you can see, what takes most of the time in our program now is the function sort_lines. Let’s have a look at the source code of this function:

 void sort_lines(std::vector<std::string>& lines) {
     std::sort(lines.begin(), lines.end());
 }

This is a simple call to std::sort, so we could naively conclude there is nothing possible to optimize. However, this is not true.

Almost all modern CPUs have more than one core nowadays but the standard C++ library is somewhat stuck in the past. Sorting can be done on more than one CPU simultaneously, and there exist parallel implementations of std::sort that can take advantage of more than one core.

Parallel algorithms in the standard library

If you are C++17, the std::sort algorithm takes an additional parameter called ExecutionPolicy. Developers can use it to specify what flavor of the algorithm they prefer. Here is our modified implementation:

void sort_lines(std::vector<std::string>& lines) {
    std::sort(std::execution::par_unseq, lines.begin(), lines.end());
}

With std::execution::par_unseq we are instructing the standard C++ library to use the parallel version of std::sort algorithm. Many algorithms of the standard C++ library accept this parameter, e.g. std::sort, std::find, std::find_if, std::count, std::for_each, std::transform, std::transform_reduce and many other.

If you are using an older C++ version, GCC’s standard C++ library offers Parallel Mode. Parallel mode is not enabled by default, and you can enable it in two ways:

Recompile your program with -D_GLIBCXX_PARALLEL. This will instruct the standard library to use parallel algorithms whenever available or
Use the algorithms from __gnu_parallel namespace, in our case __gnu_parallel::sort.

Our current implementation looks like this:

void sort_lines(std::vector<std::string>& lines) {
#if __cplusplus >= 201703L
    std::sort(std::execution::par_unseq, lines.begin(), lines.end());
#else
    __gnu_parallel::sort(lines.begin(), lines.end());
#endif
}

We are choosing the std::sort implementation depending on the version of C++ version.

The flamegraph (with GCC’s parallel mode libstdc++) looks like this:

The runtime of sort_lines function went down from 440 ms to 150 ms. The whole program’s runtime went down from 650 ms to 390 ms with a very simple change!

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

1 comment / Add your comment below

Parallel algorithms in the standard library

1 comment / Add your comment below

Leave a Reply Cancel reply