| UPCOMING WORKSHOPS Online: • AVX Vectorization Workshop: Nov 24th – 25th, 9 AM – 5 PM CET, info [at] johnnysswlab [dot] com (Register, Register for AVX Workshop, Hi! I would like to register for AVX Vectorization Workshop taking place on Nov 24th and 25th, 9 AM – 5 PM CET ), More info… • NEON Vectorization Workshop: TBD, info [at] johnnysswlab [dot] com (Express interest, Expressing Interest for NEON Workshop, Hi! I am interested in NEON Vectorization Workshop. Please inform me about the dates once this information is known.) , More info… |
In the previous article we made our program run faster by replacing std::endl with '\n' when writing to a file. Our flamegraph for the same program, after this modification, looks like this:

As you can see, what takes most of the time in our program now is the function sort_lines. Let’s have a look at the source code of this function:
void sort_lines(std::vector<std::string>& lines) {
std::sort(lines.begin(), lines.end());
}This is a simple call to std::sort, so we could naively conclude there is nothing possible to optimize. However, this is not true.
Almost all modern CPUs have more than one core nowadays but the standard C++ library is somewhat stuck in the past. Sorting can be done on more than one CPU simultaneously, and there exist parallel implementations of std::sort that can take advantage of more than one core.
Parallel algorithms in the standard library
If you are C++17, the std::sort algorithm takes an additional parameter called ExecutionPolicy. Developers can use it to specify what flavor of the algorithm they prefer. Here is our modified implementation:
void sort_lines(std::vector<std::string>& lines) {
std::sort(std::execution::par_unseq, lines.begin(), lines.end());
}With std::execution::par_unseq we are instructing the standard C++ library to use the parallel version of std::sort algorithm. Many algorithms of the standard C++ library accept this parameter, e.g. std::sort, std::find, std::find_if, std::count, std::for_each, std::transform, std::transform_reduce and many other.
If you are using an older C++ version, GCC’s standard C++ library offers Parallel Mode. Parallel mode is not enabled by default, and you can enable it in two ways:
- Recompile your program with
-D_GLIBCXX_PARALLEL. This will instruct the standard library to use parallel algorithms whenever available or - Use the algorithms from
__gnu_parallelnamespace, in our case__gnu_parallel::sort.
Our current implementation looks like this:
void sort_lines(std::vector<std::string>& lines) {
#if __cplusplus >= 201703L
std::sort(std::execution::par_unseq, lines.begin(), lines.end());
#else
__gnu_parallel::sort(lines.begin(), lines.end());
#endif
}
We are choosing the std::sort implementation depending on the version of C++ version.
The flamegraph (with GCC’s parallel mode libstdc++) looks like this:

The runtime of sort_lines function went down from 440 ms to 150 ms. The whole program’s runtime went down from 650 ms to 390 ms with a very simple change!
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.


nice… thanks