We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
In the previous article we made our program run faster by replacing std::endl
with '\n'
when writing to a file. Our flamegraph for the same program, after this modification, looks like this:
As you can see, what takes most of the time in our program now is the function sort_lines
. Let’s have a look at the source code of this function:
void sort_lines(std::vector<std::string>& lines) { std::sort(lines.begin(), lines.end()); }
This is a simple call to std::sort
, so we could naively conclude there is nothing possible to optimize. However, this is not true.
Almost all modern CPUs have more than one core nowadays but the standard C++ library is somewhat stuck in the past. Sorting can be done on more than one CPU simultaneously, and there exist parallel implementations of std::sort
that can take advantage of more than one core.
Parallel algorithms in the standard library
If you are C++17, the std::sort
algorithm takes an additional parameter called ExecutionPolicy
. Developers can use it to specify what flavor of the algorithm they prefer. Here is our modified implementation:
void sort_lines(std::vector<std::string>& lines) { std::sort(std::execution::par_unseq, lines.begin(), lines.end()); }
With std::execution::par_unseq
we are instructing the standard C++ library to use the parallel version of std::sort
algorithm. Many algorithms of the standard C++ library accept this parameter, e.g. std::sort
, std::find
, std::find_if
, std::count
, std::for_each
, std::transform
, std::transform_reduce
and many other.
If you are using an older C++ version, GCC’s standard C++ library offers Parallel Mode. Parallel mode is not enabled by default, and you can enable it in two ways:
- Recompile your program with
-D_GLIBCXX_PARALLEL
. This will instruct the standard library to use parallel algorithms whenever available or - Use the algorithms from
__gnu_parallel
namespace, in our case__gnu_parallel::sort
.
Our current implementation looks like this:
void sort_lines(std::vector<std::string>& lines) { #if __cplusplus >= 201703L std::sort(std::execution::par_unseq, lines.begin(), lines.end()); #else __gnu_parallel::sort(lines.begin(), lines.end()); #endif }
We are choosing the std::sort
implementation depending on the version of C++ version.
The flamegraph (with GCC’s parallel mode libstdc++) looks like this:
The runtime of sort_lines
function went down from 440 ms to 150 ms. The whole program’s runtime went down from 650 ms to 390 ms with a very simple change!
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
nice… thanks