We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
When it comes to software performance, it seems there is no limit to how much one can complicate things in order to make performance improvements. Here we present a short example of optimizations went wild – horrible code with clean performance. Homage to Clean Code, Horrible Performance.
Clean Code
Let’s say we have a class called my_string
, that looks like this:
class my_string { char* ptr; size_t size; };
The class implements a string, but not a null-terminated kind. Pointer to character block is kept in ptr
, and the size of string is stored in size
.
Let’s say we want to implement a find_substring(substring)
method, which, tries to find the first occurrence of substring in a string. Here is the naive and straightforward implementation of such method: (Please note, this is not the most efficient implementation of find_substring
. Many other algorithms, e.g. Boyer-Moore, Boyer-Moore-Horspool, Knuth-Morris-Pratt or Rabin-Karp are faster. The example is used to illustrate the missed compiler optimization).
std::pair<bool, size_t> find_substring(const my_string& substr) { if (substr.size > size) { return { false, 0 }; } size_t s = size - substr.size; char* ptr_str = ptr; char* ptr_substr = substr.ptr; size_t size_substr = substr.size; for (int i = 0; i < s; ++i) { bool found = true; for (int j = 0; j < size_substr; j++) { if (ptr_str[i + j] != ptr_substr[j]) { found = false; break; } } if (found) { return { true, i }; } } return { false, 0 }; }
The hot loop is on lines 10-21. The loop over i
iterates the string, and the loop over j
tries to find the substring inside the string at position i
. This is a very simple and straight-forward solution.
The Problem
The problem with this approach is the inner loop, which iterates over the substring. It reads the same data from the data cache to the registers over and over. The inner loop over j
will most likely break very quickly after it starts.
From the performance point of view, it would be very convenient if the data belonging to the substring were kept in CPU registers and not the data caches. If the size of substring were known at compile-time, and small, the compiler could perform an optimization: place the whole substring in registers and never touch memory again.
In our case, the compiler doesn’t know the size of the substring. In theory, it could put a part of the substring in a register, in our case, the first part. And, when needed, grab the rest from the memory. This would be very beneficial, as part of the substring that is accessed the most is the beginning of the substring. But compilers never do that automatically.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Horrible Code
Well, what the compiler can’t do, we can do for it. Let’s rewrite the original code by allocating a small, fixed sized array to hold the beginning of the substring. Here is the code:
std::pair<bool, size_t> find_substring2(const my_string& substr) { if (substr.size > size) { return { false, 0 }; } static constexpr size_t substring_buffer_max_size = 8; char substring_buffer[substring_buffer_max_size]; // Fill the statically allocated substring size_t substring_buffer_size = std::min(substring_buffer_max_size, substr.size); for (int i = 0; i < substring_buffer_size; ++i) { substring_buffer[i] = substr.ptr[i]; } size_t s = size - substr.size; char* ptr_str = ptr; char* ptr_substr = substr.ptr; size_t size_substr = substr.size; for (int i = 0; i < s; ++i) { bool found = true; int j; for (j = 0; j < substring_buffer_size; ++j) { if (ptr_str[i + j] != substring_buffer[j]) { found = false; break; } } if (found) { for (; j < size_substr; ++j) { if (ptr_str[i + j] != ptr_substr[j]) { found = false; break; } } } if (found) { return { true, i }; } } return { false, 0 }; }
This code is much more complex than the original. We introduce a new fixed size array substring_buffer
(line 7) and fill it with data from the original substring lines 11-13. In hot loop, we first check if we have a match from the substring_buffer
(lines 22-27). If no, we break the inner loop and move to the next value of i
. Else, we check if the rest of the substring
is a match (lines 30-35).
The compiler actually takes this into account when generating code. Here is the assembly output:
First four characters are kept in registers sil
, dil
, r8b
and r9b
. The rest of the characters for the substring are reloaded from memory, but this is not a problem with us because comparison most of the time fails on first character of the substring. Yet, this assembly indicates it makes sense to decrease the size of substring_buffer
to 4 characters.
It is quite clear why we call this code horrible. A developer unfamiliar with compiler optimizations would look with disgust at this code. But is it worth it? Let’s find out.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Benchmarking
We measure the performance of this solution on both GCC 11 and Clang 15. As an input we use a string which is 256 MB in size. We are looking for a substring that is not present; this will make sure that the whole string is processed. Here are the runtimes:
Original | With substring caching | |
---|---|---|
GCC 11 | Runtime: 0.368 s Instr: 2966 M CPI: 0.473 | Runtime: 0.255 s Instr: 2426 M CPI: 0.357 |
Clang 15 | Runtime: 0.272 s Instr: 2964 M CPI: 0.35 | Runtime: 0.169 s Instr: 2155 M CPI: 0.301 |
On both compilers the runtime of the original is worse and the optimized version is both faster and more efficient.
Bottom Line
Is this code horrible? Yes, it is.
Is it fast? As demonstrated, it is fast.
Is this performance portable? The speed improvement is portable between GCC and CLANG. It would be possible for a compiler to make this optimization without our intervention, but my gut feeling suggest that compilers that do this are rare.
Is it worth it? This is debatable and to a certain extent more a preference than a fact. For me personally, you should never write such a code as a first implementation. It is messy, and the intention is not clear to an average developer. But, if there is a performance bottleneck that you need to address, then this solution would be acceptable.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
I immediately implemented this in my own string class, for me implementation details are unimportant, its all API and performance. Ta!
I’m curious, have you benchmarked against memchr + memcmp, which is how my string class implements search? memchr the first symbol of the substring, check the remaining length, if the substring fits – memcmp the next bytes of the substring against the next bytes of the string.
Awesome article! Very good read. I think that when deciding to implement an optimization like this, its very important to state the reasoning in surrounding comments. Perhaps even together with an equivalent readable representation.
Can’t you just do this with `strtsr()`? Would be interested in seeing a benchmark against an implementation of this using `strstr()`.
No, sorry 🙁
Hey Ivica, I ran your code against GPT-4. Hope, it’s okay.
https://mlajtos.mu/posts/no-code-clean-performance
of course it is 🙂
It is cool, but GPT-4 implemented a version of the Boyer-Moore algorithm (The similarity is clear when you look at the “bad harater jump table”).
So I wouldn’t say it’s a fair comparisson.
I tried to reproduce your results on my machine (MacBook Pro with Apple Silicon M1 Max, Apple Clang 13) and I couldn’t. I used the same setup (256 MB long string, substring that does not appear in it) and I am measuring 140 ms for both find_substring and find_substring2, with no measurable performance difference between the two.
Moreover, if I run the same test with the same string and substring using std::string::find, it runs in 70 ms. So the libc++ version of substring find outperforms both of your versions by a factor of 2.
I tried substrings of length 4, 8, and 12 with the same results.
Hi Timur!
Your finding is very interesting. First, did you use the exact same code as in the repository? For example, if the compiler manages to inline those functions, and figure out you are using the fixed size strings, it will do this optimization for you. But in general case, when the string size is not the a compile time constant, it will revert to the version of the code as in the post. Paste the godbolt link, let’s analyze the produced assembly. Or maybe M1 is doing some magic on its own not seen in Intel.
With regards to std::string::find being faster, this is not unsurprising. There are Boyer-Moore, Boyer-Moore-Horspool, Knuth-Morris-Pratt or Rabin-Karp algorithms that are faster. The idea of the post wasn’t to find the fastest string search algorithm. The idea of the post is to show compiler missed optimizations that can be done manually.
Hi Ivica!
No, I did not use the exact same code as in the repository, because what repository? I couldn’t find a link to one in the article.
You are right, I did not make sure that the functions are not inlined and the string size is invisible to the compiler, so perhaps better to run those tests again. Could you please point me to the relevant repository? Thanks!
The repository: https://github.com/ibogosavljevic/johnysswlab/blob/master/2023-03-decreasememoryaccesses-compiler/aliasing.cpp
Thanks to your impressive article.
Your story is started from clean code, then to be fast but horrible. How about if you start from horrible and slow? Is clean code the only way to the promised land?
I prefer clean code because it’s simpler to read and maintain. With optimizations you often end up with horrible code. If you have any examples of code that is horrible and slow, it would be an interesting investigation to do.