We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
Early in their career every C/C++ developer has had an eureka moment: the discovery of optimization options in their compiler. A discovery that there is GCC compiler offers -O0 optimization lever for regular debugging and -O3 for fast release code will be one of the moment I will never forget. I just needed to compile my program with -O3 and everything would run faster without any effort on my side.
Many years have passed since and I believe everyone compiles their code in the similar way: we compile our .c or .cpp files using a compiler, with a certain optimization option (-O0 or O3) to get object files (.o). Then linker takes all our object files and merges them into one big executable or a library.
In this story, linker is the last piece of the puzzle, and compared to what the compiler is doing, its job is much simpler. But doing things this way we are missing out on some important optimization opportunities. So, ladies and gentlemen, allow me to explain.
Motivation
As I said earlier, compiler makes many optimizations on the code. Allow me to present two of them: inlining and loop merging. Have a look at the following code:
int find_min(const int* array, const int len) { int min = a[0]; for (int i = 1; i < len; i++) { if (a[i] < min) { min = a[i]; } } return min; } int find_max(const int* array, const int len) { int max = a[0]; for (int i = 1; i < len; i++) { if (a[i] > max) { max = a[i]; } } return max; } void main() { int* array, len, min, max; initialize_array(array, &len); min = find_min(array, len); max = find_max(array, len); ... }
This is a very simple code that initializes an array and then finds its minimum and maximum. Please pay attention to functions find_min
and find_max
and notice they are very similar. When we compile optimized version of our program, the compiler might inline functions find_min
and find_max
. Inlining simply means to remove the calls to those functions and instead copy their bodies in the place of the call. After having done the inlining, the compiler sees that both find_min
and find_max
loop over the same range and there are no data dependencies between them so it decides to do loop merging. It merges loops from find_min
and find_max
into a single loop. This decreases amount of work the CPU needs to do and also increases data cache utilization. So the optimized version might look like this:
void main() { int* array, len, min, max; initialize_array(array, &len); min = a[0]; max = a[0]; // compiler inlined find_min and find_max for (int i = 0; i < len; i++) { // compiler merged two loops int one if (a[i] < min) { min = a[i]; } if (a[i] > max) { max = a[i]; } } ... }
In the optimized version calls to functions find_min
and find_max
have completely disappeared. This saves some calling time. Furthermore, the compiler merged loops from find_min
and find_max
and now we are looping just once.
These two optimizations alone will benefit the performance. But everything we talked about here can happen if main
, find_min
and find_max
are in the same compilation unit (in C and C++ compilation unit is a file since compilers work file by file).
If find_min
and find_max
are defined in find.c and main
is defined in main.c, the compiler will not be able to perform the optimization. While the compiler is compiling main.c it is aware of find_min
and find_max
only through their signature. It knows these two functions exist, but it doesn’t know how do they look internally or their addresses. It will leave placeholders to call these two functions, and later the linker will resolve calls to those two functions with actual addresses. If the compiler and linker work in this way, they miss out on important optimization potential.
You probably have guessed: why doesn’t the linker do some optimizations? Linker is aware of multiple compilation units and certainly could do them. Well, ladies and gentlemen, allow me to introduce:
Link Time Optimizations
Introduction
Link time optimizations (abbr. LTO) are just that, optimizations done during linking. Linker is aware of all the compilation units and can optimize more compared to what conventional compiler can do. The most important optimizations that linker does are inlining and code locality improvements. As already said, inlining just means copying body of a function in the place of the call instead of calling the function directly. Linker inlines functions from other compilation units if that makes sense. When the function is inlined, the linker has an opportunity to do other optimizations, similar to what compilers do when they compile a singe file.
Regular compilers try to figure out which functions execute often and which functions execute rarely. Depending on that they move the functions that execute often close to one another in memory in order to improve code locality. Additionally, if a function A()
calls function B()
it makes sense to put them close to one another in memory for the same reason. Better locality means less page faults1 and faster code. Linkers are much better aware how the functions from different compilation units call one another and therefore can put them in memory so that they optimize for better code locality.
The result of LTO is a binary which is typically a few percent faster than the original and a few percent smaller. This is not much but for performance sensitive programs can be a big deal.
There is a downside to LTO however. Compiling and linking with LTO is much slower and uses more memory, and this doesn’t scale well. For large projects, e.g Chrome browser or Firefox it requires a dedicated machine with a lot of memory to link. The reason is quite simple, it’s because linker now has to keep in memory all compilation units.
Also, total compilation time with LTO enabled is longer and I think it is one of the reasons why we haven’t seen its wide adoption (developers are not patient people). It is very simple to enable LTO, and later we will enable it on a some projects and measure the results.
Under the hood
You can skip this and move to next chapter as this is not important to work with LTO.
So how does LTO work? Before LTO, linkers were in change only of linking. Linkers would take all the generated object files, for each file they would resolve calls to functions in other object files, create a table of exported functions and finally output the executable to disk.
Now with LTO the linker takes a lot of responsibilities compilers used to have, namely the optimization. Common object files contain binary code, but it is impossible to do LTO on the binary code. So the compiler needs to be run with LTO enabled in order to produce an object file that can be used for LTO. When compiler is started in LTO mode, it writes in special .lto sections the intermediate representation of the code. Intermediate representation is the thing on which the compilers do the optimizations; now when the optimization step is moved to linking phase, this information is necessary for the linker as well.
The line between compilers and linkers gets blurry. I wouldn’t be surprised if one day it disappears and we are left with only one tool.
How to enable link time optimizations?
LTO has been supported on compilers for a long time; GCC 4.7 which is eight years old supports it and I assume that all the other popular compilers support LTO for the same time.
To enable LTO, follow these simple steps:
- Add option
-flto
to the invocation of compiler. - Add option
-flto
to the invocation of the linker. Additionally, you need to add all options from the compiler invocations to the invocation of the linker. So if you called your compiler with “-march=i486 -O3 -fno-stack-protector
“, you will need to pass the same options to the linker.
Now you compile your program as regular. Unless you are using a very old version of the compiler, you shouldn’t expect any problems here.
Experiments
Now let’s get our hands dirty. I already made some tests on ProjectX2, the commercial project I am working on, so I am going to recount my experiences with it. Second thing we are going to investigate is the impact of LTO on ffmpeg open source software used for media files manipulation.
ProjectX
ProjectX is a medium sized project written in C++. There is no special team which is in charge of tools or performance3 and this corresponds to what we typically see in most medium sized commercial projects. It is compiled using GCC 4.9.4 and without LTO. There is one component of the system which is performance critical and I measured how the performance of that component behaves with respect to LTO. I picked a single test since testing is slow and repeated it 10 times. Here are the results:
Phase | Without LTO | With LTO | Difference |
Test runtime | 151.7s | 138.7s | +9.2% |
Binary size – libA – libB – libC – libD | 362kB 1698kB 31.6MB 1204kB | 292kB 1394kB 23.8MB 1144kB | +24% +21.8% +32.7% +5.2% |
Compile time for a single .cpp file | 20.2s | 214.2s | 10.6 x slower |
Linking time for libC (18 .o files) | 1.6s | 63.9s | 39.9x slower |
Linking memory consumption for libC | 119MB | 709MB | 5.95x more |
As the above table shows, the test runtime has decreased by 9.2%, and binaries sizes have decreased by 20% on average.
But the resources needed for compilation have increased dramatically. Compiling a .cpp file is 10 times slower, linking is 40 times slower and memory consumption needed for linking is almost 6 times higher. Please note that GCC 4.9 is really old and LTO has been correctly implemented just in that version. Let’s try a next experiment, maybe we’ll have more luck there.
ffmpeg
ffmpeg is a famous library for processing audio and video. We used ffmpeg 4.2.3 available from the web site and compiled it with GCC 8 and CLANG 9. The configure script of ffmpeg has an option to enable LTO, so that’s just what we did. ffmpeg uses some hand written assembler, we disabled this in order to measure the impact of LTO on optimizing C code. The configure line we used is:
./configure --enable-lto --disable-inline-asm --disable-x86asm
Enabling LTO on GCC went easily. On CLANG however we needed to tweak the configure command line to force it to use LLD linker4. We also wanted to try out ThinLTO, a feature of CLANG that promises the same performance gains as regular LTO and compiles in time comparable to LTO.
To test the performance we asked ffmpeg to convert a mpegts container with h264 video and aac audio to matroska container with mpeg4 video and ac3 audio. The input file was 708MB large and the output was 349MB. Here are the results of our measurements:
Phase | GCC 8 no-LTO | GCC 8 LTO | CLANG 9 no-LTO | CLANG 9 LTO |
Conversion time | 1198s | 1188s | 884s | 891s |
ffmpeg size | 17.1MB | 17.7MB | 18.7MB | 20.5MB |
Compilation time (make -j4) | 626s | 1369s | 646s | 1112s |
Compiling ffmpeg.cc | 5.5s | 1.1s | 4.9s | 2.6s |
Linking ffmpeg time | 13.1s | 1132s | 7.7s | 368s |
Linking ffmpeg memory | 287MB | 381MB | 203MB | 988MB |
I was expecting something else, but facts are stubborn things. First thing we see from the above table, is that CLANG generates a much faster binary compared to GCC for both non-LTO and LTO. The version of CLANG was one year younger than GCC, but the difference is still remarkable. GCC generates smaller binaries, and LTO binaries are bigger than non-LTO binaries for both compilers.
Compilation takes double the time with LTO enabled for both compilers. Linking time is huge with LTO for both compilers, on GCC it’s 86 times slower, on CLANG 48 times slower.
This experiment has shown that, for ffmpeg at least, enabling LTO doesn’t bring expected speed ups.
Final Words
Original assumption was that the turning on LTO will be followed with an improvements in speed of a few percent, decrease in binary size of a few percent, and spending more time and needing more memory for compilation and linking.
On ProjectX, with a several year old toolchain that did happen. ffmpeg tells completely different story. What I assume to have happened is that ffmpeg has already been highly optimized for speed; the developers moved all the small functions into headers so they can be easily inlined by the compiler. Also ffmpeg is written in C which is much easier to keep low-overhead than it is the case with C++. to Most of the optimization potential has been used already and linker can’t do much more.
I think that for most projects that haven’t been aggressively optimized for speed, it makes sense to give LTO a try. I would expect to see the promises made by LTO related to speed and binary size fulfilled most of the time. But at the end, the ultimate indication if LTO should be enabled or not are performance numbers on your specific project.
Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!
Further Read
Optimizing real-world applications with GCC Link Time Optimization
LLVM Blog: ThinLTO: Scalable and Incremental LTO
Honza Hubička’s Blog: GCC 9: Link-time and inter-procedural optimization improvements
- Page faults happen when your program calls another function or tries to access another part of memory, but this part of memory hasn’t been loaded from the disk yet. In that case the operating system needs to load the missing part from the disk which results in program slowdown [↩]
- ProjectX is a made up name, since I am not allowed to disclose the details of the project [↩]
- Large software project, for example Chromium or Firefox have teams that are only in charge of tweaking system performance. [↩]
- LLD linker is a linker provided by LLVM; LLVM also provides CLANG. This linker understands CLANG intermediate representation out of the box and can link LTO enabled object files [↩]
I’ve seen link-time optimizations provide significant speed ups to code (with GCC at least) but it’s not universal. It depends on what the code is like, what it does and how it does it. I’ve not yet seen code run slower as a result of LTO, so there’s that.
Are there any drawbacks? Can this optimization make something wrong and how we can be aware of that?
In principle, these optimizations should not make code faster. Of course, there can be some bug in the compiler/linker that causes a performance regression, but in that case you should file a bug.
I wonder is there also a performance gap to GCC when compiling ProjectX with Clang?