Making your program run faster: the key concepts of software performance

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

All engineers at one point in their career have had to deal with software performance, i.e. making their program run faster. Some concepts of performance are thought at universities, like algorithmic performance, but there is so much more to it when it comes to actually making your program or your system run faster. Here I introduce some of the key aspects of software performance engineering.

To performance or not to performance

The first question is when should you start to think about performance? When the first line of code is written? Or at the end, when the product is already working but there are performance problems with it?

The answer is not straightforward and depends on a few things. The first thing is what kind of software are you developing and what is the amount of data that the program is going to have to process. If the program is a Word to PDF converter, even the largest files will have a few thousand pages, while most of them will have a few hundred or less. You should not expect performance problems in this area (although sometimes they can happen). Following good development practices is the way to go, focusing your efforts on code readability, maintainability and portability.

But, if you expect that your program will be working on large data sets, or there is a latency requirement that the program must respond within a certain time frame, or the program will possibly be running on very slow computers, then performance should be considered from day one.

For example, a video game cannot ship if its frame rate is low. That’s why many game developers use a different programming paradigm called data-oriented design to achieve good performance. Programming in this paradigm is quite different than programming in the object-oriented paradigm, but when the result is a four times faster game, then it pays off.

When performance is expected, you will need to consider a few things in advance: general architecture of the application to avoid unnecessary operations or operations with large latencies, quality algorithms and data structures, and coding guidelines aimed to avoid performance problems. Performance should be on your mind while writing technical specifications and doing code reviews. Not every line is critical, but some types of performance problems can be very difficult or impossible to solve if they are not thought about in advance. (e.g. when a web application is sending too many small requests to the server, the network latency is an actual performance killer. Fixing this may require a large rewrite).

Finding the bottleneck

When there is a performance problem, the developers use tools called profilers to find where the problem occurs. The output of a profiler is a report which tells you what are the functions or source code lines where your program is spending the most time. The pursuit starts from here: this is the place where improving performance will most likely bring speed improvements.

The profiler will point you to code spending too much time, but the lines it spits out should be taken with a grain of salt. In a simple, single-threaded application, the function or the loop taking the most time is the obvious bottleneck. In multi-threaded or multi-process applications, this is not necessarily the case. The function which appears to be a bottleneck might actually be waiting for some other operation to complete. So the bottleneck needs to be looked for elsewhere. There are special profilers for multi-threaded and multi-processor applications, one that comes to mind is Coz.

Types of performance issues

When the bottleneck is found, it is not always quite clear what can be done. The reasons why a function is slow can be many: architectural (e.g. calling the function too many time unnecessarily), bad algorithm (e.g. linear search instead of binary search), inefficient use of the operating system resources (e.g. locking and unlocking a mutex in a loop leads to starvation for other threads), too heavy use of the system memory allocator (e.g, memory fragmentation), inefficient use of the standard library (e.g. not reserving enough space in a hash map causes expensive rehashing), inefficient use of the programming language (moving large classes by value instead of by reference in C++), inefficient use of the memory subsystems (e.g. too many pointer dereferences, a.k.a. chasing pointers), inefficient use of the CPU units (e.g. the hot loop not using the vector engine of the CPU), misconfigured compiler (e.g. disabled inlining), etc.

Some of the types mentioned here are more obvious to discover than the others. E.g. calling a function too many times unnecessarily will be spotted by most engineers in the team. But, the problem of not using the memory subsystem optimally will typically be spotted only by those who are very intimate with software performance.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Peak performance

Sometimes, the hot loop is running at peak performance. It is using the hardware in the optimal way and it is only doing the essential operations. Checking if the hot loop is running at peak performance is another way of understanding if there is a place for performance improvement.

In scientific computing, they use the roofline model to measure how efficiently does a loop use the hardware resources. This information can be useful, but there are limits to it: very often, optimal hardware efficiency is impossible, and then the question arises what is the peak.

Another useful set of tools in a quest for peak performance are special hardware-counters based profilers (most famous being Intel’s VTUNE and pmu-tools). These tools will help you understand what kind of hardware bottleneck does a loop have (computational, memory accesses, conditionals, etc), so you can direct your efforts in the right direction.

Intel’s VTUNE hardware efficiency analysis. Taken from here.

But beware! Hardware efficiency isn’t everything: linear search uses hardware resources much more efficiently that binary search, but binary search is faster because it does less work!

Application performance vs system performance

An important aspect of performance engineering is a distinction between application performance and system performance.

When talking about application performance, we mean about the performance of a program or a set of programs running in isolation (no other programs are running in the in parallel). In this scenario, we typically observe the performance of a program using a profiler, and we fix the issue by modifying the source code of the program. The articles on this site mostly deal with application performance.

On the contrast, when talking about system performance, we mean the performance of the whole system: all the different processes running together on a specific hardware. A process might run fine in a non-loaded system, but many times the problems appear when the process is running together with other processes. In this case, the problems appear mostly because one of the hardware resources is depleted: CPU, memory bandwidth, hard disk bandwidth or network bandwidth. For example, if the computer runs out of physical memory, the processes will begin to swap memory to the hard disk. This can be seen as a sharp decline in system performance.

System performance as a discipline is very important in the server world, where many different processes are executing on the same hardware. Tools used to debug system performance are completely different from the tools used debug application performance process: various observability tools that measure CPU usage, anomalies in CPU execution, IO subsystem usage, memory usage, etc. And the fixes are different as well: changing the configuration of the system, removing processes or adding a cooler to the CPU helps to fix the problems.

One of the main differences between application and system performance problems, is that the application performance problems reproduce consistently, whereas system performance problems reproduce only under the right circumstances. System performance problems are in general more difficult to debug, but they are easier to fix.

Latency and throughput

When we say that “something is slow” we can mean two things, depending on the context. We can either mean that our program doesn’t respond in a timely manner to its inputs, or we can mean that the rate our program is processing data is unsatisfactory.

Taken naively, these two sentences might sound interchangeable, but on a bit deeper level, they are different. Let’s illustrate with an example: an audio processing system must output processed audio within 20 ms to its arrival on the input. Here we are interested in optimizing latency: we want the system to respond to the input, either within a certain time agreed earlier, or as soon as possible.

A second example would be a high-performance system training a neural network. Here, the response time is not critical. The simulation can take several hours or even days. We are interested at the processing rate: the amount of data processed per unit of time should be the largest. We want the task to finish as soon as possible, but responding in the given time is not our priority, it’s the raw speed that we are trying to increase.

In the this case, we are optimizing for throughput: we are trying to process as much data as possible per unit of time.

Latency sensitive systems are common in real-time systems (e.g. automotive or aviation systems), high-frequency-trading systems (where the system has to react to the data coming from the market as soon as possible) or video games. Throughput-sensitive systems are common in other places: video rendering, neural-network training, etc.

Latency sensitive systems build on top of throughput-sensitive systems. After achieving maximum throughput, the whole system, including the hardware, operating system, network stack, etc. is optimized for latency.

One more thing before ending this topic: latency and throughput go together up to a point, but at then they start to part ways. You can’t have both of them. If you configure your system for high latency, some of your jobs will get interrupted and this decreases system throughput. Additionally, multithreading as a technique is also potentially problematic: parallelism introduces latency because of thread synchronization.

Most operating systems nowadays are configured for high throughput. For those who want to create systems optimizes for high-latency, a special configuration of the operating system is needed (e.g. Low Latency Performance Tuning for Red Hat Enterprise Linux 7, Configuring and tuning HPE ProLiant Servers for low-latency applications).

If you are interested in latency related performance topics, I highly recommend Mark Dawson’s blog.

Final thoughts

Performance doesn’t exist in vacuum. There are also other considerations when designing software systems: maintainability, portability, readability, scalability, reliability, security, time to market, etc. Some of them go hand in hand with performance, other don’t. Each software project has its own specific needs and performance is one part of the puzzle. Sometime it is a very important part, other times not so. So each software team will need to make a decision on how much effort are they willing to spend on performance.

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

In my experience, there are few profilers that get the point. Here is yet another way I tried to explain it on my programming team:

On the subject of performance, along with what you’re doing, listen to an old-timer (me). 🙂

IF THERE’S SOMETHING YOU CAN FIX, and IF IT’S WORTH FIXING (i.e. IT ACCOUNTS FOR MORE THAN A FEW % OF TIME), then
IF you just randomly halt it
THE CHANCE YOU WILL CATCH IT IN THE ACT IS JUST THAT PERCENT.
So, if you randomly halt it 10-20 times, it will have shown you exactly what you need to fix.
AND, if you think there’s only one thing to fix, think again. Repeat the whole process until you can’t any more. The speedups multiply.
… that’s this process: https://www.youtube.com/watch?v=xPg3sRpdW1U

The reason PML is as fast as it is is because this was done, repeatedly. For example, at one stage the random halts showed this stack a good % (like 60%) of the time:

Layer
… … layers above
13 ESTBLUPS
14 FCNETA
15 LLOneSubject
16 Interp
17 Advan
18 dverk
19 Deriv
20 exp( x ) or log( x )
… … layers inside library function …

SO, which layer is responsible for that 60% of time? Answer: THEY ALL ARE. But the question is, WHAT CAN YOU DO ABOUT IT?
In every layer, it spends all that time calling the layer below. So, unless you can do fewer calls at any layer, there’s nothing you can do about it.
EXCEPT for one. Layer 20. Often the argument x being sent to exp() or log() has not changed. That suggests something – memoization.
It the argument to a function has not changed, then you don’t need to call it again. You can just return the value it had last time.
So I did this. I put in functions exp_cached() and log_cached(), that keep their prior arguments and values in temporary variables.
That saved about 50% – DOUBLE THE SPEED.
Note: You can’t expect the compiler’s optimizer to figure this out. YOU have to do it.

Sorry if this explanation is lengthy. I was a professor of CS, and I can say with some authority that in general CS professors have never had to work much with serious software, so they don’t know this technique. And since they don’t know it, they don’t tell their students about it, who go on to write (but not seriously use) the profilers of the world. Of course, they are experts on big-O, which is basic knowledge, but is not enough.

3 comments / Add your comment below

Sara says:
August 1, 2022 at 11:56 am
Very informative post, thanks a lot for sharing this!
Michael R. Dunlavey says:
March 16, 2024 at 6:42 pm
In my experience, there are few profilers that get the point. Here is yet another way I tried to explain it on my programming team:
On the subject of performance, along with what you’re doing, listen to an old-timer (me). 🙂
IF THERE’S SOMETHING YOU CAN FIX, and IF IT’S WORTH FIXING (i.e. IT ACCOUNTS FOR MORE THAN A FEW % OF TIME), then
IF you just randomly halt it
THE CHANCE YOU WILL CATCH IT IN THE ACT IS JUST THAT PERCENT.
So, if you randomly halt it 10-20 times, it will have shown you exactly what you need to fix.
AND, if you think there’s only one thing to fix, think again. Repeat the whole process until you can’t any more. The speedups multiply.
… that’s this process: https://www.youtube.com/watch?v=xPg3sRpdW1U
The reason PML is as fast as it is is because this was done, repeatedly. For example, at one stage the random halts showed this stack a good % (like 60%) of the time:
Layer
… … layers above
13 ESTBLUPS
14 FCNETA
15 LLOneSubject
16 Interp
17 Advan
18 dverk
19 Deriv
20 exp( x ) or log( x )
… … layers inside library function …
SO, which layer is responsible for that 60% of time? Answer: THEY ALL ARE. But the question is, WHAT CAN YOU DO ABOUT IT?
In every layer, it spends all that time calling the layer below. So, unless you can do fewer calls at any layer, there’s nothing you can do about it.
EXCEPT for one. Layer 20. Often the argument x being sent to exp() or log() has not changed. That suggests something – memoization.
It the argument to a function has not changed, then you don’t need to call it again. You can just return the value it had last time.
So I did this. I put in functions exp_cached() and log_cached(), that keep their prior arguments and values in temporary variables.
That saved about 50% – DOUBLE THE SPEED.
Note: You can’t expect the compiler’s optimizer to figure this out. YOU have to do it.
Sorry if this explanation is lengthy. I was a professor of CS, and I can say with some authority that in general CS professors have never had to work much with serious software, so they don’t know this technique. And since they don’t know it, they don’t tell their students about it, who go on to write (but not seriously use) the profilers of the world. Of course, they are experts on big-O, which is basic knowledge, but is not enough.
Michael Dunlavey says:
March 16, 2024 at 6:44 pm
The previous comment had all the formatting removed.