Make your programs run faster: avoid expensive instructions

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

This is the fourth post in the series related to low-level optimizations, the others being:

  1. Make your programs run faster by better using the data cache
  2. Make your programs run faster: avoid function calls
  3. How branches influence the performance of your code and what can you do about it?

Common to all CPUs, both old and new, high-level and budget, is that there are certain instructions that are commonly used but that take a lot of time to execute. We call them expensive. Typically these are arithmetic instructions such as multiplication, division and modulo. Multiplication typically takes around 5 cycles, division and module take typically 30 – 50 cycles. So, a good idea is to avoid them or minimize their number.

What is expensive and cheap instructions depends a lot on the hardware. For example, floating-point instructions are not expensive on modern CPUs with hardware floating-point units. But they can be very expensive on embedded chips with simple or no floating-point units. Smaller, embedded processors typically suffer more from expensive instructions since they have fewer hardware resources.

Use logic operators instead of multiplication, division and modulo

I think this is mostly common knowledge, but multiplication, division and modulo (MDM further in the text) by a number which is a power of two can be done using shifting. Here are the shifting equivalent of MDM operations:

x * 2^n      is same as x << n
x / 2^n      is same as x >> n
x mod 2^n    is same as x & (2^n - 1)

Now, if these constants are known at compile-time, the compiler will emit those logical instructions by itself. So, if you are writing:

template <typename T, int num>
T divide(T val) {
    return val / num;
}

int y = divide<int, 8>(x);

The compiler will not emit the MUL instruction, instead, it will emit shift-right instruction. You don’t need to write unreadable code for speed, the compiler will take care of everything.

There are however places where the compiler emits multiplication, division and modulo instructions, but which are not clear from the first sight. For example, to calculate the address of the n-th element of the array, the following formula is used:

my_class a[1000];

addr(a[i]) = addr(a) + i * sizeof(my_class)

If you are randomly accessing the array1, then for every access there is a multiplication involved. Depending on the hardware, multiplication can be expensive. If sizeof(my_class) is a power of two, the compiler can optimize the multiplication to a left shift. If you want to make the size of the class power of two, you can manually pad the class with additional bytes, like this.2:

class my_class {
    ...
   char padding_bytes[12];
};

// Size of class needs to be power of two, this verifies the assumpton in compile time
static_assert(sizeof(my_class) == 64); 

One way to convert MDM operations to logical operations is to limit the input size to the power of two. Take for example a hash map that holds integers. Typically, the hash functions used to calculate the entry in the hash map is entry = x % hash_map_size. If has_map_size is a power of two, calculating the value for entry becomes much cheaper. So, the implementation would look like this:

class hash_map {
    public:
        hash_map(int log_size) {
            m_log_size = log_size;
            m_size = 1 << log_size;
            m_array = new int[m_size];
            m_count = 0;
        }

        bool find(int value) {
            int entry = value & (m_size - 1);
            return find_open_addressing(entry, value);
        }

    private:
        int m_log_size;
        int m_size;
        int* my_array;
        int m_count;
};

To test the speed difference, we implemented an open addressing hash map whose size is always the power of two. We used working sets of several sizes and tested it. The results are in the table below:

Architecture / Working set sizeModulo implementationShifting implementation
MIPS32R2 JZ4780, working set 64 kB8.014 s7.031 s
MIPS32R2 JZ4780, working set 32 MB80.039 s78.513 s
Intel i5-10210U, working set 64 kB4.701 s2.133 s
Intel i5-10210U, working set 32 MB8.736 s5.202 s
Intel i5-10210U, working set 256 MB13.338 s10.915 s
Modulo vs arithmetic shift performance on various processors and working set sizes

For a small working set (64 kB, fits mostly in L1 cache), the speed difference if noticeable, and especially so for the modern i5 CPU. As the working set size grows the difference becomes less prominent; of course this has to do with a larger amount of data cache misses related to larger working sets.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Avoid unnecessary MDM operations

Sometimes, with a few clever tricks, you can avoid MDM operations. Take for example an implementation of a ring buffer. The algorithms textbooks usually present the following formula to calculate the next place in a ring buffer:

next_place = (current_place + 1) % buffer_size;

For every insertion, this formula performs a modulo operation. The alternative is both more readable and faster:

next_place = current_place + 1;
if (next_place == buffer_size) {
    next_place = 0;
}

By using the rules of arithmetics, you can decrease the number of expensive operations. For example:

(1) a / c + b / c = (a + b) / c
(2) a / b + c / d = (a * d + b * c) / (b * d)

(3) log (a) + log (b) = log (a * b)
(4) log (a) - log (b) = log (a / b)

In the above examples, we use algebraic reductions to decrease the number of expensive operations. In the case of (1), instead of two divisions, we have only one. But beware! Whereas in mathematics a / b + b / c = (a + b) / c always holds, this is not the case for integers and floats due to overflows and rounding errors. So use it at your own risk.

To verify this, we implemented a fast average algorithm that avoids division as much as possible. The idea is the following: average is typically computed like according to following formulas:

(1) AVG = (a[0] + a[1] + ... + a[n - 1]) / n
(2) AVG = a[0] / n + a[1] / n + ... + a[n - 1] / n 

From the mathematical point of view, both (1) and (2) are the same, but due to rounding errors we prefer the version (2). However, version (2) has n divisions. To decrease the number of divisions, we do a partial summing before doing the divisions: we sum several elements of the array until we reach a certain threshold. When the threshold is reached, we divide it by the numbers of elements of the array. Here are the numbers for the two versions of the algorithm:

ConfigurationSimple averageOptimal average
Intel i5-10210U310 ms251 ms
MIPS32R2 JZ47802695 ms2169 ms
Simple vs optimal average on two different architectures

As you can see, the performance difference is around 20% and it is independent of the architecture.

The next thing is the implementation of modulo operations. Optimized modulo looks like this:

int optimized_modulo(int a, int b) {
    return (a < b) ? a : (a % b);
}

In the above code, we only perform modulo when there is an actual need. The added branch should be very cheap most of the time, so this is an idea worth investigating. Some compilers actually implement the modulo operator like above.

Use optimized versions of MDM operations provided by third-party libraries

There are libraries that do MDM operations better than the compiler under some circumstances. One like this is libdivide. According to the documentation libdivide’s scalar code is up to 16 times faster for powers of 2, 10 times faster for non-powers of 2, compared to naive hardware division. Optimized vector code is a further 2-3 times faster. The performance you will see depends on many factors. Click here to see a partial list. Of course, the best way to determine how libdivide affects your performance is to try it and measure!

Use fixed-point arithmetics on CPUs with slow or no FPU

At some point in time floating-point arithmetic has become the standard way of doing arithmetics in CPUs. But floating point arithmetics is not the only way to do math on decimal numbers. There is also fixed-point arithmetic. Fixed-point arithmetics is an excellent choice on CPUs without a floating-point unit. The alternative is FP emulation, which is slower.

Fixed-point arithmetics utilizes CPU’s integer processor resources, i.e. as far as CPU is concerned it uses integer instructions to process numbers. The difference is that logically, the value of a number in fixed-point is scaled by an implicit factor. For example, value 18.1 can be represented as 1810; in this case the scaling factor is 1/100. Scaling factors are mostly powers of two for easier computation.

Addition and subtraction in fixed-point arithmetics are the same as with integers. From the previous example 18.1 + 2.0 in fixed arithmetic look like 1810 + 100 = 2010. Other operations are a bit more involved, but not too much. Try figuring out how to do multiplication and division in fixed-point arithmetic by yourself!

For fixed-point arithmetics, there is a library called libfixmath that implements many operations in fixed arithmetics. This is the way to go if you need fixed arithmetics.

Software dedicated for low-level embedded chips will often be written to use fixed-point arithmetics to save money on hardware. A partial list of software with fixed-point support can be found on Wikipedia.

Compile your program with FPU emulation

If your CPU doesn’t support FPU but you need it, you can recompile your program with -msoft-float option (on GCC). In that case, the compiler will not emit floating-point instruction, instead they will be emulated in software. This is still faster than trapping to Linux kernel and emulating them there.

I wanted to test the performance of soft emulation vs kernel emulation, but unfortunately I didn’t have the necessary hardware. However, I managed to test fixed-point arithmetics vs soft-float on MIPS and here are the results:

Hardware FPSoftware FPFixed-point
Runtime (s)2.8 s25 s10.1 s
Time needed to calculate an average of an array of numbers on MIPS32r2 architecture

No further comment is necessary.

Final Words

First of all, it is important to optimize your data cache utilization before doing any other optimizations, including these. Otherwise, your program will run slow, and optimizations related to expensive instructions will have a little effect there.

As you could have seen, many times, removing or decreasing the number of divisions, modulo (also logarithms, square roots, power etc.) can have a visible impact on performance. Fixed-point arithmetics is a viable option for CPUs without a floating-point unit. And in the worst case, compiling your program with soft-float support is another way to avoid the penalties of trapping to the kernel.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

  1. If you are accessing the array sequentially, the compiler will perform the optimization to remove the multiplication []
  2. Note: padding a class to avoid multiplication might not be a good idea as far as data cache hit rate is concerned. When you pad a class, you are adding bytes that will not be used but which will nevertheless be loaded into the data cache. This can decrease the data cache hit rate and the speed of your program. But when the class size is a power of two, you increase the likelihood that the one element of the class fills in the whole cache line, which actually increases speed. So you need to measure. []

2 comments / Add your comment below

  1. Thanks very much for posting this! I am writing a video game for an old 8-bit system in which the resolution is 256×192 pixels thus the screen is made of 8×8 pixel tiles i.e. a grid 32×24 – SCREEN_WIDE = 32. Therefore, I wanted to do a faster X co-ord look up and initially had x % 32 or x % SCREEN_WIDE to determine the underlying column but now instead I have replaced this with x & (2^5 – 1) or x & (SCREEN_WIDE – 1) and works great – thanks

Leave a Reply

Your email address will not be published. Required fields are marked *