vectorization Archives - Johnny's Software Lab

Floating-Point Error Handling in C++: What Actually Works

January 31, 2026February 18, 2026Ivica BogosavljevićC++ Performance, Help the Compiler, PerformanceLeave a Reply

Floating-point errors are unavoidable, but how you detect and handle them can make the difference between clean, high-performance C++ code and a debugging nightmare. In this article, we explore the practical techniques for handling NaNs, infinities, and other FP errors — from manual checks to sticky bits and hardware traps — and reveal which approaches actually work without sabotaging performance.

The messy reality of SIMD (vector) functions

July 4, 2025September 7, 2025Ivica BogosavljevićPerformance, Toolchain and Performance, VectorizationLeave a Reply

We’ve discussed SIMD and vectorization extensively on this blog, and it was only a matter of time before SIMD (or vector) functions came up. In this post, we explore what SIMD functions are, when they are useful, and how to declare and use them effectively. A SIMD function is a function that processes more than…

Read

Performance Debugging with llvm-mca: Simulating the CPU!

January 31, 2025January 31, 2025Ivica BogosavljevićPerformance, Performance Analysis Tools3 Replies

We debug our performance problem by simulating it with llvm-mca!

Speeding Up Convergence Loops. Or, on Vectorization and Precision Control

August 31, 2024August 31, 2024Ivica BogosavljevićLow Level Performance, Performance, VectorizationLeave a Reply

In this post we investigate methods to speed up convergence loops – while loops that slowly converge to the correct result.

On Avoiding Register Spills in Vectorized Code with Many Constants

January 31, 2024February 7, 2024Ivica BogosavljevićLow Level Performance, Performance, Vectorization3 Replies

How to avoid register spilling in vectorized code with many constants?

What is faster: vec.emplace_back(x) or vec[x] ?

October 24, 2022October 24, 2022Ivica BogosavljevićC++ Performance, Performance5 Replies

When we need to fill std::vector with values and the size of vector is known in advance, there are two possibilities: using emplace_back() or using operator[]. For the emplace_back() we should reserve the necessary amount of space with reserve() before emplacing into vector. This will avoid unnecessary vector regrow and benefit performance. Alternatively, if we…

Read

When an instruction depends on the previous instruction depends on the previous instructions… : long instruction dependency chains and performance

September 24, 2022February 29, 2024Ivica BogosavljevićComputational Performance, Low Level Performance, PerformanceLeave a Reply

This post has a second part, the same problem is solved differently. Read more. In this post we investigate long dependency chains: when an instruction depends on the previous instruction depends on the previous instruction… We want to see how long dependency chains lower CPU performance, and we want to measure the effect of interleaving…

Read

The memory subsystem from the viewpoint of software: how memory subsystem affects software performance 2/3

August 17, 2022February 3, 2023Ivica BogosavljevićLow Level Performance, Memory Subsystem Performance, Performance2 Replies

We continue the investigation from the previous post, trying to measure how the memory subsystem affects software performance. We write small programs (kernels) to quantify the effects of cache line, memory latency, TLB cache, cache conflicts, vectorization and branch prediction.

Memory consumption, dataset size and performance: how does it all relate?

May 22, 2022May 31, 2025Ivica BogosavljevićLow Level Performance, Memory Subsystem Performance, Performance2 Replies

We investigate how memory consumption, dataset size and software performance correlate…

Vectorization, dependencies and outer loop vectorization: if you can’t beat them, join them

March 13, 2022August 14, 2022Ivica BogosavljevićComputational Performance, Low Level Performance4 Replies

As I already mentioned in earlier posts, vectorization is the holy grail of software optimizations: if your hot loop is efficiently vectorized, it is pretty much running at fastest possible speed. So, it is definitely a goal worth pursuing, under two assumptions: (1) that your code has a hardware-friendly memory access pattern1 and (2) that…

Read