Sticky bits, traps, compiler flags, and the optimizer trade-offs people rarely talk about.
| Master software performance in just 16 hours! Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register. |
In this post we are going to investigate how to efficiently handle floating-point errors in software.
The first part gives an introduction to different ways of detecting floating-point errors – from simply inspecting the result (checking for infinities or NaNs), over relying on the sticky exception flags in the floating-point status register, and finally to enabling hardware traps.
In the second part, we apply these techniques to a small example to see how they actually behave and what kind of performance overhead they introduce.
If you are an experienced engineer, you might want to skip the parts that are already familiar. For everyone else, I’ll try to explain everything you need to understand the context of floating-point errors.
The Beautiful Theory
What are floating-point errors?
Floating-point formats do not represent only finite numbers. They also have representations for ±infinity and NaN (not a number).
For example:
// Produce Infinity float a0 = 25.0 / 0.0; // Results in +Infinity float a1 = a0 + 4.0; // Infinity propagated to a1 float inf = std::numeric_limits<float>::infinity(); // Produce NaN float b0 = sqrt(-1); // Results in NaN float b1 = b0 * 1.5; // NaN propagated to b1 float b2 = 0.0 * inf;
In this article we will treat infinities and NaNs as “true errors”, because the result is no longer a usable finite value and will typically poison the rest of the computation.
Apart from these, there are also exceptions related to rounding and precision loss. The full list of IEEE-754 floating-point exceptions (from <fenv.h>/<cfenv>) is:
FE_DIVBYZERO– Raised when a non-zero finite number is divided by zero, producing ±infinity.FE_INEXACT– Raised when a result cannot be represented exactly and had to be rounded. (This exception happens very often.)FE_INVALID– Raised when an operation has no mathematically defined result (e.g. 0/0, ∞−∞, sqrt(−1)).FE_OVERFLOW– Raised when a finite result is too large to be represented and overflows to ±infinity.FE_UNDERFLOW– Raised when a non-zero result is too small to be represented normally and is rounded to a subnormal or zero.
By default, these exceptions do not stop program execution. They merely set sticky flags in the floating-point status register.
How are floating-point errors signaled to your program?
In some cases you want to be notified when a floating-point error occurs. For example, you might want to know when an operation produces a NaN, overflows to infinity, or triggers some other exception. There are a few ways to achieve this. We list them here and compare them later.
Software Signalling – Check the Result
One straightforward approach is to check the result directly.
C++ offers std::isnan and std::isinf, which you can use to test whether the result of an operation is NaN or infinity. For example:
uint64_t total_nans = 0;
for (size_t i = 0; i < n; i++) {
double r = std::sqrt(in[i]);
out[i] = r;
total_nans += std::isnan(r);
}The disadvantage of this approach is that every result is explicitly tested. A performance-conscious developer immediately sees the potential issue: additional work is executed in every iteration of the loop. Even if the check itself is cheap, it is still extra logic in the hot path.
Also note that this approach only detects NaNs or infinities – it does not detect exceptions such as inexact or underflow directly.
Hardware Signalling – Sticky Bits
Some errors cannot be detected by simply observing the result. For example, the result of 1.0 / 3.0 cannot be represented in finite precision – it must be rounded. In this case precision is lost, and the floating-point unit signals this by setting the FE_INEXACT flag.
When an exception occurs, the floating-point hardware sets a corresponding bit in the CPU’s status register1. This bit remains set until we explicitly clear it – the flag is sticky.
We will omit the hardware details here, but the standard way to test whether an exception occurred is to use fetestexcept and feclearexcept, for example:
// Clear all exception flags
feclearexcept(FE_ALL_EXCEPT);
// Floating-point work here
// Test whether an exception occurred
if (fetestexcept(FE_INVALID | FE_DIVBYZERO | FE_OVERFLOW)) {
// Handle error
}The typical workflow is simple: clear the exception flags before starting the computation; once the computation is finished, test whether any relevant exception was raised and handle it accordingly.
Note that observing floating-point exceptions may require strict floating-point semantics; otherwise the compiler is allowed to optimize under the assumption that exceptions are not observed. We talk about this later.
Hardware Signalling – Traps
Another way to signal floating-point errors is to use hardware traps (similar in spirit to faults like dereferencing an invalid memory address).
By default, floating-point exceptions are masked. When masked, the hardware does not stop execution; it simply produces NaNs or infinities and sets the corresponding sticky flag.
If you unmask a specific floating-point exception, the CPU will generate a synchronous trap when that exception occurs. The operating system then converts this into a SIGFPE signal delivered to the process. The default behavior is to terminate the program, but this can be changed by installing a custom signal handler.
On GNU/Linux systems you can enable a specific exception like this:
#include <cfenv> // Enable trap on invalid operations feenableexcept(FE_INVALID);
(Note: feenableexcept is a GNU extension and not part of the C++ standard.)
In the example above, when an invalid floating-point operation occurs (typically producing a NaN), the system sends SIGFPE to the process. Without a custom handler, the program terminates.
If you expect some exceptions and want to react programmatically, you can install a signal handler:
void fpe_handler(int sig, siginfo_t *info, void *ucontext) {
// your SIGFPE handler
}
struct sigaction act{};
act.sa_sigaction = fpe_handler;
act.sa_flags = SA_SIGINFO;
sigemptyset(&act.sa_mask);
if (sigaction(SIGFPE, &act, &oldact) != 0) {
std::cout << "Error installing signal handler\n";
}Once the handler is installed, you must decide what to do:
- Log diagnostic information and terminate.
- Attempt recovery (advanced and platform-specific).
- Stop the current computation but keep the program alive.
Needless to say, from a performance perspective, hardware traps with signal delivery are orders of magnitude slower than manual checks. They should only be used when the probability of an exception is extremely low.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
The Reality
All of the above is theory. If you read the previous section carefully, you should now have some intuition about what is possible and what it might cost.
But as we’ll soon see, there are many small details that can prevent you from using these techniques efficiently — or at all.
All source code used in this article is available in our repository.
The Baseline
To establish the baseline (no explicit FP error handling), we use a very simple loop:
size_t count_sqrt_nans_baseline(double* out, double * in, size_t n) {
for (size_t i = 0; i < n; i++) {
out[i] = std::sqrt(in[i]);
}
// The baseline version doesn't count anything
return 0;
}The input array contains mostly positive values, but a few values are negative, which cause the square root operation to return NaN.
In the baseline case, floating-point exceptions may still occur (e.g. FE_INVALID), but we do not observe or handle them.
We are interested in:
- The performance of the baseline case.
- What happens when we add NaN handling – how does the performance of the loop change?
- If we change compiler flags, does the baseline performance degrade?
To ensure optimal performance, the baseline case is compiled with -fno-math-errno, which allows the compiler to vectorize sqrt.
We also wrote a manually vectorized version of the same loop using AVX intrinsics. In some cases, automatic optimizations may be inhibited by error handling, but with intrinsics we still have full control over vectorization.
For the baseline measurements, we compile the code with just -O3 -fno-math-errno. This gives the compiler freedom to fully optimize the loop, including inlining and vectorizing sqrt, without worrying about setting errno for math errors. No strict floating-point or trapping flags are enabled, so NaNs and infinities propagate silently and exception flags may be set but are not observed. This setup provides a clean, high-performance baseline against which we can compare the cost of various error-detection techniques later.
Manual Error Handling
The simplest and most portable way to handle NaNs is just to check whether a NaN was produced. In standard C++, you can do this using std::isnan:
size_t calculate_square_roots_scalar1_count_nan(double* out, double * in, size_t n) {
uint64_t total_nans = 0;
for (size_t i = 0; i < n; i++) {
double r = std::sqrt(in[i]);
out[i] = r;
total_nans += std::isnan(r);
}
return total_nans;
}This version works on Windows, Linux, and macOS. The downside is that each element is explicitly checked for NaN. How does this affect loop performance? GCC 15.2 produced the following results:
| Version | Runtime | Instructions | Cycles per Instruction |
|---|---|---|---|
| Baseline | 0.045 s | 128 M | 1.78 |
| Baseline Vectorized | 0.045 s | 128 M | 1.76 |
| Manual EH | 0.046 s | 179 M | 1.32 |
| Manual EH Vectorized | 0.047 s | 179 M | 1.32 |
For GCC, the compiler vectorizes both the baseline and manual error-handling versions. Despite a significant increase in instruction count in manual EH version, runtime barely changes. Why?
Most mainstream x86 CPUs have only one execution port capable of performing square roots and divisions – the operations most likely to produce NaNs. All other instructions related to error handling – comparisons and additions – execute on different ports. Adding them does not congest the critical path, so runtime increases very little.
For Clang, runtimes look similar, but the instruction count and cycles per instruction differ more dramatically:
| Version | Runtime | Instructions | Cycles per Instruction |
|---|---|---|---|
| Baseline | 0.046 s | 67 M | 3.39 |
| Baseline Vectorized | 0.044 s | 67 M | 3.40 |
| Manual EH | 0.044 s | 147 M | 1.53 |
| Manual EH Vectorized | 0.045 s | 167 M | 1.14 |
Again, we see the same story: although instruction counts vary significantly, the runtime impact is minimal. The reason is port contention — the sqrt/div port is the bottleneck, while additional instructions for NaN handling execute on other ports without affecting the critical path.
Verdict
Manual error checking looks promising in terms of performance. The instructions that actually generate NaNs and infinities (divisions and sqrt) execute on a single CPU port, leaving plenty of CPU resources available for error checking.
One could imagine that in loops using other CPU ports heavily – a mix of divisions, additions, multiplications, etc. – the slowdown might be more noticeable. But error checking typically adds just a couple of instructions per iteration. It does not reduce instruction-level parallelism, so the impact on performance should remain small.
The bigger problem is the compiler. Even simple error checks can easily break automatic vectorization. This didn’t happen in our earlier example because our checks were extremely basic. But consider a loop like this:
for (size_t i = 0; i < n; i++) {
double r = std::sqrt(in[i]);
out[i] = r;
if (std::isnan(r)) {
fail[j] = i;
j++;
}
}Here, the loop-carried dependency on j would prevent the compiler from vectorizing automatically. Of course, this loop can be vectorized manually — we cover exactly this topic in our two-day vectorization workshop. If you’re interested in learning more, check out our workshop here.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Sticky Bits
The second approach is to check the floating-point exception sticky bits. We don’t want to test the sticky bits in every iteration – that would obviously be inefficient. Instead, we check them every X iterations and, if an exception is detected, we reprocess the section of the dataset where the error occurred. Here’s an example implementation:
size_t count_sqrt_nans_sticky(double* out, double * in, size_t n) {
uint64_t total_nans = 0;
static constexpr size_t SECTION_SIZE = 32;
feclearexcept(FE_INVALID);
for (size_t ii = 0; ii < n; ii+=SECTION_SIZE) {
size_t i_end = std::min(ii + SECTION_SIZE, n);
for (size_t i = ii; i < i_end; i++) {
double r = std::sqrt(in[i]);
out[i] = r;
}
if (fetestexcept(FE_INVALID)) {
for (size_t i = ii; i < i_end; i++) {
if (in[i] < 0) {
total_nans++;
}
}
feclearexcept(FE_INVALID);
}
}
return total_nans;
}Instead of processing the dataset element by element, we process it in sections – in this example, 32 doubles per section. After computing each section, we use fetestexcept(FE_INVALID) to check if an exception occurred. If so, we iterate over the section again to pinpoint which elements caused the exception. Finally, we clear the exception flags using feclearexcept(FE_INVALID) so they can be set again in the next section.
Here are the GCC measurements:
| Version | Runtime | Instructions | Cycles per Instruction |
|---|---|---|---|
| Baseline | 0.045 s | 128 M | 1.78 |
| Manual EH | 0.046 s | 179 M | 1.32 |
| Sticky Bits | 0.049 s | 166 M | 1.5 |
| Sticky Bits Vectorized | 0.048 s | 144 M | 1.71 |
Although the sticky bit version executes fewer instructions than manual error handling, it doesn’t make the code faster. In fact, it is slightly slower, and the slowdown is consistent. The reason is that this approach reduces available instruction-level parallelism: the inner loop that computes sqrt is short, and the second loop (to locate the errors) doesn’t use the processor’s division unit at all.
The CLANG measurements show a more dramatic effect:
| Version | Runtime | Instructions | Cycles per Instruction |
|---|---|---|---|
| Baseline | 0.046 s | 67 M | 3.39 |
| Manual EH | 0.044 s | 147 M | 1.53 |
| Sticky Bits | 0.186 s | 390 M | 2.35 |
| Sticky Bits Vectorized | 0.049 s | 138 M | 1.8 |
On CLANG, the compiler fails to vectorize the sticky bit version, resulting in a huge increase in instruction count. The manually vectorized version avoids this problem entirely.
We’ve covered the numbers, but we haven’t yet discussed implementation challenges. This approach has several issues with both implementation complexity and portability, which we’ll address next.
The Compiler Can Optimize Away Instructions Producing NaNs
For the sticky bit approach to work, the compiler must behave in a very specific way:
- Do not assume NaNs or infinities never occur.
- Do not optimize away any instruction that can generate NaN or infinity.
- Do not move floating-point instructions outside the boundaries set by
feclearexceptandfetestexcept. Within these boundaries, instruction reordering is fine, but moving instructions outside can break sticky-bit detection.
The problem is that there are no compiler flags that enforce exactly this. If you pick the wrong combination, your code may either run slowly or fail to set the sticky flags correctly.
- For GCC and Clang, you must avoid optimizations that ignore NaNs or infinities. The biggest culprit is
-ffast-math, which enables-ffinite-math-only. This, in turn, sets-fno-honor-infinitiesand-fno-honor-nans, which can prevent sticky bits from being set. Luckily, the default options even with -O3 do not enable these aggressive flags. - You might also consider
-ftrapping-math, because optimizations that remove traps can also prevent sticky bits from being updated. For example, 0/0 could be optimized away to produce NaN, but if you want to check the sticky bits, the compiler must not eliminate it. Be aware that enabling trapping math is expensive: it prevents reordering of floating-point instructions, disables some vectorization, and limits other optimizations. - On MSVC, there aren’t fine-grained flags for this. The default
/fp:preciseensures NaNs and infinities are produced. But to fully control optimizations affecting sticky bits, you must use/fp:strict.
An important note: The C and C++ standards provide #pragma STDC FENV_ACCESS ON as a way to inform the compiler that the floating-point environment is observable and must be respected. In principle, this should prevent the compiler from reordering or eliminating floating-point operations that modify exception flags between calls to feclearexcept and fetestexcept. In practice, however, support for this pragma is inconsistent across major compilers, and even when recognized, it may significantly restrict optimizations such as instruction reordering and vectorization. As a result, relying on FENV_ACCESS does not reliably solve the portability or performance concerns of the sticky-bit approach.
Verdict
Sticky bits remain a promising approach. Their main advantage is that you don’t need to check every iteration of the loop. They can interfere with some compiler optimizations and may require careful choice of flags, but overall, this technique is low-cost and applicable to real-world code.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Hardware Traps, The sigsetjmp / siglongjmp Approach
Another idea is to rely on FP traps. If FP errors are rare, then handling them via traps should, in theory, be very efficient. The normal execution path would remain fast, and only exceptional cases would pay the cost.
We can install a signal handler that triggers when a floating-point trap occurs. When the handler runs, it must notify the body of the loop that a floating-point error happened. In C and C++, the conventional mechanism for this kind of non-local control transfer is sigsetjmp / siglongjmp.
A simplified version looks like this:
static sigjmp_buf fpe_env;
void fpe_handler(int sig, siginfo_t *info, void *ucontext)
{
(void)sig; (void)info; (void)ucontext;
// Clear sticky FP flags, otherwise we would immediately retrap
feclearexcept(FE_ALL_EXCEPT);
feenableexcept(FE_INVALID);
siglongjmp(fpe_env, 1);
}
#pragma STDC FENV_ACCESS ON
size_t calculate_square_roots_vec_count_nan_longjmp3(double* out, double* in, size_t n) {
uint64_t total_nans = 0;
// Install handler and enable FP exceptions
...
for (size_t i = 0; i < n; i++) {
if (sigsetjmp(fpe_env, 1) == 0) {
// Normal execution path
out[i] = std::sqrt(in[i]);
} else {
// Control arrives here after an FP exception
total_nans++;
}
}
// Restore original handler and FP state
...
}
#pragma STDC FENV_ACCESS OFF
Inside the hot loop, notice the call to sigsetjmp(fpe_env, 1). This function can be reached in two different ways:
- Normal control flow — it behaves like a regular function call and returns 0.
- Exceptional control flow — the FP handler calls
siglongjmp, which does not return. Instead, execution resumes at the corresponding sigsetjmp, which now returns a nonzero value.
This mechanism effectively turns a hardware trap into a conditional branch inside the loop.
Unfortunately, inserting sigsetjmp into the hot loop completely destroys vectorization and introduces substantial overhead. The measurements for GCC make this clear:
| Version | Runtime | Instructions | Cycles per Instruction |
|---|---|---|---|
| Baseline | 0.045 s | 128 M | 1.78 |
| Manual EH | 0.046 s | 179 M | 1.32 |
| Sticky Bits | 0.049 s | 166 M | 1.5 |
| sigsetjmp/siglongjmp | 6.56 s | 2202 M | 1.97 |
The runtime explodes, and the instruction count increases dramatically. Even if we estimate performance purely from instruction count, the expected runtime would be roughly 0.7 – 0.8 seconds. The remaining overhead comes from the operating system: delivering the signal, switching to kernel mode, handling the interrupt, and returning back to user space. None of that work appears in the retired instruction count of the process, but it dominates the total runtime.
The other reason against using this technique is related to compiler optimizations.
The Order of Instruction Matters
Many compiler optimizations rely on rearranging instructions. A classic example is loop unrolling combined with interleaving to improve instruction-level parallelism.
Consider a simple loop:
for (size_t i = 0; i < n; i++) {
out[i] = sqrt(in[i]) / in[i];
}An optimizing compiler might transform it into something like:
for (size_t i = 0; i < n; i+=4) {
double r0 = sqrt(in[i]);
double r1 = sqrt(in[i+1]);
double r2 = sqrt(in[i+2]);
double r3 = sqrt(in[i+3]);
out[i] = r0 / in[i];
out[i+1] = r1 / in[i];
out[i+2] = r2 / in[i];
out[i+3] = r3 / in[i];
}This version exposes more instruction-level parallelism and is easier to vectorize. The sqrt operations can overlap, and the divisions can be scheduled more efficiently.
However, this transformation changes the temporal order of operations.
Suppose in[i+2] == -1. The sqrt(in[i+2]) will trigger an FP trap (if traps are enabled). In the original scalar loop, iterations i and i+1 would have fully completed – including storing their results – before the exception occurs in iteration i+2.
In the unrolled version, the compiler may issue all four sqrt instructions before performing any of the divisions or stores. If the exception is raised during r2 = std::sqrt(in[i+2]), then the stores for i and i+1 may not have executed yet. From the perspective of precise exception handling, observable program behavior has changed.
This is why strict floating-point modes exist. On MSVC, /fp:strict enforces strict ordering and exception semantics. On Clang, -ffp-model=strict serves a similar purpose. GCC does not provide a single equivalent switch, but a combination such as:
-fno-fast-math-frounding-math-ftrapping-math-ffp-contract=off
moves toward strict IEEE semantics.
The problem is that these flags severely restrict optimization. They limit instruction reordering, reduce vectorization opportunities, and often eliminate many of the performance gains modern compilers can provide. Once precise exception ordering becomes observable, many of the compiler’s most powerful optimizations are no longer legal.
Verdict
The FP trap mechanism combined with sigsetjmp / siglongjmp is not suitable for this use case. Placing sigsetjmp inside the hot loop destroys vectorization and dramatically increases overhead. The performance penalty is not marginal — it is catastrophic.
This approach only makes sense when the goal is to abort the entire computation as soon as an error occurs. For example:
if (sigsetjmp(fpe_env, 1) == 0) {
for (size_t i = 0; i < n; i ++) {
out[i] = std::sqrt(in[i]);
}
} else {
std::cout << "Error detected, stopping...";
}Here, the trap acts as a global fail-fast mechanism rather than a per-element error detector. In that scenario, the cost may be acceptable because the normal path remains clean and the exceptional path terminates immediately.
There are also serious semantic concerns. Unlike C++ exceptions, siglongjmp does not unwind the stack and does not invoke destructors. If objects with non-trivial lifetimes are in scope, this can easily lead to resource leaks or partially destroyed program state unless the code is written with extreme care.
Finally, portability is limited. sigsetjmp / siglongjmp and POSIX signal handling are not available on Windows, making this approach unsuitable for cross-platform code.
Throwing C++ Exceptions from an FP Trap
An obvious alternative to sigsetjmp / siglongjmp is to throw a C++ exception directly from the FP trap handler. Since C++ exceptions are designed to have near-zero overhead on the normal path, this might appear to solve the performance problem.
Conceptually, it would look like this:
void fpe_handler(int sig, siginfo_t *info, void *ucontext)
{
(void)sig; (void)info; (void)ucontext;
// Clear sticky FP flags, otherwise we would immediately retrap
feclearexcept(FE_ALL_EXCEPT);
feenableexcept(FE_INVALID);
throw fp_exception();
}And inside the loop:
for (size_t i = 0; i < n; i++) {
try {
out[i] = std::sqrt(in[i]);
} catch (const fp_exception&) {
total_nans++;
}
}This removes the explicit sigsetjmp call from the hot loop. In theory, the compiler would generate clean code for the normal execution path, with exception-handling machinery only activated when needed.
Unfortunately, reality is far more complicated.
First, throwing C++ exceptions from a signal handler is not reliably supported. There are reports online of people making it work with GCC or Clang, typically using flags such as:
-fnon-call-exceptions-fasynchronous-unwind-tables
The first allows exceptions to be thrown from instructions other than explicit throw statements. The second enables unwinding from arbitrary instruction boundaries. However, in practice, getting this to work correctly and reliably is difficult, and we were not able to make it function in a robust way.
Even if it could be made to work, the fundamental optimization problem remains. Because a hardware FP trap can occur at any floating-point instruction, the compiler must assume that every such instruction may transfer control to a catch block. This forces extremely conservative code generation:
- Instruction reordering is heavily restricted
- Vectorization is effectively disabled
- Many optimizations that rely on speculative execution become illegal
In other words, even though C++ exceptions are “zero cost” in the common case, asynchronous hardware-triggered exceptions are not. The compiler must generate code as if every FP instruction were a potential control-flow boundary.
Since we were unable to make this approach work reliably, we do not provide performance numbers. However, given the required compiler flags and the implied restrictions on optimization, it is highly unlikely to outperform the simpler alternatives.
Conclusion
The original motivation for this exploration was Java. In Java, dereferencing a null reference raises a NullPointerException. If you inspect the generated assembly, you will not see an explicit null check. The code simply performs the dereference, and if the pointer is null, the hardware fault is intercepted by a signal handler installed by the JVM, which then throws the appropriate exception.
In theory, we might try to replicate this pattern in C or C++.
Instead of writing:
status_t do_something(my_pointer_t * p) {
if (p == nullptr) {
return status_t::BADARG;
}
int arg0 = p->arg0;
...
}we might attempt something like:
// Install signal handler that throws nullexception
status_t do_something(my_pointer_t * p) {
int arg0;
try {
arg0 = p->arg0;
} catch(const nullexception& e) {
return status_t::BADARG;
}
}In such a design, there would be no explicit null check. We would rely on a hardware fault to trigger an exception.
Unfortunately, in C and C++, this approach is neither reliable nor portable. While there are scattered reports of making it work with specific compiler flags and configurations, our experiments were unable to reproduce a robust solution.
More importantly, the fundamental issue is architectural. Hardware exceptions are at odds with modern compiler optimization. Once you allow an exception to be raised by any instruction, the compiler must treat many instructions as potential control-flow boundaries. Reordering becomes dangerous. Vectorization becomes illegal. Speculative execution becomes constrained. The optimizer loses freedom – and performance suffers.
If you try to keep optimizations enabled while relying on hardware exceptions, you quickly enter undefined-behavior territory. That is not a place where high-performance systems code should live.
The practical conclusion is simple: avoid using hardware traps as a control-flow mechanism in C and C++. They appear elegant, but they are fragile, non-portable, and hostile to optimization.
On modern hardware, explicit error checks are extremely cheap – often effectively free after optimization. They compose well with the optimizer. They are predictable. They survive compiler upgrades and flag changes.
The small theoretical win of removing a branch is not worth the complexity, performance instability, and portability risks introduced by traps, signals, and special compiler modes.
In C and C++, the boring solution – explicit checks – is the one that works the best for most people.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
- On x86-64 this is the MXCSR register for SSE/AVX instructions (and the x87 status word for legacy x87 instructions). On AArch64, the equivalent register is FPSR (Floating-Point Status Register). [↩]

