Deep Dive in Java vs C++ Performance

Master software performance in just 16 hours!
Choose between the AVX/NEON Vectorization Workshop and the Software Optimization for the Memory Subsystem Workshop. Express your interest via the contact page.

For most of my career I lived in the world of C and C++, and I honestly believed that these languages are the pinnacle of software performance. But two months ago I started working at Azul, the maker of low-latency Java compiler and I had an opportunity to deep dive into Java performance. And it turns out I had some serious misconceptions about it! In this article I am going to explore Java’s performance and compare it with C++, hoping that this will clear some misunderstanding for you as it did for me.

Warmup

Java is JIT-ed language, meaning it doesn’t compile everything in advance. The java compiler javac converts.java files to .class files containing bytecode. But this transformation is not full native compilation like in C++, rather, it’s a very simple operation that converts Java code to bytecode with minimal optimizations and with the goal of portable code that can run on any JVM.

The real optimizations happen during program runtime and only on the hot methods. Initially, all the code is ran through the interpreter. Interpreting is relatively cheap to start up, but has bad performance compared to optimized native code. While the interpeted code is running, the Java VM collects profiling data, and “hot” methods get compiled by a lower-tier JIT compiler. This compiler doesn’t do expensive optimizations because compilation time is important here. Only those methods that are really hot will get compiled with high-tier JIT compiler!1

It is important to note that compilation happens on a separate thread(s) and CPU core(s) while your Java program is running. Therefore, the system that the program is running needs to have available CPU cores for this – which is usually the case on server and desktop systems but not embedded systems.2

Because of such design, there are two drawbacks with JIT-ting:

  • To get to peak performance, the program needs to be running for some time. Otherwise, the VM doesn’t get a chance to profile and compile hot methods. Short-running programs may never reach the same performance as long-running ones.
  • Compilation is happening while the Java program is running; this background compilation consumes CPU and memory resources. This makes less resources available for the main work – a concern especially in resource-constrained environments (few cores, strict power/CPU limits).

JVM / compiler developers have introduced features and techniques to mitigate these drawbacks. For example, in the case of the standard open-source JVM OpenJDK, there is support for Ahead-Of-Time (AOT) compilation. AOT allows compiling classes to native code ahead of runtime, reducing or eliminating the need for JIT warmup.

JVM / compiler developers have introduced a few techniques to mitigate these problems. For example, OpenJDK VM3 allows you to precompile your code with Ahead-Of-Time compilation (AOT), which reduces or eliminates the need for JIT warmup. Azul’s compiler offers warmup cache and compile stashing, two features that allow your Java program to immediately run optimized code from the previous run.

Verdict

C++ wins this round, but only by a small margin! Although the out-of-the-box experience with short-lived workloads is typically better with C++ than Java, a modern properly-configured JVM can doubtless complete with C++ even for short-lived workloads.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

 

Memory Management and Garbage Colllection

One important thing that separates C++ from Java is memory management. In C++, memory is released at predefined points in the program (RAII, delete, destructors). In constract, Java uses the machanism of garbage collection that automatically frees memory when objects become unreachable.

How memory allocation works?

For memory allocation, there are two paths: fast and slow. Most allocation happen in the fast path and it is very fast:

  • Java Virtual Machine (JVM) assigns each thread a Thread-Local Allocation Buffer (TLAB).
  • Each allocation is done simply by bumping a pointer inside the TLAB. Conceptually:
    old_top = tlab.top
    new_top = old_top + alloc_size
    if (new_top <= tlab.end) {
    tlab.top = new_top
    return old_top
    }
  • Only when the thread-local allocation buffer is exhausted is the slow path executed.

This bump-pointer allocation style is incredibly cheap. C++ could use per-thread arenas to achieve something similar (and some allocators do use it), but this is not the default because allocators needs to support reuse of earlier released memory (they don’t have to in Java),

Another important point: in Java, beside freeing the unused memory, the garbage collector also compacts memory. This significantly reduces the problem of memory fragmentation. In contrast, C++ heap can get very fragmented in a long running system that constantly allocated and frees memory.

Large object allocations (above the TLAB size or in the “humongous” category) go through a slower path, but are still comparable in performance to C++’s large-block allocation.

How garbage collection works?

Java doesn’t explicitly release memory; the garbage collector does that. But the GC does more than just freeing memory – in many implementations it reorganizes memory for smaller memory consumption and better performance.

GC typically activates either periodically, or when the JVM needs more memory and detects there is memory to reclaim. The process of garbage collection consists of two phases:

The mark phase

Garbage collection usually starts with a mark phase:

  • Start from root references (globals, static fields, thread stacks, registers)
  • Follow all pointers recursively, marking all objects as alive
  • Any object not reached is safe to garbage collect

The sweep / compaction phase

After marking, the GC must reclaim memory. Depending on the implementation:

  • Non-moving / non compacting implementations (rare in modern JVMs) only free dead objects
  • Compacting collectors move live objects together to eliminate fragmentation

Moving objects requires updating references. Most garbage collectors are Stop-the-Wold (STW), i.e. they pause all the threads to ensure no partially moved objects. Azul’s C4 is one of the few garbage collectors that can collect garbage concurrently4.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

 

Why is compaction important?

When the GC moves objects, it places related objects close together in memory. This improves spatial locallity and Johnny’s fans know how important this is for software performance. This is something C++ doesn’t provide automatically – you have to manually implement arenas, custom allocators or object pools for the same behavior.

Verdict

The absolute advantage of this approach is that it essentially eliminates the problem of memory fragmentation:

  • slow allocation time due to heap fragmentation
  • “out-of-memory” errors despite free memory available
  • loss of spatial locality for long-lived objects

This is a huge advantage over C++ for long-running server applications that manipulate many small objects. It is also one of the reasons why Java is commin in high-frequency trading system: the key datastructure orderbook is often a tree, and trees are prone to memory fragmentation.

Applications that are not memory intensive – those that allocate a few large blocks of memory or everything upfront – generally don’t benefit from garbage collector and often suffer from it. GC and compaction consume CPU cycles and memory bandwidth, which are scarce on embedded systems.

Execution Speed

The next important question about the quality of Java program is the speed of execution. Undoubtedly, the performance of Java program depends directly on the way it was compiled. The interpreted code is logically the slowest, followed by low-tier JIT-ted code and then high-tier JIT-ted code. If you’re code is compiled by a high-tier JIT, you can expect it to be slower than a similar C++ program compiled with high-optimization flags (e.g. -O2 or -O3).

If we compare C++ code compiled with high-optimization level and Java code compiled with high-tier JIT compiler, similar code will result in assembly of similar quality!

However, there are a few important differences.

JVM Overhead

Java threads need to be occasionally stopped for garbage collection. This is typically achieved like this: VM and Java threads synchronize using one or more dedicated memory pages. There can be one page for all the threads, or each thread can have its own page; the mechanism is the same. While Java threads are running, these pages are readable. When JVM wishes to stop one or more threads, it removes reading permissions from relevant memory page(s). When a threads next touches is, a page fault occurs. JVM has installed a page-fault handler that notifies the VM that the thread has reached a safepoint and is effectively stopped.

This check costs only a single memory access but must be done periodically. In a tight loop with only a few instruction the effect of this safepoint check can be measurable.

Another thing specific to concurrent JVMs (and Azul’s VM is such a JVM): on every reference load the compiler needs to check if the reference has been moved and fix it if necessary. This also adds a small overhead that can be felt in tight loops.

Runtime Information Make Java Optimizations More Precise

Java collects profiling data during the program execution and this information is fed to high-tier JIT compiler during compilation. The Java compiler can generate more precise and more optimized code than a static C++ compiler – because it knows the actual runtime behavior. Several examples:

  • Devirtualization
    If a loop accesses pointers to a base class, but at runtime all instances are of the same derived class, Java can devirtualize the method call. It replaces a virtual call with a direct call after a short type check, making the call much cheaper.
  • Dead code elimination with deoptimization points
    The compiler can remove code that is never executed under current runtime conditions. Instead of assuming the removed code is unreachable forever, the compiler inserts deoptimization points. If assumptions later turn out wrong (e.g. new types appear), the JVM can revert to the interpreter or recompile the method.
  • Constant folding of global variables
    The compiler doesn’t need to load global constants initialized at program startup – it knows their values from profiling or class initialization data and can emit instructions using immediate values directly.
  • Loop specialization (unswitching based on runtime constants)
    In C++, loop unswitching must be conservative to avoid code size blow-up. In Java, the compiler knows actual unchangeable (invariant) values at runtime and can emit specialized loop instances only for the observed values, avoiding unnecessary branches or redundant loops.

Deoptimization Checks

Real-world code can of course break some assumption that the JIT compiler has made, which renders some optimizations invalid. To handle this, the JIT compiler inserts deoptimization checks.

  • If a deoptimization check fails, the JVM reverts to the less optimized code where failed optimization wasn’t implemented. It resumes collecting profiling data.
  • This can be seen as a temporary drop in performance.
  • On following runs, the compiler can re-emit optimized code, omitting optimizations that caused deoptimization.

This mechanism allows Java to perform aggressive, speculative optimizations while maintaining correctness.

Pointer Aliasing

In Java, if two references are different they are guaranteed not to alias one another.

  • So, a construct like int a = malloc(…); int b = a + 1; is invalid in Java, because Java doesn’t allow pointer arithmetic.
  • Pointer aliasing can break many optimizations, notably vectorization
  • Because Java simplifies pointer aliasing analysis the compiler can optimize more aggressively compared to C++.

In C++, pointer aliasing can be controlled using the restrict keyword and careful coding, but Java achieves this out of the box, without programmer intervention.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

 

Vector of Objects

Java does not have a “vector of objects” in the same sense as C++ (contiguous memory storage of objects); it only has arrays of object references and arrays of primitive types.

  • From a hardware efficiency standpoint, a true vector of objects is the best.
  • However, because Java’s garbage collector compacts and relocates objects, logically neighboring objects will very often end up neighbors in memory as well.
  • As a result, access patterns over arrays of objects in Java has similar spatial locality and performance as a vector of objects in C++.
  • Everything we said here a applies to compacting collectors, non-compacting collector don’t have these advantages.

Java Can Emit More Efficient Instructions

Since the Java compiler knows the system it is running on, it can inspect which instructions are available and emit the most efficient ones.

A C++ library compiled with AVX2 will only use AVX2 even if AVX512 is available, unless you ship multiple binaries or use manual runtime dispatching. In contrast, the same Java compiler can detect the presence of AVX512 and compile code specifically for this architecture. This behavior comes out-of-the-box, no setup required!

Memory Layout Optimizations – Not There Yet

Java’s object model is both its biggest limitation and its biggest untapped source of optimizations.

A Java class only stores references to its members. This means the JVM and the garbage collector are, in principle, free to place the actual objects anywhere in memory. Nothing in the Java language says that fields a, b and c inside class class { A a; B b; C c; } must sit together, or even near each other. The JVM just stores three references – the real objects can be arranged however runtime wants.

Now imagine the JVM had profiling data telling you that your hot loop only touches a and b, but not c. In theory, a compacting GC could place X, A and B tightly together for maximum cache efficiency, and push C somewhere else where it won’t pollute the cache. That’s essentially automatic hot/cold splitting that we covered in C++ here, but this time powered by runtime data. Doing this in C++ would be close to impossible.

Unfortunately, no JVM today performs this kind of memory-layout optimizations. But Java is one of the few languages that are both fast enough and have a garbage collector and where such a feature makes sense and could be implemented.

Verdict

When both languages are running fully optimized code, C++ and Java are surprisingly close to each other. Sometimes C++ wins, sometimes Java wins – it all depends on the workload. But don’t forget that this comparison applies only to hot, JIT-optimized Java code. In real applications, Java spends a noticeable amount of time running unoptimized code, so the everyday user experience easily gives the impression that “Java is slower”, even though the optimized steady-state performance tells a different story.

Latency

Latency is defined as the time between the moment the task is submitted to the program for processing and the moment the result is delivered. For a Java program in steady state and mostly optimized JIT code, you can get latency comparable to C++. But in practice, Java latency is more variable:

  • Java runs a mix of optimized and unoptimized code. C++ always executes fully optimized machine code. Java, on the other hand, may execute interpreted code, low-tier JIT-ted code, or high-tier JIT-ted code depending on the warm-up state. In some corner cases (cold paths, rarely used GUI actions, background callbacks), the JVM will still be running low optimization code – which produces higher latency and more jitter. Anyone who has used a large Java GUI app knows the feeling: click an option you haven’t used before, and the first-time delay appears because the JVM must execute cold code and possibly compile it.
  • Garbage collection introduces latency variation. When the GC runs, latency spikes. Many JVMs still introduce “stop-the-world” pauses to update references or compact memory. This problem, however, has been partially solved, as there are garbage collectors that collect memory in the background and stop the program only minimally!

Verdict

C++ latency is generally more predictable: it runs fully optimized code from the start and uses fewer runtime resources. Java can, in specific scenarios, achieve better latency – for example, when C++ applications suffer from severe memory fragmentation, or when Java’s compacting GC provides more cache-friendly memory layout. And the fact that Java is used in high-frequency trading proves that it can deliver extremely low latency when tuned correctly.

But overall, C++ wins this round for predictability and consistency of latency.

Memory and CPU Overhead

This is the place where Java is weakest. The JVM has its own bytecode instruction set, and the JIT compiler must profile, optimize, and compile code at runtime. This consumes CPU cycles and power, and the runtime infrastructure (profiling data, metadata, code cache, GC structures) can take a non-negligible amount of memory.

The process of garbage collection also consumes memory bandwidth, especially during marking and compaction phases. This can delay other programs on the same system because they memory bandwidth is a shared resource; the other programs must wait longer for data to arrive from memory.

Verdict

It is clear that C++ dominates here. Java is much heavier on system resources, and for this reason it is rarely used on low-end embedded systems with tight CPU and memory budgets. (Java is used on higher-end embedded Linux systems like Android, smart TVs, etc., but not on microcontroller-class devices.)

Final Words

To summarize:

AreaC++Java
Startup TimeInstantNeeds warm-up
Memory UsageLowHigh (JVM + metadata + JIT overhead)
LatencyPredictable, no GC pausesLess predictable.; depends heavily on the JVM implementation5
Optimization AvailabilityVery goodExcellent, thanks to runtime profiling, speculative optimizations, and deoptimization
Memory LayoutObject’s layout is fixed and never moved automaticallyGC compacts and defragments memory, preserving locality in long-running programs
Platform InstructionsFixed at compile timeDetects and uses the best available instructions at runtime
Embedded SuitabilityExcellentUsually unsuitable for low-end embedded devices (high-end embedded/Android is fine)
Server SuitabilityExcellentExcellent
Possible Worst-Case PerformanceRare; typically due to complex code causing register spills or failed optimizationsSame as C++, plus potential issues from running cold code or hitting optimization/deoptimization cycles

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

 
  1. The exact compilation tiers and behavior depend on the JVM implementation and configuration []
  2. Note that many JVM providers offer compilation in the cloud so the price of compilation doesn’t need to be paid on the system that is running a Java program []
  3. OpenJDK is the open-source reference implementation of JVM []
  4. Or mostly concurrently, depending on the definition. Azul’s C4 needs to pause all the threads, but the pause is very short and doesn’t depend on the heap size []
  5. concurrent GC implementations achieve far latency than standard implementations []

Leave a Reply

Your email address will not be published. Required fields are marked *