Latency-Sensitive Applications and the Memory Subsystem: Keeping the Data in the Cache

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

This is the 17th post in the series about memory subsystem optimizations. You can also explore other posts in the same series here.

Quick Introduction to Latency Sensitive Applications

Most performance techniques we covered until now were related to increasing throughput – the amount of data the system can process in a unit of time. So, a system that processes 5 GB/s of data is faster than a system that processes 3 GB/s. And for most applications whose performance we want to improve, what we actually want to do is to increase the data throughput.

But there is a smaller number of applications that are latency sensitive – the most important consideration is how much time passes between the request being made to the system and the system response. In these systems, although the throughput is still important, latency is nevertheless more important.

There are generally two types of latency sensitive applications:

  • Real-time applications, where the response to an input must be in a specified time frame. Many automotive systems, industrial automation systems, medical systems, etc. are like this.
  • Low-latency applications, where the response to an input must be as fast as possible. Prime examples of such systems are high-frequency trading systems or audio/video communication systems, where the response needs to be provided as soon as possible.

When it comes to memory performance in latency-sensitive environments, two things are important:

  • The relevant data should be in the fastest level of cache as possible. This will ensure fast access to it when it is needed.
  • No unnecessary hardware or software mechanisms related to memory management should kick in during the access to the data. Things like swapping, TLB cache misses or physical memory allocation introduce additional latency and slow down the response.

In today’s post we talk about mechanisms used to keep the relevant data in the fastest level of the data cache. Next post will talk about hardware or software mechanisms related to memory management that slow down access to data.

Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

Techniques

Software Cache Warming

The main problem of latency critical systems is that our critical data gets evicted from the cache because it is not accessed often enough. Consider the following of a code that receives a packet, parses it, and sends a response by performing a lookup in a hash map:

std::unordered_map<int32_t, order> my_orders;

...

packet_t* p;
while(!exit) {
    p = get_packet();

    // If packet arrived
    if (p) {
        // Check if the identifier is known to us
        auto it = my_orders.find(p->id);
        if (it != my_orders.end()) {
            send_answer(p->origin, it->second);
        }
    }
}

The above while loop is a busy loop. When the packet arrives, it will lookup the id in the hash map my_orders and send the answer. The response time is latency critical, therefore there are no sleeps in the loop.

If packets don’t come too often, this loop doesn’t do anything useful most of the time. In this case the data of our hash map my_orders will probably get evicted from the cache, since it is not used.

The solution to this problem is to perform bogus requests to the hash map instead of idle looping. This will make sure the data is not evicted from the data cache and requests can be served faster. Here is an example on how to keep the data cache warm:

std::unordered_map<int32_t, order> my_orders;

...

packet_t* p;
int64_t total_random_found = 0;

while(!exit) {
    // If the packet header is available in the 
    // packet buffer, we stop cache warming because
    // we want to respond as fast as possible.
    if (packet_header_arrived()) {
        p = get_packet();

        // If packet arrived
        if (p) {
            // Check if the identifier is known to us
            auto it = my_orders.find(p->id);
            if (it != my_orders.end()) {
                send_answer(p->origin, it->second);
            }
        }
    } else {
        // Cache warming instead of idle looping
        auto random_id = get_random_id();
        auto it = my_orders.find(random_id);
        // We need to do something with the result, otherwise
        // the compiler may optimize the access away
        total_random_found += (it != my_orders.end());
    }
}

std::cout << "Total random found " << total_random_found << "\n";

In the above example, instead of idle looping, the code will look up random identifiers in the hash map to prevent data eviction.

You need to be careful with this cache warming, because the data that you are receiving is not necessarily random. Imagine that packets come in bursts where most packets have the same identifier. In this case, cache warming as we presented in the above code snippet would bring useless data to the cache thereby evicting useful data. Therefore, instead of keeping the cache warm using random values, you would need to keep the cache warm using some values that came in the past.

Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

Hardware Cache Warming

Apart from software cache warming, where we are accessing our data without a real need just to prevent it from being evicted from the cache, some CPUs offer hardware mechanisms that can do that. With hardware mechanisms, the application just needs to use some kind of API to let the CPU know which part of the memory is important, and the hardware will take care of the rest.

On Intel, the technology is called Cache Pseudo Locking and it is implemented as a part of Intel Resource Director technology. Setting it up is not simple. The user first needs to configure a part of the cache that will be reserved for holding critical data, then this part of the cache is exposed as a character device and can be accessed using mmap system call. More info about it is available here.

AMD has a similar technology called L3 Cache Range Reservation. It allows the user to reserve a physical address range that will be kept in L3 cache. Unfortunately, AMD’s user manual only mentions this feature and doesn’t give examples on how to use it from user space.

Experiments

All experiments were executed on Intel(R) Core(TM) i5-10210U. We disabled turbo boost for more precise measurements.

Software Cache Warming

For the experiment, we use an ordinary std::unordered_set<int> hash map whose access time is very important to us and we want to minimize it. We create an artificial scenario, where this hash map is accessed rarely – once every million cycles. The source code is available here.

We perform the experiment with three different setups:

  • Regular hash map – the program performs the access to the hash map only when requested. Apart from that, it is doing the busy wait.
  • Reload zero – when the program is not doing critical access, it is continuously accessing the same data with value 0.
  • Reload random – when the program is no doing critical access, it is accessing random values in the hash map.

We perform the experiment on hash maps of various sizes. Here we report the average access time and standard deviation for 10.000 accesses to the hash map. We measure the latency on X86-64 using rdtsc instruction, which is a low cost instruction used to get the clock counter.

We measure latency, i.e. the number of cycles between the time when the request is issued and the result is returned. The hash map consists of integers; we vary the number of entries in the hash map. Here are the average latency and the latency standard deviation in cycles depending on hash map size:

Number of EntriesRegularReload zeroReload random
1 K226.1 (219.0)213.3 (205.1)132.5 (67.3)
4 K324.7 (296.3)350.7 (331.3)140.1 (95.4)
16 K396.8 (341.1)389.1 (354.5)208.7 (134.5)
64 K425.5 (376.1)416.0 (360.6)232.1 (152.6)
256 K514.2 (451.5)473.3 (480.6)338.8 (317.6)
1 M599.8 (550.2)615.1 (573.6)466.3 (429.8)
4 M702.1 (647.0)619.7 (649.2)531.3 (508.3)
16 M756.7 (677.6)668.8 (707.4)543.2 (499.9)
64 M769.1 (702.3)735.9 (734.2)641.0 (774.4)

From the above table we can draw several different conclusions:

  • Both the average latency and the latency deviation is smaller for the reload random version compared to the other two versions. This means that thy system will respond more quickly both on average, but also the worse response time will be better.
  • The advantage of reload random version are the biggest for the small to medium number of entries in the hash map. Very large hash maps don’t profit too much from reloading random elements in busy loop.

Although the table might suggest that the latency follows normal distribution, this is not the case. Minimal latency for all accesses is around 40, and maximum latency varies greatly from 2.000 cycles to 100.000 cycles. Smaller numbers are probably related to data cache misses and larger numbers are related to the cases where the program was interrupted by the operating system or something similar. This highlights the importance of setting up the system properly, including pinning the program to the core (which we didn’t do) and increasing the priority of the program (which we also didn’t do). Nevertheless, the largest latencies probably didn’t happen too often otherwise the averages would be much larger.

Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

Summary

Latency-sensitive applications are applications where latency between request and response is critical, and throughput (measured as GB of data processed per second) is only secondary. For such applications, it is important that the critical data is kept in fastest level of cache. To achieve this, we described two techniques:

  • Software data warming, where we access the critical data when the program is idle to prevent it from being evicted from the data cache.
  • Hardware data warming, where we configure the hardware not to evict critical data from the data cache.

Software data warming is easier to implement and doesn’t require special permissions or hardware features, but it will consume more power and memory bandwidth. On the opposite part of the spectrum, hardware data warming is more resource-saving but is more complicated to set up, requires special access to the system and you cannot have it keep critical data in the fastest level of the data cache (works only for L2 or L3 cache).

Like what you are reading? Follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Need help with software performance? Contact us!

Leave a Reply

Your email address will not be published. Required fields are marked *