We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
This is the 18th post in the series about memory subsystem optimizations. You can also explore other posts in the same series here.
For people who are interested in low-latency systems, we would wholeheartedly recommend JabPerf’s blog which covers these topics in much greater depth.
Introduction
We already wrote about what latency-sensitive applications are and how to keep data in cache in the previous post. If the topic of latency-sensitive applications is unfamiliar to you, you should read the previous post before moving to this one.
In this post we talk about memory mechanism that increase memory accesses latency and we explore the techniques to avoid them in latency-sensitive systems. Such mechanism we cover in this posts are swapping, page faults, table walks or TLB shotdowns. Here we introduce them and also we explore the techniques you can use to ensure these mechanisms are rarely or never activated.
Techniques
Page Faults
When your program requests memory from the operating system, the operating system immediately reserves enough pages of virtual memory to accommodate the request, but no physical memory is actually allocated. Instead, the operating system marks the virtual pages as non-allocated.
On the first access to an address inside a non-allocated virtual page, the CPU generates a page fault, and transfers the control to the operating system. The operating system allocates a physical page for the non-allocated virtual page, and the control is returned to the original program.
Allocating of the physical page for a given virtual page is an expensive process, and it introduces unwanted latency to memory accesses. Latency-critical applications use prefaulting, a collection of techniques used to ensure all virtual pages an application is accessing are backed up by physical memory.
Prefaulting
System Calls
On Linux, memory is most of the time reserved using mmap
system call. However, the memory block returned by this call is rarely prefaulted.
If you wish to make the operating system return prefaulted memory, you need to specify the flag MAP_POPULATE
when making calling mmap
. This flag tells the operating system to immediately allocate physical pages for the block of virtual memory it will return. First access to such a block will not cause page faults and physical memory allocation. Example:
int32_t* array2 = (int32_t*) mmap(0, array_memory_size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE, -1, 0);
This is important for people who are writing memory allocators, either system allocators, or they are overwriting the default memory allocator for e.g. std::vector
. You can use mmap
to allocate large amounts of memory from the operating system by bypassing the system allocator.
Allocating Memory in C and C++
On programs written using C or C++, most memory allocation calls eventually resolve to malloc
. There is no guarantee malloc
will return prefaulted memory, therefore, on latency critical system, it is necessary to ensure all the memory is prefaulted so the first access to it would be fast.
System allocator offersCalloc doesn’t necessarily zero out prefault memory, at least not in Linux. Thanks David Blackman for the update.calloc
instead ofmalloc
. Call tocalloc
will zero initialize the block of memory. Later, when this block is accessed, no page faults will occur.- You can fill any block of memory with zeros. The block can be allocated using
malloc
,new
, it can be allocated in the global memory or on the stack. - If a class has a non-trivial constructor, call to the constructor will prefault the object’s memory page.
C-style Arrays and std::array
There are three types of array we are talking about in this section: C-style array allocated in the stack/global memory, C-style array allocated on the heap and std::array
. Here are the definitions of these arrays:
// Stack or global memory allocated C style array int32_t a[1024]; // std::array std::array<int32_t, 1024> b; // Heap allocated C style array int32_t* c = malloc(1024 * sizeof(int32_t)); // Heap allocated C style array using new int32_t* c = new int32_t[1024];
The bellow recommendations apply to arrays of simple types, since arrays of complex types typically involve calling the constructor which initializes (and therefore prefaults) all the memory.
For global memory allocated arrays, the first access to them will almost certainly result in page faults. For stack allocated arrays, the same will happen unless the stack memory hasn’t been used earlier. And for the heap-based arrays, page faults will happen unless the system allocator returns a recycled memory block.
For all three types, the fix is fairly simple. For stack allocated, global memory allocated and heap allocated arrays using new
, we can easily tell the compiler to initialize these arrays using {}
after the array definition. For heap allocated arrays using malloc
, we need to add memset after the call to malloc
. Here is the example on how to prefault all the arrays:
// Stack or global memory allocated C style array int32_t a[1024]{}; // std::array std::array<int32_t, 1024> b{}; // Heap allocated C style array int32_t* c = malloc(1024 * sizeof(int32_t)); memset(c, 0, 1024*sizeof(int32_t)); // Heap allocated C style array using new int32_t* c = new int32_t[1024]{};
Prefaulting std::vector
If your code is using std::vector
, then it is not possible to have a vector that contains uninitialized data, even for simple types. If std::vector
contains N elements, it is guaranteed that all N elements in the vector will be initialized to same value, thus avoiding page faults on first access to it.
The main problem with prefaulting and std::vector
happens with reserve
method. In many applications, the developer knows how many elements a vector is going to need so they use the reserve
method to avoid expensive data reallocations. But the method reserve
doesn’t necessarily prefault the memory; later pushing new values to such a vector will result in page faults.
To fix this, you have two solutions. Either to avoid reserve
completely; instead you should create a vector of desired size when the vector is constructed. The other way is to override std::vector
‘s allocator and prefault the block of memory before passing it to std::vector
.
Prefaulting hash maps (std::unordered_set and std::unordered_map)
Internally, hash maps consists of one large block of memory for the hash map itself. Apart from this, and depending on the hash map type, a hash map can allocate additional memory. Open addressing hash maps do not allocate additional memory; separate chaining hash maps also allocate additional smaller memory chunks for collision resolution.
Under the assumption you know roughly the size of hash map in advance, you will call its reserve
method to reserve the space for the hash map and avoid expensive rehashing. But, similar to std::vector
, the reserve
method will allocate memory without prefaulting it. Again, similar to std::vector
, using a custom allocator to prefault the memory before returning it to the data structure will help.
In case of separate chaining hash maps, there is no guarantee that the hash map will not allocate additional memory in case of collision. So you will need to allocate and prefault additional memory. Open-addressing hash maps don’t have this issue.
Prefaulting trees (std::map and std::set)
Trees consist of many small memory chunks of the same size. There is no reserve
method for trees, which means that it is not possible to zero-initialize it.
Same as with earlier structures, prefaulting doesn’t happen automatically. Again, same, if you want to prefault your memory blocks the simplest way is to with a custom memory allocator.
Large Memory Pages
Standard memory pages are 4 kB in size on most systems. So, if a program allocates 2 MB of memory, this will be serviced through 512 memory pages. Alternative are large pages: e.g. a 2 MB of memory will be serviced through only one 2 MB memory page.
Advantages of using larger pages are (1) less page faults when accessing data for the first time, since there are fewer memory pages in total for the same data size and (2) better access time for random memory accesses because the TLB cache (cache that translates virtual memory addresses to physical memory addresses) can service larger blocks of memory without cache misses.
Unfortunately, on most common systems large memory pages are not enabled by default. To enable them, you must either have root privileges or you need to ask the system owner to enable them for you. You can find more information about large memory pages, including how to enable them and what to expect performance-wise in our post about large memory pages and TLB cache.
Disable 5-Level Page Walks
When a TLB cache miss happens, i.e. when the CPU cannot find a mapping of a virtual address to physical address in the fast TLB cache, it needs to find the corresponding mapping in the memory. Since the virtual memory space is very large, and most of it is unused, the virtual to physical memory mappings are organized hierarchically, and the CPU needs to do a hierarchical lookup for the needed virtual page to physical page mapping.
The depth of the hierarchy depends on the size of the virtual addressing space. On 64-bit systems, it is possible to allocate 256 TB of virtual memory with 48 bits and using 4-level page tables. But a few years ago, Intel decided to increase the size of of virtual addressing space to 128 PB, which requires 57 bit addresses and 5-level page tables.
A 5-level page table takes longer time to traverse than 4-level page table. Therefore, in latency-sensitive systems, 5-level pages should be disabled unless you really need the extra memory. On Linux you can disable them by passing no5lvl
option on the kernel boot command line.
You can read more about 5-level page tables in JabPerf’s blog about the topic.
TLB Shootdowns
TLB caches hold information about virtual to physical memory mapping. This includes: (1) the physical memory page corresponding to a virtual memory page, (2) whether the page is allocated, (3) permissions for the page (read/execute/write).
Everytime some of this information about a page changes (e.g. a virtual memory page becomes valid because the operating system has allocated a physical page for it, or a virtual page became executable), this information needs to be propagated to all TLB caches in a multicore system.
Propagation is mostly performed in software, using interrupts and it introduces unpredictable latency.
In practical terms, TLB shootdowns in latency-sensitive systems are prevented by allocating all the needed memory upfront and prefaulting it. This will ensure that the TLB cache doesn’t need to be updated and will prevent expensive TLB shootdowns.
Mark Dawson from JabPerf has talked extensively about this topic, including how TLB shootdowns are handled in software, what is their impact on performance and how to check if TLB shootdowns happen in your program. More information here.
Swapping
Swapping mechanism kicks in when there is not enough memory available. The operating system will move data that hasn’t been used recently from the physical memory to the disk in a process called swapping. Access to the data that has been moved to disk will be slow, because it will result in a page fault and the operating system reloading the data from the disk to the memory.
Although many low-latency and real-time systems disable swapping completely, on those systems that have swapping enabled, you can disable swapping for a particular process or particular part of memory. On Linux, this is achieved using mlock
and mlockall
system calls.
Experiments
In this experiment, we measure the impact of doing computation on an array which is prefaulted vs an array which is not prefaulted. The source code is available here.
We ran the test on Intel(R) Core(TM) i5-10210U. We disabled turbo boost. The test fills an array of 2 GB with random data. We measure how much time does it take to fill the data. We repeat the experiment 10 times and report the average and the standard deviation:
Array Type | Runtime |
---|---|
Without Prefaulting | 15.060 (0.022) |
With Prefaulting | 13.811 (0.015) |
Speed-up is measurable, but not huge. Since our test program accesses memory sequentially, it is quite possible that page clustering Linux mechanism got activated to speed up allocation of physical pages.
Final Words
The memory management mechanism, like swapping, TLB page walks, page faults and others increase memory access latency and should be avoided on latency sensitive systems. However, latency-insensitive systems don’t need it, since many times they can result in decreased system throughput because they decrease the memory available for other processes.
Hi. Lots of great tips here!
One comment regarding “System allocator offers calloc instead of malloc. Call to calloc will zero initialize the block of memory. Later, when this block is accessed, no page faults will occur.”
Can you check this? On RHEL 8, I don’t see the RSS of a process that calls calloc increase until *after* the process starts writing to the calloced pages.
Yes, you are right. In my environment as well, calloc and malloc don’t differ in the amount of time. My guess is that memory allocated for cmalloc is allocated internally using mmap with MAP_ANONYMOUS flag, which guarantees initialization to zero. My allocator uses this approach, I guess the situation is same with yours. I am going to update the post, thanks.
Hi Ivica – Great post, as always! I think you neglected to insert the specs of the system the experiment was run on prior to posting.
Fixed it, thanks 🙂