We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.
This is the 13th post in memory optimization series. Feel free to checkout other posts on the same topic.
Introduction to TLB Cache
On practically all modern systems except for very low-end embedded systems, the memory is not physically addressed. Instead, virtual addresses are used. To maintain the virtual to physical memory mapping, the operating systems maintains an in-memory page table.
All instructions that read from the memory and write to it will need to translate virtual addresses to physical. We already know that the memory accesses can be slow. To speed up translation of virtual to physical addresses, each CPU core has a special cache called translation lookaside buffer (TLB). It is a very fast cache memory that stores mappings of virtual addresses to physical addresses. When a mapping is already in TLB, the translation process is very quick. But, when not, an expensive process of page walking needs in order to find the missing mapping.
In real-life programs, most codes don’t exhibit TLB cache misses. As we’ve shown earlier, codes that exhibit a lot of TLB data cache misses are codes that perform a lot of random memory access on large data structures – trees, hash maps and random array lookups (e.g. binary search). So, before trying to speed up your code using tips from this post, make sure your code actually is performing random memory accesses.
In this post we talk about TLB data cache misses which happen when the program is accessing data. There are also TLB instruction cache misses, which happen when the code is jumping around various different memory locations and execute instructions. You can read more about TLB instruction cache misses in the excellent post by Denis Bakhvalov.
Techniques
The problem with TLB cache misses happen with random data accesses. Since random data accesses are also the culprits for regular data cache misses, every technique that decreases regular data cache misses will also decrease TLB cache misses. For example, decreasing the data set size will decrease the number of TLB cache misses. Also, techniques that eliminate random memory accesses, like loop interchange, also decrease TLB cache misses.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Huge Pages
But, a part from that, there is one more thing we can do. Normally, the page size on x86 and ARM systems is 4 kB. If a TLB cache has 64 entries, this means we can randomly access data within a block of 256 kB (64 * 4 kB) without any data cache misses. If the block of data is larger than this we will experience some TLB cache misses and this number will grow as the size of the block grows.
Both Intel and ARM offer large data pages or huge data pages. Instead of a page size of 4 kB, the page size can be 2 MB, 4 MB or 1 GB on Intel, and 64 kB, 1 MB or 16 MB on ARM. So, for example, with a page size of 2 MB, the data block size with random memory accesses where no TLB data cache misses will appear will be 128 MB (2 MB * 64). Compared to original block of 256 kB, this is a huge improvement.
There is one additional advantage of using large pages over regular pages. When the memory is first returned by the operating system, it is not backed up by any physical memory (this is done to save space, because applications often allocate memory that they don’t use). Only when the program accesses a page belonging to that block of memory for the first time, will the operating system allocate a physical page and create a virtual to physical mapping. If the block of memory is large, there will be a call to the operating system for each page in the block. With huge pages, the number of calls to the operating system will be much lower.
Setting up Huge Pages
Large pages and huge pages don’t come out of the box, and they need to be set up on the target system. Typically, this is done through configuration files, and certain amount of memory (e.g. 1 GB) is dedicated for large pages (e.g. 1024 pages of 2 MB in size).
To set up huge pages on newer Ubuntu versions and x86-64 platforms, you need to add line vm.nr_hugepages=1024
to file /etc/sysctl.conf
and then restart your system. A total of 1024 2MB pages will be reserved for you. Bear in mind that these pages will not be available for other apps, unless those apps explicitly request huge pages.
Using Huge Pages
There are several ways to use huge pages. On the operating system level, huge pages are allocated using a mmap
system call and specifying MAP_HUGETLB
as a parameter.
char * p = reinterpret_cast<char*>(mmap(0, mem_size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0));
This approach allows you to allocate huge pages only for those blocks of memory that would actually benefit from it (blocks with random memory accesses). This is certainly the best way, because it frugally uses huge pages, but requires you to understand the memory access pattern of your hot loop, because, huge pages only help with random memory accesses.
A coarser way to use huge pages is libhugetlbfs. It allows you to have a per-process control of who gets huge pages. This library is available in Linux repositories. If you link against it, your program will automatically use huge pages if available. Alternatively, you can use LD_PRELOAD
to preload the library and use huge pages without recompiling the program. More info in libhugetlbfs documentation.
Another option is to use a system allocator that supports huge pages. Google’s TcMalloc is one such allocator. For Ubuntu, it is available in the repositories. To use it, you don’t need to recompile your program. You can use LD_PRELOAD to enable the allocator and huge pages like this:
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4 TCMALLOC_MEMFS_MALLOC_PATH=/dev/hugepages/tcmalloc ./my_program
And lastly, the simplest, but with least control way to use huge pages is through transparent huge pages (on Linux). This is a feature of the operating system and the kernel needs to be compiled with it. Essentially, the operating system will allocate huge pages to some of the processes in the system without the need to do anything. This feature is often used in Linux gaming to improve game performance, but has found little use outside of it since there is no control on which process will get allocated with huge pages.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.
Experiments
In the experiment section, we want to measure the performance impact of huge pages. For the experiment we used the open addressing hash map example from the previous post. We measured the performance of the hash map lookups when the program was using either small pages or huge pages. Both small pages and huge pages are allocated using mmap
. We ran the test on Intel(R) Core(TM) i5-10210U system. Source code is available here.
Here is the comparison between the small pages version and huge pages version for lookups in open addressing hash maps of various sizes:
Hashmap Size | Small Pages | Huge Pages |
---|---|---|
8 kB | Runtime: 0.261 s L1 DTLB load misses: 96 k L1 DTLB load miss rate: 0.0001 | Runtime: 0.262 s L1 DTLB load misses: 96 k L1 DTLB load miss rate: 0.0001 |
64 kB | Runtime: 0.318 s L1 DTLB load misses: 99 k L1 DTLB load miss rate: 0.0001 | Runtime: 0.292 s L1 DTLB load misses: 96 k L1 DTLB load miss rate: 0.0001 |
512 kB | Runtime: 0.573 s L1 DTLB load misses: 7.483 k L1 DTLB load miss rate: 0.0076 | Runtime: 0.503 s L1 DTLB load misses: 101 k L1 DTLB load miss rate: 0.0001 |
4 MB | Runtime: 0.710 s L1 DTLB load misses: 13.277 k L1 DTLB load miss rate: 0.0148 | Runtime: 0.585 s L1 DTLB load misses: 94 k L1 DTLB load miss rate: 0.0001 |
32 MB | Runtime: 0.437 s L1 DTLB load misses: 12.573 k L1 DTLB load miss rate: 0.026 | Runtime: 0.355 s L1 DTLB load misses: 99 k L1 DTLB load miss rate: 0.0002 |
The story is same as always. For small hash maps, there is no difference whether we are using small or huge pages. For larger hash maps, the speedup is between 1.14 x and 1.25 x.
Final Words
We presented here the only technique that targets specifically TLB cache misses, namely huge pages. The technique is not complicated to set up, and results in some 15% – 25% speedup on loads with random memory accesses. It makes sense on dedicated server systems and embedded systems, where we have the full control of our environment.
Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.