Growing Buffers to Avoid Copying Data

UPCOMING VECTORIZATION WORKSHOPS
AVX Vectorization Workshop: 4 half days, May 26th to May 29th,
11 AM – 3 PM (US East Coast)
8 AM – 12 PM (US West Coast)
5 PM – 9 PM CET (Europe)
NEON Vectorization Workshop: TBD, send e-mail to info@johnnysswlab.com to express interest

More info…

Copying data can be expensive in some cases, especially since it it doesn’t change the data, it’s just moves it. Therefore we, engineers interested in performance, want to avoid copying data as much as possible.

We already talked about avoiding data copying in C++ earlier. In that post, we talked about what mechanism C++ has to offer when it comes to avoiding or minimizing copying. In this post, we focus more on what the operating system can give us to avoid data copying.

A small introduction

Among memory allocation functions in C library, one function is particular: realloc. This function allows us to grow or shrink a buffer allocated with malloc or calloc: if realloc can do it in place, it will avoid copying data. If not, then it will allocate a new buffer, copy the data there, and free the old buffer.

If you are using C, this is all that you need. In C++, things get complicated. C++ doesn’t have a dedicated realloc function. It only offers new and delete for the user. Containers use std::allocator to allocate memory, but again, no buffer growth is available there.

If you need your buffers growing, you can opt for custom container implementation with realloc. But, there is a problem of non trivially copyable types: some types cannot be copied using realloc, instead they need to be properly constructed and destroyed. Therefore, for some types, realloc is not even an option. Instead, a manual construct new-destroy old sequence is needed for the things to work correctly.

API for enlarging buffers

If we want to avoid copying in C++, we would need a function called resize_buffer with a following signature:

bool resize_buffer(void* ptr, size_t new_size)

This function tries to resize the buffer in-place. If successful, it returns true and no copying is needed. If unsuccessful, it returns false and we need to allocate another buffer, move our data there and release this buffer using free.

This function doesn’t exist; nevertheless, we will try to build it in this post!

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

How to build resize_buffer?

If you are working with buffers that are smaller than memory page size (typical 4 kB), and you want to implement resize_buffer for all buffer sizes, this can come with a huge memory overhead. In essence, you will need to reserve a whole memory page (e.g. 4 kB) to allow your buffer to grow from 1 byte to 4 kB.

Growing larger buffers is easier. A common 64-bit system with virtual memory support (which is essentially all desktop and server CPUs and some more powerful embedded CPUs) have a memory address space that is 2^48 bits, or 256x4GB of data. If we want to allocate a growable buffer, we need to pick an address somewhere in this space at which we will allocate our buffer. Let’s say, we picked address X and the size of our buffer is 1 MB.

Let’s assume now that we want to grow our buffer from 1 MB to 4 MB. To achieve this, the following needs to be true:

  • The virtual memory from address X + 1MB to X + 3 MB needs to be unused – nobody else should have allocated this virtual memory space earlier.
  • The system should have additional 3 MB of physical memory to back up the virtual memory

Then, we can either ask the OS to expand the original virtual memory block from X + 1 MB to X + 3 MB. Alternatively, we can ask the OS to allocate a new memory block of 3 MB at address X + 1 MB.

Growing buffer on 32 bit system is the same, except that instead of 256×4 GB we have 4 GB of virtual memory space at our disposal. Growing memory is possible here as well, but there is a much larger chance that the virtual address space we want to grow to is used by someone else.

On Linux

On Linux, memory is allocated using mmap system call. This call returns a block of memory which is a multiple of page size (the size in bytes, but the actually allocated memory will be rounded up to a multiple of page size, so e.g. 20 byte allocation will actually be a 4 kB allocation). Here is a an example on how to allocate memory using mmap:

void * res = mmap(base_addr, size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

Important parameter is base_addr, which suggests to mmap the address we want to use – if nullptr, the OS will pick an address for us; otherwise the OS will try to allocate the specified address (if rounded to page size).

Unfortunately, in our testing environment, we couldn’t let Linux pick the base address for us: if we do this, the next call to mremap will fail. The reason for the failure is that mmap tries to be conservative – and it reserves memory after your block so your block cannot grow.

Therefore, we have to provide a base address for mmap for the allocated block to be growable.

On Windows

The builtin support for growing buffers on Windows, similar to mremap on Linux, is missing. There is no equivalent of mremap on Windows. Luckily, we can work around this.

The equivalent of Linux mmap is VirtualAlloc. Function mremap doesn’t exist, but we can allocate consecutive virtual memory blocks and treat them as one continuous block for the purposes of user program. So, to allocate memory, we would use:

ptr0 = VirtualAlloc(base_ptr, size0, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

and then append additional calls to VirtualAlloc to make it grow:

ptr1 = VirtualAlloc(ptr0 + size0, increment1, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
ptr2 = VirtualAlloc(ptr1 + increment1, increment2, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

When releasing memory like this, we must not forget to call VirtualFree for each pointers ptr0, ptr1 ... This implies we need to maintain the list of all pointers and dispose them properly when they are not needed anymore (by the way, you can use the same approach on Linux if you don’t want to use mremap).

System allocator support for resizing memory buffers?

As already said, standard library does not provide an API that would allow you to resize buffers. The only allocator that does so is jemalloc – a drop in replacement for system malloc and free. It provides a function with this signature:

size_t allocated_size = xallocx(old_ptr, new_size, 0, 0);

The function, if succeeds, will grow (or shrink your buffer). The function will succeed if new_size <= allocated_size. The function sounds something that would be nice to have in C standard library.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Experiment

I wanted to get a feeling of how much time avoiding copies save. For this, I wrote a simple jsl::vector implementation that is similar to std::vector. When we push new data into the vector, the vector will grow if there isn’t enough place for new data. The difference from std::vector is that jsl::vector first tries to resize the underlying buffer to avoid copying, and only if that fails, it allocates a new buffer. The source code is available here.

We implemented four types of allocations:

  • Simple: no regrowth. Works on Linux and Windows.
  • Resize: regrows by allocating another neighboring block of memory, if possible. Works on Linux and Windows.
  • Posix: regrows using mremap. Works only on Linux.
  • Jemalloc: regrows using xallocx call from jemalloc. Works on Linux and Windows, but we tested only on Linux.

We tested writing 256M values into jsl::vector<float> using push_back methods, without reserve. The Linux and Windows systems are different and therefore not directly comparable. Here are the runtimes (g++, 10 repeats, average time) on the Linux system:

ConfigurationRuntime
std::vector0.998 s
Simple0.899 s
Resize0.495 s
Posix0.522 s
Jemalloc0.547 s

Avoiding copies definitely pays off in terms of number. Both Resize and Posix configurations were able to completely avoid copying and grow their bytes from 0 to 256 M. Jemalloc on the other hand, wasn’t able to grow buffers and some copying were needed: small buffers (less than 32 kB had to be copied), and sporadically, resizing larger buffers also failed: out of 15 consecutive attempts at growth (from size X to size 2X), 4 failed.

Same measurements, on Windows with MSVC compiler:

ConfigurationRuntime
std::vector0.64 s
Simple0.54 s
ResizeAverage: 0.415 s, but there were two numbers where the results were clustering:
0.46 s (6 measurements)
0.34 s (4 measurements)

Again, we see here similar behavior. Resizing buffer instead of copying helps, but not as much as on Linux. We notice that in Resize we actually have two numbers. The reason for this is that on Windows Resize doesn’t manage to grow buffers in place – sometime VirtualAlloc cannot allocate the memory address we need for our buffer to grow so we need to copy it. If this happens while the buffer is relatively small – we get the short runtime. If this happens while the buffer is large – we get the long runtime.

Problems with buffer growing?

I intentionally painted a picture without going into too much detail. Can this go wrong? Absolutely. Here are some possible problems which I either experienced while writing this post, or that can happen if you try to use this approach for many buffers, based on my experience:

  • Different behavior on different systems: you might test this approach and it works on your development system, but when you try to run it on a production system, it fails. In the worst case, a system can experience virtual memory shortages in spite of having enough physical memory.
  • Silent failure: if this approach fails, it will be silent. Your performance improving efforts will be spent in vain.
  • Virtual space fragmentation: to allow buffer growth, your buffers will be far away from one another. This leads to virtual address space fragmentation and lower performance, for two reasons: (1) the OS needs more time to allocate and free memory and (2) there will be a higher number of data cache misses and TLB cache misses, which can result in overall slowness of the system without an obvious reason.
  • mmap quirks: we need to provide the base address for mmap for mremap to succeed later. This introduces the problem of memory management in our program: now our program has to decide how it wants to arrange buffers in memory, which not trivial. In this experiment we picked a random address for mmap, but definitely this is not a good solution for many buffers that wish to grow.
  • VirtualAlloc quirks: Windows’ allocation function is not without its own problems. Although VirtualAlloc allocates memory in page sizes, the buffers can only starts at offset which a multiple of SYSTEM_INFO::dwAllocationGranularity, which on Windows 11 was 64 kB. Therefore, we need to allocate our memory in blocks of 64 kB in order to be able to grow them, which further leads to more memory waste.
  • Doesn’t scale well: imagine having thousands of allocations that can potentially grow. It can be very challenging to maintain to potential of buffer growth in the presence of many allocations.
  • Unstandardized: this is something that works differently on different systems and great care needs to be taken if you want this to be portable to different systems.

Conclusion

Buffer growth without copying is definitely possible, and worth it in the case where you have data of undetermined size where you want to avoid copying. But, although it might seem simple in the first run you need to be careful with implementing and testing if you want to avoid some very serious problems. But, what is life without challenges?!

In the next post we will investigate the next alternative to copying: moving the same data to a new address. But, instead of copying, we will just map physical pages with data to a new virtual address and avoid copying!

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Leave a Reply

Your email address will not be published. Required fields are marked *