Growing Buffers to Avoid Copying Data

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

Copying data can be expensive in some cases, especially since it it doesn’t change the data, it’s just moves it. Therefore we, engineers interested in performance, want to avoid copying data as much as possible.

We already talked about avoiding data copying in C++ earlier. In that post, we talked about what mechanism C++ has to offer when it comes to avoiding or minimizing copying. In this post, we focus more on what the operating system can give us to avoid data copying.

A small introduction

Among memory allocation functions in C library, one function is particular: realloc. This function allows us to grow or shrink a buffer allocated with malloc or calloc: if realloc can do it in place, it will avoid copying data. If not, then it will allocate a new buffer, copy the data there, and free the old buffer.

If you are using C, this is all that you need. In C++, things get complicated. C++ doesn’t have a dedicated realloc function. It only offers new and delete for the user. Containers use std::allocator to allocate memory, but again, no buffer growth is available there.

If you need your buffers growing, you can opt for custom container implementation with realloc. But, there is a problem of non trivially copyable types: some types cannot be copied using realloc, instead they need to be properly constructed and destroyed. Therefore, for some types, realloc is not even an option. Instead, a manual construct new-destroy old sequence is needed for the things to work correctly.

API for enlarging buffers

If we want to avoid copying in C++, we would need a function called resize_buffer with a following signature:

bool resize_buffer(void* ptr, size_t new_size)

This function tries to resize the buffer in-place. If successful, it returns true and no copying is needed. If unsuccessful, it returns false and we need to allocate another buffer, move our data there and release this buffer using free.

This function doesn’t exist; nevertheless, we will try to build it in this post!

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

How to build `resize_buffer`?

If you are working with buffers that are smaller than memory page size (typical 4 kB), and you want to implement resize_buffer for all buffer sizes, this can come with a huge memory overhead. In essence, you will need to reserve a whole memory page (e.g. 4 kB) to allow your buffer to grow from 1 byte to 4 kB.

Growing larger buffers is easier. A common 64-bit system with virtual memory support (which is essentially all desktop and server CPUs and some more powerful embedded CPUs) have a memory address space that is 2^48 bits, or 256x4GB of data. If we want to allocate a growable buffer, we need to pick an address somewhere in this space at which we will allocate our buffer. Let’s say, we picked address X and the size of our buffer is 1 MB.

Let’s assume now that we want to grow our buffer from 1 MB to 4 MB. To achieve this, the following needs to be true:

The virtual memory from address X + 1MB to X + 3 MB needs to be unused – nobody else should have allocated this virtual memory space earlier.
The system should have additional 3 MB of physical memory to back up the virtual memory

Then, we can either ask the OS to expand the original virtual memory block from X + 1 MB to X + 3 MB. Alternatively, we can ask the OS to allocate a new memory block of 3 MB at address X + 1 MB.

Growing buffer on 32 bit system is the same, except that instead of 256×4 GB we have 4 GB of virtual memory space at our disposal. Growing memory is possible here as well, but there is a much larger chance that the virtual address space we want to grow to is used by someone else.

On Linux

On Linux, memory is allocated using mmap system call. This call returns a block of memory which is a multiple of page size (the size in bytes, but the actually allocated memory will be rounded up to a multiple of page size, so e.g. 20 byte allocation will actually be a 4 kB allocation). Here is a an example on how to allocate memory using mmap:

void * res = mmap(base_addr, size, PROT_READ | PROT_WRITE,
                MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

Important parameter is base_addr, which suggests to mmap the address we want to use – if nullptr, the OS will pick an address for us; otherwise the OS will try to allocate the specified address (if rounded to page size).

Unfortunately, in our testing environment, we couldn’t let Linux pick the base address for us: if we do this, the next call to mremap will fail. The reason for the failure is that mmap tries to be conservative – and it reserves memory after your block so your block cannot grow.

Therefore, we have to provide a base address for mmap for the allocated block to be growable.

On Windows

The builtin support for growing buffers on Windows, similar to mremap on Linux, is missing. There is no equivalent of mremap on Windows. Luckily, we can work around this.

The equivalent of Linux mmap is VirtualAlloc. Function mremap doesn’t exist, but we can allocate consecutive virtual memory blocks and treat them as one continuous block for the purposes of user program. So, to allocate memory, we would use:

ptr0 = VirtualAlloc(base_ptr, size0, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

and then append additional calls to VirtualAlloc to make it grow:

ptr1 = VirtualAlloc(ptr0 + size0, increment1, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
ptr2 = VirtualAlloc(ptr1 + increment1, increment2, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);

When releasing memory like this, we must not forget to call VirtualFree for each pointers ptr0, ptr1 ... This implies we need to maintain the list of all pointers and dispose them properly when they are not needed anymore (by the way, you can use the same approach on Linux if you don’t want to use mremap).

System allocator support for resizing memory buffers?

As already said, standard library does not provide an API that would allow you to resize buffers. The only allocator that does so is jemalloc – a drop in replacement for system malloc and free. It provides a function with this signature:

size_t allocated_size = xallocx(old_ptr, new_size, 0, 0);

The function, if succeeds, will grow (or shrink your buffer). The function will succeed if new_size <= allocated_size. The function sounds something that would be nice to have in C standard library.

Experiment

I wanted to get a feeling of how much time avoiding copies save. For this, I wrote a simple jsl::vector implementation that is similar to std::vector. When we push new data into the vector, the vector will grow if there isn’t enough place for new data. The difference from std::vector is that jsl::vector first tries to resize the underlying buffer to avoid copying, and only if that fails, it allocates a new buffer. The source code is available here.

We implemented four types of allocations:

Simple: no regrowth. Works on Linux and Windows.
Resize: regrows by allocating another neighboring block of memory, if possible. Works on Linux and Windows.
Posix: regrows using mremap. Works only on Linux.
Jemalloc: regrows using xallocx call from jemalloc. Works on Linux and Windows, but we tested only on Linux.

We tested writing 256M values into jsl::vector<float> using push_back methods, without reserve. The Linux and Windows systems are different and therefore not directly comparable. Here are the runtimes (g++, 10 repeats, average time) on the Linux system:

Configuration	Runtime
`std::vector`	0.998 s
Simple	0.899 s
Resize	0.495 s
Posix	0.522 s
Jemalloc	0.547 s

Avoiding copies definitely pays off in terms of number. Both Resize and Posix configurations were able to completely avoid copying and grow their bytes from 0 to 256 M. Jemalloc on the other hand, wasn’t able to grow buffers and some copying were needed: small buffers (less than 32 kB had to be copied), and sporadically, resizing larger buffers also failed: out of 15 consecutive attempts at growth (from size X to size 2X), 4 failed.

Same measurements, on Windows with MSVC compiler:

Configuration	Runtime
`std::vector`	0.64 s
Simple	0.54 s
Resize	Average: 0.415 s, but there were two numbers where the results were clustering: 0.46 s (6 measurements) 0.34 s (4 measurements)

Again, we see here similar behavior. Resizing buffer instead of copying helps, but not as much as on Linux. We notice that in Resize we actually have two numbers. The reason for this is that on Windows Resize doesn’t manage to grow buffers in place – sometime VirtualAlloc cannot allocate the memory address we need for our buffer to grow so we need to copy it. If this happens while the buffer is relatively small – we get the short runtime. If this happens while the buffer is large – we get the long runtime.

Problems with buffer growing?

I intentionally painted a picture without going into too much detail. Can this go wrong? Absolutely. Here are some possible problems which I either experienced while writing this post, or that can happen if you try to use this approach for many buffers, based on my experience:

Different behavior on different systems: you might test this approach and it works on your development system, but when you try to run it on a production system, it fails. In the worst case, a system can experience virtual memory shortages in spite of having enough physical memory.
Silent failure: if this approach fails, it will be silent. Your performance improving efforts will be spent in vain.
Virtual space fragmentation: to allow buffer growth, your buffers will be far away from one another. This leads to virtual address space fragmentation and lower performance, for two reasons: (1) the OS needs more time to allocate and free memory and (2) there will be a higher number of data cache misses and TLB cache misses, which can result in overall slowness of the system without an obvious reason.
mmap quirks: we need to provide the base address for mmap for mremap to succeed later. This introduces the problem of memory management in our program: now our program has to decide how it wants to arrange buffers in memory, which not trivial. In this experiment we picked a random address for mmap, but definitely this is not a good solution for many buffers that wish to grow.
VirtualAlloc quirks: Windows’ allocation function is not without its own problems. Although VirtualAlloc allocates memory in page sizes, the buffers can only starts at offset which a multiple of SYSTEM_INFO::dwAllocationGranularity, which on Windows 11 was 64 kB. Therefore, we need to allocate our memory in blocks of 64 kB in order to be able to grow them, which further leads to more memory waste.
Doesn’t scale well: imagine having thousands of allocations that can potentially grow. It can be very challenging to maintain to potential of buffer growth in the presence of many allocations.
Unstandardized: this is something that works differently on different systems and great care needs to be taken if you want this to be portable to different systems.

Conclusion

Buffer growth without copying is definitely possible, and worth it in the case where you have data of undetermined size where you want to avoid copying. But, although it might seem simple in the first run you need to be careful with implementing and testing if you want to avoid some very serious problems. But, what is life without challenges?!

In the next post we will investigate the next alternative to copying: moving the same data to a new address. But, instead of copying, we will just map physical pages with data to a new virtual address and avoid copying!

Very instructive! I find myself wondering, however, if certain issues you mentioned that arise in production systems could be avoided by exploring other methods of managing the buffers.

For example, in certain situations, it may be that it would be possible to reduce the amount of actual memory needed by proactively exploiting redundancy in the data being cached. In other words, rather than actively allocate greater amounts ahead of time, you could allocate less and take a small performance hit for insertion operations by looking for repeated values and simply encoding each unique value’s position and the amount of times the value occurs. Of course, that instantly makes optimization at a byte level basically useless, as the method would likely require more than a byte to encode the locations of the unique values. To respect the performance-focused developer, you would also likely need to disclose to the user the hit they are taking with this method too for insertion times and offer the methods you already presented as alternatives for workloads that will often modify the contents of the buffer.

Maybe something that could also be explored is either exploiting eBPF (on Windows and Linux :)) or kernel modules (which I’m not a huge fan of compared to eBPF because of their inherent risks, but there are some things that eBPF simply can’t do yet that kernel modules can) to effectively “close off” regions of memory at kernel startup and make it so that they can be directly controlled by userspace libraries. That could offer a compelling way to avoid the overheads inherent with the system calls you would have to rely on to allocate and free memory.

Anyways, just wanted to leave my thoughts! Think this blog/company is super neat, really appreciate the writing!

1 comment / Add your comment below

Luke Hofstetter says:
June 2, 2025 at 8:05 pm
Very instructive! I find myself wondering, however, if certain issues you mentioned that arise in production systems could be avoided by exploring other methods of managing the buffers.
For example, in certain situations, it may be that it would be possible to reduce the amount of actual memory needed by proactively exploiting redundancy in the data being cached. In other words, rather than actively allocate greater amounts ahead of time, you could allocate less and take a small performance hit for insertion operations by looking for repeated values and simply encoding each unique value’s position and the amount of times the value occurs. Of course, that instantly makes optimization at a byte level basically useless, as the method would likely require more than a byte to encode the locations of the unique values. To respect the performance-focused developer, you would also likely need to disclose to the user the hit they are taking with this method too for insertion times and offer the methods you already presented as alternatives for workloads that will often modify the contents of the buffer.
Maybe something that could also be explored is either exploiting eBPF (on Windows and Linux :)) or kernel modules (which I’m not a huge fan of compared to eBPF because of their inherent risks, but there are some things that eBPF simply can’t do yet that kernel modules can) to effectively “close off” regions of memory at kernel startup and make it so that they can be directly controlled by userspace libraries. That could offer a compelling way to avoid the overheads inherent with the system calls you would have to rely on to allocate and free memory.
Anyways, just wanted to leave my thoughts! Think this blog/company is super neat, really appreciate the writing!