Excessive copying in C++ and your program's speed

Master software performance in just 16 hours!
Join our Software Optimization for the Memory Subsystem Workshop taking place from May 18th to May 21st. Click here to express interest or register.

C++ was originally created for object-oriented programming. The idea is to have objects and have objects manage themselves. This allows developers to take a specification from the non-technical world and code it straightforwardly into objects and behaviors. But this level of abstraction comes with a performance price.

Compared to C, C++ has a reputation for being slow. But precisely the level of abstraction C++ provides is what is making it slow. In C, there is no simple way to pass an array to a function by value. In C++, std::vector will let you do exactly that. In C, you cannot have arithmetic operators on structs. In C++, you can overload arithmetic operators and do full arithmetic on classes, but the penalty will be the generation of temporary objects. There are no constructors and destructors in C, whereas in C++ they are often called without any explicit command from the developer. And so on.

C developers facing C++ for the first time will pick up the language quickly. Still, being unaware of how things work internally, they will use the language in an inefficient manner. The compiler cannot optimize away slow code and you are stuck with a program that is slow but doesn’t necessarily have to be. Here we will present a few most common techniques to remove unnecessary copying and temporary objects.

Table Of Contents

Tips to avoid excessive copying in C++

Before moving on to tips, we implemented a class bit_field to illustrate the examples and do the measurements. Class bit_field is a wrapper that allows you to do the logical operation &, | and ^ on an array of integers. The size of the array can be determined at runtime and can grow or shrink as needed. Runtime array allocation and resize are expensive operations, and if not used properly, can generate a lot of copying.

The full source code of bit_field class is available in our github repository.

Everyday coding

Inside functions, pass bigger objects by reference, not by value

The first rule of the book is to pass the objects by reference, not by value. So a function with this signature is bad:

int calculate_sum(std::vector<int> v);

And function with this signature is good:

int calculate_sum(const std::vector<int>& v);

In the above example, when we are passing std::vector<int> v by value, C++ creates a temporary object which is a copy of the input argument, and it uses this copy inside the function. This object will get destroyed when the function returns. Creating a temporary std::vector can be very expensive if the vector is big. Classes that internally allocate memory, work with files etc. are generally expensive to copy.

The solution is to pass arguments to a function by reference (or const reference, or pointer).

There are however, types that you want to pass by value. Simple types: int, float, double, char and bool. Passing them by value actually allows some compiler optimizations. You should also pass by value pointer types shared_ptr and weak_ptr. You can also pass by value small flat classes¹ that represent a single value used for processing, e.g. complex_number.

Prefer a copy constructor to an empty constructor followed by an assignment

Developers coming from C background sometimes make an error of declaring the variable in one line and assigning to it in another line, like this:

bit_field c;
c = c2;

In the above example, the first line constructs an object of type bit_field. The second line calls the operator= on the object. We can construct the object with the proper constructor and avoid the call to assignment operator like this:

// Syntax 1
bit_field c(c2);
// Syntax 2
bit_field c = c2;

Both above commands to the same thing and avoid the call to operator=.

In our measurements, on an array of 10 million bit_fields, the version with a copy constructor took 305 ms, compared to 320 ms which took the version with default constructor followed by the assignment operator.

Reuse old variables instead of constructing new ones

Constructor and assignment operator do cost something, but typically the cost of the assignment is smaller than the cost of the construction. In our bit_field class example, the constructor will always allocate memory for the underlying array, whereas the assignment operator will allocate only if it doesn’t have enough space already. So it is a good idea to reuse old variables instead of constructing new ones. An example:

bit_field calculate_sum_ctor(size_t count) {
    bit_field result({0});

    for (size_t i = 0; i < count; i++) {
        bit_field tmp({i});
        result ^= tmp;
    }

    return result;
}

bit_field calculate_sum_assign(size_t count) {
    bit_field result({0});
    bit_field tmp({0});

    for (size_t i = 0; i < count; i++) {
        tmp = {i};
        result ^= tmp;
    }

    return result;
}

In the above example, in function calculate_sum_ctor we create a new bit_field object in each iteration of the class. In function calculate_sum_assign we reuse the same variable declared outside of the loop.

The constructor allocates memory every time; the assignment allocates memory only if needed. In our measurements, the numbers are clear: assignment version took 31 ms to execute, compared to the constructor version which took 306 ms.

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us

You can also subscribe to our mailing list (link top right of this page) or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Prefer compound assignment operators to arithmetic and logical operators

Compound assignment operators are +=, -=, *=, /= etc. The good thing about compound assignment operators is that they modify the value of the object without creating a temporary object. So for instance:

bit_field crunch_bitfield_regular(bit_field a, bit_field b) {
    bit_field result(a);
    for (int i = 0; i < arr_len; i++) {
        result = result & a;
        result = result | b;
    }
    return result;
}

bit_field crunch_bitfield_compound(bit_field a, bit_field b) {
    bit_field result(a);
    for (int i = 0; i < arr_len; i++) {
        result &= a;
        result |= b;
    }
    return result;
}

In the above example, the function crunch_bitfield_regular generates a temporary object to store the value of result & a, and then uses operator= to assign this value to the variable result. The function crunch_bitfield_compound doesn’t have the problem with temporaries because compound assignment doesn’t generate temporaries.

The difference in runtimes goes like this: regular version took 849 ms to execute compared to only 96 ms for the compound version.

Another more elaborate result:

complex_number r = (a1 + a2 + a3) / b;

The above code is equivalent to the fallowing code:

complex_number tmp1(a1 + a2);
complex_number tmp2(tmp1 + a3);
complex_number r(tmp2 / b);

As you can see, we have two intermediate temporary objects. Let’s replace them with compound assignment operators:

complex_number r(a1 + a2);
r += a3;
r /= b;

The above code doesn’t generate any temporary objects, but it is less readable. Still, this optimization might be worth considering for the code on the hot path.

Class design

Inside a constructor, initialize the member variables in the constructor initialize list

All the member variables in the class will be initialized to the default value at the moment the first line in the body of the constructor is reached. If you want them initialized to a custom value, you should perform the initialization in the constructor initializer list. The constructor initializer list is located between the constructor signature and the body of the constructor (in the example below, the constructor initializer list is between lines 4 and 7).

class bit_field {
   public:
    bit_field(size_t size)
        : m_size_bits(size),
          m_size(calculate_size(size)),
          m_capacity(calculate_capacity(size)),
          m_value(new itype[m_capacity]) {}

    bit_field(size_t size) {
        m_size_bits = size;
        m_size = calculate_size(size);
        m_capacity = calculate_capacity(size);
        m_value = std::unique_ptr<itype[]>(new itype[m_capacity]);
    }
...
};

The constructor declaration on line 3 is the right way to go. The constructor on line 9 is slower because it assigns to member variables that were already constructed. Every “initialization” you make there is actually a call to the assignment operator on an already constructed object.

We did several measurements for the constructor with and without the constructor initializer. The constructors with the constructor initializer is slightly faster, in one measurement it was 910 ms vs 1033 ms, in another it was 2304 ms vs 3813 ms.

Define move constructor and move assignment operator

To illustrate this, let’s have a look at our class bit_field . Here are the relevant parts:

class bit_field {
   private:
    size_t m_size_bits;
    size_t m_size;
    size_t m_capacity;
    std::unique_ptr<itype[]> m_value;

   public:
    bit_field(const bit_field& other)
        : m_size_bits(other.m_size_bits),
          m_size(other.m_size),
          m_capacity(other.m_capacity),
          m_value(new itype[m_capacity]) {
        std::memcpy(m_value.get(), other.m_value.get(), m_size * sizeof(itype));
    }
}

The class allocates memory dynamically and stores it in the unique pointer m_value. Because of this, creating a copy of bit_field is an expensive operation since it involves a memory allocation and memory copying.

But, now imagine the following situation. You are copying from a temporary instance of bit_field which will get destroyed immediately afterward. Here is the situation that this might happen.

bit_field c;
c = a ^ b:

In the above example, the result of the operation a ^ b will be assigned to the variable c and destroyed immediately after that.

The idea of move constructor and move assignment operator is the following. If the input parameter to the copy constructor is a temporary variable, the compiler can call the move constructor instead. The difference between the two is that the move constructor can “steal” resources from the temporary variable instead of allocating its own. The same principle applies to the move assignment operator.

So, in our previous example, the compiler can call the move assignment operator on the temporary a ^ b instead of calling the copy assignment operator. Here is an implementation of the move constructor:

class bit_field {
    ...
    bit_field(bit_field&& other) noexcept
        : m_size_bits(other.m_size_bits),
          m_size(other.m_size),
          m_capacity(other.m_capacity),
          m_value(std::move(other.m_value)) {}
};

The first thing to notice about the move constructor is that the input parameter has && instead of &. In the body of the constructor, notice that there is no memory allocation. Member variable m_value is of type unique_ptr. If you are unfamiliar, the expression m_value(std::move(other.m_value)) will “steal” the pointer from other.m_value and assign it to m_value. After this, the value of other.m_value will be nullptr.

If you are implementing a move constructor and move assignment operator, you must implement the destructor properly. For instance, you will need to set the value of pointers you stole memory from to nullptr, and you need to handle the possibility of nullptr correctly in the destructor. Otherwise, you might end up with deallocating the memory twice or deallocating nullptr.

And the last thing: you should mark move constructor and move assignment operator with noexcept keyword (and of course not throw exceptions inside them). Otherwise, the compiler might pick the copy constructor instead of the move constructor inside STL containers.

So, when does the compiler generate calls to move constructor instead of move assignment operator? Here are the examples:

bit_field c;

// Move assignment operator called implicitly because the value of a ^ b is a temporary value
c = a ^ b;

// Move assignment called explicitly with std::move
bit_field d = std::move(c);
// Don't use variable c after

On line 4, the compiler calls the move assignment operator because the value of a ^ b is a temporary value. On line 7, the programmer calls the move assignment explicitly using std::move. After the move, the variable c should not be used.

In our measurements, calling to the std::vector<bit_field>.push_back(b) took 1104 ms, and calling std::vector<bit_field>.push_back(std::move(b)) took 931 ms on an array of 10 million elements.

Move semantics is a complicated topic and we only scratched the surface. But it is important to note that for performance reasons you should implement both the move constructor and move assignment operator, because “stealing” resources saves cycles, memory and time.

Mark constructors as explicit

C++ can use constructors with one parameter as a conversion constructor. Consider the following example:

class bit_field {
    bit_field(std::initializer_list l) { ... }
};

bit_field c({0xf});

c &= {0x0};

if (c == {0x0}) {
    ....
}

We initialize the variable bit_field c and then we perform the operation and with {0x0}. If the class doesn’t have an assignment operator with std::initializer_list as a parameter, or it doesn’t have an equality operator with std::initializer_list as a parameter, the compiler will silently do the following:

Create a temporary bit_field by calling the constructor with std::initializer_list parameter.
Call the appropriate function that takes bit_field as an argument, in our case void operator^=(const bit_field&) or bool operator==(const bit_field&).

Although this is fine for a large spectrum of applications, for performance-critical code you will want to disable implicit conversion to avoid temporary objects. For this, you need to mark the constructor as explicit.

class bit_field {
    explicit bit_field(std::initializer_list l) { ... }
};

If you do it like this, the compiler will issue an error when trying to compile the code from the previous section. Now you can:

Define custom operators for int type: void operator+=(const bit_field&) or bool operator==(const bit_field&).
Explicitly create a temporary object:

bit_field c({0xf});

c &= bit_field({0x0});

if (c == bit_field({0x0})) {
    ....
}

A short note on making the copy constructor explicit

Making a copy constructor explicit can make your code faster, because it can prevent certain bad practices slow down your programs. But when you do like this, some implicit conversions will need to become explicit, and some things become impossible altogether.

Consider the following function.

bit_field and(bit_field a, bit_field b) {
    bit_field result(a);
    result &= b;
    return result;
}

// This won't compile because the copy constructor is explicit
bit_field result = and(a, b);

// This compiles because we called the copy constructor explicitly
bit_field result = and(bit_field(a), bit_field(b));

Calling the function by passing arguments by value is still possible, albeit more involved and the programmer will probably want to pass the parameters by references instead of by values to simplify the calls to functions.

Please note that you can return by value from a function even when the copy constructor is explicit, because the compiler avoids it using an optimization named return value optimization (we will talk about it later on).

The following rewrite completely dispenses with passing arguments by values and passes everything by reference.

void and(const bit_field& a, const bit_field& b, bit_field& out_result) {
    out_result = a;
    out_result &= b;
}

In my opinion, all constructors with one parameter, including the copy constructor, should be marked with explicit for performance-critical classes. The compiler will then force the programmers to write more performant code.

Avoid defining methods that return class instance by value

A method that returns a class instance by value can be used as an input to another method that accepts a value or a reference. This can lead to the generation of temporary copies and program slow-downs. Here is an example:

bit_field bit_field::append(const bit_field& b) { ... }

void accumulate(const std::vector<bit_field>& b_vec) { 
    bit_field result({0});

    for (int i = 0; i < b_vec.size(); i++) {
       result = result.append(b_vec[i]));
    }
}

In the above example, method append is defined to return the class bit_field by value. In function accumulate, we create a new temporary object by calling result.append and assign it to the variable result. In each iteration of the loop the constructor and destructors are called for the temporary object.

This approach is not good in C++. C++ cannot optimize away temporary objects, so a better idea is to not even have methods that return a class instance by value. Instead, the object should modify itself.

void bit_field::append(const bit_field& b) { ... }

void accumulate(const std::vector<bit_field>& b_vec) { 
    bit_field result({0});

    for (int i = 0; i < b_vec.size(); i++) {
       result.append(b_vec[i]));
    }
}

Typically C++ developers rarely create explicitly methods that return objects by value. However, built-in operators operator+, operator-, operator& etc. return class instance by value. In case performance is important, consider deleting those methods. The users can use the compound assignment operators (+=, -=, &=) instead.

Take advantage of copy elision

Imagine a scenario when your method is returning a new instance of a class. When this happens, you have two choices: you can return it as a pointer, or you can return it as a value. Here is such a function that returns a value:

complex_number operator/(const complex_number& lhs, const complex_number& rhs) {
    float common_denominator = rhs.m_real * rhs.m_real - rhs.m_img + rhs.m_img;
    float real_part = (lhs.m_real * rhs.m_real - lhs.m_img * rhs.m_img) / common_denominator;
    float img_part = (lhs.m_real * rhs.m_img + lhs.m_img * rhs.m_real) / common_denominator;
    return complex_number(real_part, img_part);
}

complex_number c = a / b;

Inside function operator/ we create a stack variable to hold the result. On line complex_number c = a / b, the compiler can call function operator/, create a temporary copy to hold value a / b on the stack and then construct the variable c as a copy of the stack-allocated result.

This involves one unnecessary copy, and C++ standard allows that the constructor is called only once instead of two times. The name of this optimization is return value optimization (RVO). Compilers will try to remove the creation of a temporary copy as much as possible, and construct directly into the destination. In order to help them out, you will need to make sure that:

Your function has ideally one return statement, at the end of the function and it returns a nameless temporary object that is part of the return (e.g. return complex_number(real_part, img_part);)
If not possible, you can have one named temporary object defined anywhere in the function, which you return at the end of the function.
If you must have several return statements in your function, you always return the same named temporary object. You don’t return several different objects of the same type.
You throw all your exceptions before constructing the temporary object you will be returning.

If you are not sure if RVO optimization is turned on for your function, we suggest that you read your compilers info on the topic or inspect the produced assembly.

Containers

Prefer pre-increment operator on the iterators

Check out the typical implementation of the post-increment operator for any class²:

    // Canonical form of postincrement:
    T T::operator++(int)
    {
      T old( *this ); // remember our original value
      ++*this;        // always implement postincrement
                      //  in terms of preincrement
      return old;     // return our original value
    }

The post-increment operator creates a temporary copy of the current value in old, calls the pre-increment operator on the object and returns old. Compilers might or might not be able to optimize away the unused variable old. If you use the pre-increment operator, the compiler doesn’t run into such problems. So, when using the iterators, prefer the pre-increment version:

int sum = 0;
for (auto it = v.begin(); it != v.end(); ++it /* Don't use it++ */) {
    sum += *v;
}

In measurements with our bit_field example there were no differences between the pre-increment and post-increment versions. As already said, the compilers can optimize away unused variables, and this has happened in our case. However, other sources note the performance difference.

This tip works also in the general case. Prefer the pre-increment operator to post-increment operator for all classes and in all cases.

Reserve space in containers

Following containers std::string, std::vector or std::unorder_set or std::unordered_map internally use an array to hold data, and when you are putting new values into them they can grow if the underlying array isn’t big enough.

If you can estimate the size of the input (for example, you can expect that std::vector will hold one million elements), you should use the reserve method to avoid unnecessary regrowth of the container. Like this:

// Vector example
std::vector v;
v.reserve(1024*1024); // Reserving place for 1 million elements
for (int i = 0; i < 1024*1024; i++) {
    v.push_back(calculate_value(i));
}

// String example
std::string s;
s.reserve(240);

for (char c = 'a'; c <= 'z'; c++) {
    for (int i = 0; i < 10; i++) {
        s.append(c);
    }
}

We did a measurement on our bit_field class where we implemented the method reserve. Here is the example code:

constexpr size_t arr_len_append = 10000;

bit_field append_optimized(bit_field a, bit_field b,  bool reserve) {
    bit_field result({0});

    if (reserve) {
        result.reserve((a.get_size() + b.get_size()) * arr_len_append + result.get_size());
    }

    for (int i = 0; i < arr_len_append; i++) {
        result.append(a);
        result.append(b);
    }

    return result;
}

If we pass true for the reserve argument, the program will reserve enough memory so that the whole result can fit into it. Needless to say that the function took 525 ms to execute when reserve was false and 1 ms to execute when reserve was true. I would expect similar results with std::string or std::vector on a loop of this size.

Use references in for range loops

For range loops are the new C++ feature that allows for more readable code. Here is a typical example of a for range loop:

std::vector<bit_field > vec;

bit_field sum({0});
for (auto c: vec) {
    sum ^= c;
}

The auto keyword hides it, but in every iteration of a loop, a variable c is made by calling a copy constructor((If you make copy constructor explicit, the above example doesn’t compile.)). This probably wasn’t your intention, and you should create the variable as a reference to avoid unnecessary copying. So instead of auto you should use auto& (or const auto&). So, here is the fix for the above example:

std::vector<bit_field > vec;

bit_field sum({0});
for (const auto& c: vec) {
    sum ^= c;
}

In our measurement, processing array with 10 million elements took 404 ms on the value for loop, compared to 72 ms on the reference for loop.

Construct directly into the containers

When you want to fill the container class with the instance of your classes, you will call a push_back (vector), insert (map or set) or operator[] (for map). The problem with these functions is that internally they create a new object by copying from the one you provided.

Your intention most of the time will be to keep the object in the container, not to keep it in the container and have another copy outside of the container. To avoid the creation of temporary objects, use the methods that have emplace as a prefix, e.g. std::vector.emplace_back(), std::map.emplace() etc. Emplace methods receive the same arguments as the constructor of your class, but the object is created directly in the container.

Here is an example on how to do it:

std::vector<bit_field> numbers;
numbers.reserve(arr_len);

for (int i = 0; i < arr_len; i++) {
    numbers.emplace_back({i});
}

In our measurements, the above code tool 1143 ms to execute with call to push_back, 942 ms if we used push_back with std::move to force the move constructor and 926 ms if we used emplace_back.

Clang-tidy

There is a tool called clang-tidy, a part of the llvm suite, that allows you to enforce some of the rules we just mentioned.

You run the tool for each source code file and you provide the compilation flags in a similar way you would do for the regular compilation. There are plenty of rules to enforce, and they are placed into groups, for instance: portability, readability, performance or modernize. You can pick from the command line which rules to enforce. The full list of rules is available in clang-tidy documentation,

As far as performance is involved, the tool will check the following:

Rule name	Description
performance-for-range-copy	Warn if the program is accessing an element by value in a for range loop
performance-inefficient-vector-operation	Warn if the program is adding elements to a vector without reserving place in the vector beforehand
performance-noexcept-move-constructor	Warn if the move constructor is defined without `noexcept` keyword
performance-unnecessary-value-param	Warn if the argument of the function is passed by value when it could be passed by reference instead

Warnings related to performance detected by clang-tidy and covered in this article

This is not a complete list, there are also other warnings related to performance not covered in this post.

To be maximally useful, the tool should be made part of the continuous integration checks.

Final Thoughts

All of the above optimizations apply to classes. However, not all classes are created equal. There are classes that are cheap to copy, and there are classes that are expensive to copy. The compiler will try hard to optimize away empty constructors and destructor, unused variable assignment, etc. This is not always possible however, but playing by the above rules can help them a lot.

None of the above tips should hurt performance, but bear in mind that performance gains can depend very much both on the class itself and on the compiler.

Further Read

C++ Optimization Strategies and Techniques, Pete Isensee

Flat classes are simple where all the data is within the instance of the class. They don’t hold pointers, don’t allocate resources (dynamic memory, threads, files etc.) [↩]
Taken from: http://www.gotw.ca/gotw/055.htm [↩]

Excessive copying in C++ and your program’s speed