Decreasing the Number of Memory Accesses 1/2

We at Johnny’s Software Lab LLC are experts in performance. If performance is in any way concern in your software project, feel free to contact us.

When we are trying to speed up a memory-bound loop, there are several different paths. We could try decreasing the dataset size. We could try increasing available instruction level parallelism. We could try modifying the way we access data. Some of these techniques are very advanced. But sometimes we should start with the basics.

One of the ways to improve on memory boundness of a certain piece of code is the old-fashioned way: decrease the total number of memory accesses (loads or stores). Once a piece of data is in the register, using it is very cheap, to the point of being free (due to CPU’s ability to execute up to 4 instructions in a single cycle and their out-of-order nature). So all techniques that try to lower the total number of loads and stores should result in speedups.

Techniques

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Loop Fusion

If we have two consecutive loops that operate on the same dataset, fusing them into one loop would decrease the number of memory loads and stores and consequently improve performance. This transformation is called loop fusion or loop jamming. Consider the example of finding a minimum and maximum in an array. One of the ways to do it is using two separate loops:

double min = a[0];

for (int i = 1; i < n; i++) {
    if (a[i] < min) { min = a[i] };
}

double max = a[0];

for (int i = 1; i < n; i++) {
    if (a[i] > max) { max = a[i] };
}

Both loops touch process the same dataset. We could merge the two loops into one loop that finds both minimum and maximum. This cuts the number of data loads by two.

double min = a[0];
double max = a[0];

for (int i = 1; i < n; i++) {
    if (a[i] < min) { min = a[i] };
    if (a[i] > max) { max = a[i] };
}

We measure the effect of loop fusion in the experiments section.

A few notes about loop fusion:

  • Loop fusion is also a compiler optimization technique, so in theory the compiler can do it automatically. But this doesn’t happen often and when it does happen, this optimization has a tendency to break easily.
  • With regard to loop fusion, there are cases where loop fusion would fail to deliver speed improvements. If one or both loops are vectorized before the fusion, but the fused loop is not vectorized, then this transformation can result in a slowdown, not a speedup.
  • Loop fusion is a close relative of loop sectioning. The main difference is that loop fusion reuses the data while it is in registers, whereas loop sectioning reuses the data while it is in fast data caches. Therefore, a loop fusion is more memory efficient that loop sectioning, but fusing two loops is more complex than sectioning two loops. Also, preserving vectorization is easier with loop sectioning.

C++ Ranges

For folks using C++ STL algorithms, it is important to be aware of C++ ranges. The original STL library contains many algorithms, but they are not composable. Composability means that the result of one algorithm is fed into another algorithm. Consider the example1:

struct user_t {
   int id;
   std::string name;
   int age;
};

std::vector<int> get_ids_adult_users(std::vector<user_t>& users) {
    std::vector<user_t> adult_users;
    std::copy_if(std::cbegin(users), std::cend(users), std::back_inserter(adult_users), [](auto const& user) { return user.age >= 18; });
    std::vector<int> ids;
    std::transform(std::cbegin(adult_users), std::cend(adult_users), std::back_inserter(ids), [](auto const& user){ return user.id; });
    return ids;
}

The function get_ids_adult_users returns the vector containing the ids of all the users who are adults, i.e. whose age is 18 or more. To achieve this using STL, we use two algorithms: std::copy_if which filters out the minor users to create the list of adult users and std::transform to extract only ids from the vector of user_t class.

This approach forces the code to iterate the two collections instead of one: the first collection is the original collection of users, and the second collection is the temporary collection holding adult users. To avoid this, C++ developers have STL ranges at their disposal. Here is the same example rewritten using ranges:

std::vector<int> get_ids_adult_users(std::vector<user_t>& users) {
    auto ids = users | std::views::filter([](auto const& user){ return user.age >= 18; })
                     | std::views::transform([](auto const& user){ return user.id; });
    return {ids.begin(), ids.end()}
}

This code, apart from being cleaner, also performs fewer memory loads and memory stores. The filter adapter performs the filtering, and the result of filtering is directly fed into the transform adapter, element by element. This avoids running through the vector two times, and it is equivalent to loop fusion.

Kernel Fusion

Kernel Fusion is just loop fusion brought to a much higher level. Consider the following: imagine you have N image processing algorithms making an image processing pipeline. The output of algorithm X is the input of algorithm X+1. One of the ways to implement the pipeline is to have them run one after another, from zero to N-1. Each algorithm must finish before the next one starts.

With this kind of setup, processing an image will often be memory inefficient. The whole image is loaded from the slower parts of the memory subsystem to CPU registers, processed, then the output is written back to the memory subsystem. And this is repeated for each algorithm in the pipeline: load input, do some modifications, store output.

In this example, each algorithm is a kernel. And by kernel fusion, we mean that two algorithms are fused. An algorithm X generates a single pixel, and feeds it directly to the algorithm X + 1, then to the algorithm X + 2, etc. The benefit of such an approach is that all relevant data never leaves the CPU, which avoids unnecessary data movements.

However, there are two problems with this approach:

  • Writing such a processing pipeline is not easy, and this task needs to be planned from day one. In fact, there is a special programming language, Halide, that is designed for writing fast and portable image and tensor processing codes.
  • The types of algorithms that would benefit from this approach must be memory bound, i.e. light in computation. Algorithms that are computationally intensive would benefit little or not at all, because computational bottleneck will hide the memory latency.

If you happen to know more about this topic, please leave a comment or send me an e-mail, I would like to know more as well (and also keep this post updated).

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Loop Fusion inside a Loop Nest

Loop fusion inside a loop nest (also called unroll-and-jam) is a more advanced type of loop fusion, applicable to loop nests. The technique is simple to apply to many sorting algorithms, but not only them. Consider the selection sort algorithm. Here is the source code:

for (int i = 0; i < n; i++) {
    double min = a[i];
    int index_min = i;
    for (int j = i + 1; j < n; j++) {
        if (a[j] < min) {
            min = a[j];
            index_min = j;
        }
    }

    std::swap(a[i], a[index_min]);
}

The algorithm is very simple: in the i-th iteration, it scans the elements from i to n to find the smallest value, and puts the value in place a[i].

For loop fusion inside a loop nest, we need to unroll the outer loop two or more times to make the fusion potential explicit. Here is the selection sort algorithm, where the outer loop has been unrolled two times:

for (int i = 0; i < n; i+=2) {
    min = a[i];
    index_min = i;
    for (int j = i + 1; j < n; j++) {
        if (a[j] < min) {
            min = a[j];
            index_min = j;
        }
    }

    std::swap(a[i], a[index_min]);

    min = a[i + 1];
    index_min = i + 1;
    for (int j = i + 2; j < n; j++) {
        if (a[j] < min) {
            min = a[j];
            index_min = j;
        }
    }

    std::swap(a[i + 1], a[index_min]);
}

The first and second inner loops are iterating over almost identical datasets. There are a few statements preventing a simple loop fusion, but they can be moved away. With some tricks, we fused together the two inner loops. Here is the result:

for (int i = 0; i < n; i+=2) {
    min_1 = a[i];
    min_2 = a[i + 1];
    index_min_1 = i;
    index_min_2 = i + 1;

    // min1 must be smaller than min2
    // Swap two values if not true
    if (min2 < min1) {
        std::swap(min1, min2);
        std::swap(index_min_1, index_min_2);
    }

    // Look-up two smallest values in array.
    // The smaller is kept in min1, the larger
    // in min2.
    for (int j = i + 2; j < n; j++) {
        if (a[j] < min_2) {
            if (a[j] < min_1) {
                min_2 = min_1;
                index_min_2 = index_min_1;
                min_1 = a[j];
                index_min_1 = j;
            } else {
                min_2 = a[j];
                index_min_2 = j;
            }
        }
    }

    std::swap(a[i], a[index_min_1]);
    std::swap(a[i + 1], a[index_min_2]);
}

The loop fusion in this case is not trivial, but it is possible. The inner loop is looking up the two smallest values in the array to put them into the beginning of the section currently being processed. The total number of memory accesses is about two times lower compared to the simple selection sort algorithm.

Note: This optimization closely resembles outer loop vectorization, where the outer loop is running several instances of the inner loop in parallel.

Decreasing the Number of Data Passes by Doing More Work

As we have seen in previous examples, loop fusion allows the elimination of some memory accesses by fusing two neighboring loops running over the same data. But this is not the only way. Many ideas that decrease the number of data passes will result in fewer memory accesses and better performance.

Consider the simple selection sort algorithm from the previous section. The original algorithm was looking for the minimum in the remaining array. To decrease the number of total memory accesses, we could scan for both minimum and maximum. We would then put the minimum element at the beginning of the remaining array and the maximum element at its end. The algorithm looks like this:

for (int i = 0, j = n - 1; i < j; i++, j--) {
    double min = a[i];
    int index_min = i;
    double max = a[j];
    int index_max = j;
    for (int k = i; k < j; k++) {
        if (a[k] < min) {
            min = a[k];
            index_min = k;
        }
        if (a[k] > max) {
            max = a[k];
            index_max = k;
        }
    }

    std::swap(a[i], a[index_min]);

    if (a[index_min] != max) {
        std::swap(a[j], a[index_max];
    } else {
        std::swap(a[j], a[index_min]);
    }
}

This version has fewer iterations, and therefore fewer memory loads, but inside each iteration it does twice as much work. From the algorithmic point of view, it performs roughly the same number of operations as the first version. But, performance-wise it is more efficient. In the experiments section we will see how much efficient.

Another important algorithm with a similar reduction in memory accesses is a variant of Quicksort called Multi-Pivot Quicksort. Before explaining MPQ, let’s give a quick reminder about Quicksort. The Quicksort algorithm consists of two steps. The first step is array partitioning: picking a pivot and then partitioning the array into a left part that is smaller than the pivot and a right part that is larger. The second step is the recursive call: the partitioning is performed recursively on the left and right part of the input array, until the size of the array becomes 1.

Recursive array partitioning in Quicksort2

With Multi-Pivot Quicksort, the partitioning step is performed by picking two pivots and partitioning the array into three parts. Then partitioning is recursively performed on the left, middle and right part of the array. If an array has N elements, with plain Quicksort we expect an average number of memory accesses for each element of the array to be O(log2 N). With Multi-Pivot Quicksort, the average number of memory accesses would be O(log3 N).

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

Storing Small Arrays in Registers

If the compiler knows a size of an array at compile time, and this size is small, it can store the array in registers. But, if the size of array is not known at compile time, then the compiler cannot perform this optimization. You could, however, manually perform this optimization in your code. More info on post Horrible Code, Clean Performance.

Experiments

All the experiments were executed on Intel Core i5-10210U CPU, 4 cores, 2 threads per core. Each core has 32kB of L1i cache, 32kB of L1d cache and 256kB of L2 cache. There is a total of 6MB of L3 cache shared by all the cores. The source code used for all experiments is available here.

Loop Fusion

The first experiment is related to loop fusion. We measure the runtime of two separate loops and compare it with the runtime of the fused loop. The examples we use for testing are loops that calculate the min and max, first in two separate loops, then merged.

Here are the runtimes (five repetitions, average values):

Array SizeOriginalFused
32 kBRuntime: 0.159 s
Instr: 671 M
CPI: 0.793
Runtime: 0.068 s
Instr: 402 M
CPI: 0.665
256 kBRuntime: 0.136 s
Instr: 671 M
CPI: 0.800
Runtime: 0.068 s
Instr: 402 M
CPI: 0.667
2 MBRuntime: 0.136 s
Instr: 671 M
CPI: 0.801
Runtime: 0.068 s
Instr: 402 M
CPI: 0.667
16 MBRuntime: 0.171 s
Instr: 671
CPI: 0.855
Runtime: 0.085 s
Instr: 402 M
CPI: 0.739
128 MBRuntime: 0.175 s
Instr: 671 M
CPI: 0.855
Runtime: 0.086 s
Instr: 402 M
CPI: 0.742

The table shows, that, on average, the fused version is about two times faster than the original version. The fused version also executes less instruction and is more hardware efficient (Cycle per Instruction metric is better). Fewer instructions are executed because (1) there is only one loop instead of two, so this means fewer iterator increases, iterator comparisons, jumps and (2) removed one redundant load, since the piece of data is already in a register.

Selection Sort

We are going to experiment with selection sort, as described in the section about decreasing the number of data passes. To measure the effect we will compare the version of the selection sort where we are finding minimum only vs the version where we are finding both minimum and maximum. The first version scans the remaining part of the array to find the minimum and put it at the beginning. The second version scans to find both the minimum and maximum, and places them at the beginning and at the end of the array respectively. We expect the second version to be faster because it needs to perform two times fewer scans.

Here are the numbers (five runs, average numbers):

Array SizeOnly MinMin and Max
8 kB
16384 repeats
Runtime: 8.74 s
Instr: 60.3 B
CPI: 0.57
Runtime: 4.44 s
Instr: 43.2 B
CPI: 0.405
32 kB
1024 repeats
Runtime: 8.72 s
Instr: 60.1 B
CPI: 0.573
Runtime: 4.36 s
Instr: 43.0 B
CPI: 0.401
128 kB
64 repeats
Runtime: 8.69 s
Instr: 60.1 B
CPI: 0.572
Runtime: 4.37 s
Instr: 43.0 B
CPI: 0.402
512 kB
4 repeats
Runtime: 8.69 s
Instr: 60.1 B
CPI: 0.572
Runtime: 4.39 s
Instr: 42.9 B
CPI: 0.405

The Min and Max version is both faster and more hardware efficient in all cases. It also executes fewer instructions, because both the inner and the outer loops are shorter, so they perform fewer memory accesses.

Conclusion

Loop fusion is a simple and powerful technique to decrease the total number of memory accesses in the program. Although we described here the simplest version, loop fusion is possible even if datasets overlap partially.

In general, any idea that would result in a decrease of memory accesses has the potential to speed up your code. If you have any ideas that are not mentioned in this post, feel free to leave a comment so we can update this post.

In the next post we will talk about another way to decrease the total number of memory accesses. These memory accesses are “unwanted”, in the sense that the compiler has created them without your intention: memory accesses related to pointer aliasing and memory accesses related to register spilling. Until soon!

Do you need to discuss a performance problem in your project? Or maybe you want a vectorization training for yourself or your team? Contact us
Or follow us on LinkedIn , Twitter or Mastodon and get notified as soon as new content becomes available.

  1. Taken from here []
  2. Source []

6 comments / Add your comment below

  1. For your section on “Kernel Fusion” you probably want to look into strip-mining (which is somewhat similar to tiling). Instead of processing full size images at every stage in your pipeline you instead process a strip of N rows x width of image. N is typically chosen such that any strip buffers that you use fit within L1/L2 cache. You then iterate through the pipeline operating on N rows of data at a time, re-using your strip buffer(s) as needed. The only full size buffers are the initial source image(s) and the final output image(s). As well as improving cache locality this also keeps the functions for each pipeline operation independent of each other, making changes much less brittle than with explicitly fused kernels.

    1. Loop strip mining (I call it loop sectioning, but its the same thing) was mentioned in the post about changing data access.. Both techniques are very similar: they try to reuse the data while it is in fast part of the memory subsystem before it has to be evicted to a slower part. The key difference is that with loop fusioning approach, the data never leaves registers, whereas in loop sectioning approach, the data never leaves L1 cache. So loop fusioning approach are faster, but more difficult to implement.

  2. Nice summary article.
    The code you show for stl::ranges does not compile for three reasons:
    1. It’s `views` not `view`
    2. Ranges are in the namespace stl (or use the `stl::views` shorthand)
    3. The return type is a range, not a vector. With C++23 we could use `std::views::to()`.

    Here is a C++20 version that compiles:
    “`cpp
    std::vector get_ranges(const std::vector& users) {
    auto ids = users | std::views::filter([](auto const& user){ return user.age >= 18; })
    | std::views::transform([](auto const& user){ return user.id; });
    return {ids.begin(), ids.end()};
    }
    “`
    Also, if you mention this stuff, please provide more data and explain what happens here.
    If we throw this into [QuickBench](https://quick-bench.com/q/-DOx0M6p0X-l34e55F8gaFFPc_M), we see that it beats your algorithm example by a mile, which is nice. However, ranges are still lagging behind the naive implementation.

    1. Thank you for note regarding ranges in C++. I am updating the example so it compiles. I am not very familiar with ranges, so I cannot comment on syntax or quality. Also, there have been many talks about ranges, so I don’t think mentioning them more than this would bring something for a common reader.

      It would be interesting to analyze why naive approach is faster.

  3. It seems like Richard C. Waters (1989a, 1989b) called a very similar algorithm “pipelining”. His focus was on what he called pre-order traversal, so no arbitrary permutations, e.g. sorting and reversing. Permutations require buffering all of the data or destructively modifying an array.

    Personally, I already built a compiler for using Waters’ pipeliner for faster loops in JavaScript, see link below.

    It seems like pairing it with Fortran 90 could provide easier vectorization, because Fortran seems to ignore pointer aliasing. This “only” leaves the problem of the Fortran C FFI.

    Let us have a chat. This topic interests me as well. Got a 30x speed up in a simulation study in Rlang with better cache pipelining. Difference was a couple hours vs several weeks of run-time. It is a very interesting topic indeed.

    Kind regards,
    Andrew

    Waters(1989a):
    https://dspace.mit.edu/handle/1721.1/6035

    Waters(1989b):
    https://dspace.mit.edu/handle/1721.1/6031

    JS pipeliner:
    https://dapperdrake.neocities.org/faster-loops-javascript

    1. Hi Andrew! Thanks for the feedback. I don’t know what you want to chat about, but feel free to send me a message using the contact form and I will try to answer any questions. We could also meet for a live session if you want to discuss something.

Leave a Reply

Your email address will not be published. Required fields are marked *