A story of a very large loop with a long instruction dependency chain.

A story of a very large loop with a long instruction dependency chain.
In our experiments with the memory access pattern, we have seen that good data locality is a key to good software performance. Accessing memory sequentially and splitting the data set into small-sized pieces which are processed individually improves data locality and software speed. In this post, we will present a few techniques to improve the…
We use matrix multiplication example to investigate loop interchange and loop tiling as techniques to speed up your program that works with matrices.