Ever since taking UCLA’s computer architecture class (CS33) I was curious why loop tiling worked. Loop tiling is a tactic used to speed up matrix multiplication, apparently running matrix multiplication in a weird order somehow speeds up the multiplication because cache? It was a bit odd as it was the first and only time I really found a need for cache conscious programming. I implemented it wherever needed for the class but I never fully understood it and it’s been bugging me ever since, so here we are. Here I try to explain it to freshman me in CS33, and hope its helpful for some.
Let’s first see how classic matrix multiplication works and why it is slow. To keep things simple, for now we will have each row be one cache line and have the matrices be square.
This shows the classic matrix multiplication algorithm, calculating one output cell at a time and moving on. The blue bars between certain cells are cache line ends/starts (somewhat hard to see since each cache line is a full row here), each section between bars represents one cache line. For now, cache line size is equal to the length of a row, so it may be somewhat hard to see.
Turns out our ability to process data on a CPU has far outrun our ability to read memory. If we had to wait on every memory access our CPUs would be many many times slower these days. As a solution CPU manufacturers came up with memory caching. Whenever accessing memory, we grab an entire block of bytes which contains the memory address we are trying to access, say 64 bytes, and bring them into the CPUs cache. This increases speed based on two assumptions:
Temporal locality: Memory addresses most recently accessed are most likely to be accessed again.
Spatial locality: Memory addresses near a recently accessed one are most likely to be accessed again.
And this should make sense, within a program, the same variable will likely be referenced multiple times and many programming languages try to keep memory addresses for variables recent near each other on the stack. Arrays are similarly stored with their elements consecutively in memory.
Usually cache has a Least Recently Used (LRU) policy on which cache is bumped to make space from new cache. Oldest accessed cache lines are the ones that are removed, to satisfy temporal locality best.
So what makes matrix multiplication this way so slow?
In most high performance languages, 2D arrays are stored as an array of 1D arrays, in other words every row is consecutive in memory and every cell is too. Multiple accesses within rows likely are in the same cache block, accesses across rows are likely not.
So in our second matrix here, we are doing accesses by column. Burning through our cache, getting a new access twice. Our cache does not have enough space to store one cache line for the height of the matrix, so instead on every loop it has to pull from cache again. And herein lies the slowness. The primary cost is in the second matrix.
First thought that comes to mind is “can we at least have each cache line on the second matrix accessed twice in succession?”
And of course we can.
It does work a lot better. Because we zig zag down two rows at a time, we are able to halve the number of times we descend down the second matrix. Each descent costs a number of cache misses, halving these speeds us up by a lot.
A natural extension of this is of course simply to go for more than 2. In fact why not go for the full row?
We are getting very optimized at this point. As we traverse a full row rather than just two from the above, we only need a row lookup on the second matrix for every row in the first. As such we have more than decimated our cache misses.
But can we get greedier.
Our cache holds 8 cache lines. At one time, we need to hold one row in the first matrix, one row in the middle, and one row in the result. We could with minimal cost do two rows at a time for the first matrix, it would only cost us having one more row in cache for the first matrix and one for the result. That way those neighboring rows can share cache lookups from part 1, while we are still using a total of 2 + 2 + 1 = 5 cache lines as a time.
And as we have 3 additional cache lines left, we can repeat the same trick once more, 3 rows in the first matrix, 3 rows in the result, one in the second matrix gives us 3 + 3 + 1 = 7, still less than the 8 our cache holds. If we try to go to 4 rows, we end up with 9 cache lines needed, more than our 8 sized cache.
And this does turn out to be the most best case for this matrix, nothing is faster. Of course cache lines don’t entirely line up with the start of a matrix in real life, but this means in other cases cases with 1 row less tend to work better in my simulations, imperfect alignment of cache will sometimes require that first matrix row read need 2 cache slots rather than 1 to stay.
So where is the tiling? This just seems to use multiple rows.
The big problem here separating this from a realistic example is that an entire row here is a cache line. In reality L1 cache lines are surprisingly small, about 32 bytes each, or 16 to 8 elements using ints or longs. When multiplying matrices of 100 elements across, we run into more significant issues.
One key thing to consider is that when we have matrices this big, we can consider splitting them up. Consider a square matrix, split it up into four matrices.
If we multiply two such matrices together, each cell in the result merely the sum of matrix multiplication cell operations as shown above.
So if we have a high cache hit rate for each of the two matrix multiplication cell operations, we have a high cache hit rate overall.
Suppose a case where we have cache lines split down the middle of the matrix as in the above. It turns out the following is the optimal algorithm for that.
Of course we can similarly generalize such logic into splitting matrices 9, 16, etc ways.
So generally it seems the optimal width of the tiling is the length of the cache on the array. The height of the tile meanwhile is generally dictated by the amount of cache we actually have available, but also by the cache boundaries of the first matrix as shown here. Loop tiling of course is irrelevant if you have enough cache to keep all the matrices in memory.
Although this assumes matrices are often nice and cache borders are conveniently placed, which is not always true. For instance the following is the optimal algorithm for this 14 by 14 matrix multiplication.
However simple tiling with square tiles of cache line size still give pretty good results.
In conclusion, we do see a general motivation for tiling here, even when the optimal tile size problem is a lot more difficult to argue. But in this case perfect is the enemy of good, we can likely get 90% of the improvements we need simply tiling by cache size.
The code for this is available open source on my github page here. Feel free to email me at maksymovi@maksym.pro for any comments, suggestions, feedback, or anything of interest. The code is a little crude so feel free to let me know if you find any problems with it.
Click anywhere to start