May 2020

Data Cache and Programming

When listening to talks on optimizing code to run fast (usually low level stuff), you might hear the term "cache miss" come up. It seemed important, so I decided to dig into it a bit. I wanted to find out three things.

Is it important?
What is it?
Is it important even in the frontend?

The answer to the first and last question:

Yes, it is indeed important in low level languages (I'm using Rust).
I measured a 10x difference in performance in JavaScript between doing it right and doing it wrong.

An example

We start by looking at very simple example: We want to sum up numbers in a matrix. There are fundamentally two approaches: by row or by column. We will use this example throughout the post, so stick with it.

Here is a simple 3×3 matrix. To make life easier, we will keep width and height the same.

matrix = [
    [ 1, 2, 3 ],
    [ 4, 5, 6 ],
    [ 7, 8, 9 ],
]

By row means we start walking down the first row, adding them to a total. 1 ... 2... 3... 4... By column means taking first item in the first row, then the first item in the next row, and so on. 1... 4... 7... 2...

In code they look almost identical.

function by_row(matrix, n)
    sum = 0
    for i in 0..n
        for j in 0..n
            sum += matrix[i][j]
    return sum

and

function by_col(matrix, n)
    sum = 0
    for i in 0..n
        for j in 0..n
            sum += matrix[j][i]
    return sum

The only difference is where we access the matrix with either matrix[i][j] or matrix[j][i].

Here is a question: which is faster? And another: by how much? To find out, I benchmarked both of them in Rust, using the excellent Criterion. The code for the row-first function looks similar to the pseudo code above.

fn by_row(matrix: &Vec<Vec<u32>>, n: usize) -> u32 {
    let mut total = 0;

    for i in 0..n {
        for j in 0..n {
            total += matrix[i][j];
        }
    }

    total
}

The by_col looks like you'd expect. Only i and j reversed.

The benchmark was run on my ThinkPad x1 Carbon. The results:

The size on the x-axis refers to the number of rows and columns. So a size of 2000 means a 2000×2000 matrix.

Summing a 4000×4000 matrix takes an order of magnitude more time if you go column by column rather than row by row. Without doing any further benchmarking, I also feel fairly confident saying that it isn't getting any better as the size of the matrix increases.

I was not taught any of this at university. I don't think many students are.

But a 10x difference in performance from swapping the order of indexes seem important to me. It can either be a big win or a big loss depending on where you are coming from.

That was the answer to the first question: Yes it is important.

What is going on?

Let's try to understand what is happening. The basic concepts are actually fairly simple (and of course there are asterisks everywhere that I will conveniently ignore). We'll start by looking at how the matrix is laid out in memory. Since all the arrays have a known size, they are just laid out linearly. Let's say we use just five bits to represent each number, the above matrix (1 to 9) will hold the values 00001 to 01001.

matrix = [
    [ 00001, 00010, 00011 ],
    [ 00100, 00101, 00110 ],
    [ 00111, 01000, 01001 ],
]

And in memory, it's just laid out one after the other.

000010001000011001000010100110001110100001001
╰─┬─╯╰─┬─╯╰─┬─╯╰─┬─╯╰─┬─╯╰─┬─╯╰─┬─╯╰─┬─╯╰─┬─╯
  1    2    3    4    5    6    7    8    9

Going row by row means going left to right, and going column by column means jumping around a lot.

The reason this matters, is the data cache. Quoting wikipedia.

“When trying to read from or write to a location in main memory, the processor checks whether the data from that location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory.”

The “cache” mentioned here is the data cache of the cpu. It is used specifically to decrease the time it takes to load new data. Reading from this cache is many times faster than reading it from memory, and not finding the data you need in there, is what is called a “cache miss.” How many times faster is it? Well, this is where some of all the asterisks come in. There are several caches, called L1, L2, and L3, but between 10 and a 100 times faster seems to be a decent enough estimate.

But this is not enough to understand what is happening, because in the example we are not reusing any of the data. Once we've read an entry in the matrix, we never return to it. So how are we getting more cache misses when going column by column rather than row by row? Shouldn't everything be a cache miss?

The key is that when data is inserted into the cache, it is done so in blocks of a fixed size. These are called cache lines, and the hardware will try to predict what data you need next. It is this prediction that works out much better, if data is accessed linearly by the algorithm. Jumping back and forth in the array makes it much harder for the computer to figure out what you are going to want to do next. Going from one end to the next just works out much better.

This is as far as my knowledge goes on this topic. There are holes still, but I am not done reading about it either. However even with this little knowledge it is possible to start formulating theories about what fast code looks like and what it doesn't look like, which can form the basis for benchmarking. What I find interesting, and I believe I will return to this in a future post, is that it also matches my mental model of programming as a set of transformations on data, rather than different actors (or objects) having their own will (and state).

What about JavaScript?

I do a lot of my coding in JavaScript (and languages compiling to it), so I was curious about how this super dynamic quirky language would do. JavaScript does not have arrays the same way other languages have arrays. For instance this is perfectly legal.

const a = []
a.push(1)
a.push("fish")
a[42] = class Foo { }

Magical things are happening when JavaScript is run though. So I wouldn't take anything for granted. I wrote a jsperf test (on a 4000×4000 matrix) to try it out. Feel free to run it for yourself, but here are my results.

Results of Matrix Sum (Ops/sec)
	By row	By column
Firefox	36.99 (±22.13%)	2.00 (±1.45%)
Chrome	49.28 (±5.13%)	6.68 (±1.38%)

This is operations per second, so higher is better, and again we see about an order of magnitude difference. My best guess is that the engine realize that only numbers (maybe even only integers) are being inserted into the arrays, and construct the underlying structure accordingly.

This answers the third question: Yes, this is absolutely also important in a high level language like JavaScript if fast code is important.

Conclusion

What all of this means, at least to me, that writing fast code is not easy. Building up an intuition about what is happening after the language is compiled is critical to get this right. So is benchmarking.

Note

The code used for benchmarking is not idiomatic Rust. This function is safer, faster, and smaller.

fn sum(matrix: &Vec<Vec<u32>>) -> u32 {
    matrix.iter().map(|row| row.iter().sum::<u32>()).sum()
}

Why it's faster, I can't explain (yet), but I wanted there to be as little difference between the two benchmarked functions as possible and this version makes that harder.