Lost in the Cache: The Hidden Forces Slowing Down Your Code

A Tale of Traversals

Let’s take a trip back in time to one of your very first data structure tutorials. You’re introduced to the humble 1D array. You solve a few cool problems, and before long, you’re feeling confident navigating its straightforward structure. But just as you’re getting comfortable, along comes the 2D array, adding an extra dimension literally and figuratively to your problem-solving adventures (I hope this memory is old enough for you not to be the only one feeling nostalgic here).

You quickly grasp the concept of a 2D array and start tackling problems with it. One such problem asks you to count the occurrences of a specific target number in an n×n 2D array of integers. Sounds simple, right?

count := 0
for i := 0; i < n; i++ {
    for j := 0; j < n; j++ {
        if nums[i][j] == target {
            count++
        }
    }
}

You write this solution and feel good about it. But then, your friend working on their machine chooses a slightly different approach. Instead of going row by row (row traversal), they prefer going column by column (column traversal):

count := 0
for i := 0; i < n; i++ {
    for j := 0; j < n; j++ {
        if nums[j][i] == target {
            count++
        }
    }
}

You glance at their code and think, “It’s the same thing, just rearranged. The output should still be the same, right?”

Well, actually… Not really the same.

The final value of the count should be the same, but will they take the same amount of time to execute? Intuitively we might think so after all, both snippets traverse the same number of cells. However, that alone is not enough to guarantee identical execution times. When we compare the execution of the row traversal snippet to the column traversal snippet, the difference becomes evident.

Strangely enough, the row traversal is significantly faster than the column traversal. And as the size of the 2D array increases, the performance gap between the two becomes increasingly noticeable.

Why is this happening? We’ll circle back to this mystery shortly. For now, let’s give another example to settle this for our old selves since we break now some of their basics.

Parallelism to the Rescue?

To make things even more interesting, you decide to scale the problem. Instead of having one thread do all the work, you divide the matrix into P parts, assigning each part to a thread to process. The plan? Leverage parallelism for faster performance.

Your code might look something like this:

Split the matrix into parts.
Assign each part to a worker thread.
Use a shared array to store the counts calculated by each thread.
Wait for all threads to finish, then sum up the counts.

 result := make([]int32, numWorkers)
 var wg sync.WaitGroup

 // Calculate chunk size for each worker
 chunkSize := n/numWorkers + 1

 // Launch workers
 for p := 0; p < numWorkers; p++ {
  wg.Add(1)
  go func(workerID int) {
   defer wg.Done()

   // Calculate this worker's portion of the matrix
   myStart := workerID * chunkSize
   myEnd := int(math.Min(float64(myStart+chunkSize), float64(n)))

   // Count target numbers in this worker's portion
   for i := myStart; i < myEnd; i++ {
    for j := 0; j < n; j++ {
     if matrix[i*n+j] == target {
      result[workerID]++
     }
    }
   }
  }(p)
 }

 // Wait for all workers to finish
 wg.Wait()

 // Sum up the results
 total := 0
 for p := 0; p < numWorkers; p++ {
  total += int(result[p])
 }

Increasing the number of threads should make better performance. Parallel work doing same job should be enhanced but it did not.

As shown from the graph below scaling is not really enhancing the performance it actually sometimes had worse effect on the performance

What if our junior selves, we just disappointed, decided to add their own touch? Instead of updating the count in a shared array directly, they used a local variable to accumulate the count within each thread and updated the shared result only after the loop.

// Same Code ... 

var localCounter int32

 // Count target numbers in this worker's portion
 for i := myStart; i < myEnd; i++ {
  for j := 0; j < dim; j++ {
   if matrix[i*dim+j] == target {
    localCounter++
   }
  }
 }

 // Update shared result array at the end
 result[workerID] = localCounter

 // Same rest of the code ..

From some technical perspective, it only increases space complexity by introducing an extra seemingly unnecessary variable. However, let's monitor its performance and see how it behaves.

This approach actually improved performance. Even more interesting, as the number of threads increased, the performance continued to scale effectively — unlike our advanced version, which struggled to deliver similar results.

Enter the CPU Cache: The Invisible Bottleneck

At the heart of this mystery lies the CPU cache. Modern CPUs are incredibly fast, capable of executing billions of instructions per second. But there’s a catch: the CPU needs data to process those instructions, and accessing memory is slow compared to the CPU’s speed. While the CPU operates in nanoseconds, accessing main memory can take hundreds of cycles. During this time, the CPU sits idle, waiting for data like a sprinter frozen at the starting line because the baton hasn’t arrived yet.

To bridge this gap, CPUs use a small but fast memory called cache. Cache is much closer to the CPU and stores frequently used data, reducing the time needed to fetch it. What makes the cache so powerful? It’s all about locality:

Temporal Locality: If you’ve accessed a piece of data recently, chances are you’ll need it again soon.
Spatial Locality: If you’ve accessed a piece of data, nearby memory locations are likely to be accessed next.

Think of it as reading a book. If you’re on page 42, chances are you’ll soon read page 43 (spatial locality) or revisit something on page 42 to cross-check a detail (temporal locality).

Once More to Our Traversals Story

Let’s revisit the row vs. column traversal example. In row-major order, elements in the same row are stored consecutively in memory. When you traverse row by row, the program accesses memory sequentially, leveraging spatial locality.

However, column traversal jumps between rows for every iteration, resulting in poor spatial locality and frequent cache misses. Every cache miss forces the CPU to fetch the required data from slower memory, significantly increasing runtime.

This is why row traversal outperforms column traversal, despite processing the same number of elements.

The Hidden Enemy of Multithreading

When we think of multithreading, we expect better performance — more threads should mean faster execution. However, as we saw in our example, adding more threads didn’t always help. In fact, in some cases, performance actually got worse. Why? The culprit in our case is something called false sharing, a sneaky performance killer that occurs due to the way modern CPUs handle cache memory.

To understand false sharing, we need to look at how CPUs store and share data across multiple cores.

Modern processors don’t fetch individual variables from RAM; instead, they work with cache lines small chunks of memory that store multiple variables at once. When a CPU core accesses a variable, it loads the entire cache line containing that variable into its local cache. If multiple threads modify different variables that happen to be in the same cache line, they unintentionally interfere with each other.

Even though each thread may only be working with its own data, the CPU treats the cache line as a shared resource. Every time a thread modifies a variable in that cache line, the entire line must be invalidated and reloaded, forcing unnecessary memory traffic between cores. This is known as false sharing, and it can significantly degrade performance.

Back to Our Threading Story

Now, let’s revisit our earlier story where we tried to speed up counting in a 2D array by dividing the work among multiple threads. Initially, we stored each thread’s count in a shared array. However, performance didn’t scale well adding more threads didn’t help, and sometimes even made things worse.

This was a classic case of false sharing. Each thread was updating its own index in the shared array, but those values were packed into the same cache line. As a result, every update caused cache invalidations, leading to constant back-and-forth synchronization between CPU cores.

When we switched to using a local variable instead where each thread accumulated its count in a separate variable before and slack (might be interesting topic for other blog post 👀) updating the shared array the performance improved. Why? Because the local variable lived entirely within each thread’s private cache, avoiding unnecessary cache invalidations. Only at the very end did each thread update the shared array, significantly reducing the frequency of cache misses.

Final Thoughts

Understanding CPU cache isn’t just for low-level programmers , it’s essential for anyone who cares about performance. Whether you’re working with loops, optimizing multithreading, or simply writing efficient algorithms, being aware of how data moves through the CPU can make a huge difference.

Why Should You Care About CPU Cache?

Raw CPU Speed Isn’t Enough
Modern processors are incredibly fast, but they’re only as efficient as the data they can access. If your code constantly causes cache misses, the CPU ends up waiting on memory instead of executing instructions.
More Threads ≠ More Performance
Simply adding more threads doesn’t always speed things up. If threads interfere with each other through false sharing or inefficient memory access, performance may plateau or even degrade.
Efficient Memory Access = Faster Code
By writing cache-friendly code — such as iterating over arrays in a way that aligns with how memory is loaded into cache — we can take full advantage of CPU speed without unnecessary slowdowns.
Scalability Matters
The gap between CPU speed and memory latency continues to grow. As hardware evolves, software that efficiently utilizes caches will scale much better than code that ignores it.

So next time you’re debugging slow code, don’t just look at your logic — think about how your data is accessed and stored. Because sometimes, The biggest bottleneck isn’t in your code — it’s the effort lost in the cache.

This blog post was inspired by Scott Meyers’ talk CPU Caches and Why You Care from the code::dive 2014 conference. Special thanks to Mohamed Moataz El Zein for introducing me to this lecture. If you’re curious to dive deeper, I highly recommend watching the full talk!