Chapter 27

The Data Locality Pattern

Modern CPUs can execute instructions far faster than they can load data from RAM. The bottleneck in performance-critical code is almost always the memory system, not the arithmetic. The Data Locality pattern is about organising data so the CPU finds what it needs already in cache — transforming cache misses into cache hits, and slow loops into fast ones.

The Memory Hierarchy

Your computer has several layers of memory, each faster and smaller than the one below it:

LevelTypical SizeTypical Latency
L1 Cache32–64 KB per core~4 cycles
L2 Cache256 KB–1 MB per core~12 cycles
L3 Cache4–32 MB shared~40 cycles
RAM8–64 GB~200 cycles

When the CPU needs a value that isn't in any cache level, it goes to RAM. That's a 200-cycle stall — the CPU is doing nothing for 200 cycles waiting for data. A tight inner loop that causes cache misses on every iteration can run 50× slower than the same loop with warm cache.

Cache Lines

Memory is not fetched byte-by-byte. It's fetched in chunks called cache lines — typically 64 bytes on modern x86 hardware. When you read one byte, the CPU actually loads the surrounding 64 bytes into cache. If the next read is within those 64 bytes, it's free. If it's outside, that's another cache miss and another 200-cycle stall.

The practical implication: data that is accessed together should be stored together. If you always read field A and field B at the same time, put them in the same struct so they land on the same cache line. If you access them separately in different loops, separating them doesn't hurt and might help.

Spatial Locality: std::vector vs std::list

This is why std::vector<T> is almost always faster to iterate than std::list<T>, even though a linked list has O(1) insertion anywhere.

A std::vector<T> stores its elements in one contiguous block of memory. Iterating them is a linear walk — each element is adjacent to the last, so the CPU prefetcher can stay ahead. Cache lines are fully utilised.

A std::list<T> stores each node as a separate heap allocation. Iterating them is a pointer chase — each node is at a random location in memory. Every node traversal is likely a cache miss. The list's flexibility costs you in every iteration you do.

For game objects that are iterated every frame, this difference is measurable. Use std::vector unless you have a specific reason not to, and treat std::list as a specialist tool.

Vector of Objects vs Vector of Pointers

Even with a vector, you can accidentally reintroduce pointer chasing:

// Fast: objects are contiguous in memory
std::vector<Particle> particles;

// Slower: pointers are contiguous, but the objects are scattered
std::vector<Particle*> particlePointers;

// Even slower: unique_ptrs are contiguous, but their payloads are scattered
std::vector<std::unique_ptr<Particle>> ownedParticles;

A vector<Particle> stores all the particle data in one block. Iterating it walks that block linearly — excellent cache behaviour.

A vector<Particle*> stores the pointers in one block, but each pointer points to a different heap allocation. Dereferencing the pointers during iteration accesses scattered memory — cache misses on every element in the worst case.

The rule: if you own the objects and iterate them frequently, store them by value in a vector, not by pointer. Use vectors of pointers (or unique_ptr) when you need polymorphism, shared ownership, or objects that must not be moved.

Array of Structures vs Structure of Arrays

Consider updating the physics of 10,000 particles. Each particle has position (x, y), velocity (vx, vy), color (r, g, b, a), and a lifetime (float).

Array of Structures (AoS): one struct per particle, in one vector.

struct Particle
{
    float x, y;
    float vx, vy;
    float r, g, b, a;
    float lifetime;
};

std::vector<Particle> particles; // 10,000 of these

The physics update only cares about position and velocity. But each Particle is 36 bytes. When you iterate and update x += vx * dt, you're loading 36 bytes per particle even though you only need 16 of them. 20 bytes of colour and lifetime data rides along in every cache line, wasting bandwidth.

Structure of Arrays (SoA): separate arrays per field.

struct ParticleSystem
{
    std::vector<float> x, y;
    std::vector<float> vx, vy;
    std::vector<float> r, g, b, a;
    std::vector<float> lifetime;
    int count = 0;
};

Now the physics update iterates only the x, y, vx, vy arrays. Those four arrays fit in cache together and nothing else contaminates the cache lines. The colour update only touches r, g, b, a. Each loop is perfectly cache-efficient for its data.

The trade-off: AoS is more natural to write and read ("a particle has x, y, vx, vy"). SoA is faster to iterate when you process fields in separate passes. For 100 particles, the difference is invisible. For 100,000, it can be a 4–8× speedup.

Hot/Cold Splitting

A more surgical form of SoA: identify which fields of a struct are accessed in the hot path (every frame) and which are cold (rarely accessed), then split them.

// Hot: accessed every frame in the physics loop
struct EnemyPhysics
{
    float x, y;
    float vx, vy;
    float radius;
};

// Cold: accessed only when an enemy is clicked, on death, or in the UI
struct EnemyData
{
    std::string name;
    int         xpValue;
    std::string deathSound;
    // ...
};

std::vector<EnemyPhysics> enemyPhysics; // hot array — always in cache
std::vector<EnemyData>   enemyData;    // cold array — rarely loaded

The physics loop only touches enemyPhysics. All the cold data (name, xp, sounds) stays out of the cache entirely during the hot path. This is hot/cold splitting, and it's one of the highest-return optimisations in game-engine work.

The Tension with Polymorphism

The Component pattern from the previous chapter uses virtual dispatch and unique_ptrs — both of which scatter data across the heap and replace direct array iteration with pointer chasing. That's a real cost.

The tension is genuine: the OOP patterns that make code flexible and maintainable are in direct opposition to the data layout patterns that make code fast. Professional game developers navigate this tension by:

  1. Measuring first. Don't optimise before you know it matters. At 100 entities, it doesn't matter. At 100,000, it might.
  2. Choosing the right layer for each concern. High-level game logic (menus, dialogue, rare events) can afford polymorphism. Core simulation loops (physics, rendering, AI path following) are where cache layout matters most.
  3. Using ECS for hot paths. An Entity Component System explicitly uses SoA layout for the components that get processed in tight inner loops, while still providing a flexible "attach any component to any entity" model at the architectural level.

When to Apply These Patterns

The honest answer is: profile first, then optimise. Premature micro-optimisation is the source of much unmaintainable code. The rule of thumb for game development:

Practical Example — Particle System

The particle fountain from Chapter 13 used a vector<Particle> — AoS. At 500 particles that's fine. If you wanted 50,000 particles, the SoA layout would give a significant speedup. Here's what the update loop looks like in SoA form:

void ParticleSystem::update(float dt)
{
    for (int i = 0; i < count; ++i)
    {
        x[i] += vx[i] * dt;
        y[i] += vy[i] * dt;
        vy[i] += GRAVITY * dt;
        lifetime[i] -= dt;
    }

    // Remove dead particles by swap-and-pop
    for (int i = count - 1; i >= 0; --i)
    {
        if (lifetime[i] <= 0.0f)
        {
            // Swap with last
            x[i]        = x[count-1];
            y[i]        = y[count-1];
            vx[i]       = vx[count-1];
            vy[i]       = vy[count-1];
            lifetime[i] = lifetime[count-1];
            r[i] = r[count-1];  g[i] = g[count-1];
            b[i] = b[count-1];  a[i] = a[count-1];
            --count;
        }
    }
}

The physics loop touches only four arrays (x, y, vx, vy, lifetime). The colour arrays (r, g, b, a) are only touched by the render pass and the dead-particle removal. The CPU can keep the physics data warm throughout the physics loop.

Summary