Skip the ECS

What My Engine Does Instead

May 27, 2025

When building Cultist Astronaut, I didn’t go with a full ECS (Entity Component System). Instead, I opted for a simple megastruct paired with targeted optimizations. This approach keeps the logic straightforward while still hitting high performance, without the overhead of a complex ECS framework.

In this article, I will walk through the core ideas: megastructs, hot-data packing, specialized systems, and data layout strategies. I’ll also share some benchmarks and code snippets to show how these ideas work in practice.

The Megastruct

People spend an enormous amount of time trying to figure out how to organize their entities. Some are obsessed with hyper-optimizing everything. Others clog up their codebase by chasing OOP principles: interconnected hierarchies, heavy encapsulation, dynamic dispatch until it’s impossible to figure out what the code actually does.

Both sides make the same mistake:

They’re solving made-up problems instead of real ones.

There are much simpler ways to structure entities that aren’t insane and more importantly, that let you focus on solving the problems that actually make your game fun.

One approach I use is the megastruct. The idea is simple: just shove every variable you need for an entity into one big Entity struct.

I first saw this in Handmade Hero, and it was later reinforced in a great conversation between Ryan Fleury and Randy.

At first, I was skeptical. It felt like it would bloat memory usage especially if I had thousands of entities and each one only used a small subset of the fields. But once you run the numbers, you realize it’s not actually a problem.

Take memory: one of the 90+ textures I use in Cultist Astronaut is 16 MB (16,777,364 bytes). Let’s say my Entity struct is a beefy 2 KB (2,048 bytes). That means I could store 8,192 entities in the same amount of memory as a single texture. And that’s just one texture—most games load dozens.

Then there’s the CPU cache. Because I only simulate one room (or block) at a time, and each room averages about 200 entities, that all fits easily in cache. On my machine, I get a consistent 3 ms per frame for entity simulation, and this code could still be optimized!

One big downside of megastructs is that they make it easier to accidentally misuse variables that belong to other systems. Specialized entities can help reduce these kinds of bugs. This is a topic I plan to cover in more detail in a future article.

If performance becomes a concern due to memory layout or entity count, we can optimize specific systems or subsets of entities. That’s where techniques like hot-data packing or moving to specialized systems come in, which I’ll cover in this article.

You can also check out Dylan Falconer’s article on Megastructs. He provides an excellent starting point.

Hot Data Packing

Alongside megastructs, I use hot-data packing to extract tight subsets of data for specific systems—like physics—where performance and cache efficiency really matter.

Hot-data packing is the practice of pulling only the fields you need into a compact, cache-friendly structure—so your inner loops can stay fast.

As gameplay grows in complexity, the Entity megastruct inevitably grows too. It becomes large and sparse, carrying dozens of fields, many of which are unused for most entities.

Even if your total entity count is modest, some systems—like collision detection—scale poorly. For example, many collision checks are quadratic in nature. That’s where the megastruct starts to break down.

The Buoyancy Case

Let’s look at a real-world example from my game Cultist Astronaut, where I hit a major performance issue while simulating lots of entities with buoyant properties.

In my game, buoyancy is toggled with a flag (CollisionFlag_Buoyancy) inside the Entity struct. This allows any entity to become buoyant at runtime. Some entities, like props, are buoyant by default. Others like enemies or the player become buoyant only after dying. I also use entities to define buoyancy zones which are regions that apply upward force to buoyant objects.

If we naively loop through all entities to find buoyant ones and then compare them against all zones, we might write something like this:

EntityIterator it;
ForActiveEntities(it) {
    EntityInstance* entity = it.entity;
    
    if (!ContainsFlag(entity, CollisionFlag_Buoyant)) {
        continue;
    }
    
    EntityIterator otherIt;
    ForActiveEntities(otherIt) {
        EntityInstance* other = otherIt.entity;
        if (!ContainsFlag(entity, CollisionFlag_BuoyancyZone)) {
            continue;
        }
        
        // Perform collision check here
    }
}

Now imagine the following:

1,000 total entities
200 are flagged as buoyant
10 are flagged as buoyancy zones

This means:

1,000 outer loop flag checks
200,000 inner loop flag checks
All just to perform 2,000 actual collision checks

This is bad. We're doing way too much work just to compute a small set of meaningful interactions. The buoyancy-relevant data is also scattered in a large, sparse structure, so we’re polluting the cache while we’re at it.

Applying Hot-Packing

The heaviest part of this function should be the collision check and response. To make this as fast as possible, I pack only the necessary data into tighter structures and iterate over dedicated arrays for buoyant entities and zones. I call this approach hot-data packing.

The idea is simple: iterate through the full entity array, extract only what you need, and store that in specialized arrays.

Here’s the struct I use for buoyant entities:

// Total size: 40 bytes
struct BuoyantEntity {
    Vector2 position;    // 8 bytes
    Vector2 velocity;    // 8 bytes
    AABB aabb;           // 16 bytes
    EntityHandle handle; // 8 bytes
};

You're in full control of the layout here. I’m using an AoS (Array of Structs) layout for simplicity, but you could just as easily use SoA depending on your performance needs. We will discuss this later.

During the update, we populate the packed array:

buoyantEntities.length = 0;

EntityIterator it;
ForActiveEntities(it) {
    EntityInstance* entity = it.entity;
    
    if (!ContainsFlag(entity, CollisionFlag_Buoyant)) {
        continue;
    }
    
    BuoyantEntity* data = Insert(&buoyantEntities);
    data->position = entity->position;
    data->velocity = entity->velocity;
    data->aabb = entity->aabb;
    data->handle = entity->handle;
}

Note: The arrays are reset every frame (length = 0) because buoyancy is dynamic. Entities can die, respawn, or change state at any time. This "immediate mode" approach avoids lifetime complexity.

We do the same thing for buoyancy zones.

At first glance, this may seem like more work—you’re adding extra loops and copying data. But in reality, this is a cheap pass:

1,000 iterations
200 inserts
8 KB of memory written (200 × 40 bytes)

Once packed, the buoyancy logic becomes:

for (int i = 0; i < buoyantEntities.length; i++) {
    BuoyantEntity* entity = &buoyantEntities[i];
    
    for (int j = 0; j < buoyantZones.length; j++) {
        BuoyantZone* zone = &buoyantZones[j];
        
        // Perform collision check / response here
    }
}

Now we’re doing exactly 2,000 iterations.
No extra branching, no flag checks, and the data is tight in memory.
This dramatically improves performance—and it scales well.

Check out this video of the souls in Cultist Astronaut:

What about Writing Back to Entities?

You might have noticed: these packed arrays are transient. They don’t persist across frames. So how do we update the original entities?

That’s what the EntityHandle is for. It lets us refer back to the full entity only when necessary (e.g., during a collision response).

This indirection does add a tiny cost, but:

It only happens when something meaningful occurs
The main simulation loop stays tight, branchless, and cache-friendly

In my experience, this hasn’t been a bottleneck. The real cost is in the collision math and memory access—not the occasional handle lookup.

When Hot Packing Isn’t Worth It

Hot packing isn’t always the right answer. In systems where entities are uniform and use the same logic, like particles or bullets, it’s often better to use a dedicated array and skip packing altogether.

In the case of buoyancy, I prefer to keep it as a flag on the base entity. This gives me maximum flexibility: anything in the world can become buoyant, from props to corpses to the player.

In this case, the indirection cost is worth it.

The key point is this: it’s all about use case and profiling.

Every game has different requirements. Don’t follow dogma - follow your data.
Hot packing is just one tool among many. Use it where it matters.

When to Ditch Entities

I ran into a problem like this in Cultist Astronaut when enemies started dropping “souls” on death. Souls interact with world collision, gravitate towards the player, and were absorb as spendable points. Since they have a position, velocity, sprite, and collision data (even buoyancy), it made sense at first to treat them like regular entities, just flag them as souls and handle them accordingly.

As the game grew in complexity, performance in soul-heavy areas tanked. In one extreme case, the player could die and drop 4,000 souls into a room. If your entity loops aren’t tightly filtered, especially in collision, it’s easy to end up comparing souls against souls, triggering 16 million checks for no good reason.

This is a perfect use case for not using full entities. Souls don’t need all the overhead: configurability, branching behavior, the potential to turn into something else. They simply:

Collide with the world
Move toward the player
Disappear (and turn into points) on touch

That’s it.

So, treat them like what they are: a specialized particle system. Keep them simple. Keep them separate. There's no reason for them to pollute the entity array or bloat innocent-looking loops that have no business interacting with souls.

Since souls share the same buoyancy behavior as entities, it makes sense to refactor those functions to operate on more generic data. You can also use lightweight templating or macros to substitute types and generate specialized versions of the function without duplicating logic.

SoA vs. AoS

People often bring up memory layout when discussing ECS or entity design and for good reason. Structure of Arrays (SoA) and Array of Structures (AoS) can have a major impact on performance depending on how you're accessing data.

While Cultist Astronaut uses a megastruct layout that resembles an AoS (Array of Structures) model—one large struct per entity stored in an array—I wanted to test whether a true SoA (Structure of Arrays) layout actually outperforms it in real-world scenarios.

What’s the Difference?

AoS (Array of Structures): An array of full structs. Easy to work with, but large structs hurt cache usage when only one field is needed.
SoA (Structure of Arrays): Separate arrays for each field. Much faster when accessing a single field across many elements.

AoS Example

struct Entity {
    float position;
    float velocity;
    float health;
    char padding[2036]; // simulate large entity size (2 KB total)
};

SoA Example

struct EntitySoA {
    int count;
    float* position;
    float* velocity;
    float* health;
};

Benchmark Setup

All benchmarks were compiled with MSVC 2019:

cl /MD /O2 /GL /Ob2 /Oy

System specs:

CPU: Intel Core i7-10875H (8 cores, 16 threads @ 2.3GHz base, Turbo ~5.1GHz)
RAM: 64 GB DDR4
OS: Windows 11 Home, Build 26100
Machine: Razer Blade Pro 17 (Early 2020)

Each test:

Loops through 10 million elements and adds a float to an accumulator
For AoS, the struct varies from 4 to 2048 bytes, but only one float is accessed
Runs 100 times, reports the average
Flushes the CPU cache between runs to avoid skew
Uses QueryPerformanceCounter for high-resolution timing

Test 1: 10 million Entities (1 Add per Iteration)

Benchmark Description

Each run sums a single float (health) from each element. AoS structs vary in size, but only the float is accessed.

float sum = 0;
for (int i = 0; i < count; i++) {
    sum += entities[i].health; // AoS
    // or
    sum += soa.health[i];      // SoA
}

Results

| Layout             | Time    |
|--------------------|---------|
| SoA                |  9.3 ms |
| AoS (4 bytes)      |  9.3 ms |
| AoS (64 bytes)     | 26.9 ms |
| AoS (1024 bytes)   | 69.3 ms |
| AoS (2048 bytes)   | 72.7 ms |

AoS (4 bytes) performs identically to SoA—no surprise, since the layout is basically the same.

Test 2: 10 million Entities (4 Adds per Iteration)

Here we unroll the loop to perform four additions per iteration:

for (int i = 0; i < count; i += 4) {
    sum += soa.health[i + 0]
         + soa.health[i + 1]
         + soa.health[i + 2]
         + soa.health[i + 3];
}

Results

| Layout             | Time    |
|--------------------|---------|
| SoA                |  3.1 ms |
| AoS (64 bytes)     | 27.5 ms |
| AoS (1024 bytes)   | 70.0 ms |
| AoS (2048 bytes)   | 81.8 ms |

Unrolling improves performance on SoA by reducing loop overhead and taking advantage of the fact that health values are stored contiguously in memory. This leads to more efficient memory access and better use of the CPU’s cache. In contrast, AoS layouts interleave health values with unrelated data in large structs, breaking memory locality. As struct size increases, each access pulls in more unused data, causing more cache misses and lower memory throughput — problems that unrolling alone can't solve.

In some cases, unrolling can allow us to see opportunities to apply SIMD instructions, but we will save that for another day.

Test 3: 1,000 Entities (Real-World Case)

Most games don’t process 10 million entities per frame. Here’s what happens when we scale down to a more reasonable number: 1,000 entities per frame.

| Layout             | Time      |
|--------------------|-----------|
| SoA                | 0.0008 ms |
| AoS (64 bytes)     | 0.0013 ms |
| AoS (1024 bytes)   | 0.0045 ms |
| AoS (2048 bytes)   | 0.0058 ms |

At this scale, everything is fast. Even a 2 KB AoS struct is completely reasonable for typical game simulation.

SoA vs. AoS Summary

You can go very far with plain structs and contiguous arrays.

Use SoA when you're processing one variable across many elements.
Use AoS when you're working with multiple fields, especially in branching gameplay logic.
Don't optimize blindly—profile with realistic data.

It's easy to assume SoA is always faster, but in real-world gameplay code, the benefits may be negligible—or even negative.

The right memory layout depends entirely on how you access your data. Keep it simple and benchmark your assumptions.

Conclusion

You don’t need a full ECS framework to build a performant, scalable game. In Cultist Astronaut, I’ve found that a few simple strategies go a long way:

A megastruct keeps development fast and flexible early on.
Hot-data packing helps optimize performance where it actually matters.
Migrating to a specialized system, like a particle system, can offload work from the main entity system.
And understanding memory layout, like SoA vs. AoS, helps you write more efficient code.

The point isn’t to avoid abstraction at all costs, or to chase micro-optimizations. The point is to build systems that are clear, predictable, and profile driven. You can always optimize further, but if your architecture starts out simple, you’ll have room to evolve without rewriting the engine from scratch.

Roger Schoellgen

Discussion about this post

Ready for more?