An Origin Story

A turbulent flight over the Pacific, a letter to a three-year-old, and the eight-year road to building a database that shouldn't exist.

Setting: 2018, on an airplane, somewhere over the Pacific Ocean.

I woke up with a start, disoriented and slightly nauseous. The plane was in the middle of a long flight, and the turbulence had woken me from a deep sleep. I rubbed my eyes and looked out the window, trying to orient myself amidst the endless expanse of the clouds and ocean below. The plane bucked again, and I clutched at the armrest, still disoriented. I took a deep breath. Asked myself, “Are we going to crash?”

The airplane was completely fine. I was fine. Everything was fine. But, what if it wasn’t?

That thought would eat at me for the rest of the 34-hour flight. How would my wife know my passwords to pay the bills? There were so many things I wanted to tell my wife, my three-year-old son. I’d never be able to congratulate him on graduating from school, on becoming a dad, or anything. I’d be gone. Just gone. A memory.

I suppose that is true for all of us.

One day it is our turn to make space for the young. But I could do better.

Solving People-Problems with Technology

I’m a software engineer. More specifically, I solve people-problems with technology. It’s what I do.

And this. This is a perfect people-problem: email my son in 25 years.

Simple. Easy.

Right?

In a week, I had the first version running on my laptop. If I didn’t open my laptop for more than a few days, an email would be sent to my family to check on me. After two weeks, everything would be sent to my wife. Passwords, documents, access to everything.

That solved part of the problem. But I still couldn’t reliably send my son an email in 25 years:

  1. My son didn’t have an email address.
  2. Even if he did, would he still have access to that email address?
  3. Would email still be around in 25 years?
  4. Would the internet, as I knew it, still exist?

I needed something to store the letter in. Something that would never die, always be available – even when parts of it are offline. It needed to outlive me, with zero maintenance.

A Solution Does Not Exist

I would spend a large part of the next year hunting for a database solution that would work for me:

  1. Always available
  2. Strongly consistent
  3. Distributed

Did you know? It didn’t exist (foreshadowing: yet). In fact, literature seemed to suggest that such a database couldn’t exist.

It was 2019. I knew that if I wanted to have this, I’d have to build it myself. I just didn’t know where to start.

Building The Universe

Also, in 2019, I was building a pretty neat game. The idea was simple: you build a space probe that can accelerate at insane speeds, going through a wormhole out in the asteroid belt to explore the universe and map out a constellation of wormholes.

One of the interesting bits is that due to the speeds of these probes, I was also modelling relativity on the probes and masking time dilation effects with latency. It was pretty cool, and I was able to model our entire solar system and beyond.

This got me entrenched in the realm of relativistic effects. It was fun and hard; trying to design gameplay around constraints that exist in the real world but don’t “really” exist in the game.

I was using a mix of Go and PHP to build the game, and around the start of COVID-19, I discovered FrankenPHP and immediately started contributing to it.

Kévin Said Something

Fast-forward to 2024, I was sitting in a keynote at a conference, as a FrankenPHP contributor. Kévin (the creator of FrankenPHP) said something on stage, and it was like a lightbulb went off in my head. I knew how to make it. I can’t remember what he said though. Which is a shame.

I spent most of the next year working on a new kind of consensus algorithm, based on Paxos. I called it AtlasDb – and yeah, there were a couple of projects named that; apparently.

I kept at it until spring of 2025. I kept running into a problem building a distributed database: how the hell do you keep the cache coherent?

See: databases can be distributed. They could always be distributed one way or another since replication is a thing. Caches? Not distributed.

Sure, you can shard caches. But a world-spanning cache? That’s a problem. Some might say, it’s an impossible one.

As engineers, we either forgo the cache in this case, or we accept that it is just going to be janky sometimes; held together with duct tape and glue code.

It was working on the cache for AtlasDb’s engine … that I realized I had a solution. I was building a distributed cache engine already.

Defying CAP?

I had a prototype by December 2025 – and it worked. It also appeared to violate CAP. The CAP theorem states that a distributed system cannot simultaneously provide consistency, availability, and partition tolerance. My solution allowed for all three – but first, I needed a name. And this, this really is the hardest thing in Computer Science – naming things. I decided on Swytch . How it works … is pretty cool actually.

Everything that happens in Swytch is called an “effect”. If you’ve ever worked with Event Sourcing, you’d recognize them as “events” to some degree. However, what makes an effect different from an event:

  1. every effect carries what was observed
  2. every effect carries semantics, not (always) data
  3. every effect is composable with the effects that came before it

In a lot of ways, these are a different kind of CRDT than you’re used to – ordering matters, but the ordering is encoded in the structure. They compose into what I called a “causal log”: a log of everything that has ever happened. Structurally, they form a Causal Directed Acyclic Graph (DAG, pronounced like the first part of “dagg–er”).

In database speak, this gives you what is called: total order. This is a very important property to have for reasons we will get into later.

But how does this allow you to have Consistency, Availability, and Partition Tolerance?

This is where the DAG becomes quite useful, actually. See, we know that in order to provide consistency, we need to have the DAG on the local node. The technical term for it is causal closure.

In order to derive correctness from causal closure, we need to be able to see all forks in the DAG. We need the ability to serialize those forks and decide which fork is the correct path in history.

And in distributed systems, you can have a bunch of machines (called nodes) spread all over the world. These nodes can lose contact with each other or crash. You can’t tell the difference. But in relativistic terms, the nodes don’t go away – ever, the distance between them approaches infinity; you’re moving away from each other faster than the speed of light (or rather, the only way to communicate is to go backwards in time).

So, in Swytch, the time we’re willing to wait for a node’s response could approach infinity as well. Thus allowing nodes to be consistent within their light cone, always available, and partition tolerant.

Note: “within their light cone” – what is a light cone? In physics, a light cone is your causal history and your causal future. Things outside your light cone are things that can’t affect you in any way until the light cones overlap.

A good analogy would be to think of an event that happens on the other side of the world. Initially, it is outside your light cone. It isn’t until news of the event reaches you in time and space that it can affect you and from that point both the event’s light cone and your own personal light cone overlap.

A similar thing happens with Swytch. If a far away node goes away, it exits the local light cone. That node can no longer affect the light cone of the local node. That means it can’t contribute to causal closure or contribute forks. It’s no longer part of the cluster.

From the other perspective, if a node suddenly loses contact with the outside world, it is still consistent with itself. It still knows about every decision made before that moment. Traditional systems would refuse to operate … but we don’t have to.

The interesting bit happens when the two light cones reconverge. We have several choices – and we want to expose this per key:

  1. We can merge everything deterministically. This makes sense for some keys. For example, a page view counter.
  2. We can ask the user to merge. This makes sense for some keys. For example, a lottery winner. If two people win on different sides of a partition, only the business can decide what to do about it, not math.
  3. We can refuse to allow the merge at all. What we call “holographic divergence”: both databases are internally consistent but can never be merged.

We also have choice before even getting into this situation. We can do the above, or we can refuse to operate – sacrificing availability. The latter is more conservative and it is what people expect, so it’s the default.

The point is though … we don’t actually have to.

Actually, CAP doesn’t apply

This uncovered that Swytch was operating within three parameters. There were only three things I could actually control:

  1. Causal closure (C): how complete the history is to make a decision.
  2. Fork Resolution (O): how to serialize and make a decision about two concurrent effects.
  3. Timeliness (R): how long I’m willing to wait for an effect to arrive.

And actually, there is a fourth thing: reading the DAG, what is the actual value I return to a user (if any).

In mathematical terms, for Swytch, this would look like F(C[explicit], O[key], R[∞]). I go into detail in the paper: arXiv .

The paper explains what we’re calling Light Cone Consistency (or LCC for short). It’s a summation of all consistency levels and explains how they all work together.

Combining these parameters, we find three problems in distributed systems, all aptly named:

  • CxR => CAP
  • OxR => FLP
  • CxO => AFC

The reason we never hit CAP, even with a partition and strong consistency is because the engine itself doesn’t constrain R. It can never hit CAP. Instead, we lose causal closure or fork resolution. We hit AFC. We lose the ability to serialize our data across the whole (original) cluster.

And when that happens: we end up with holographic divergence, but we still have consistency and availability on both sides of the partition.

We can tune F() as mentioned above to limit availability or merge, as we see fit.

Swytch is something truly amazing, and as far as I know, unique to its kind. It’s not just a cache. It’s not just a database. It’s something new.

The Inevitable Always Comes

So, what about that letter? That letter lives in Swytch. It’s one of the clusters on the community stats . Even if the internet falls down completely … as long as the power is on somewhere, the letter will survive.

Special Thanks

The post above sounds like I did everything by myself, but nobody does anything by themselves. There were some special people along the way:

Yanir: Always challenging me while working on an internal database at Automattic. Without that experience, I’d have never been able to build Swytch.

Melvin: My sounding board during the early years of working on Swytch, back when it was called AtlasDb and I hadn’t solved all the hard problems yet.

Jessie and Sophia: These are special people who helped with branding and the logo you see today. I highly recommend them .

Lorenzo: For pointing out the novel parts.

Mark and András: For being a sounding board on the theory, and giving me a shoulder to stand on.

Kaben: For believing in me when I messaged him at some obscene hour about a crazy idea.

Stefan: Giving me the kick in the butt that I needed to finish Swytch.

And last, but surely not least, my wife and son, without whom none of this would have ever happened.

What’s next?

This blog post marks the end of one journey and the beginning of the next. There’s a roadmap, things to do, customers to sell to, and so much work ahead. As much as I want to say that I know what is coming … I literally have no idea.

Stay in the loop

New essays, never spam.

Get fresh posts, experiments, and updates straight from the workbench. One click to unsubscribe whenever you like.