Frequency Thresholds and Sharded Eviction: Building CloxCache

tl;dr I tried to write a tiny cache for AtlasDB. Accidentally discovered that protecting items with freq > k from eviction gives near-optimal hit rates. Then discovered that the design scales almost linearly with CPU cores — unlike LRU, TinyLFU, or ClockPro. The whole algorithm is ~500 lines of Go.

Repo: github.com/bottledcode/cloxcache

A few months ago, I needed a cache for AtlasDB (a global distributed database) and I didn’t feel like evaluating libraries. So, I decided to build one myself. It’s just a cache, right? How hard could it be?

It started simple: “Hey ChatGPT, what are some really simple caching algorithms?” And thus began my journey of learning about CLOCK. It is, indeed, pretty straightforward. You have a circular buffer, a “clock hand” that goes around the circle, and each time the clock hand sweeps past a value in the cache, you “decay” it until it gets evicted. The funny part is that I implemented it wrong because ChatGPT sucks at remembering details and I didn’t look it up properly. The real CLOCK algorithm uses a single bit to “decay”, but I used 8-bits counting from 15 to 0. When a GET had a hit, I bumped it up a number, and when the hand swept by, it decremented the number. This turns out to be quite important later… Ironically, if I hadn’t implemented CLOCK incorrectly, I probably never would have discovered the thing that makes CloxCache work.

I realised pretty quickly, though, that it wouldn’t scale. This is going to be in a database, after all. So, I reworked it until I had a sharded CLOCK algorithm. Each shard has its own hand, its own counters, and state. Then the hash of the key determines which shard it goes to.

It was fast, it worked well enough to be honest. But earlier this week, I opened a PR to replace otter in frankenphp and some people ran some benchmarks. It, uhh, wasn’t better.

So, over the last few nights, I poured my brain into the problem. It’s fast—as in throughput—but the hit rate is absolutely terrible. See, CLOCK is pretty good, but as I started benchmarking against otter and just regular LFU and LRU … I realised there’s a lot of room for improvement.

It was around this time when I was dumping benchmarks into ChatGPT to summarise for me, that I learned about Bélády and his MIN algorithm. Especially how to use it to determine the optimum hit rate for a given cache. So, I implemented that and ran some benchmarks. My algorithm sucked.

That’s OK, I had Bélády’s algorithm to help me. So, I wrote a sharded version of the algorithm to prove that sharding the cache wouldn’t affect the hit rate. This, in fact, proved that a sharded cache is within 0.1% of a cache with global knowledge. This meant I could, theoretically … maybe, get my algorithm to be fully independent shards. At the time, there was still some shared state for stats. So, I made a way to disable stats or move them to each shard as needed.

From there, I decided to peel apart my algorithm and the Bélády algorithm to show me where I was choosing the wrong victim to evict.

I stared at the data, completely dumbfounded. I broke it down into buckets: 0–15 buckets and calculated the percentage of time I was right.

For 0-freq items, it was nearly a perfect 50/50 coin toss whether my algorithm would be right or wrong. My algorithm can’t see the future (unlike Bélády’s algorithm), and the only signal I had was frequency. What was even more interesting was that when evicting items with freq > 2, I was wrong nearly 80% of the time—these items were likely to be accessed again.

My algorithm can’t see the future, but as I learned long ago, if you can’t make a “right” decision, and you’re guaranteed to make a “wrong” decision, the best decision you can make is to not make any decision.

So, I rewrote the eviction algorithm as something basic:

if freq <= 2 {
  // Low frequency: candidate for eviction
} else {
  // Protected: only evict as last resort
}

And it started beating otter under some very specific conditions, getting within 90–99% of the theoretical “perfect cache” that Bélády promises us must exist. The conditions:

  • the “working set” of the cache needs to fit in the cache
  • the data must be temporal; scan-resistant reads fail terribly

I started trying to come up with a formal theory of why this even worked.

Under zipf(α), item popularity is proportional to 1/i^α. Thus, popular items accumulate frequency quickly. And if they do that, the probability of reaccess is monotonically increasing with frequency. And, at least for the data I was using at the time, the “cross-over” frequency where working-set membership becomes obvious is around freq = 2~4.

So, something like this?

E[P_r(x) | freq = f] is non-decreasing in f.

Empirically:

Tracefreq=1 P_rfreq≥7 P_r
Twitter18.8%43–68%
OLTP22.0%58–82%
Zipf-1.0140.5%76–87%

This surprisingly holds across almost all traces I was able to get my hands on. Hence, freq > 2 is probably in the working set and freq <= 2 is probably noise. In my work, I called this magic number: k.

That, right there, will allow you to beat or match LRU with almost no complexity. The entire library (not including tests and benchmarks) is less than 500 lines of Go code.


I got my hands on more traces, and my theory started to break down. The ideal k varied by workload. So, why not make it adapt?

The key insight was tracking the graduation rate: what fraction of items survive long enough to cross the protection threshold before being evicted.

A low graduation rate means items are being evicted before they have a chance to prove their worth—the cache is too small for the current protection threshold. So we raise k to be more selective, only protecting the hottest items. A high graduation rate means the cache is comfortable and k can stay low.

rate := graduated / totalEvictions
if rate < 0.05 {
    k++  // Under pressure: be more selective
} else if rate > 0.20 {
    k--  // Comfortable: protect more items
}

This adaptation happens per-shard with no global coordination, maintaining lock-free read performance.

It actually wasn’t hard (for once) and it worked well. Almost too well.

  • It matches or slightly beats LRU
  • It trails otter by ~1–3% on some workloads
  • Under high cache pressure, it raises k to be more selective (only protect very hot items)
  • Under low cache pressure, k settles at 2 (standard protection for items accessed 3+ times)
  • Gets within 90–99% of Bélády’s optimal when the working set fits in the cache.

But it gets even better! The real win is in the concurrency:

Here’s a benchmark on a read-heavy workload (90% reads, 10% writes), 1M ops:

GoroutinesCloxCacheSimpleLRUOtter
113.5Mops27.7Mops9.6Mops
448.9Mops13.0Mops18.7Mops
1692.2Mops8.6Mops20.1Mops
6487.3Mops6.9Mops15.3Mops
256121.9Mops6.6Mops11.2Mops
  • CloxCache has no global eviction structure
  • CloxCache has no global admission filter
  • CloxCache shards share no state

As far as I can tell from days of looking for it, CloxCache is the only eviction algorithm that is concurrency-first by design. Not concurrency-friendly. Not concurrency-tolerant.

Concurrency first.

At 256 goroutines, CloxCache delivers 18x the throughput of a mutex-protected LRU and 11x the throughput of Otter.

So…

If you want:

  1. High hit rate across real-world workloads
  2. An eviction algorithm that’s shorter than some README files
  3. No global locks
  4. Multicore scaling that beats every algorithm I tested
  5. Predictable performance

Then CloxCache might actually be the simplest practical solution.

Is it the best cache ever invented? No.

Is it the simplest one that scales across cores? Probably, most likely, an emphatic yes.


Give it a go in your project: https://github.com/bottledcode/cloxcache

Stay in the loop

New essays, never spam.

Get fresh posts, experiments, and updates straight from the workbench. One click to unsubscribe whenever you like.