Linux Kernel: A Single Point of Failure


The last 24 hours have been an absolute disaster of downtime for me. I’m talking about servers being down kind of downtime, not the netflix-and-chill kind of downtime. The end result of this is that I discovered a new (to me) single point of failure

This blog is one of several things that run on a small (but beefy) Kubernetes cluster. Since I’m cheap, I self-host everything, but here’s the lay of the land:

  • Garage runs distributed around the world, outside of the cluster. There are backups, and container images stored here.
  • Ubuntu 22 running on bare metal via Hetzner.
  • K3s is my distribution of choice here.
  • MetalLB provides VIPs for the cluster.
  • Loft provides me with a good administration environment over Kubernetes.
  • Cilium provides me with a nice fast network inside the cluster (via native routing).
  • Longhorn provides me with endless fun and replicates storage.
  • Selenium, GitHub self-hosted runners, Minecraft, IRC, Loki, Netdata, this blog, etc are also on here.

This means when something breaks (and there is a lot that can break!) then it’s all up to me to fix it. Dev-ops is my hobby, I don’t get paid to do this stuff in real life. But I digress.

Every Friday afternoon, I run upgrades. I have a little playbook I go through to make sure everything is updated every-so-often. One thing that happens every Friday, went horribly wrong.

I have a free subscription to Ubuntu Pro which live-patches the kernel (among other things), so normally, when I log in, it will let me know if a live patch is applied and an upgrade is needed. This time, I got no such message. In fact, all I got was a linux-firmware update and a held-back mdadm update. So, I did that, updated a few helm charts, and saw everything was fine.

Then I rebooted.

ALL.

HELL.

Broke loose. I performed a rolling reboot and I didn’t even notice an issue until an entire availability zone went down. I actually almost didn’t notice it at all (since everything was still running in the other two zones). I only noticed it when I didn’t get a Slack message that a database had recovered. I was about to execute the final zone when I realized Slack was fairly quiet.

Into the kubectl hell I delved, deep into containers and networking to identify the issue …

  1. First, it was apparent that Longhorn volumes weren’t attached.
  2. After digging into those error messages, I realized something was up with DNS.
  3. DNS was totally and utterly not working, but the most cryptic error messages were coming out of that. I started to get a suspicion that networking was being wonky.
  4. In the networking pods, I saw error messages about validating webhooks being unreachable (because DNS was down). I deleted those hooks.
  5. Finally, networking pods started spewing out “BPF program is too large.”

But wait… what? Literally, nothing changed in the host except the firmware update. Surely Linus wouldn’t let something like this into the kernel? Oh wait, this is Ubuntu.

Eventually, I stumbled upon this issue that leads me to believe there may be a Cilium update. Sure enough, there is. Thank God that fixes my issue (or at least began a long road to recovery and making sure non-resilient services came back ok).

Anyway, it was at that moment that I realized having a homogenous cluster is probably a bad idea. For the future:

  1. Mix up Linux distros and kernel versions (preferably all the same major versions, but different cadences are better).
  2. Run host upgrades in a canary fashion. So just do a single host, and verify things come back ok — particularly networking. Automate this.
  3. I’ve been fairly trusting of OS updates, but I’ll never trust one again.

So who knew that the Linux kernel could be a failure point? I’d like to say that I knew this intellectually, but I’ve read Linus’s posts, he doesn’t let crappy code into the kernel. But in this case, it was a linux-firmware update, which probably shifted some bits around. I have no idea how BPF works, and why this update would suddenly cause the BPF programs to use up all the memory. I plan to learn a little more about it, at least so I can understand a bit of ‘the why.’

I can say that I’ve personally experienced this particular issue, hopefully, if I see it again, I’ll recognize it and fix it faster. This was an all-day affair, chasing goose-after-goose before getting to the bottom of it.

Until next time,

Rob


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.