I swear, for the past few months, there has been a major issue with my Kubernetes cluster, at least once a month. Once, a simple update took out my entire network infrastructure. It was down for nearly a whole weekend… This time … this time it was down for nearly a week: the worst one yet.
So, what happened? Heh, this one is pretty funny, in retrospect.
I use an amazing and free cloud tool to monitor my servers. I run my cluster on bare metal, in a data center. It’s a ton of fun and this is my hobby. I host a variety of random tools and stuff, that are decently popular. There was one, Very Important™️ metric the monitoring software didn’t monitor: free File Descriptors.
Interestingly, the monitoring tool went into an infinite loop where it would open the database repeatedly; without closing it. Luckily, this was limited to a single node in the cluster.
As the node ran out of free file descriptors, exciting things started happening. For starters, new connections couldn’t be opened, but existing ones still worked. Eventually, the monitoring tool would crash, freeing those descriptors, and allowing some new connections (for those still trying, like etcd and Longhorn).
However, anything storing data on the node would eventually become hopelessly corrupted. It turns out, most software doesn’t expect to run out of file descriptors. Most software handles this error as though there was a network error, or a file didn’t exist, so it would try to create a new one (truncating the old one).
All monitoring was reporting intermittent issues, but not long enough to trigger alarms. It took five days before the corruption started revealing itself… By then, things had started going down completely. It was a cascading failure… taking down the entire cluster. I’m still not sure what exactly caused the cascading failure, but I’m glad it happened eventually.
Thankfully, there were point-in-time backups… or so I thought. After identifying the issue, and shutting down the monitoring software, I realized the corruption had spread to the s3-compatible stores as well. Luckily, an off-site mirror wasn’t corrupted due to a ‘bug.’
It took nearly a week of several hours a day, continuously fixing things before even this blog came back up.
- Monitor the number of file descriptors (and choose a better monitoring tool)
- Don’t colocate backups with the thing they are backing up (kinda obvious)
- Somehow create an alarm for an increase in intermittent issues.
- Spread the news that software should “properly” handle file descriptor starvation (this post).
So, if you want to silently and slowly kill a Kubernetes cluster: just open the same file repeatedly until you run out of file descriptors, then crash after a few seconds.
Note: issues were opened in all affected software and libraries.