After a year of getting deep into Docker, and just about every orchestration service from Docker Cloud, to Kubernetes, to Mesos, to Rancher, to Triton, to Open Shift, to ECS, and on and on … there’s one thing I’ve learned: They are all good for one or two things, but they’re not the full package yet.
So, that shouldn’t get me to leave Docker behind, right? I mean, after all, in my last blog post I talked about building my own orchestration service. Well, let me link to some of the bugs I’ve run into over the last year:
- https://github.com/docker/docker/issues/14203
- Actually, fuck that.
Let me just tell you about them…
Let’s recall the first one I listed, when our build server would clean up images … the fix was literally, putting a random string in the environment variables. OK, so you want to break production with some magical string? Got it.
Kernel panics if a container stops in just the right conditions. It’s actually a bug in the kernel that no one seems to know how to fix… If you are using a container to handle your networking (or you have a busy container), you will probably run into this one. The only thing you can do is hard reset the machine.
Mounts getting stuck. You can get stuck with mount points being unmountable/deletable from host (and thereby any container) … this one also requires a hard reset to fix. I’ve run into this one mounting NFS from inside a container. It is a lot of fun.
Accidentally saturating a host. This one is easier than you’d think by accident … either move through the slosh and wait it out to scale back, or just bring up more infrastructure and wait for it … wait for it … keep waiting …
Unable to stop a container, because it doesn’t exist. Yeah, it will show it to you that it exists, and it is running. ps will even show the processes running away and sucking up your resources. Docker won’t stop the container, because it swears up and down it doesn’t exist. Only fix I know is a reboot … hopefully you won’t kernel panic on the way…
Unable to stop a container. Sometimes they just won’t stop … reboot and take down everything else you’re running on that node. Having fun yet?
Removing a bunch of containers at once sometimes fails. Yep, true story, you might have to do them one at a time.
Running out of file descriptors. This has happened a few times. Generally when I’m trying to fix one of the many other things that can go wrong… Reboot.
I might be making this seem a bit more drastic than it really is. I’ve only run into these problems several times. Which is several times more than I’ve ever needed to run into these problems, and something I don’t care to run into in a production situation.
With bare metal or vm’s you don’t really run into these problems, you run into well-defined software problems, caused by misconfiguration or errors in your code. With Docker, you run into problems that used to be usually caused by failing hardware. In fact, using Docker is like standing beside an open rack and plugging random network cards, keyboards, displays and printers into the PCI slots until it panics.
In fact, you are playing with software that touches parts of the kernel usually reserved for drivers, written in a language that barely has a debugger. It is akin to when we wrote cgi scripts and Rasmus Lerdorf wrote PHP. That’s what the state of Docker is now, in my opinion.
Right now Docker is this really cool toy that is transforming how we do business and write applications. It’s forcing people to write stateless applications to take advantage of it, nothing wrong with that.
But I’m done. I’ll write my stateless apps and keep my golden images. Hell, there’s some good use-cases for Docker that I’ll keep in my back pocket and wait for the next generation of Docker. Right now, there’s still a quite of bit in flux when it comes to Docker infrastructure as they decouple a lot of things … which will hopefully make it better.
But for now, I’m done.