January 22, 2015

The case against Docker

Trying to use Docker for production for several week finally ended in the decision to let Docker for the moment.

Over the last week I tried to use Docker in production for the following usecases:

  • putting eXist-db into a container for the Onkopedia relaunch in order to simply the eXist-db installation and adminstration (prod, dev, staging)
  • putting Plone 4.3 and eXist-db into a container for having an easy way to manage a demo instance of XML Director
  • using eXist-db and Base-X in containers in order test the XML Director backend against different XML database backends in an automatized way

This blog post summerizes the former blog posts (link, link) about Docker.

Docker is not very developer friendly

For production we installed a decent VM running CentOS 7 with a recent kernel version (3.10) that is supported by Docker. I recreated the related Docker images on the deployment box from scratch. The bad experience discussed in my former blog posts remained. A typical build under Docker was  5-10 times slower than executing the same scripts and code on the same machine directly under the shell. Pushing the three images - each about 1.3 GB in size - took more than two hours. I have seen that one layer of an image got pushed with a decent speed close to our bandwith limit  while the next layer crawled over the net with 500 KB per second...completely unpredicatable push behavior. The same behavior was reproducible on a different host in a different data center. Pulling the images on a different host also showed same downstream issues with the Docker registry - completely useless and time consuming.

But anyway...running the Docker images on the host caused the next surprise. Starting exist-db, executing a small Plone script for site setup and finally starting the Plone instance takes about ten minutes (under one minute without Docker). The complete virtual machine became very unresponsive during that time with a CPU load going through the roof of up to 10 with no jobs running on the VM except this Docker container. But anyway...I proceeded to the next Docker container and tried to run Base-X and eXist-db. There was a mistake in one of the Dockerfiles and I had to re-run the build for eXist-db. This build suddently failed while running apt-get inside the build....network issues. I checked the log and discovered some issues with the iptables. Not being a network guru I filed a bug report on the Docker repo @Github. It turned out that the Docker chain within iptables configuration got lost and therefore the complete network functionality of the Docker build failed. Nobody could told me where and why this happened. The only manual change done earlier was to add port 80 to the list of public ports - perhaps something was happening here. The only solution to get around the problem is to restart Docker. But restarting Docker also means that your containers go away and need to get restart - major pain in the ass factor...why is Docker so stupid and monolithic that containers can not continue to run? This is bad application design.

Containers are for software, not for data

Docker containers are consumables. Docker containers should be used to scale applications, for firing more app servers with the same setup etc. However Docker guys and fan boys want to put data into containers and speak of data containers. What the fuck? Data belongs onto the filesystem but not into a container that can neither be cloneed in an easy way nor incrementally backed up in a reasonable way. Containers are for software, not for data.

Docker made me inefficient, Docker blocks my daily work

The slowness of Docker is a big pain. Build and deployment procedures are not predictable. Even with only three or four images on the system I need with something like thirty containers and images on the system (docker ps -a, docker images). There is not even a producedure for cleaning up the mess on the system except fiddling around with something like

docker rm $(docker ps -a -q)
docker rmi $(docker images -q)

Docker needed around 7-10 seconds per image/container removal..The overall cleanup operation took several minutes. Oh well, stopping the docker daemon and removing /var/lib/docker manually is much faster in reality.

Conclusions

Containers are great and provide the right abstraction layers for running applications in production.

The theory and ideas behind Docker are great, its architecture and implementation is a mess. Docker is completely unusable in production. It is unreliable, it is unpredictable, it is flaky. The idea working with filesystem layers is great but in reality it sucks (push & pull of 30-40 layers takes a lot of time - at least with the current way how it is implemented). The idea of Docker file is great but in reality it sucks (you can not re-run the build from a certain step without fiddling inside the Dockerfile). Especially with Plone buildouts it takes a long time for re-running a dockerized buildout without the chance of using buildout caches in some way).

Other options? CoreOS came out with its Rocket approach some weeks ago...too new in order to consider it for production at this time. Rocket looks promising but well thought (compared to Docker) but far away from being ready for prime time. Vagrant is a nice way for deploying to virtual machines however this is not the level granularity we are all waiting for. NixOS with NIX package manager? The Nix package manager looks nice and powerfull and I heard only good things about Nix but I am not sure how this solves this issue with isolated environment  and how this plays together with containers - especially NixOS is a black box right now and I need to look deeper into its functionality and features. For now: back to old-style deployments.

A big sigh.....