Should You Run Your Database in Docker?
Someone mentioned not running databases in containers, but I don’t understand why it would be bad…
What problems can be caused by running PostgreSQL in a container? Should you be worried even if you don’t have a lot of traffic? Is it a general thing, and if so, why are so many apps setting up databases in their docker-compose files?
Let’s dig into this question, see what factors should influence your decision and whether you could do better or should start worrying.
Should you run databases in Docker?
If you’re doing so in your development environment, there’s nothing to be concerned about.
You don’t have important data to lose. In case anything goes wrong, you simply recreate your environment from scratch. (You can get your dev env up in a single command, right?)
Let’s look at a few upsides of using containers in this setting:
- There’s less clutter on your development machine
- You can work on multiple projects side by side, which depend on slightly different database versions
- You can create a development environment on any OS in a reliable fashion
- Everything is “documented” through automation and reproducible
Personally, my favourite way to develop right now, is to have backing services be defined in a docker-compose.yaml file for each project and bind to local ports of the host. This way, I can run a dev server locally, using the complete power of my tools, and have it interact with databases which live in containers.
But What About Dev/Prod Parity?
That should be taken care of by your deployment pipeline. Your important tests run against a staging/qa environment, and any critical errors are caught there before you deploy to production.
Even if there should be a discrepancy which went unnoticed in your dev environment, it should be detected and easy to fix if good processes are in place.
What About My Simple Live App?
If you’re working on a small project, and are deploying to a single machine, it’s completely okay to run your database in a Docker container.
Be sure to mount a volume to make the data persistent, and have backup processes in place. Try to restore them every once in a while to make sure your backups are any good.
A lot of people are running their small projects using Docker containers, or using docker-compose.yaml files to bring up their complete stack. It’s convenient, and perfectly fine for small projects handling non-crucial data. In the worst case - you restore a backup and are back in the game.
If you have a production system which people depend on, please take this recommendation with a grain of salt, and take a look further below.
Pitfall: Scheduling Across Multiple Machines
Here’s a pretty bad anti-pattern, which can cause you a lot of trouble, even if you’re just working on a small project.
You should not run stateful applications in orchestration tools which are built for stateless apps.
It’s all about assumptions. Such tools are designed to orchestrate Docker containers which house stateless applications. Such applications don’t mind being terminated at any time, any number can run at the same time without communicating with each other and nobody will really notice if a new container will take over on a different machine.
This is obviously not true at all for databases! They need their data to be available and being interrupted can cause all kinds of havoc.
If you’re using Kubernetes and your database runs in a ReplicaSet or ReplicationController, that’s a serious problem. You’d need to make use of StatefulSets and PersistentVolumeClaims. This means your cluster should provide a way to create PersistentVolumes, which can be accessed from all nodes.
But even then, you can’t just put any stateful app into a StatefulSet and be done with it. You have to tune it and make sure that the assumptions which are required actually are respected by the application.
Okay, but what about production? Is it a good idea to run your important databases in Docker containers?
In general, I’d say don’t use Docker for production databases. Don’t put everything in Docker containers just because.
The upsides of using containers listed for development environments don’t really apply if you are building a long-term stable environment which should be easy to maintain and reliable.
Databases are critical services. They take effort to operate, and more to do so reliably. If you really really need your data to stick around and be safe no matter what, you don’t want unnecessary risks.
Running databases with valuable data in Docker has been known to cause trouble.
Docker WILL crash. Docker WILL destroy everything it touches.
There’s a history of weird bugs causing data corruption, and many gotchas when operating a crucial service and performing maintenance tasks. Do you really trust that all details of Docker volumes and filesystem modules to be consistent and failsafe?
Database services provided by your cloud provider are the best way to go for production (and that means also staging due to prod/staging parity) databases. Use RDS if you’re on AWS, Aurora on Azure, or an equivalend hosted Database service of your cloud provider. This will simplify a lot of management tasks such as updating minor versions, handling regular backups and even scaling up.
What If Managed Database Services Are Out Of The Question?
So it’s super important data, which is too sensitive to go to a managed service?
Is there a really well-maintained internal Kubernetes cluster? Is the service known to be in a reliable state when running on k8s? Are there actual important business reasons to use Kubernetes for everything? Maybe. But I’d still be very hesitant.
In all other cases, I’d skip Docker, and go with a dedicated machine for each part of the DB cluster instead, so there is as little complexity as possible.
Docker is great for running databases in a development environment! You can even use it for databases of small, non-critical projects which run on a single server. Just make sure to have regular backups (as you should in any case), and you’ll be fine.
For the love of all that’s good - don’t run dockerized databases in container orchestration/scheduling tools without making sure that they can handle stateful apps. Probably they can’t, and you will see a lot of weird issues. Even if they can, you usually need to tune the app to be compatible with the tool in question.
Should you use Docker for production databases? No. Simply because there are better options, like the database services managed by your cloud provider. If you really have to self-host such services in a reliable fashion, you’re in for a lot of work and learning. Set up dedicated machines and skip Docker.