We have all been there. The code works perfectly in the local dev setup but as soon as it hits a staging or production instance, things start to go south. The reasons for that are plentiful, among them might be a different infrastructure or networking configuration or a lack of production-like data in the dev environment. If the problems are caught on a staging instance or during CI, that’s better than hitting production obviously, but the process of fixing the issues and getting feedback from the pipeline can be tedious and include lots of waiting. And if the code manages to actually land in production, prepare for some exciting hours.
As today’s deployment patterns have become more and more complicated, locally replicating them has become increasingly difficult and the gap between dev and production even wider.
In part I of this mini-series we will take a high-level look at some of the problems and potential solutions around working with different environments for different stages of the delivery pipeline. In part II we will get hands-on to achieve production-like dev environments and manage them in a concise way with Garden.
Deployment Environments for different stages of the Software Development Lifecycle (SDLC)
The evolution of deployment environments is a logical one. First and foremost there is production, where customers or users access the application. Downtime of production can be extremely costly and the industry is highly invested in minimizing disturbances to the production environment. So the code should only hit production once it has been tested thoroughly and the infrastructure of the production environment should be resilient and able to tolerate failures.
This is where staging comes in for most companies. A staging environment is similar to the production environment. It contains all the different services and can be used to fully test the new code. However, fairly often this environment is a singleton, which means that teams or people have to wait until they can use it. A staging environment is some sort of pre-production environment, where changes are rolled out and used internally before a release.
The CI environment is an ephemeral environment that is in most cases triggered on certain events like pull requests on a version control software like GitHub or GitLab. This environment gets created just for the purpose of running not only unit but also end-to-end tests, because all services that are part of the application can be deployed and torn down on demand just for this single branch.
Preview environments can be used to showcase some changes to team members and other teams like QA or product managers. They are basically a snapshot of the application at a specific commit or release tag. As the name suggests they allow a preview of the application as it would be with a specific change or new feature.
Development environments are where most of the coding actually happens and where developers spend most of their time. These have traditionally been deployed to the local laptop and are often partial environments, which is to say subsets of the full system, ie. only the apps, databases etc. needed for what’s being worked on currently.
With containerization on the rise, these have often been moved to local Docker-compose setups, which allows to deploy several services at the same time and opens up the possibility to run end-to-end tests.
Local development environments are usually configured and run differently than in other environments—realism is sacrificed to achieve fast iteration and enable features like live code reloading. This loss of realism isn’t without its problems, though.
The following paragraph highlights a few problems with this classical approach.
Why dev and prod environments shouldn’t be so far apart 💔
So while developers spend most of their time in a local dev environment, it is the one farthest away from production. This can lead to a couple of problems, among them:
Changes haven’t been tested end-to-end, because the whole stack making up the application can’t be deployed to a laptop.
The configuration for the production service differs from the configuration for the local development environment. Imagine a Docker Compose setup in dev and a Kubernetes cluster in production. Communication between containers requires a very different configuration, due to different DNS and networking setup. This means that at some point, someone needs to translate the local changes to manifests or infrastructure-as-code that also works remotely. If it isn’t the developers doing that however, some context can be lost on what is the expected outcome of configuration changes. SRE or DevOps folks rarely use the local dev setup in return. One could argue that this creates a barrier between teams and a friction point in the software delivery lifecycle.
The dev environment hides scaling problems because it never sees any real data or traffic. This can be a really frustrating one. Let’s think about a backend service running some complicated database queries. Everything is fine on a freshly spawned local database loaded with a small, manageable data set. However, once these changes hit some real data, they might severely stress the system. And if this new query is triggered not by one local user, but by thousands of users at the same time, it might even cause performance degradation or even outages.
The CI environment as an intermediate can cause long wait times. Waiting 30 minutes for a complex CI system only to catch a small bug simply isn’t fun. Speaking only for myself here, but I have a lot of trouble getting any other work done while trying to wrap up a commit and waiting for the code pipeline to finish. Running end-to-end tests in a production like dev environment significantly speeds up the time to release a feature. Mistakes can be caught early in the process and test suits can be run isolated.
Dev environments should strive towards their role model: production ✨
So how can we bring dev environments closer to production? Let’s look at each point individually.
Deploy the stack to a remote environment. This could be a development cloud account or Kubernetes cluster. The key to avoiding a horrendous cloud bill is ephemeral environments and auto-scaling or re-provisioning of resources. A lot of developments in the platform engineering space are pointing towards this way of working.
Developers work with production manifests to deploy. Engineers working on infrastructure provide manifests and cloud environments that developers can use and extend. These manifests can be wrapped in helpful tooling like Garden that enables a smooth developer experience and helps with the complexity.
Persistent data can be loaded via s3 buckets or database snapshots. It can be stored in the cloud, duplicated, mounted or restored much more easily than on a local machine.
All tests including end-to-end tests can be run in the dev environment. This one is a good marker of success when trying to make dev environments more production-like in my opinion. It dramatically reduces the time developers spend waiting for CI pipelines, since any test failures can be reproduced and fixed without going through CI.
In part I of this series, we looked at how production-like development environments not only increase reliability by avoiding configuration drift and early access to end-to-end tests, they also increase overall developer productivity and happiness.
In part II we will look at how to achieve production-like dev environments and stay on top of managing multiple environments in a consistent and clear way with garden.