Increasing numbers of services in a cluster can quickly lead to versioning dependency hell. Learning how to manage those environments properly will save users from problems when testing.
In the last blog post, we discussed Kubernetes Environments Management Basics.
Having many software development teams deploy multiple services to a cluster can quickly lead to dependency problems between different deployed versions.
To start, here is a simplified two service example using canary deployment. If service A depends on service B and service B has a breaking change in its API used by service A, then service A will stop working. This isn’t a problem when there are only two teams. The first team will probably wait until the second team fixes their service. But what happens if instead of two teams, they have twenty-six teams? They’ll likely end up with a situation where the shared environment is constantly broken.
In this post, we’ll expand upon what happens when the number of services in a Kubernetes cluster is increased, and we will use the concept of canary deployment to explain this approach.
Canary deployment methodology is used to reduce the risk created when releasing a new version of the software. With this method of deployment, new incremental releases are pushed out to a subset of the infrastructure, and the software is tested before it’s rolled out to the rest of the infrastructure.
This way, only a small group gets exposed to the new version of the software while it’s being tested, and this group can be an early indicator of potential problems that may occur. If a canary deployment suffers a failure, it affected only a small part of the environment, allowing teams to find the root cause of the problem and fix it with minimal impact.
What happens if instead, they give every team their shared environment?
With this solution, every single team has its own private shared development environment and they don't end up with conflicts. The downside is, they’ve only pushed the problem down the line. Any issues due to dependencies on what other teams deploy will end up in staging instead.
Let’s consider another way to handle a Shared Development Environment.
In this example, Service team C set up a shared development environment with services from teams A, B, and D.
They pull services from staging to a shared development environment so that they can develop against a proper version. They know that if the integration works, they can push to staging without any problems because they've already tested it with everything that is in staging.
With a service mesh, requests can be routed between arbitrary versions of services.
With it, traffic can be routed through service A in a shared development to an actual staging environment. This allows one copy of the code to be run in a clean staging environment. With a service mesh, users get the advantage of everything being stable and well-structured.
In principle, this works. But the problem comes at a large scale (i.e. 20 to 50 teams pushing 100 microservices). The team will sometimes have conflicts.
In this scenario, we have 26 teams all pushing out their different services. But, to simplify the discussion, we’ll focus on 3 of them.
These services are dependent on each other. In this example, the team working on Service A is based in Berlin. They test their code and run into a situation where they've got version three of their service. They push to staging and everything works, so they go home.
The team that is developing Service B is based in London. They've got version three of service A and version four of service Z, and it's working. But, they decide to leave early, rather than deploy it.
The team for Service Z is based in New York. They've got version three of service A and version one of service B. Everything works, so they decide to push it to staging and go home.
The next morning, the London team gets back and deploys its service. When they tested the previous night, everything was fine. However, there was a change overnight. The New York team updated version Z. Now the tests are not valid. They do the push anyway and the service breaks.
The above example might seem a bit of an edge case. However, take a more mundane example where one service has a testing process that takes 40-minutes and another service has a testing process that takes 2-minutes. If they’re both deployed at the same time, the one that has a 2-minute window wins. The one with the 40-minute testing window might pass, but when they do the deployment everything fails. This often happens in reality and probably happens more often than people give credit to it.
To avoid such conflicts when moving from shared development to staging, you can use a testing environment.
In our example, we start with having the current state of staging in our testing environment, where an automated pipeline picks up deployments from shared development sequentially.
In the example above we talked about service A, then service Z, then service B. If you queue up these deployments and run a test after every deployment until a test fails, you’ll be able to identify the latest version of a service that caused the issue. The team gets a message that the deployment failed, and they know they have some incompatibility and fix it.
At the same time, this incompatible change can be rejected. Instead of updating to a new version of a service, the testing environment goes forward with a previous version already tested and running stable in staging. This way the teams make sure that no incompatible version of any service is deployed to staging anymore.
The result is a much more stable staging environment.
Another important development environment is feature development. This environment works similar to the testing environment, except without the automation.
In this case, engineers don't deploy directly to the shared development environment.They just apply changes from their local development to the feature development environment. Once they have tested changes there, they deploy to the shared development environment. You can think about this as a kind of reverse canary deployment where you are pulling the upstream environment into the downstream environment.
There are many strategies for managing environments for Kubernetes-based apps. An interesting approach is to use a Service Mesh to enable a “reverse canary” setup. This can be done by automating an ephemeral testing environment as a gateway to moving to the next environment. With the Humanitec platform, we’ve built a solution to set up and manage such ephemeral environments in a very easy way. Explore our Internal Developer Platform on your own time. Start a free trial.