Paolo Garri is the Director of Technology at TV broadcaster, Sport1. His team of 350 developers is busy managing several 24/7 broadcasting streams globally and across TV, digital, and mobile apps. Therefore, his ops team has to go the extra mile to ensure all of the application developers have what they need to work productively. Equally important is that all of Sport 1’s systems are up and running, stable and secure. Paolo’s team, like most, started off running everything on-premise. They then moved to the first cloud and soon after, made the decision to go multi-cloud.
In these kinds of scenarios, you can usually measure legacy by the number of scripts. There is a script for everything and even worse, there is a different kind of script for everything. It’s literally a zoo. There is the bash-bonobo, the manifest-monkey, the yaml-Yak, the Terraform-Tiger. There is the Python-python and the perl-penguin just to name a few. In one case, Paolo hired an external agency and it took them three costly weeks just to understand the languages and pipelines of scripts. That’s just a lot of money going down the drain for nothing.
Paolo asked a simple question to his DevOps team: Can you tell me exactly how many scripts we have and what they are doing? Alarminglyno one could account for all the scripts which approached 400 at that time. Thus, he decided he needed to understand how this could have happened and more importantly, how to prevent this in the future. This zoo of scripts is very dangerous. It means your team is constantly trying to fix scripts someone else wrote. It also doesn’t allow for self-service and massively drives pressure on your overhead costs.
How did this zoo develop in the first hand?
Paolo's case isn’t unique. When I recently chatted with Jan Löffler who built Zalando’s Internal Developer Platform it was the same story: “We just realized hiring more Ops wouldn’t do the job, we had to fundamentally understand the problem and change”. “Pff, what amatuers” you think! Don’t get too cocky my friend. Github, arguably one of the best teams in the world, had the exact same issue as Jason, their CTO recently told me in a conversation: “The world runs on bash-scripts, you don’t want Wall Street to notice that”.
I would argue that this problem is a function of DevOps size, number of applications, services and tools. And I would say this gets more serious for every team that has been building apps for more than 5 years and has more than 30 application developers - and in particular with those thinking about growing their headcount.
In trying to understand the reason why the zoo gets out of hand, I’ve looked at hundreds of setups in detail.
I’ve identified a set of key reasons:
1. You’re stuck in the “you build it you run it” philosophy
That is probably the most classic example, especially if we’re moving into things like Kubernetes. It’s a common misconception especially in younger teams that there shouldn’t be specialization. I find it highly unlikely that you will get every React developer to master helm-charts. Not because they couldn’t, but because they optimize their career chances and their on-the-job performance. For them, it pays off more to optimize their typescript skills.
2. You think everyone should have all the freedom of choice
Allowing everyone to write already unstructured script formats in their preferred language is the founding principle of the script zoo. It’s like running an app with many microservices that are written in 10 different languages. That’s technically perfectly possible but still adds to the complexity of maintenance and further development. Same applies to scripts. Variety in language is a really, really bad idea.
3. You neglect standardization within the scripts and yaml
Unstructured, unstandardized scripts freak me out. Take the example of Helm charts, one of their fundamental design flaws is that they offer so many ways to Rome. I’d say it’s perfectly possible to reach the exact same cluster state with 1000+ variations in the way you write the yaml or script. This makes backward compatibility impossible, and it also makes it really hard to read and understand what the hell is going on. As the brilliant Joel once asserted: “It’s harder to read code than to write it.” So timeless and true and it also applies to scripting and yaml.
4. You don’t regularly invest in cleaning up
This requires discipline and continuously and rigorously going back into the weeds and cleaning up. It’s a little bit like a well structured workshop of a decent craftsman. They’re structured, continuously optimized and regularly revised.
So what to do? Github, Zalando and Sport 1 all tackled the problem by building Internal Developer Platforms (IDPs). These tools have a huge impact on streamlining internal workflows and operations. As Jason from Github puts it <it>“We needed to be more rigorous and we needed more bespoke paths to get code from developers’ machines to production.”<it> You might want to consider an IDP as well and I’m going to explain the technical architecture below but first there are two core things in your culture you should fix.
Fix your culture first
There are some clear steps you can take depending on the severity of the problem. I propose you start with fixing your culture first. This will already lead to a number of uncomfortable conversations:
- Explain the value of specialization
Make sure your team really understands why there are certain people doing A and others doing B. Also explain that this doesn’t imply anyone is incapable just because he or she isn’t dealing with everything. It does not exclude self-serving.
- Explain standardization:
Your team isn’t stupid, they just don’t have your high-level view. They will understand why you need to introduce the standards when they experience the benefits (such as improved state of mind and being on call for things you wrote out in the wild).
Once we’re done with that go ahead and radically standardize:
- Limit the scope of scripts and standardize how you create ymls.
- Use standard integrations between systems wherever possible.
But don’t forget to fix your deployment architecture too
The single most important thing you should do if you want to fix this mess for good is:
Define strict baseline templates onto which you let teams apply changes and let them apply these changes preferably through a standard interface (CLI, API or UI.)
Let’s look at this from the example of your process for an application running on Kubernetes which I took from another piece I wrote around the evolution of your K8s practices (but it really applies to any other scripted process as well). Let’s define three different roles (a single human could perform one or all of these roles depending on how you manage stuff):
- Infrastructure Operator: applies changes to the underlying infrastructure
- Application Operator: applies changes to the application configuration
- Application Developer: develops the application
Really immature teams prior to optimizing their delivery process have a setup that tends to look this:
As you can see changes to the infrastructure happen without leaving any real trace, same applies to changes to the application configuration.
Use versioned manifests
The next maturity step is the use of versioned manifests for the app configs (but still completely unstructured and unstandardized) and also version changes to the infrastructure. This is already a great step towards better maintainability, but still doesn’t tame the zoo of scripts and yaml. In fact, with the versioning element things can get even more out of hand.
From a graphical point of view this would look like this:
This solution is generally more sustainable but it still doesn’t solve the optimization problem. You still have this zoo of script and it can be even worse as you are now versioning.
The ultimate solution for your zoo of scripts
The actual solution is to define base-line templates by application or service type. Your DevOps/Ops teams define these base-line templates. The goal here is to have the smallest number of baseline-templates that are sufficient to handle the complexity of your applications. Next, you have to log changes to these scripts. The best way to do this is by allowing changes only through a central API (or CLI if you want) and “create” the resulting script from these changes to make the record somewhat immutable. If several scripts are involved in getting your application and the infrastructure in the right configuration state make sure that these changes are logged and applied to all included scripts. This state of standardization requires a strict parameterization of environment variables to make sure you can inject them at run-time which will likely require some lightweight refactoring to your existing code.
In a graphical representation this would look like this for an application running on K8s consuming some out-of-cluster resources:
As you can see there are only two roles left. Infra Operators (your DevOps team) sets the baseline templates and assigns the external resources that can be consumed. Application Developers have a certain range of values they can change (env variables, database, environment type). These changes are applied to the base-line template set by the ops team. The API then takes the resulting charts and puts all the resources into the required state. These changes to the baseline chart are saved in a log at deployment time making the config reversible at any point in time.
And all of the sudden the clouds clear. There is no script creep anymore because everything is derived from the base-line charts in a logged and auditable manner. Developers are self-serving without being able to screw anything up and ops can concentrate on the stuff that matters, nailing the SLA.
Build your Internal Developer Platform with Humanitec
So, how do I build such an API? Well you might just want to tune into how Paolo solved this at Sport1 or Jan at Zalando. They built Internal Developer Platforms to solve these problems. You can do this too, with Humanitec. Good luck on this journey my friend!