I’m looking at hundreds of DevOps setups a year. Lately, everybody is talking about GitOps. I noticed that some teams were increasingly frustrated with this approach, while others were immensely positive. But google “GitOps”. Everyone is positive. I started to wonder why.
It became clear to me that teams had a positive initial experience with the approach. Yet, frustration grew down the line. And down the line means, it grew with the:
- Time teams used the GitOps approach
- Number of apps or services teams introduce to GitOps
I started to realize that if I mapped several cases on a coordinate system with two axes (satisfaction level with Gitops and # of YAML files) they would approximate a descending graph. One could map the GitOps’ satisfaction rate as a function of the number of YAML files.
Can we explain this function theoretically?
What are the reasons for this behaviour change? Why does the number of YAML files have an impact on the satisfaction level with the GitOps approach? To explain this, I came up with the following thought-experiment.
You’re tasked with building a factory lane. The inputs are pre-processed pieces of steel. You need to build an assembly line that takes those pieces and presses them to different screws. Those screws will go into extremely sensible medical equipment. If this equipment fails, this has severe implications. You need to have end-to-end auditability of what screw came from what piece of steel.
So far, so good. Not a simple exercise but doable if you can control the input, right? If you get the exact same piece of metal in the exact same quality every time, it’s straightforward to ensure that the output is consistent.
Let’s assume the assembly line is perfect and is never the cause of failure. This assumption leaves the input as the only reason a screw can fail. Let’s say we only have one screw. This case can be characterized with the following graphic:
If the probability that the steel delivered by <it>supplier 1<it> works by 90%, the system’s (the medical equipment) overall probability to function as expected equals the probability of <it>screw 1<it> to work. This takes us to the following likelihood to fail:
P(system fails) = 1- P(steel supplier 1 fails)
Solution: In the single supplier example, the failure probability equals 10%.
Unfortunately, life is more complicated. You now have three steel suppliers. Again, for sake of simplicity, we assume that steel from each supplier is used for exactly one screw in the machine. And our machine now consists of three screws that can each fail. In a graphical representation, it looks something like this:
To keep the formula easy, we will assume the steel from all three suppliers works with the exact same probability of 90%.
P1=0.9, P2=0.9, and P3=0.9
The combined probability of the medical equipment to function as expected can be calculated by multiplying all probabilities.
We can simplify:
P(system works) = P1*P2*P3 to P(system works)=P^n
Which means the likelihood of the system not to work equals:
Solution: Putting in the actual numbers this leaves us with 27.1% likelihood that this system fails at some point.
P(system fails) = 1-0.9^3 = 0.271 or 27,1%
Reality is even more screwed (haha). You will have several screws made from steel from different suppliers. We will refrain from complicating this further. You get the idea now.
<it>What does this tell us?<it> With an increasing number of suppliers (with their own unique likelihood of supplying failing pieces), your overall probability of failure rises. Remember, the likelihood of failure with a single supplier and screw was 10%. Where with three suppliers of three screws, the likelihood of failure equals 27.1%.
It also tells us that if you start simple, you can go for quite some time without experiencing any failures; You can actually statistically calculate how long you can go on before you experience the first failure.
In general, with a growing number of processes, failure will hit. Once failure hits, it also gets increasingly hard to ensure your audit trail. You’ll soon find yourself trying to understand what piece of steel - by which supplier - was used to produce the specific screw that then failed.
Why am I telling you this nonsense about screws, you ask?
Because it’s a good way to understand the fundamental problem with GitOps. The problem is not the toolchain (although mostly immature at the moment), but in particular, the nature of its input. Let’s translate the above example into our software world!
- Your assembly line is your GitOps toolchain and workflow.
- Your suppliers are individual developers.
- Steel is now a set of configuration files written in YAML
- Your screws are microservices.
- The machine you’re building is an application.
If we have one service from one supplier, the world looks like this:
Now, let’s add microservices from different developers:
Your likelihood of failure rises with the same formula. It depends on the number of YAML files from individual developers that configure an individual microservice and the number of suppliers.
You will likely go without any trouble for some time and YAML volume. But the time will come where it will fail. And when it fails, GitOps becomes a nightmare to audit where the system failed: Finding what introduced the problem and why becomes a real challenge.
Using a version control system to browse and try to trace an error in unstructured scripts is a horrible endeavor (and gets more horrible with the total number of YAML files rising.)
Let’s change our original screws graphic. Also, we can swap the Y-axis with the probability of not experiencing failure of the system:
This matches well with how the Director of Platform of a huge global delivery company has recently described his experience with the GitOps approach:
“GitOps is like magic if it works, and like magic if it doesn’t.”
In short, as you do not control the quality and consistency of the input, and the system is hard to audit, throwing an assembly line (or automation framework) on top is a terrible idea. It will not fail immediately, but it will eventually. If it does, it’s bad. Things get worse with time and the number of services in the system.
So what to do now?
You have exactly one way of solving this problem: You focus on the input.
Your problem lies with the unstructured nature of the “YAML” format. If you can solve this, you can use the GitOps approach (or anything else), and it will work. But if you decide not to eradicate the root cause, you’ll run into GitOps’ scalability and auditability problems.
Ádám Sándor from Container Solutions actually wrote a great piece on this. So how do you fix the input? How is configuration as code possible (with all its advantages), while load balancing the down-sides (the unstructured format and the lack of control.)?
The key is to remove or minimize the human error when creating those YAML files, which is the main cause of an increased probability of failure. The fewer “suppliers” you have, the less likelihood of failure you’ll have.
So how do you actually do that? You cannot erase it entirely, because it’s unlikely to believe that a.) a single set of manifests can rule them all and b.) the Ops team can write all manifests themselves. The solution is to work with “baseline configurations'' and let developers apply changes to those baseline configurations in a standardized way. Developers can create different baseline configuration profiles.
In those profiles, they set specific standards such as “the default CPU allocation on a workload always equals 0.25.” Developers can make changes against these default templates via a centralized UI or CLI. At deployment time, a tool (an Internal Developer Platform, for instance) can apply the changes to the baseline configurations and create the manifests.
As you can see, we are now “creating” new manifests from a set of clearly auditable rules for every single deployment. This is an immutable record of the state of application configurations at deployment time. Whether you use GitOps to then execute those configurations or any other method doesn’t really matter. As we are now tightly controlling the input, the assembly line isn’t as important.
This strategy provides the Ops team with more control over the YAML files and prevents any “surprise” changes by developers while still giving them the needed freedom to configure their applications.
In other words, developers aren’t restricted as they can still apply all changes if necessary while you eradicate input variability. But the nature of creating one set of new manifests per deployment makes it traceable, auditable, and simple to roll back to the previous version.
Do you have a different experience?
Look at our little formula. If you are a small team, or you work with a limited amount of services, your likelihood of regularly hitting problems is low, maybe very low.
If this makes it feasible for you, I’m not saying you should change anything. But if you are running 5+ apps with 10+ services, each in production, you should seriously consider whether getting into GitOps is a good idea.
<it>From my experience: it isn’t.<it>