Dec 2, 2020 06:0 PM: Next DevOps webinar to advance your expertiseRegister now

Your Helm Zoo will kill you

This article is controversial. It aggressively questions helm-charts and current dev workflow designs, and I’m well aware that not everyone will like this. Let me be clear before we dive in: this is an enterprise view. It’s a view that is relevant to team sizes of 20 developers onwards. If you’re a smaller dev shop that builds a few apps, this doesn’t apply to you and you should just keep things as is. But for those of you that are working at scale or that are about to scale: watch out. Your helm-chart zoo will kill you. Maybe not tomorrow but almost definitely next year.

Working change by change with kubectl

At first they created kubectl-kangaroo and everyone could do everything the way they wanted. However, the challenge with just using kubectl is that you are working change by change. That’s fast but makes it impossible to track what has actually changed in your cluster. One super clever person went ahead and managed everything in Kubernetes manifests and then versioned them in Git. Dope my friend, dope.

A kubectl-Kangaroo that looked so sweet when you first came across it punches your DevOps team in the face.

The Helm-Hedgehogs crawl your way 

But then, the amount of changes grew and someone realized that most of the manifest isn’t really changing. “Cool,” said the community, “Let’s create helm-charts-hedgehogs”. Now everyone thought “Yeah, I can now write only bits of the manifest and take the configs from the template”. That made writing charts faster, you could still version them in Git and execute changes against the K8s API.

But then, the team and apps grew. Versions were piling up, and applications added. This time there was a clear trace of what happened to Kubernetes Resources. And then the prickliness started.

Hedgehog babies look cute, innocent and a lot like their mother. That’s misleading. They will develop their own very unique nasty character as they grow up.

Legacy is like the shaggy fur

It’s lightweight at first. It takes half an hour to browse through an unstructured chart to understand what got changed where. Then, yet another onboarding session so people understand how to handle charts. A couple of hours per DevOps to roll back or find a specific version. Numerous hours to just align the changes to the infrastructure and the resources out of the cluster (databases etc) with the ones in the cluster. But legacy is like a difficult to untangle, shaggy fur.

Legacy is like a dirty wool-pig. Better don’t touch it!

A task that takes 30 minutes today might take 30 hours tomorrow. As you are rolling out more apps and more services, as colleagues leave with precious documentation in their heads, hours become weeks. The itching is becoming painful. Really, really painful. At first you will increase operational overhead until your DevOps colleagues get frustrated being the help-desks for overwhelmed app developers. Then you will try some potential solutions. Adding the Kustomize-Kingfisher sounds great. But you will discover that it’s yet another patch for the disease and not a cure.

A Kustomize-Kingfisher in action. You can literally see the performance and ease. But wait half a year and the majestic feeling will cease.

At this point, folks usually start rearranging the way they store helm charts and versions in Git. They’ll have long argumentations whether it’s smarter to save them separately or with the application. Then, someone attends yet another meetup and you call it GitOps! Yessss, Gitops. That will do the job for another quarter until you will find that it has fundamental flaws as well, I could spend ages talking about those but, this is a great article around what they are that speaks for itself.

Chameleon Garden
GitOps-Chameleons look like magic. Everytime you deploy them something changes. When it works: magic. If it doesn’t: magic. Only this time it takes you hours sweating while the business team screams at you because your service is down and you cannot find the correct version to roll-back to.

Helm has fundamental design flaws 

Long story short, I can tell you where the problem lies, Helm-charts and the entire ecosystem have a fundamental design flaw: they are developed for setups only such as where there is a comparably small number of experts working closely together, writing both applications and charts. Everyone owns their charts, which they know in and out. They develop a very specific style of writing them and will know exactly what to change if necessary. The thing is: that’s not the mainstream use-case, especially not in the enterprise.

In enterprise you will have an ever changing group of people working on several applications across teams. You will have many more onboarding situations, you will have the division of labor and you will always have the application developer who just wants to focus on React and Typescript and not about Helm-Charts. In these setups, the approach is a recipe for a disaster. One that comes in the form of a slow and uncomfortable disease that will eventually halt delivery almost completely. I talked to the CTO of a 500 people dev team that is completely devastated because they are hardly delivering features at all.

Let’s zoom in on the details 

Design flaw #1: Too many was to get to Rome

Scripts and Helm Charts in particular, allow too many ways to get to Rome. If you want to update a cluster, there are just too many ways an individual contributor can write the syntax (I would argue 1000+). That makes it really hard to standardize and really hard to maintain, update and read. 

Solution for design flaw #1: 

Treat manifests as a protocol and standardize the way you apply and track changes to the baseline configurations of this protocol through an API. Never let anyone except the DevOps maintainers change the template directly but ONLY through the API. Keep the amount of different templates to an absolute minimum. 

Let the API also execute the manifest to the in-cluster resources through kubectl.

Design flaw #2: No proper way to manage out-of-cluster resources

The scripts we detailed above deal with the in-cluster stuff. But what about all the out-of-cluster resources? Which database is your app supposed to consume in what state? What DNS in what setup? You might have that scripted in IaC but how do you make the connection between in-cluster and out-of-cluster resource? This is a graphical representation of this problem:

But what’s the cure? Fortunately the cure is surprisingly simple:

Solution for design flaw #2

The cure for design flaw #2 is slightly harder to implement than #1. The key here is to have a central record referencing the correct state of the in-cluster resources as well as the out-of-cluster resources. We’ve built this for us and call it a deployment set. It combines the information of the state of everything the application needs to run at deployment time. It will tell the Kubernetes API what to set up. It also will call the correct scripts to reinstate the desired state of the resources (internally we call these scripts resource drivers).

Don’t these solutions sound awfully abstract?

Let’s look at that in a graphical representation. It would look like this: 

As you can see there are two “operators” now. DevOps teams set the baseline charts (for the K8s resources to cure disease 1) and configure the drivers (necessary for all out-of-cluster resources). Application developers use a simple GUI, CLI or API to specify deployment sets (I want this image with this database, in this cluster and with this DNS setting etc.). The API combines all information and executes the state.

What you will get using this approach: Clear and easy tracking of all changes at scale for all resource types

  • Standardized and comparable charts that take a fraction of the time to update, maintain, and roll-back. 
  • Your developers don’t need to learn kubectl, helm or any other tool but only specify the minimum necessary. 
  • Replication of environments and developer self-serving works like charm 
  • Tracking and analytics on top of the API at scale 

And the best thing: You will only need a fraction of the total overhead previously required.

A problem becomes mainstream

Kubernetes is comparably new to the market so it will be a while until this problem goes mainstream. We get many requests from teams who see this coming on the horizon in the near future, much faster than anticipated. It will hurt. A lot. But there is a cure if you follow the above outlined design patterns. Good luck on this journey!

Upcoming webinars