Jan 28, 2021 06:0 PM: Next DevOps webinar to advance your expertiseRegister now

Why GitHub built their own Internal Developer Platform

What is an Internal Developer Platform (IDP)

We believe that every developer needs three ways to view their work and their progress.

  • Their editor or IDE where they write code.
  • Version control where they collaborate and merge their work.
  • Their Internal Developer Platform (IDP) where they deliver their work to users.

Many tools exist to help developers deliver their work to users, so why go to the lengths required to create your own, or use an IDP such as Humanitec?

In the next series of posts from our dev roundtable interviews, we speak with tech leaders at major global companies who have created IDPs to discover what problems they were trying to solve, how they did it, and what advice they have for others following the same process.

In this article we summarize a conversation we had with Jason Warner, CTO of GitHub, and previously VP of Engineering at Heroku. In both these roles he focussed on creating a platform that allowed developer teams to self-serve new features, and for Ops teams to focus on creating reliable services. In its modern incarnation, GitHub would not be possible without an IDP.

Why did you build an Internal Developer Platform at GitHub?

If you look at GitHub as an organization it has this almost magical flair to it. It feels strange to think that they are confronted with the same issues and infrastructure evolution as everybody else. But it started off as a Ruby-based monolith. The teams followed a “you build it, you run it” pollicy. Jason says of the policy today: “If you run a setup where Dev and Ops talk to each other you don’t have a great setup.” Even GitHub didn’t start off with an IDP, but as they added to, and changed their setup, it became increasingly obvious that they needed one.


What were the key reasons your setup became more complex and increasingly painful?

GitHub was originally built on bash scripts that didn’t properly reflect a modern rapidly growing and dynamic company. Or as Jason puts it:

“You cannot scale an efficient engineering organization on bash-scripts, you’ll die.”

Over time, the GitHub setup became more complex as they split their applications up into microservices, increased team-size, and adopted a  multi-datacenter approach. At this point the development workflows became unmanageable with their existing setup. You could look at it from a social, operational, development, or really any view, but in the end the team had to get more rigorous and streamlined in their processes to deal with all the bash scripts, integration points, application configurations, and infrastructure setups in an efficient way.

Webinar: How to make your developers self-serving with an Internal Developer Platform


Scalability is often the key driver. Innovative engineering organizations have to keep up with the speed. “If you’re slow you die” is something of a mantra to Jason. In an engineering organization with high velocity you have to move fast without breaking things. That becomes increasingly impossible if you run on unstructured scripts with insufficient workflows.

A good indication that you need an IDP is when the pressure on Ops piles up and they start to struggle with the workload. When you start to have to keep adding people to the team, and you realize that you would need 4-5x the same amount of people to keep up to speed.

It was a different experience for Devs and Ops teams

As seen in almost every example of teams building IDPs the experience differs between development and Ops teams. Before the IDP, developers had more freedom in the way they set up and developed their applications. Which might sound great in theory, but meant that there weren’t any standards and it became increasingly hard to track who did what and were. It also meant that there were hardly any self-service deployments because every service, database or resource had to be set up in one specific way. GitHub ended up in a situation where Ops teams served developers resources for services for which the Devs setup the configuration in the way they wanted. It also meant that Ops teams had to set up and maintain services with configurations that contained problems that were passed from one team to another. As Jason puts it:

“What we had allowed us to go off-path too often. That’s dangerous.” 

For Jason as the CTO the case was crystal clear, he didn’t want to trade speed vs. safety. He wanted to run an organization that was super fast, secure, and lean at the same time. An IDP was the only way to achieve this.

Team frustration was the business case

Deciding yes or no to an IDP wasn’t the choice, the difficulty was building it. It started at a grassroots level because the Ops teams at some point were so fed up that they had to streamline and standardize practices internally to somehow keep their heads above water and scale. For Jason the number one driver was speed,

“If you’re slow, you die. If you care about not dying this is one of those investments you make.”

 

Back then GitHub wasn’t part of Microsoft yet and Jason had CTO/CEO like authority which allowed him to just decide to go all in on this. Would he have needed to make a business case, he would have made it strictly based on headcount. If you want to keep the speed of your organization at scale and how much overhead you would have to add in order to serve application developers so they don’t slow down. The buy or build decision was easy, there was nothing else out there they could use.

How GitHub’s IDP was actually build 

The team decided to base their IDP on Kubernetes as an orchestrator called Moda that abstracts away everything related to K8s so that application developers have zero touchpoints with it. The IDP is a ChatOps driven approach (which makes GitHub one of the last supporters of this approach and Jason isn’t sure he’d go down that route again). There is an internal catalog that manages services that is hooked together with Service-level objectives(SLOs), and the entire platform is bespoke.

There is a site reliability engineering (SRE) group that manages the catalogue services, and another team for all the integration points, especially between the IDP and the underlying technology and databases. Another team exclusively focuses on managing packages for different languages. 

Specialized team members are focussed on managing certain elements of the containers themselves. The idea being that as long as an application developer creates an app or service that fits inside these containers, they get certain guarantees such as monitoring, logging, alerting, and auditing and all catalogue integrations are taken care of.

How many people built the platform? 

The majority of the work was not developing the platform, but maintaining it. Development started with maybe 4-5 people, quickly grew to 12 people, and at this point there are 40 FTEs exclusively focussed on the IDP. Jason mentioned that this number would probably be trending towards 100 FTE in any other organization as “We understand these concepts better than anybody on the planet. In the end we invented 90% of all concepts we use today”.

What was the impact of rolling out the IDP? 

The core change was the workflow itself. Development now really felt like using Heroku. Need to spin up a new environment with a new database to test a feature-branch? That’s a simple command in Slack and the IDP takes care of the rest in the background.



The key change for the organization was that things were scalable now. There is no one person in the app team thinking about DDOS prevention, anyone can care about a subject deeply if they want to, and have it represented in a container manifest. Compliance people aren’t spread out throughout the organization, but instead focus on the settings in the IDP. As long as a concern is represented in the IDP it’s there and you don’t need to worry. As Jason puts it:

“It’s really magical if you have it and I cannot understand how it’s possible to actually ship fast if you don’t have this. “

The impact was easy to measure. Teams were able to ship faster with a smaller headcount in ops. They reduced the degree of freedom every single application developer had and standardization drove efficiency. Developers became entirely self-serving and there is zero unnecessary communication between teams. Because keep in mind that “a good setup is one where Dev and Ops don’t need to talk to each other at all”.

How teams dealt with the change

Every change is hard and people usually don’t like it. GitHub was at the point where people started thinking they were getting too big, losing the company's grassroots character and soul.



Especially app developers who wanted to have choice and freedom didn’t like feeling constrained. What they hadn’t realized was that in the “you build it you run it world”, they are on call when things go wrong, which is a high price for “total freedom”. It means you are responsible for whatever happens (which was especially true for the monolithic situation). At some point they did understand that. No one likes to trade speed against safety. 

From an Ops perspective there wasn’t any push back. They were already so overwhelmed, they just wanted something to help them keep up. Or as Jason puts it:

“If you are ten feet under water you hope to get a snorkel that is 10 feet long.”

Lessons learned

First and foremost he wouldn’t have used ChatOps again. At Heroku they were completely obsessed with an amazing command line experience and he replicated that at GitHub. He does question Kubernetes. It’s one of those things that is too bespoke and too generalized at the same time. But then again it’s market-standard now. 

Interestingly enough he would build the entire platform with an eye on productizing it for external use from the get go. That wouldn’t be possible anymore, it’s already much too specific to Gihub now. In his opinion these tools that get generalized and productized too late don’t really work, for example Spinnaker but also Kubernetes.

When teams should build an Internal Developer Platform

Jason has a clear opinion here.“For most of your life you should use a combination of GitHub and Heroku and just don’t deal with this concern.” The last thing anyone should do is let a setup float free. Speed is what keeps you alive and you should use all your focus in development, with ops and management laser focussed on whatever creates business value. Only if you really outgrow Heroku you should look around. You shouldn’t build an Internal Developer Platform either. That is something that multi trillion technology dollar companies should do and there are very few in the world that actually apply to this.

 


“If companies are able to dedicate less attention to internal ops their chance of survival increases.” 

As a rule of thumb you should start investing in this if your setup exceeds that “several monoliths, a few databases and one data-center” world. Because afterwards nothing will be the same:

“Going from monolith to microservice, going multi cloud. These things are like getting your first kid, nothing will be the same. You have to change the way you work.”

At this point you have to evaluate how you can keep the exact same speed while keeping ops overhead to a minimum and the setup scalable and secure. You need to start thinking how to serve developers like you serve customers. If you do this you have much better chances to survive. As Jason puts it:

“Internal Developer Platforms are a trend because people realize we need to serve internal developers as much as we serve external customers. “

What’s next for GitHub

GitHub was founded around adding collaboration to Git. If he asks for feedback 90% of that feedback is focussed on things that improve the git experience, so around pr enhancement and features like that. That will remain the core area of focus and they will make that better.


The second thing that people think about when they think of GitHub is GitHub Actions. They think of Actions as “CI” but that’s really wrong. Actions is supposed to be a general compute platform. It’s supposed to be an end-to-end workflow. If you think about release management, CI etc. on top and you blur this with security analytics, insights, code manipulation below the surface you probably have a good idea where GitHub will go.

Upcoming webinars