
What is an Internal Developer Platform (IDP)
We believe that every developer needs three ways to view their work and their progress.
- Their editor or IDE where they write code.
- Version control where they collaborate and merge their work.
- Their Internal Developer Platform (IDP) where they deliver their work to users.
Many tools exist to help developers deliver their work to users, so why go to the lengths required to create your own, or use an IDP such as Humanitec?
In the next series of posts from our dev roundtable interviews, we speak with tech leaders at major global companies who have created IDPs to discover what problems they were trying to solve, how they did it, and what advice they have for others following the same process.
In this article we summarize a conversation we had with Jason Warner, CTO of GitHub, and previously VP of Engineering at Heroku. In both these roles he focussed on creating a platform that allowed developer teams to self-serve new features, and for Ops teams to focus on creating reliable services. In its modern incarnation, GitHub would not be possible without an IDP.
Why did you build an Internal Developer Platform at GitHub?
If you look at GitHub as an organization it has this almost magical flair to it. It feels strange to think that they are confronted with the same issues and infrastructure evolution as everybody else. But it started off as a Ruby-based monolith. The teams followed a âyou build it, you run itâ policy. Jason says of the policy today:
âIf you run a setup where Dev and Ops talk to each other you donât have a great setup.â
âEven GitHub didnât start off with an IDP, but as they added to, and changed their setup, it became increasingly obvious that they needed one.
â
What were the key reasons your setup became more complex and increasingly painful?
GitHub was originally built on bash scripts that didnât properly reflect a modern rapidly growing and dynamic company. Or as Jason puts it:
âYou cannot scale an efficient engineering organization on bash-scripts, youâll die.â
Over time, the GitHub setup became more complex as they split their applications up into microservices, increased team-size, and adopted a multi-datacenter approach. At this point the development workflows became unmanageable with their existing setup. You could look at it from a social, operational, development, or really any view, but in the end the team had to get more rigorous and streamlined in their processes to deal with all the bash scripts, integration points, application configurations, and infrastructure setups in an efficient way.
â
Scalability is often the key driver. Innovative engineering organizations have to keep up with the speed. âIf youâre slow you dieâ is something of a mantra to Jason. In an engineering organization with high velocity you have to move fast without breaking things. That becomes increasingly impossible if you run on unstructured scripts with insufficient workflows.
A good indication that you need an IDP is when the pressure on Ops piles up and they start to struggle with the workload. When you start to have to keep adding people to the team, and you realize that you would need 4-5x the same amount of people to keep up to speed.
It was a different experience for Devs and Ops teams
As seen in almost every example of teams building IDPs the experience differs between development and Ops teams. Before the IDP, developers had more freedom in the way they set up and developed their applications. Which might sound great in theory, but meant that there werenât any standards and it became increasingly hard to track who did what and were. It also meant that there were hardly any self-service deployments because every service, database or resource had to be set up in one specific way. GitHub ended up in a situation where Ops teams served developers resources for services for which the Devs setup the configuration in the way they wanted. It also meant that Ops teams had to set up and maintain services with configurations that contained problems that were passed from one team to another. As Jason puts it:
âWhat we had allowed us to go off-path too often. Thatâs dangerous.âÂ
For Jason as the CTO the case was crystal clear, he didnât want to trade speed vs. safety. He wanted to run an organization that was super fast, secure, and lean at the same time. An IDP was the only way to achieve this.
Team frustration was the business case
Deciding yes or no to an IDP wasnât the choice, the difficulty was building it. It started at a grassroots level because the Ops teams at some point were so fed up that they had to streamline and standardize practices internally to somehow keep their heads above water and scale. For Jason the number one driver was speed,
âIf youâre slow, you die. If you care about not dying this is one of those investments you make.â
Â
Back then GitHub wasnât part of Microsoft yet and Jason had CTO/CEO like authority which allowed him to just decide to go all in on this. Would he have needed to make a business case, he would have made it strictly based on headcount. If you want to keep the speed of your organization at scale and how much overhead you would have to add in order to serve application developers so they donât slow down. The buy or build decision was easy, there was nothing else out there they could use.
How GitHubâs IDP was actually buildÂ
The team decided to base their IDP on Kubernetes as an orchestrator called Moda that abstracts away everything related to K8s so that application developers have zero touchpoints with it. The IDP is a ChatOps driven approach (which makes GitHub one of the last supporters of this approach and Jason isnât sure heâd go down that route again). There is an internal catalog that manages services that is hooked together with Service-level objectives(SLOs), and the entire platform is bespoke.
There is a site reliability engineering (SRE) group that manages the catalogue services, and another team for all the integration points, especially between the IDP and the underlying technology and databases. Another team exclusively focuses on managing packages for different languages.Â
Specialized team members are focussed on managing certain elements of the containers themselves. The idea being that as long as an application developer creates an app or service that fits inside these containers, they get certain guarantees such as monitoring, logging, alerting, and auditing and all catalogue integrations are taken care of.
How many people built the platform?Â
The majority of the work was not developing the platform, but maintaining it. Development started with maybe 4-5 people, quickly grew to 12 people, and at this point there are 40 FTEs exclusively focussed on the IDP. Jason mentioned that this number would probably be trending towards 100 FTE in any other organization as âWe understand these concepts better than anybody on the planet. In the end we invented 90% of all concepts we use todayâ.
What was the impact of rolling out the IDP?Â
The core change was the workflow itself. Development now really felt like using Heroku. Need to spin up a new environment with a new database to test a feature-branch? Thatâs a simple command in Slack and the IDP takes care of the rest in the background.
The key change for the organization was that things were scalable now. There is no one person in the app team thinking about DDOS prevention, anyone can care about a subject deeply if they want to, and have it represented in a container manifest. Compliance people arenât spread out throughout the organization, but instead focus on the settings in the IDP. As long as a concern is represented in the IDP itâs there and you donât need to worry. As Jason puts it:
â âItâs really magical if you have it and I cannot understand how itâs possible to actually ship fast if you donât have this. â
The impact was easy to measure. Teams were able to ship faster with a smaller headcount in ops. They reduced the degree of freedom every single application developer had and standardization drove efficiency. Developers became entirely self-serving and there is zero unnecessary communication between teams. Because keep in mind that âa good setup is one where Dev and Ops donât need to talk to each other at allâ.
How teams dealt with the change
Every change is hard and people usually donât like it. GitHub was at the point where people started thinking they were getting too big, losing the company's grassroots character and soul.
Especially app developers who wanted to have choice and freedom didnât like feeling constrained. What they hadnât realized was that in the âyou build it you run it worldâ, they are on call when things go wrong, which is a high price for âtotal freedomâ. It means you are responsible for whatever happens (which was especially true for the monolithic situation). At some point they did understand that. No one likes to trade speed against safety.Â
From an Ops perspective there wasnât any push back. They were already so overwhelmed, they just wanted something to help them keep up. Or as Jason puts it:
âIf you are ten feet under water you hope to get a snorkel that is 10 feet long.â
Lessons learned
First and foremost he wouldnât have used ChatOps again. At Heroku they were completely obsessed with an amazing command line experience and he replicated that at GitHub. He does question Kubernetes. Itâs one of those things that is too bespoke and too generalized at the same time. But then again itâs market-standard now.Â
Interestingly enough he would build the entire platform with an eye on productizing it for external use from the get go. That wouldnât be possible anymore, itâs already much too specific to Gihub now. In his opinion these tools that get generalized and productized too late donât really work, for example Spinnaker but also Kubernetes.
When teams should build an Internal Developer Platform
Jason has a clear opinion here. âFor most of your life you should use a combination of GitHub and Heroku and just donât deal with this concern.â The last thing anyone should do is let a setup float free. Speed is what keeps you alive and you should use all your focus in development, with ops and management laser focussed on whatever creates business value. Only if you really outgrow Heroku you should look around. You shouldnât build an Internal Developer Platform either. That is something that multi trillion technology dollar companies should do and there are very few in the world that actually apply to this.
Â
âIf companies are able to dedicate less attention to internal ops their chance of survival increases.âÂ
As a rule of thumb you should start investing in this if your setup exceeds that âseveral monoliths, a few databases and one data-centerâ world. Because afterwards nothing will be the same:
âGoing from monolith to microservice, going multi cloud. These things are like getting your first kid, nothing will be the same. You have to change the way you work.â
At this point you have to evaluate how you can keep the exact same speed while keeping ops overhead to a minimum and the setup scalable and secure. You need to start thinking how to serve developers like you serve customers. If you do this you have much better chances to survive. As Jason puts it:
âInternal Developer Platforms are a trend because people realize we need to serve internal developers as much as we serve external customers. â
Whatâs next for GitHub
GitHub was founded around adding collaboration to Git. If he asks for feedback 90% of that feedback is focussed on things that improve the git experience, so around pr enhancement and features like that. That will remain the core area of focus and they will make that better.
The second thing that people think about when they think of GitHub is GitHub Actions. They think of Actions as âCIâ but thatâs really wrong. Actions is supposed to be a general compute platform. Itâs supposed to be an end-to-end workflow. If you think about release management, CI etc. on top and you blur this with security analytics, insights, code manipulation below the surface you probably have a good idea where GitHub will go.