Oct 22, 2020 06:00 PM: Next DevOps webinar to advance your expertiseRegister now
Blog

Why Zalando built their Internal Developer Platform‍

Developer Experience Roundtable #2

We believe that every developer needs three ways to view their work and their progress.

  • Their editor or IDE where they write code.
  • Version control where they collaborate and merge their work.
  • Their internal developer platform (IDP) where they deliver their work to users.

Many tools exist to help developers deliver their work to users, so why go to the lengths required to create your own, or use an IDP-specific platform such as Humanitec?

In the next series of posts from our dev roundtable interviews, we speak with tech leaders at major global companies who have created IDPs to discover what problems they were trying to solve, how they did it, and what advice they have for others following the same process.

In this article we summarize a conversation we had with Jan Löffler, who built the IDP at Zalando (a multi-billion € e-commerce platform in Europe), previously built the hosting platform at 1&1 internet, and is now doing the same at Plesk. This interview focuses mostly on Zalando.

The problem

Back in 2014/15, Zalando was rapidly growing in terms of business and developer teams. Every time a new team formed or gained members, there were certain onboarding processes automated, but not all, and keeping up with the scale of the company was a challenging task. Hiring more people for the operations team wasn’t a solution, but creating better automation could be. There was also a growing desire amongst newer hires to work in other languages (traditionally Zalando used Java and Python), and they needed a more flexible tech stack. Finally, each team often chose its own solutions (sometimes completely custom) for building and testing code changes, which brings flexibility, but is not efficient for cross-collaboration, pricing, or audit trails.

The Approach(es)

Cultural change

One of the first steps Zalando needed to take was cultural, not technical. Up until this point, Zalando had largely been a follower of well-established practices and needed to switch to be a leader instead. This process involved changing hiring and onboarding processes, development workflows, and building a new company culture.

Technical solutions

With cultural changes in progress, Zalando started creating their own centralized cloud-based solution for testing and building changes. This began with a process of evaluation, assessment, and benchmarking, which frequently led to dead ends and rabbit holes. During this process was the first mature release of Docker, which suddenly introduced a whole other world of options the team had not considered so far.

Leverage and Spread Company Knowledge

Zalando has a lot of developers split into specialist teams for areas such as security, databases, CI/CD, and identity management. These teams, plus special interest “guilds” around topics that affected multiple teams such as API guidelines worked together to look at how other large companies handled their application architectures. After this research, they created analysis of what had and hadn’t worked for those companies and over time also open-sourced their experiences so others could learn from them.

Team buy-in

While every team knew that the existing system couldn’t stay, there was always going to be a small resistance to change and a need to convince teams that the change was worth it. Jan and his team decided to find champions on each team, and bring them into the platform team to help research and build it, which helped give the project credibility.

They also found that changing peoples thinking to a more developer-centric approach also helped gain buy-in. For example, traditionally Zalando thought of most of their infrastructure in terms of “servers”, which doesn’t really mean much to developers. Switching to thinking of their tech stack in terms of “applications” helped everyone from developers to managers understand what they were working on.

Challenges

Compliance

One of the bigger challenges Zalando had during this process was during their IPO process, which has stringent compliance requirements for tracking code ownership and changes from idea to deployment, to scaling. In fast-moving companies using containers and continuous integration, this is challenging, and quite different from how many (internal or external) cloud providers generally treat container-based deployments.

Build for Resilience and Relevant Metrics

In Jan’s first week with Zalando three data centers experienced downtime, losing the company large amounts of euros, far more than any significant developer time. When he later suggested following chaos engineering techniques to help mitigate similar issues, most teams panicked and generally felt they weren’t resilient enough to such testing.

Moving a team towards feeling confident that their applications are resilient takes time and a shift away from vulnerable monoliths to autoscaling microservices that you gradually introduce into production as you test them with a combination of unit and integration tests. Following this process helps you collect business-driven metrics sooner and understand the impact of changes, it also influences how you build an IDP when you know what you want to measure.

Setting the Balance

Any large (tech-based) business needs to balance the happiness of various internal and external stakeholders and customers. Defining the values, autonomy and level of trust you are willing to give to team members is an important step in setting this balance. For example, a security perspective would rather deploy live as little as possible to reduce risk. Developers would like to deploy as quickly as possible, but this can lead to downtime or errors, and unhappy customers which business-minded teams are trying to avoid.

Define Safety Nets

With their balance defined, Zalando built a safety net into their IDP that enabled developers to move fast and autonomously on their application areas, but not affect the work of others. Modelled on Google's “no trusted network” approach, this safety net extends to communication and access between applications. No service can communicate with another without authorization and encryption, or proper consideration for secrets management.

Zalando built an identity management (IM) service that worked across data centers and vendors to handle these requirements and handle up to 200,000 requests per second. It was one of the more challenging parts of their plan, but essential for creating this safety net.

Lessons Learned

Never Underestimate Cultural Changes

Jan referenced the famous Henry Ford quote:

“If I had asked people what they wanted, they would have said faster horses.”

When he first asked Zalando developers what they needed to improve their developer experience, they said “faster and more stable hardware,” nobody asked for a different paradigm as many didn’t realize it was possible. He received similar answers when asking about prioritization of tasks and requirements, with most requesting their problem solved first.

He had to dig further to find out what the underlying problem, solution and ideal vision of developers was in the long run, which may be multiple problems. The answers to these questions vary depending on what stage your company is in, from initial growth to a large public company. But without asking these questions in the first place, you are unlikely to build a solution that people will use or find useful.


Upcoming webinars