GitOps and developer self-service are key trends in the DevOps industry. But how can companies operating in highly regulated industries benefit from these new approaches? We sat down to discuss the topic with Steve Wade, a leading expert in implementing self-service and GitOps setups for top-performing fintech companies. While Steve spoke mainly from a fintech perspective, his insight applies to teams in every industry.
In this article, we will some up the key takeaways from our discussion. In case you missed it, you can also find the full recording here.
Five common problems with internal platforms
Platform teams need to manage two types of requirements: their individual requirements and the list of developer desires. Typically, developer requirements will end up at the bottom of the list because platform engineers believe their personal needs are more important. So the list of developer requests keeps growing, and it’s almost impossible to consolidate them. Since they never get implemented, overall frustration grows, and nobody wins.
ClickOps is the reverse of GitOps. Git should drive everything, but with ClickOps, people drive development with button clicks. They spin up instances of various systems in the development environment, and it ends up looking nothing like production. This leads to problems with each deployment to the production environment.
Slow release cycles
The time between releases lasts a long time, and the distance between deployments can be about four to five weeks. In many companies, this is understood and accepted, but fast iterations are how businesses remain ahead of their competitors in today’s digital world. Organizations need to let developers iterate quickly, and provide new features and functionality to their customers as fast as possible.
Changes are risky
The ClickOps mentality and the slow release cycle lead to the company perception of change being too risky. Every change becomes risky--it doesn’t matter if it’s a new feature or a change to the underlying platform. Developers should not spin up infrastructure haphazardly because it can impact many of their services or features. Applications don’t exist in a silo anymore.
Production changes are risky
Traditionally, organizations deployed large monolithic applications. They didn’t need to worry about integration testing, as the entire application would be deployed in one go. But with microservices, this is no longer true. One microservice version may introduce an incompatibility with a previous version that is still used by other applications. Now, every time a release goes to production, one microservice change can bring down the entire platform.
These five issues can be extremely detrimental to the team culture. According to Steve, when problems start popping up, it fosters a “blame culture.”
“We just blame the next person or the other product team, or the lowest common denominator is always just ‘blame the platform.’ Every good engineer or developer just blames the platform.”
He then points out that without confidence in a platform, it’s hard to build confidence between the people maintaining and using the platform. If these people can’t find harmony on the platform, the different teams continue to butt heads and leads to a massive tug-of-war of blame.
Steve presents the solution as building the perfect platform for each unique team in six steps.
Building the perfect platform in 6 steps
Step 1: A strong mission statement
A strong mission statement provides a North Star for the direction the platform and the teams should take. The mission statement should focus on the challenges that affect the platform and the teams. Steve provided this mission statement as an excellent example:
“To provide an easily extensible, config-driven, ephemeral platform with clear ownership and most importantly, one that is reliable and consistent.”
In one sentence, we know that this platform:
- Is easily extensible: There will always be multiple iterations, and the platform teams will be adding different components over time. It should be easy to introduce new iterations and components.
- Is config-driven: There are no ad hoc deployments, and every deployment should come from a known configuration. This is the reverse of ClickOps.
- Is ephemeral: If nodes run for a long time, developers and engineers are scared of removing them because they’re unsure what files and keys reside there. So they leave the nodes in place, which is the equivalent of putting a server under someone’s desk. If the node stops, nobody understands the impact. Platform teams should not expect nodes to run indefinitely.
- Has clear ownership: When ownership and boundaries are set correctly, one team can work on the platform without worrying about clashing with another group. Each team can iterate consistently and at its own speed.
- Is reliable and consistent: If the platform is not consistent and reliable, it is difficult to evaluate how changes will impact the overall organization. Consistency makes the entire process easier and safer for everyone. For example, adding infrastructure components to the dev environment should be identical to doing it in the production environment.
Step 2: Build the base platform
Internal platforms should use the same components in development as in production. They provide a common playground where everyone can deploy their workloads and their applications.
All the configuration modules (typically Terraform) should be stored in a dedicated repository. These configurations provide a consistent way for any engineer to spin up infrastructure. For example, engineers can deploy secure S3 buckets even if they aren’t Amazon-certified. They can create, configure, and use them without needing to know what happens under the hood. If different teams require different configurations, each team can put the configuration they need in a dedicated repository. Each repository can have different owners.
Step 3: Make it developer-friendly
A top-tier platform leverages the way developers work. Typically that means using pull requests from their IDE. So, developers should use pull requests for any type of change. This allows them to drive everything with GitHub instead of creating infrastructure with a click, making the platform consistent across environments.
Step 4: Keep the compliance posture high
In fintech, audit and compliance are essential. During the audit, organizations must show how they make changes and ensure that the changes meet compliance regulations (HIPAA, GDPR, ISO, etc.). Teams can set up a compliance tool (like Regula or Fugue) to execute a Terraform plan and save it as JSON. Once the setup is correct, the policy checks determine if the changes pass or fail compliance requirements.
Step 5: Configure continuous integration
The goal of the platform is not to mandate how to execute the work. Teams can create workflows using the tool of their choosing. All the code and configurations are stored in GitHub, and each application image gets tagged with the appropriate prefix (e.g., dev-, int-, or prd-). This makes the whole process consistent and allows teams to create cookie-cutter pipelines unique to their needs. Teams can build and reuse the same channel for deployment in any given environment.
Step 6: Put it all together
By bringing everything together in a single successful command, the platform becomes reliable. The pull requests deploy all the required components under a single directory. Using the same environment prefix throughout all the repositories gives teams the ability to stand up a cluster quickly and easily. When a team wants to create a new environment, they agree on the character prefix for that new environment and run the same command to stand up a new instance.
To truly enable self-service, platform engineers need to get out of the way of developers and let developers focus on the customer end product. Putting self-service in place requires a lot of time, so platform teams need to keep people informed of the progress. Whenever they meet with upper management, team leaders can demonstrate how their work makes things easier for developers, product teams, and the organization as a whole.
To improve platforms, product teams can provide guidance by answering the following questions:
- What’s the best thing about the platform?
- What’s the worst thing about the platform?
- If there’s one thing you could change about the platform, what would it be?
These can become the top-level epics for the following month. It ensures that platform engineers are working on stories that keep developers happy.
Platform teams should always make sure to have a break-glass scenario if, for some reason, the cloud provider that holds Git configurations has a problem that prevents deployments within a reasonable time.
Using this self-service approach, Steve reports that some of his clients experienced a 50% increase in production deployments, approximately twenty minutes to recover from a complete cluster outage, and developers who are 75% less focused on operations. Instead, they’re focused on delivering value to the customers, bringing in more customers, and keeping them happy.
Steve paraphrases a quote from Kelsey Hightower to explain the work he does.
“Once you find a pattern that works, your next goal should be helping others to do the same.”
Because if internal platform teams can make the developer experience better, everybody wins.