Forto is a digital freight company that aims to make shipping products as easy as sending emails. They provide a number of services to simplify supply chain shipping and recently successfully shipped 20 million masks from China to Germany for the German government.
When Forto began, they started with Heroku and then later switched to Kubernetes as their team and needs changed. Switching from one provider or architecture to another is a common path many teams find themselves following. This post discusses when Forto realized they needed to switch, how they planned the switch and their recommendations for others doing the same. It is based on a webinar with Masashi Beheim, who is Chief of Staff to the SVP Engineering at Forto, but used to work for Simon Piero the iPad magician. Maybe you've seen him somewhere on YouTube or on TV.
Forto (currently) has an event-driven microservices-based architecture using MongoDB. They use Node with TypeScript, reactive TypeScript, and Kubernetes on GCP. They use AWS SNS and SQSs, so already there's a kind of a split.
Initially, they had Heroku running on AWS in a multi-cloud setup, which was an odd setup, but worked.
Why use Heroku?
Masashi explains the teams initial motivation to move to Heroku as part of a wider migration:
“The move to Heroku was initially part of a bigger transition from a monolith to an event-driven microservices-based architecture. At the time Forto was just three engineers who were working on an application monolith. Logistics was a new field to everyone involved, and the team lacked the domain knowledge to properly design the different components, even within a monolith.”
This meant the monolith was tightly coupled, there were long deployment times, were inexperienced using Docker, and had incredibly complicated error-prone scripts, with no ability to rollback, and no certificate management. Most problematic was that deployment was in the hands of one or two engineers only.
But the application worked well enough to fulfill initial user and market research to prove the business model worked and gave Forto the opportunity to invest in improving their tech stack.
In 2017 Forto grew from 3 to 12 engineers split into teams, and the question was, how can they ensure that engineers can effectively and efficiently release new features? Autonomy and ownership were not really clear with the monolith and they wanted to make sure that the company organization followed the application architecture. Teams wanted to prevent stepping on each other's toes when changing application components or features.
And that's how Forto ended up with Heroku. The priority was removing the cognitive load from development teams, not picking the latest and greatest custom setup that relied on specific knowledge to maintain. In addition to Heroku, Forto also introduced CircleCI for deployments for easier rollbacks in the case of problem builds.
When and Why did Forto switch from Heroku?
In the beginning, teams were relieved. They were more autonomous and could ship code independently from each other. But after about two years, ironically the autonomy became the biggest issue. Teams talked less to each other when they created new services, and created so many disconnected services, the application as a whole became more of a “distributed monolith”.
The decision to use Heroku was driven by infrastructure rather than cost because the cost of engineers is higher than infrastructure. The time of the engineers is most valuable, and Forto optimized for that. However as the company doubled its engineering team, the cost of people grew even faster than infrastructure.
The challenges of increasing application complexity with Heroku
Due to the distributed monolith, Forto built a command-line tool that allowed for quicker bringing up of multiple applications for testing, jokingly called the “distributed monolith creator”. The tool helped test the tightly coupled components but was a compromise. With it, you could define three services, and two endpoints and queues to communicate with each other. But the tool helped enforce the whole anti-pattern and made it hard to turn back from the path the team was going down.
Configuration was another paint point, as there wasn’t a proper distinction between configuration and secrets. It was an error-prone process full of copying and pasting environment variables around.
When did Forto decide on migrating and decide where to migrate to?
From an architectural point of view, Forto knew they wanted to stay with event-driven microservices. They wanted to head towards serverless architectures, and as it was fairly well established, to switch to Kubernetes. Ideally, this new infrastructure would help Forto continue to grow, ship lots of code quickly, and be flexible.
Planning the migration to Kubernetes
Masashi Beheim and Forto believed there were 3 “chapters” to the migration. Looking at it this way reflects the reality of many migration processes, where what happened before influences the next steps, and really you can’t plan the entire process precisely upfront.
Nobody really had lots of Kubernetes experience, but one of the team leads took this topic on and drove it forward. They already had some prior interest and were managing most of the infrastructure before.
The first step was to Dockerize everything and introduce containers because before that they were pushing from Git directly to Heroku. Fortunately, Forto had a homogeneous stack consisting of MongoDB, node, etc., which made it easier to migrate services and write Docker files.
Teams migrated services incrementally, from those with the least business risk and low load onwards. Forto learned that using more standard tools and patterns reduced the need for documentation to explain processes to new hires, and made resolving problems easier.
The importance of an SRE team
During migration, Forto gave a lot of thought to configuration management and secrets. Initially, they put everything into one repository, but that turned out to be a single point of failure and a bottleneck. The deployment pipeline was constantly calling the repository and suddenly broke. There were regular merge conflicts in the repository from automated calls to it.
These problems led Forto to finally hire their first Site Reliability Engineer (SRE) for the first time who spent 50% of their time trying to fix the pipelines. Eventually, Forto was able to retire their older and custom tooling, and the migrated platform stabilized to a point that even during a period with no one directly responsible for infrastructure, there were no major issues.
Realizing how valuable they were to keeping operations running, last year Forto started to build an SRE team. Finding and hiring SREs with experience is a hard and expensive process, but again, this is helped if you use industry standard practices and tools that most are likely to know and have used already. Now with 5 staff dedicated to the role, they took things to the next level. They standardized the deployment pipeline with Helm. They added external DNS, Mozilla SOPS for secrets management, and other tools. The team dog foods everything before using it production, resulting in a setup everyone is satisfied with overall.
Key learnings from the migration
“Our key learning was that before any migration you need to understand the business domain as much as possible.”
It’s important to learn how to relate infrastructure and architecture choices to company goals. Hiring an architect to help you with this is crucial to help you understand how to define the ideal boundaries of services.
Most migration processes take longer than expected and existing tools can help you reduce the time you need to invest in retooling and relearning. These could be open source tools, where developers need to understand the basics, as they're writing the manifests and the repositories, but that's typically enough, and there are many sources of knowledge available. In the background, everything is managed by the SRE team, who handle deployments, keeping pods running, etc.
Another option is commercial tools such as Humanitec that provides you with an Internal Developer Platform ideal for companies without an SRE team.
First Forto wants to have multiple clusters with multiple regions as they conduct a lot of business in China in addition to Europe. And to be able to deploy all the services behind Forto in one single shot in the case of new regions or downtime.
Forto wants to have audited and automated escalations for engineers to deal with incidents. This means that if an incident happens, an engineer can get elevated access to production for a period of time and this is audited so there's a record for the future.
Finally, Forto aims to be more passwordless, by removing passwords for SPS, SNS, S3, and Google Storage in favor of pod identity and workload identity.