We're in a time where our meetups, conferences, and hackathons have moved virtual, and while we may not get the same experience, we still have the opportunity for learning and networking. I recently attended the DevOps Enterprise Summit, a three-day mighty behemoth of presentations, interviews, and discussion. Presenters shared real-world problems and how DevOps provides a framework to problem solve, change company culture, and ultimately drive forward customer satisfaction and financial benefits for all concerned.
Sure, there were plenty of deep dives into critiquing the theoretical foundations of DevOps - and the linguistics and semantics of DevOps were heavily discussed, as speaking spent time defining theories and frameworks. But if you're the kind of person that likes to focus on meaningful outcomes and actually see how things work (or don't work) like me, an enterprise conference is an excellent opportunity to hear people talking about their own experiences in their own company, sharing case studies and more.
Transform a high-context environment into a low-context environment.
Thomas Limoncelli is SRE Manager at Stack Overflow. He gave an interesting talk about the differences between high and low culture and the value of documentation. He shared a story from his first week at Stack Overflow:
"I still remember my training as I asked how to create a virtual machine. We use a product called VMware, and my mentor walked me through the process. It involved five very complicated steps; it wasn't written down, just verbally passed on from one system into the other. I asked how anyone could memories this? The response was 'Well, we just kind of expected that anyone who would get through our interview process would just know how to do this kind of stuff.' I remembered thinking how could I be expected to know all of that?
The day involved a call from the boss. Thomas and his mentor had made a mistake in their work, demonstrating that even an experienced staff member was unable to memorize everything.
What is a high context culture?
A high context culture is one where:
- Communication is informal, less documented, and involves collective history.
- People have to read between the lines to understand what's going on.
- More assumed knowledge.
- It relies on long term traditions and practices such as family gatherings where people know how to do things and what to expect.
What is a low context culture?
In a low context culture:
- Communication is explicit; you are told the rules, knowledge tends to be codified, public, external, and accessible.
- There are more interpersonal connections of shorter duration.
- Knowledge is often more transferrable.
- Examples of a low context culture include airports and sports with established rules.
The need for Low context DevOps
According to Thomas, the DevOps environment should strive to be low context "you should spend more time working and less time frustrated with roadblocks and information gaps."
Three ways to reduce the required context of your DevOps environment:
Carefully constructed defaults - The defaults should be the way you expect most people to work. Typically new employees don't have the software and access and permissions they need to do their job. But new employees can't fix it. If you are changing projects all the time, you might be faced with this problem regularly, and dynamic companies should involve moving from projects. Thus, employee-friendly default keeps employees happy and workplaces functional.
Make right easy - Thomas notes that most websites run on OpenSSL. "But settings become stale, and it effectively requires a Ph.D. to use." Comparatively, LibreSSL makes the default 'timelessly correct.'
Stack Overflow embodies this sentiment through tools and infrastructure to help provide a low context environment: e.g.
- Ticket system
- Bug tracking system
- Monitoring/observability
- CI/CD pipeline system
- Container/artifact repository
- Documentation repository
Ubiquitous documentation - Documenting as you work means you'll have documentation when you need it like when you're fixing an error as part of 3 am pager duty. Documentation is easy to do with a deep link/URL can include:
- In error messages
- In CI/CD control panel restrictions
- In alert messages
Thomas advises the mindset of "my code is the documentation" and suggests instead that people need to "record tech debt or it won't be fixed."
For those that historically hate documentation, he suggests the incentive, "the better you document, the more relaxed you can be later. It also means someone else can do my work, and I can go onto more interesting projects."
For those who hate documentation, he suggests templates that do much of the work and that teams include documentation updates in work estimates - "don't think of documentation as something extra but part of the project itself." He also notes that there's no need to reinvent the wheel, and you can find where engineers already write and repurpose - such as email, chatrooms, IM, Stack Overflow, and repurpose these into your own work.
Resources:
- The unicorn project by Gene Kim.
- Stackoverflow for teams -gives new employees the power to fix things and build a good reputation. It works across team silos.
Demystifying DevOps and SRE
Daniel Maher works in Developer Relations at Datadog. He took a look at SRE including common terminology, practical examples and the relationship between site reliability engineers (SRE) and DevOps.
He describes "DevOps as a professional and cultural movement that focuses on openness sharing and mutual respect. It seeks to improve the quality of life for its adherence practitioners, for their company, customers, and those participating." However, improving the quality of life involves availability and reliability, which is where SREs come in. "How can we ensure that the systems that we have in place will be there when people need them?" In other words, "DevOps is an idea, SRE is a practice."
Daniel describes Site Reliability Engineering: How Google runs production systems as "The big lizard in the room. It's just one interpretation - albeit hugely influential - and how Google did something in 2016 is not necessarily how you should do something today." Instead, Daniel spoke about the importance of finding out what is needed and works best for your own organization.
Teams and organizational structure
Pertinent to SRE and DevOps, Daniel suggests we can organize people in product teams, squads, and guilds:
- Product teams are one of the ways that DevOps scale to large enterprise companies. But they are only one part of an organization.
- Assembling a squad to focus on a particular product or problem is a great way to use a product management structure across teams. Examples at Data dogs of squad work include eg recruiting, building coding tests, and hackathons. Squads are typically short term, defined, and have a beginning and end.
- A guild owns and shepherds an important part of an organization such as organizational culture, standards around automation, and traditionally involves lots of different stakeholders.
Teams of SREs can undertake a range of tasks, including code reviews, incident reports, and facilitate post-mortems. They may focus on a dedicated portfolio or product team, and individual SREs may rotate in and out of projects/sprints or not.
Tips on finding and growing SRE talent
According to Daniel, SREs have strong personalities with specific attributes, which might include: a wide range of technical interests, patience for staring at code, an enjoyment of problem-solving, and interest in mentoring/teaching. There is no formal SRE qualification, so the desire for self-learning is super important.
He stresses that "great SRE talent can come from anywhere", especially as it is not limited to particular qualifications.
Practical suggestions and pitfalls of SRE
The configuration of an SRE depends on the specifics of your organization. It may require some testing and tinkering. "The number one thing to avoid is dogma - don't look at how another org has implemented things as the only way to do it." Instead, attend conferences and meetups, read blog posts, and talk to others to see how it could work.
"No one can sell you DevOps, it's a journey and a process with no end, and you should embrace that."
Team Topologies in Action
Since the book Team Topologies was published in 2019, organizations around the world have started to adopt Team Topologies principles and practices like Stream-aligned teams, modern platforms, well-defined team interactions, and team cognitive load as a key driver for fast software delivery and operations. Authors Manuel Pais and Matthew Skelton took us through some recent case examples.
4 team types
- Stream aligned team - AKA a product team - they are aligned to a stream of work that is not necessarily a product. They are among the core types of teams that deliver value to the customers or the users.
- Enabling team - Typically, teams of experts in a specific area will collaborate with stream-aligned teams to help them gain the capabilities they are missing.
- Complicated sub-system team - involve sub-systems that need such deep skills, high-level expertise, and an understanding of niche technology.
- Platform team - platform teams provide services that make the lives of stream-aligned teams easier. They might provide infrastructure on top of which the stream-aligned teams run their products, or build a self-service system that the stream-aligned teams use to access build and test environments on demand.
Team interaction Models
According to Team Topologies, there are only three ways in which a team should interact:
- Collaboration: working together for a defined period to discover new things (APIs, practices, technologies, etc.)
- X-as-a-Service: one team provides and one team consumes something "as a Service"
- Facilitation: one team helps and mentors another team
Case study: Gjensidige
Gjensidige Insurance is a leading Nordic insurance company with 4000 employees and businesses in the Nordic and Baltic countries. It uses the four fundamental team types to clarify team responsibilities and interactions and is moving towards several "thinnest viable platforms" with Stream-aligned teams as internal customers.
Positive outcomes:
- 40% annual growth in digital sales over the last 5 years
- More than 100% growth in digital customer service, shifting transactions from call centers to online
- Claims handling is heavily digitized – more than 80% of claims are now filed online, of which up to 40% of are automatically handled.
Case study: Puregym
PureGym is Britain's largest gym chain - the first to gain over 1 million members. As PureGym expanded, so did the need for software to enable their members to book and manage gym sessions. Since 2019, PureGym has re-aligned its teams and team interactions based on Team Topologies patterns, helping to scale the engineering teams and improve flow.
Matthew explains that the company was experiencing a range of problems as the team rapidly expanded that led to their realignments, such as pain points in inter-team communication, monolith software, and a single code repository. They worked to reshape teams (over several different configurations), breaking up tasks and responsibilities.
Their success was made possible through continuous collaboration and facilitation across teams and tasks. The results included:
- More business responsive
- Balanced ownership of services
- Improved team morale
- Better long term architecture
Resources
The authors of Team Topologies are currently working on a free workbook for remote teams, and also have plenty of resources available:
DevOps Enterprise Summit was a great opportunity to explore how businesses across multiple industries are actively utilizing DevOps to facilitate internal transformation and improve customer service and other business outcomes. The opportunity to hear speakers not simply spruiking a product but also talking openly about the challenges and failures associated with business transformation - and how they were able to over come them - provides a great opportunity to learn lessons and strategies to help apply DevOps to your own workplace.