Back to all tech blogs

From Glitches to Grins: The Support Superhero Squad’s Epic Journey to SLO Success!

Overcoming Hurdles: The Support Superhero Squad’s Quest for SLO Triumph

By Tulio Cruz and Berta Amat, Adevinta

This article delves into the critical significance of Service Level Objectives (SLOs) within Adevinta’s operational framework. As organisations increasingly rely on seamless digital services, SLOs emerge as a pivotal component, ensuring optimal performance and reliability. Berta Amat, Engineering Manager in Adevinta’s Data Catalogue team, along with Tulio Cruz, Tech Principal at Thoughtworks, working closely with Adevinta’s Spanish marketplace — Infojobs, offer insights into why SLOs hold substantial relevance for Adevinta’s operational efficiency and technical advancement.

Welcome! Have you ever wondered about the buzz surrounding SLOs, error budgets, and SLIs? It’s not just another temporary tech trend, and we’re here to shed some light on why they matter — and why they are here to stay.

The concept of Site Reliability Engineering (SRE) and Service Level Objectives (SLOs) has its roots in Google’s engineering teams back in the early 2000s. They were faced with the monumental task of ensuring the reliability of their large-scale, distributed system (Google Site Reliability Engineering Team, n.d.).

Now that we’ve set the scene, feel free to get comfortable, perhaps with a cup of tea, as we share this story with you…

Supervillain
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

Once Upon a Time: When Heroes Meet Chaos

In the fast-paced world of tech, ACME, a company building user-facing services, embarked on a tumultuous journey. At the crux of their evolution, they transitioned to a distributed system while riding the waves of exponential user growth.

The teams were diligently crafting fresh features to both retain and entice new users. Each team expedited the implementation of these features. However, the rapid pace of development became a problem, as some of these changes often led to broken functionalities or misbehaving areas of the system that affected services, creating a disconcerting user experience.

ACME encountered a significant hurdle in the form of a lack of consensus regarding the definitions of availability and reliability within the organisation. Each team determined the availability and reliability of their services autonomously. There was no consistency between the different services, and with no prior deliberation or consensus at either the team or company level.

This situation led to negative consequences.

The sheer volume of metrics being collected became overwhelming, with many alerts proving irrelevant or non-actionable. Swamped by the constant barrage of alerts, team members began to ignore them, resulting in chaos when real issues arose. Sometimes team members had to wake up in the middle of the night and assess whether the alert was actionable or not. If it was actionable, the stressful path to find out the problem started…

Team morale plummeted, and relationships between team members and other teams deteriorated. Trust levels hit an all-time low.

Internal conflicts within the organisation had a significant impact on the platform’s users. They experienced a noticeable decline in the quality of the platform’s functionalities, with increased downtime and latency, ultimately resulting in user dissatisfaction.

Unfortunately, the Customer Satisfaction team struggled to identify the root causes of the issues being reported. Consequently, the platform witnessed a decline in its active user base.

Following thorough deliberations, ACME’s team members resolved to address the predicament through the lean inception of the ‘SLO initiative’.ACME initially aimed for 100% reliability. However, they soon realised the complexity and cost of this approach outweighed its benefits for users. It became evident that better alternatives existed.

After exhausting various strategies on their own, including creating a multitude of monitors and alarms, they came to the realisation that external assistance was needed. That’s when they discovered ‘The Support Superhero Squad’. This dedicated team swoops in to aid their clients whenever challenges arise. With each member possessing unique strengths and abilities, they stand ready to tackle any situation that comes their way.

Let’s meet the Superheroes: Exploring The Support Superhero Squad

Metric Maven (SLI — The Service Level Indicator)

Metric maven
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

Meet Metric Maven, the embodiment of dynamism armed with an unparalleled ability to intricately measure every dimension of service performance. This persona remarkably parallels the criticality of Service Level Indicators (SLIs) within the landscape of SLO methodology. Much like guiding beacons, SLIs play an illuminating role, shedding light on the effectiveness and dependability of a service. These indicators meticulously assess pivotal metrics such as response times, uptime percentages, and error rates, thereby providing invaluable insights indispensable for refining and optimising performance

Sentinel Speedster (SLO — The Service Level Objective)

Sentinel Speedster
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

In the bustling metropolis of ACME, where the pursuit of excellence is akin to a finely tuned symphony, the Sentinel Speedster emerges as the living, breathing embodiment of Service Level Objectives (SLOs). Picture this swift and impeccably precise superhero as the guiding North Star, illuminating the path towards unrivaled service excellence that your company aspires to achieve. SLOs, much like the conductor of an orchestra, set the pace and precision at which your services should harmonise, ensuring customers revel in nothing short of a truly extraordinary experience. With the Sentinel Speedster leading the charge, ACME is not merely raising the bar — it’s crafting a magnum opus of exceptional service!

Error Buster

Error Buster
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

As the self-styled guardian of the error budget, Error Buster stands steadfast, ready to shield your services from unforeseen challenges. Much like this vigilant defender, the error budget provides a safety net. It allows for a margin of imperfection, ensuring that your services can operate within defined parameters while still meeting the prescribed SLO.

Alarm Oracle (the Alarms)

Alarm Oracle
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

In the ever-evolving landscape of SLOs, Alarm Oracle emerges as a seer, sensing trouble before it manifests. This astute hero operates much like alarms in the world of SLOs. Picture them as vigilant sentinels, alerting your team at the first sign of deviation in service performance. These early warnings enable timely intervention, preempting potential issues before they escalate.

Collaboration Crusader (Cross-Team Collaboration)

Collaboration Crusader
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

In the tapestry of SLO methodology, Collaboration Crusader takes centre stage, emphasising the significance of cross-team collaboration. Like a unifying force, Collaboration Crusader rallies teams together. SLOs thrive on seamless cooperation between diverse departments — support, engineering, product teams, all working in harmonious synchrony. This collective effort ensures that everyone marches in unison towards achieving the defined SLO.

The Support Superhero Squad in action

Initially, the Collaboration Crusader took the stage, employing their exceptional interpersonal skills to unite key stakeholders and capture the attention of all involved teams. Together, they worked to define the availability within the context of their service and determine the necessary level of reliability.

Following this, Metric Maven, armed with their extraordinary ability to measure a wide range of metrics, contributed to defining key performance indicators. These included metrics like request latency, batch throughput, and failures per request, all essential for achieving the previously established goals.

Subsequently, with their analytical expertise and holistic approach, Sentinel Speedster guided ACME’s stakeholders in defining SLOs based on the previously established definitions. This involved agreeing on a service’s minimum reliability before users would notice any issues. They also discussed the margin for errors and the room for introducing new functionalities without hindering the overall process, thereby defining the acceptable risk of the system.

The Sentinel Speedster also shared an important concept during these discussions. ACME initially aimed for 100% availability, but the Sentinel Speedster highlighted the drawbacks of this approach, including its high cost and technical complexity. Ultimately, they worked together to find a more suitable solution for the agreed objective.

Next in line was Error Buster, armed with rational, pragmatic, and confident capabilities. Working in tandem with other superheroes, Error Buster tackled the task of determining the error budget — essentially, the permissible level of unavailability tolerated for applications before intervention became necessary, based on the defined SLOs.

Error Buster collaborated with various stakeholders, and the Product team played a significant role in these discussions. The discussions primarily focused on determining the error budget, taking into consideration the importance of the previously established acceptable risk and SLOs in this context. For example, if an ACME team wanted to deliver many risky features, they had to configure a loser SLO compared to a team that wanted to release fewer risky features.

Furthermore, Error Buster, in tandem with the Collaboration Crusader, communicated to ACME the potential hurdles teams might encounter when surpassing the error budget. The paramount repercussion is that, should the Error Budget be exceeded, we’ll be compelled to postpone a release until we bolster the system’s reliability. Therefore, it’s imperative to bear in mind that when we’re inclined to take more substantial risks, we may require more adaptable SLOs to avert swift overages.

The predefined metrics and monitoring will assist the company in tracking our proximity to exceeding the error budget. This warns us if our SLIs are failing a lot and may exceed that target.

However, a pressing question arises: can we afford to overlook an exceeded error budget? The heroes proposed the idea of establishing an exception system that permits teams to release under exceptional circumstances a limited number of times per year. These two heroes helped ACME establish a contingency plan should this error budget be exceeded.

With all our superheroes operating at peak efficiency, having meticulously defined the SLOs, SLIs, and Error budget, it was Alarm Oracle’s turn to step in. Armed with their unique ability to anticipate trouble before it arises, Alarm Oracle set up a comprehensive alarm system, ensuring that every critical aspect was monitored vigilantly, all while respecting the error budget parameters defined by Error Buster.

The Alarm Oracle concentrated on creating metric dashboards and alarms in collaboration with stakeholders, with a specific focus on actionable alerts to prevent unnecessary noise or alert fatigue.

Conclusion

Ultimately, the combined efforts of these extraordinary heroes, each embodying a crucial facet of SLO methodology, paved the way for ACME’s success. The Collaboration Crusader fostered unity, Sentinel Speedster provided clarity, Metric Maven ensured precision, Error Buster fortified resilience, and Alarm Oracle stood vigilant. Together, they defined SLOs, SLIs, and error budgets, fostering a culture of reliability and innovation within ACME.

ACME team members quickly witnessed the profound impact of these changes and the effectiveness of SLOs. The availability and reliability of the services they provided showed a remarkable exponential improvement. When alerts were triggered, they were all actionable, enabling our teams to readily pinpoint and address issues. All these changes lead to a boost in team morale. These transformations didn’t go unnoticed by the Customer Support team either; they observed a reduction in user issues, resulting in an upsurge in user satisfaction and an increase in the user base.

With all the efforts and all the SLO methodology applied, the ACME could see that they had greater control of their entire ecosystem. They could pay attention to user responses, always knowing what to expect from their applications. And because they could focus much more on improving their business, they saw revenues increase in a more precise and consistent way.

As the journey continues, the lessons learned from the ‘The Support Superhero Squad’ will continue guiding ACME toward a future where excellence is not just a goal but a way of life. It’s a testament to the power of SLOs in transforming not just services but entire organisations into beacons of reliability in the ever-evolving landscape of technology.

Zero to Hero, Error to Zero! Support Squad, Let’s Roll!

squad
Credits: Malu. “Portfolio.” Malu.dev, https://malu.dev.

References

Google Site Reliability Engineering Team. (n.d.). “Service Level Objectives.” In the SRE Book. Retrieved from https://sre.google/sre-book/service-level-objectives (Accessed on October 25, 2023).

Related techblogs

Discover all techblogs

Lessons learned from organising our first global AI hackathon

Read more about Lessons learned from organising our first global AI hackathon
AI hackathon

How we matured Fisher, our A/B testing Package

Read more about How we matured Fisher, our A/B testing Package
working

How we joined forces to enable GPU serving for our AI solutions

Read more about How we joined forces to enable GPU serving for our AI solutions
Stream