At Adevinta, the mission of the Cloud Governance team is ‘Empowering teams to optimise our interaction with providers in terms of cost savings, data-driven decisions and smart provisioning of cloud systems’. Additionally, although we have teams providing a common platform and supporting other teams, we endorse the ‘You build it, you run it (and you pay for it)’ concept, so teams are not only owners of their resources, but also accountable for them.
In this article, we explain the tools and automations we are providing to allow teams to be self-autonomous over more than a thousand cloud accounts while we provide a linkage between cloud resources, teams and their budgets. We also explain which challenges we have that shaped the current governance framework. Details about Cost Savings and (Green) FinOps are out of the scope of this article, as our colleague Ferran Grau has already provided insightful talks on the subject.
Keeping cloud accounts ownership up to date
Knowing who manages or operates an account is key to making data-driven decisions, either regarding costs or other types of operational aspects (e.g., identifying who we should contact in case AWS notifies us about an incident in an account). However, keeping this information up to date is also a challenge given the agile nature of projects in Adevinta.
A little bit of history
Before 2020, we used to work with tables in a Confluence page where we put the contact details for every AWS account. Unfortunately, this isn’t a scalable solution when you have hundreds of accounts. Because of that, we started coding a solution to track this kind of information: the Catalogue API.
More than a catalogue API
The Catalogue API started as a simple REST API backed by just 400 lines of Python and a database that was listing the AWS accounts, linking them to a finance cost center and a team name. Now the API has more than 7k lines of code, and is able to do much more, including:
- List cloud accounts and link them with their budget, managers of the account, security contacts and security risk assessment (ROLFP framework)
- Create accounts within AWS, GCP, Datadog, Grafana Cloud or PagerDuty
- Manage access to these accounts using Single Sign-On
- List CIDR usage within our cloud providers and manage IP address assignments within the RFC1918 address space (IPAM solution)
The goal of developing this tool is to provide a homogeneous interface of primitives that enables us to operate on cloud accounts, abstracting the details between the different cloud providers. This allows us to develop on top of this API and to keep its code as simple as possible. We will dive deeper into some of the details later.
When mapping account ownership is not enough
As a very dynamic company, we are used to changes in teams and projects in Adevinta. Early in the development of the Catalogue API, we realised that not only do we need to keep the link between the accounts and its teams up to date, but also the teams’ data to provide reliable cost information, as it is one of our core competencies.
These changes created new challenges:
- Because we were copying the data into our database, every team change required the update of multiple records
- We were using an internal API that listed teams within Adevinta, but:
- It was not being properly maintained, and their data was not curated nor updated.
- It listed teams in a flat manner, so we were not tracking the Adevinta organisation tree, which makes sense to group costs and to track them within some cloud features (i.e. AWS Organisations and GCP folders).
We considered using the HR service portal API to track the Adevinta structure, but we realised that HR hierarchy does not always replicate functional structure, so we discarded it.
At that time, several teams were also interested in a replacement for the existing internal API. Given the situation, we took the lead, wrote an RFC, and developed the Organisations API.
The Organisations API fulfills these requirements:
- Tracks the organisation tree
- Every organisation is a node in that tree that can be moved within the tree graph
- It has a permanent ID, which allows us to abstract its name, cost center or parent node within several of our internal tools
- It links a team, department or division to a cost center
We take care of maintaining this information, so the data we provide to Adevinta is as correct as possible.
Governance at scale with a 2-pizza team
The global Cloud Governance team started with only three people in mid 2019, including the engineering manager. In mid 2020, the number grew to five. We had to wait until 2024 to surpass eight members, with a total of twelve organised into two squads: FinOps and GovOps.
During this time, the number of AWS accounts we managed went from less than 150 to more than 700, and up to 500 GCP projects. At the moment we are writing these lines we have exactly 732 AWS accounts and 425 GCP projects.
In order to avoid being a blocker for a company of such scale and agility, we need to automate as much as possible. Here is how we did it.
The entry point for managing cloud accounts
Since the beginning of Catalogue API, we have been providing a UI that lists the API information. When the capabilities of the API increased, and other APIs appeared, we also added more features to the UI, which became something more than a mere display of information: the Cloud Management portal.
The Cloud Management portal is an entry point for our internal users to access the multiple Cloud Governance’s API capabilities and provides a mechanism for making requests that the related manager or owner is able to approve or reject. This is key for us, as it allows teams to self manage their related resources without requiring the Cloud Governance team to manage every single request related to cloud accounts.
We are detailing in the following sections the most significant wins we have by integrating automations within Cloud Management.
Growing without losing insight
Since the team’s kick-off, one thing has been constant: creating new cloud accounts. Because of that, we added an account creation endpoint in early 2021 into the Catalogue API and automated the workflow using Cloud Management. Thanks to this, our colleagues can request to create an account associated with a budget, and it is the budget owner of the given team who approves or rejects the account creation request, thereby keeping the ownership information up to date.
Over time, we also included, optionally, to deploy a base VPC with a non-collision IP range, ensuring teams will be able to easily interconnect without any issue. This is powered by the IPAM functionality we included, which in turn is powered by scanning all CIDR within cloud accounts to avoid conflicts.
Delegating account access management
Like other companies, we use an Identity Provider service to login to several services using single sign-on. However, before late 2019, AWS was not able to connect to external providers. This left us exploring some custom solutions that were difficult to scale up, so access was still managed using IAM users in the AWS accounts. When AWS announced explicit support for several third party IdP services, we were so eager to adopt it we even needed to set up a Selenium script to bulk on-board our accounts as the AWS SSO API was not published until September 2020.
Nevertheless, managing access at scale is also a challenge. AWS required us to set up the permissions, link them to users or groups in one single organisation account, which was a considerable toil. Then, when the SSO API was published, we made the Catalogue API the glue between AWS SSO, IdP user groups and the Permission Sets, while Cloud Management allows users to request access to AWS accounts, allowing teams to self manage their access.
When we started to include other cloud providers in our scope, we kept the ability to manage access through our tooling, including GCP, Grafana Cloud, Datadog and PagerDuty.
Managing a cartesian number of AWS permissions and teams
At the beginning, we just set a few generic Permission Sets (i.e. admin, power user, read only). However, that is not enough to cover all use cases, so we set up a repository that was able to create them as Infrastructure as Code in the SSO management AWS account. Hence, the teams were able to create their own Permission Set or customise them by sending us Pull Requests.
Unfortunately, the number of different permission sets has a soft quota of 500. Even if it is possible to increase the limit, it was difficult to effectively manage the increasing number of permission sets that were being requested or updated by teams. Then, after Attribute Based Access Control (ABAC) was introduced in SSO, we created a specific Permission Set that is able to pass some attributes sent by our IdP to assume other roles which use them to conditionally allow or deny access.
Although using ABAC for jumping to another roles allows us to have very fine-grained control on the permissions the user gets, it has a few issues:
- It requires to switch roles after authenticating, which might be a burden for non-tech savvy users
- The attributes sent can only be used in the trust policy of the role which the user is jumping to
- We cannot identify a job role given that attributes are sent within the same team, so we cannot effectively manage subgroups of people, having to rely on the user’s email to identify job roles.
Lastly, after AWS added Customer managed policies, we crafted a set of generic empty roles (with some explicit denials to prevent that read-only intended roles could end having more permissions than expected) intended to cover different business job roles (like auditor, finance, developer, etc.), with some well-known policy names that teams can instantiate in their own AWS accounts.
Using Customer managed policies allows teams to completely manage their permissions within their own AWS accounts. Alternatively they can use the generic roles as a way to group users and allow them to jump to other roles. They can also use ABAC, as attributes are still sent. In the event that the dozen provided generic roles are not enough to cover all their use cases, they can fallback on the assuming role permission set.
Delivering global notifications to the proper audience
Another particular aspect we needed to cover in the AWS case was to properly route important emails and notifications from the provider.
Although AWS shows you notifications in the web console, most of our engineers don’t access it regularly. Most of their work happens outside the console, or not all the accounts a team owns are accessed with the same frequency. In this situation, notification emails from AWS help, but they are delivered to the root user’s email, which is managed by us, so we needed to monitor and manually forward them.
In order to automate this process, we developed Notifications API, a service that listens to AWS notifications using AWS Health Aware and routes them to the account users. We do so by:
- Exposing an API that Cloud Management can consume and show notifications there
- Generating a daily and weekly digest email that is sent to the account users
Over time, we enhanced this API and now we also use it to send customised notifications regarding internal Adevinta processes that we manage, like budgeting period reminders or cost alerts.
Final words
As we saw, having proper (meta)data is key to properly governing the cloud within Adevinta, and to be able to better address cost split and savings. That way we can properly address recommendations, notifications and issues to the right people.
The provider governance tools have evolved significantly during these years, as have our tools. However, this also meant we had to develop internal tools to cover the lack of some features at that moment, but this became a strength when we added more cloud providers to the mix, and allowed us to adapt them to our functional company structure.
Finally, we hope this article helps you to gain insight of the possible challenges of governing a multicloud environment with such scale. We hope to have helped you properly select the best tool for your use cases, and that our proposed solutions inspire you to overcome your limits.