How we joined forces to enable GPU serving for our AI solutions

By Joaquin Cabezas, Machine Learning Engineer and Daniel Ricart, DevOps Engineer

Turn on the GPU again…

And when everything works just fine, someone goes into the meeting room and says: “would it be possible to have a GPU for serving deep learning models directly in Kubernetes? Asking for a friend…”

A quiet and calm room in the Miyagi prefecture, Japan. Source: Marisa from Unsplash.

The light bulb in the image above is switched on. You know that pressing the power switch will turn it off, even if you don’t really understand how it works. In the same way, I know I can put any of my services online because I can trust that someone else will keep the lights on.

But this superficial understanding also creates awkward moments: someone will ask for a feature they see as “simple”, “easy” or even “trivial”. Imagine someone asks you to “simply” change the location of the wall switch; this may require you to cut a hole in the wall to re-route the electrical wires.

Now imagine applying the same thinking to artificial intelligence computing. Instead of a light bulb, you’re asked to “simply” adjust GPU (Graphical Processing Unit) availability for Deep Learning Inference. Currently, there are two technical teams within Adevinta (Cognition and the Common Platform team) who are already thinking about very different ways to fulfil this “simple” request. Cognition focuses on making AI solutions available to all Adevinta while the Common Platform team (CPR) works on providing Adevinta’s multi-tenant Kubernetes distribution in the form of a Platform-as-a-Service. This is where Software Engineers across our marketplaces can deploy and manage their microservices.

And now, as in Kurosawa’s classic movie Rashomon, let each part tell its own story.

Cognition’s tale

It’s the beginning of 2023, and someone is asking to change the light bulb. My name is Joaquin Cabezas and I am a Machine Learning (ML) Engineer within the Cognition team, focused on Deep Learning for images and text.

For any image going through Adevinta’s marketplaces there is a Deep Learning model operating on it. It can be a text detector, a car classifier, a background remover etc. As these models are better served using GPUs, we usually rely on the GPUs available through Amazon Web Services (AWS) Sagemaker.

This time was different, we wanted to build a project where we could benefit from a bit more control and coupling between parts of the service. This project is Cognition’s next generation of image embeddings.

(Quick tech note: an image embedding is the representation of an image in a way that is convenient for lots of tasks like classification or computing similarity between any pair of images. It is a critical component of many ML platforms).

So, very excited about the project our team went and asked for GPUs in our cluster for the production environment. As we already have GPUs in our training platform, this should be a no-brainer, right?

Graphical Processing Unit, or GPU, is just a general term for a type of computing technology that has taken the world by storm because of the AI revolution. In this context, it is a way to serve our ML models faster and cheaper, so it is a sound business case to present to upper management.

But GPUs alone are not enough to improve the service. If we are moving our serving endpoints there are also a few more things to consider. We already know this is not going to be a simple feature request…

It’s a long way to Tipperary

It is common to start with a list of requirements, but on this occasion it would be better described as “a wishlist”. There was no rush and nothing was broken, so we took the opportunity to outline our (long) wishlist:

Use GPUs for serving
Connect our services using GRPC
A faster autoscaling than the one available on Sagemaker
Custom metrics for triggering autoscaling (not just CPU, memory…)
Full observability in our Grafana dashboards
Choose different instance types of GPU
Continue using our Golden Path with FIAAS, an abstraction to reduce cognitive load when writing Kubernetes manifests.

Wait! Why overload our colleagues with so many items? Well, the list grew organically while discussing the details and defining priorities together. After all, if only one team prioritises the list before discussing it together, it’s not a joint decision.

GRPC, a non-negotiable feature?

So, as we already said, we are looking to boost the performance of the service. We already know that GRPC is helping us increase throughput and we didn’t want to lose that benefit, so this was a top priority. And like in many good stories, with the first clear objective comes the first clear letdown.

Turns out our Kubernetes ingress controller (a reverse proxy) is only working with HTTP/1.1, and our beloved GRPC is working on HTTP/2. Before giving up, we continued experimenting with port-forwarding on a single pod. But at a certain point we needed to connect to the service from outside the cluster (to use the load balancer).

We partnered with the CPR team, to explore alternatives — and we learned that they were already aware of this limitation which they had encountered as part of their work on the SCHIP product. After they shared some workable solutions, we decided to move the consumer (a JAVA application) into the cluster. This would enable us to call the service (through the load balancer) just by using the name of the deployment as hostname. And for testing, we could deploy our Locust workers with Kustomize in the same cluster, so… problem solved!

Autoscaling for the masses

One of the critical components in our serving endpoints is the elasticity, or the ability to acquire resources when needed (and release them as soon as they are no longer required). Applying a simple rule of autoscaling when a CPU threshold is passed over a specified period of time will probably suffice. But if we also consider other conditions we can improve the overall performance and save costs.

A simple ramp load test to trigger autoscaling. Source: Adevinta

“If we could create the auto scaling triggers just like we create Grafana alerts, that would be awesome”, someone commented in our Slack channel, without much hope. We were used to the AWS Sagemaker autoscaling policies, which are limited in comparison with EC2, and we thought that it would be super hard to build our desired functionality.

But we were mistaken! The Common Platform team had covered this in an experimental way, using Keda, and there was a slot for us in the next “Loko Friday” (similar to Google’s 20% side-projects). While not yet a self-serve feature (we don’t have the permissions yet), we were able to experiment with a few rules — and the feeling was positive.

This was also the litmus test that we had already full observability using our usual Prometheus and Grafana configuration. Everything we needed was there. In particular, for the metrics coming directly from TensorFlow Serving, it was really important to have the processed batch size distribution along with the rest of the metrics. This is a key indicator of how we are using GPU memory resources.

Playing with different traffic patterns to watch how the processed batch size is impacted. Source: Adevinta

It’s not all a bed of roses

From the list of priorities we presented, not all were fulfilled:

We still don’t have the GRPC ingress so we are constrained to a single cluster
We were accustomed to selecting different instance types per application, because it is not the same to serve a Resnet-101 image classification service than a LLM like Llama2 for text generation. Deciding on a single instance type for all of them is a hard compromise between direct serving costs and engineering effort costs.
Also, giving up our cozy FIAAS for an approach using “raw” Kubernetes manifests was a hard tradeoff. A first direct effect was that I had to take the CKA course from the Linux Foundation, which was helpful. A less appealing consequence would have been modifying our production pipeline, but luckily we didn’t have to do that.

We tend to emphasise victories when writing articles like this one, but it is important to note that we experienced a combination of success and failure. As one of our Adevinta Key Behaviours says: “Win and lose together”.

Common Platform team’s perspective

The Common Platform team (CPR) is responsible for Adevinta’s SCHIP product. In our feature backlog we had the ability to offer specialised hardware in our clusters, which could be ARM instances or instances with GPUs. When we started exploring how the instances with GPU worked, and we had developed an initial approach that worked, we reached out to our friends in the Cognition Team. They had the workloads and the expertise in using GPUs that we lacked.

I am Daniel Ricart, an engineer in the Common Platform team at Adevinta. As part of the team’s experimentation time, that we call ‘Loko Fridays’, I worked on providing GPU instances in our clusters. And as introduced in this post, I am currently working on offering new ways to horizontally scale the workloads running in the cluster.

As part of our feature set, we offer our users some pre-made dashboards to check the health of running applications. Each user can create their own dashboards to fit their use case, but we offer some defaults that highlight the most basic signals of any pod, ingress or deployment. Once we added the GPU-powered instances to the fleet, we extended the dashboards to show some metrics about GPU usage.

Another default feature of SCHIP is scaling based on CPU and memory usage. With this feature, any user can make sure their applications respond automatically to increased demand or reduce the number of idle resources when not in use.

By leveraging the newly added GPU-powered instances, the extended application dashboards and GPU metrics, and existing basic autoscaling features, Cognition, and any other ML team could start their tests. And they jumped into it.

We offered the same experience they already knew, to deploy CPU-based applications with a new resource:

Using this resource, Cognition can deploy GPU-based applications in SCHIP using the interfaces they are familiar with. But there was a previously unknown caveat: they can only scale their applications based on the CPU load of a GPU-based application. There’s a correlation between the usage of both resources, but it’s not similar enough to be reliable.

Scaling

At Adevinta, we allocate one day every two weeks to run technical explorations. In CPR this day is called “Loko Friday” and it has fostered several initiatives that are now part of the portfolio. If an idea that comes up is approved by the rest of the team, and it adds value to the users, it is usually further developed and added to our regular portfolio as Alpha stage. Later it will graduate as a supported feature. As part of this allocated time, I started exploring a long-forgotten user request: how feasible would it be to offer horizontal scaling based on any metric instead of just allowing the usual CPU and memory metrics?

I had some breakthroughs, and leveraging components already available in the cluster for internal use, so I came up with a very neat solution in a couple of “Loko Fridays”. The proposed solution still required polishing and it was primarily aimed at advanced users of SCHIP, because it lacked integration with FIAAS. But according to my tests, we should be able to test the solution with real workloads.

And while that was happening from Friday to Friday, as part of our conversations with Cognition engineers, they sent us their wishlist for the next steps of serving GPU-based applications.

Everything started from a Slack conversation on 11 July 2023. Source: Adevinta

Here we had the second collaboration with the Cognition Team. In CPR, we were ready to test and validate our technical exploration with some users, and this was the perfect opportunity. Also, Cognition wanted to explore this new feature we were testing.

On the following Friday I shared with Joaquin that I was working on this feature, and guided him through the required setup on his side. That same day we generated some satisfactory results and a short list of improvements to the feature.

Number of pods serving a tensorflow application. Source: Adevinta

What’s the magic behind it?

In SCHIP we use keda.sh (Kubernetes Event-driven Autoscaler) to scale ingress based on the load we are serving at any given time. It periodically runs a Prometheus query to fetch average requests and perform the required calculations to maintain a pool of ingress pods. To setup a KEDA autoscaler, you provision a ScaledObject to your namespace with some specific settings about the scaling behaviour and datasource. It then controls an horizontalPodAutoscaler object to add or remove application pods according to the values of your ScaledObject.

During my “Loko Friday” work, I crafted a helm chart that encapsulates a ScaledObject.keda.sh object with all the settings that fit SCHIP. This is then connected to the appropriate Prometheus that hosts the tenant metrics. There are some customisation options like the Prometheus query to run, minimum and maximum size of the pool and a few other parameters. Most of the parameters of the object are tailored to a sane approach we have already tested in other cluster components.

Wow!

So far we have fulfilled four-and-a-half items of the Cognition wishlist.

✅ Use GPUs for serving
⚠️ Connect our services using GRPC
✅ Faster autoscaling than the one available on Sagemaker (a new instance in less than 3 minutes)
✅ Custom metrics for triggering the autoscaling (not just CPU, memory etc.)
✅ Full observability in our Grafana dashboards
❌ Choose different instance types of GPU (not possible for now)
⁉️ Continue using our Golden Path with FIAAS for Kubernetes

Why “and a half”? Well, deploying workloads using GPU instances is already part of the FIAAS for Kubernetes as an extension we added previously. Internally, FIAAS will add the required resource blocks to make sure that the pods are scheduled in a node with GPU support.

The ability to scale based on custom metrics is still an alpha feature that is not yet exposed to FIAAS in any way.

A few final words

It was funny that in both teams we thought we were asking for help, when really, it was a hot topic that happened to appear for us all at the same time. This just shows how valuable it is to share notes and observations beyond your team.

It also highlights how allocating some time for innovation (“Loko Fridays”, Discovery Days etc.) fosters collaboration, and drives spontaneous innovation that adds value to all users.

We have described how the need of using a feature and the willingness to serve this feature is a powerful combination that needs to be leveraged when it (rarely) appears at the same time. This pushes teams to work together in the best way possible. We hope this article (which is really about collaboration between two technical teams) will inspire you to find these “matches” in your organisation to join forces during “long ways” (to Tipperary…), not just high intensity meetups.