Unpacking the Complexities of Kubernetes Upgrades: Beyond the One-Click Update

The Beginning

Our platform, SCHIP, is a multi-tenant, multi-region Kubernetes distribution for Adevinta developers. Behind the scenes, SCHIP comprises multiple Kubernetes clusters, and in the past, they were provisioned using kube-aws, a now deprecated Kubernetes cluster provisioning technology. Before being retired, this tool did the job very well, providing a lot of flexibility for maintainers of Kubernetes clusters like us. However, after it announced deprecation, kube-aws support stopped at Kubernetes version 1.15.

This gave us no choice but to search for a new provisioning tool as we were stuck with Kubernetes 1.15 for a while after the deprecation of kube-aws. We could not keep running our platform on top of an out-of-date Kubernetes version forever.

By not following the version stream of Kubernetes we experienced several downsides including:

Missing out on all the shiny new features that Kubernetes offers
Lack of updates on vulnerabilities
Unable to update versions of open source tools as they evolve with Kubernetes landscape
Not able to adopt new tools as the minimum compatible version is far beyond 1.15

We tried and compared several options internally and I would like to share this process in a separate blog post as we learned a lot, but long story short, we chose AWS EKS.

One of the factors is that our infrastructure is running mainly on AWS, but the most crucial reason is we would like to focus on the upper part of the Kubernetes stack to provide value to our platform users (we call them ‘customers’) and off-load the lower part of the stack to AWS.

This allows us to worry less about maintaining components like ETCD, API-server and the network backbone which adds a huge amount of cognitive load to the team.

Of course, despite all of the benefits mentioned, everything has its own pros and cons. EKS comes with an expiration date, which is roughly 14 months after its first availability.

Using kube-aws, we can keep the version for as long as we would like, but it also means we have fewer incentives to upgrade.

Moving to EKS forces us to rethink how we perceive Kubernetes version upgrades.

How did we do the upgrade in the past?

So far, the only way we have performed a Kubernetes version upgrade for our users was via a strategy called “blue-green”

Basically we first deployed the newer version of a Kubernetes cluster. Then we worked with our customers to deploy their workloads to the new clusters, slowly migrating traffic from the old cluster to the new cluster. You can read more in this article, where I explained the technical detail behind how this traffic migration is done.

This approach has proven very useful as it allows our team to verify that the services under migration work well before switching production traffic completely. If things go wrong, we can always switch back to the original cluster, halting the migration.

However, there is no one-size-fits-all solution for our software architecture. The blue-green approach also has its downsides.

It requires significantly more effort

Even though it is the safest way to migrate, blue-green migration requires lots of effort to plan, coordinate, and execute.

We need to look at our current clusters and which customers run on each of them to analyse data such as the size of their workloads, how much traffic they are receiving, and how critical they are, to create a migration plan.

After we formulate a plan, we need to coordinate with each of our customers about the switch-over date. The hardest part is the execution itself, migrating each of our customers synchronously takes tremendous effort.

As our user base has grown a lot in the past few years, it has become impossible to continue with this process because the toil of performing the migration consumes all of our engineers’ time.

EKS release schedule

Using EKS, the cluster’s lifetime is ~13 months

When we jumped onto the EKS train, we realised immediately the hard truth that we are chased constantly by the End of Life of each version. Each version of EKS lasts around 13 months from its first release — but we don’t actually have that much time.

Our team needs time to work on provisioning the new cluster after each EKS release. This further shortens the time that any version is available in SCHIP, which means more frequent upgrades to perform and more effort for our engineers.

The work we need to do is mainly making sure that our core components are compatible and able to run seamlessly in the new version. This normally takes at least a month, as we have to ensure that everything works and nothing breaks.

Also, we need a grace period to test the new EKS version by migrating less critical workloads before we open it widely to all customers. Even though we do our best to make sure everything is compatible, there have been cases where we discovered customers had lost some of their metrics after moving live traffic to the new server. Where this happened it was because the metric name changed as the version of the kube-state-metrics was changed to be compatible; because we were not using those metrics extensively ourselves, we missed them during the verification stage.

To avoid repeating this issue during future upgrades, we created alerts to notify us when our officially supported metrics went missing. It is because of situations like these that we want to have some grace period before opening a new cluster to all users.

Once this testing period has been completed we have less than 10 months to use any version. But in reality, timings are much tighter — you cannot start performing the upgrade on the last day of its lifetime, right? The upgrade must be completed at least 1 month before the final deadline to ensure you can complete the upgrade in time. So we actually have ~9 months for each version, which is obviously not practical for us with our current strategy.

Some might argue that the date specified by AWS EKS is not really the final date, and we are aware of that. We know that the cluster might stay unchanged even after the end of support has been reached. However, AWS does not guarantee, nor tell you when your cluster will be forced into an automatic upgrade.

Our business, and our customers, cannot continue in the hope that AWS will give us extra time to upgrade our clusters. In my previous experience, I used to work in a team where we ignored the urgency to upgrade our EKS clusters, and one day, our clusters were forced to upgrade. When we realised, it was a big mess as the development team started to notice that all of their new deployments failed, impacting business uptime. It was a nightmare.

We don’t want to fix the whole company’s workload in one afternoon. So, we decided to develop a relatively conservative plan.

The scale problems

As our clusters grow organically through our customer’s usage, our components also grow with them.

This means the number of EC2 instances in the cluster is scaled to suit usage. The size of our core components’ pods is also adjusted to match the current usage inside the clusters including Prometheus size that matches the amount of metrics being generated from the workloads, and the number of ingress controller’s pods that matches the current traffic.

Scale mismatch between cluster in-use and newly provisioned cluster

After migration, all of these will eventually be adjusted organically as we have already deployed multiple auto-scaling mechanisms to take care of them. However, in our experience, to ensure the smoothest migration with the least noise for our customers, it still requires some pre-scaling and hand-holding to make it possible.

Apart from this, there are also other things, for example, the certificate, which has some limitations that require our preparations to avoid facing problems when we execute the migrations.

We are still working towards making this process as automated as possible, but it still shows that the blue-green strategy is not as seamless as it appears.

Stateful and singleton workloads

The biggest and probably most complex problem posed by the blue-green upgrade strategy is how to migrate stateful and singleton workloads.

For stateful applications such as databases, which normally have the persistent volume attached, migration using a blue-green strategy and expecting a subtle impact is almost impossible.

We need to take care of the migration of the data separately and consider whether migrating traffic from one to another would cost more or less downtime to the database application.

Singleton workloads cannot have more than one active pod at any point in time. These include applications that might consume some events and send out notifications. They also created a challenge in that we could not pre-provision them in another cluster while the previous one was still active.

The SCHIP platform previously only supported a stateless type of application. We have recently re-engineered to support several types of stateful applications for our customers, providing a key driver for us to look for different strategies to perform Kubernetes version upgrades.

As these reasons show, we need to do things differently.

In-place upgrades

This isn’t magic and I believe everyone knows that it exists; In-place upgrade is the native capability provided by EKS to upgrade the version of your running cluster.

It works simply by rolling updates in the API server nodes in EKS with the latest Kubernetes version. Basically, EKS takes care of the control-plane version upgrades for you, so you only need to upgrade your node group independently afterward.

It sounds fairly simple, right? Just a click of a button or one command:

eksctl upgrade cluster --name my-cluster --version 1.27 --approve

Well, it’s not as simple and magical as it may appear. There are many aspects to consider when you choose to use the in-place upgrade.

Two major questions sprang to mind when we first analysed the in-place upgrades:

How do we ensure that our customer workloads will not break after the cluster has been upgraded?
How can we minimise the impacts on our customer side as we need to upgrade all of the worker nodes to the new versions?

We decided to split the problems and work on the second concern first.

Minimise the impacts of the cluster rebuilds

We anticipated that we would need to rebuild every single node of the cluster to upgrade the Kubernetes version of the worker nodes. So we started to analyse how to make this process as transparent as possible for our customers.

In addition, minimising the impact of the rebuild will benefit us greatly during normal scheduled maintenance which requires all of the worker nodes to be refreshed, regardless of the upgrade. The cluster rebuild has been perceived as a disruptive process for us so far and we have tried to avoid doing them unless absolutely necessary.

It is somewhat difficult to fully understand the impact on customer experience while the cluster is being rebuilt. We decided to look at the indicators that we are most familiar with and independently fine-tune them.

These indicators became our SLOs (Service Level Objectives).

At SCHIP, we are now providing different SLOs to our customers, including:

Kubernetes API Availability / Read, Write latency
Kubernetes Ingress Availability / Latency
Internal / External DNS latency
Logging Availability / Latency
IAM Availability / Latency

We started with the most important step which is to collect the baseline data to answer another two questions

What does the impact on the SLOs look like at the moment?
How long does it take to rebuild the whole fleet?

Executing the rebuild on production allows us to collect the baseline data of several aspects of the rebuild including:

Time to rebuild

Time to rebuild is very important as we need to find the right balance between the impact on the SLOs and how long it takes to finish the rebuild. We have explored several options so that we can choose the behaviour of the rebuilds depending on how aggressively we want to balance it with the impact.

Non-serialized node groups

Basically with this approach, it will perform the upgrade to all of our node pools at once. In each node pool, it will double the number of instances and mark the old instances SchedulingDisabled, then slowly kill them.

The number of instances is doubled and slowly converges back to the baseline

This strategy’s strength is time, as it only took us ~35 minutes to refresh 12 nodes (~3 minutes per node). However, we observed quite a strong effect on our SLOs.

For example, our logging integration SLOs dropped below our threshold.

Logging SLOs dropped while the rebuild is being performed

A big dip was also observed on Kubernetes API read latency SLI for our customers.

Big dip in Kubernetes API read latency SLI

We observed most of the SLIs being affected and dropping below the acceptable threshold.

Serialised node groups

This approach is a more subtle version of the Non-serialised node groups approach where only one node group is upgraded at a time.

The mechanism is exactly the same. The number of instances is doubled for one of the node pools, the old instances are isolated and slowly terminated.

Nodes are being updated one node pool at a time

Nodes are rolled out in a serialised manner

Obviously, this approach takes more time than the previous example. The data shows it takes ~5 minutes per node.

Even if it’s more subtle, this does not mean that we didn’t see any effect on the SLOs. They were still impacted but to a lesser degree.

Kubernetes read latency SLIs are still impacted but with less disruption

One node at a time

The safest and lowest impact way to perform the rebuild is probably to do it node by node. Even though AWS does not provide this capability out of the box, we could try to find a way to achieve it. However, we decided not to go down this route as we learned from the above strategy the average time per node, and the time per node for this option will be impractical for the size of the cluster we have.

Karpenter managed nodes

We also did some experiments using Karpenter. Karpenter comes with a lot of useful features such as multiple instance type node pools. However, we found the concept of deprovisioning in Karpenter a bit too complicated for us to invest in at this point in time. We would need to evaluate the impact of changing our whole approach to node management to achieve the rebuild using Karpenter, and we don’t find the cost-benefit worth pursuing at this point in time.

As we are running more than 30 production Kubernetes clusters in SCHIP, we cannot afford to spend so many hours on each cluster as it would take a full week to complete the rollout of all the clusters.

From the data collected, we decided to go with the serialised node groups option as it seems to be the best balance between impact and time. It’s not the fastest compared to the non-serialised strategy. However, it’s still acceptable when we calculate the time/nodes of the biggest cluster we have (~200 nodes) which took around 6–8 hours during the real test.

We prefer to minimise impact on our customer’s side, even if it means sacrificing a bit more time to complete the process because we can also parallelise multiple clusters at the same time.

Impact on the SLOs

In the process of collecting baseline information for the cluster rebuild, we learned about the time it takes to rebuild, and which SLOs are impacted the most.

With this information, we can prioritise mitigating the impact on the SLOs based on importance and level of impact.

The example of the data we collected about the effect on SLOs

Once prioritisation was done, we analysed each of the effects to discover the root causes and how to mitigate them.

There were several causes and the resolution varies from easy to do (a single pull request) to a complete redesign of some components.

For example, the effect on ingress availability and latency. We found that the `minReplicas` setting of our ingress controller is too low. There was also a lack of sufficient configuration of the pod disruption budget which is the Kubernetes way of defining how many pods should always be available in the time of maintenance or disruption.

Providing more instances for the minReplicas and setting up sufficient pods for the pod disruption budget shows great performance improvement during the rebuild.

This applies to many SLOs we observed, including the Kubernetes API read/write availability as they are served from specific ingress controller pods as well.

However, there are more complicated factors such as the gap in metrics that we observed during the rebuild. This is because we are running a single pod, the kube-state-metrics. So, when the pod was being rescheduled to the new nodes, we lost some minutes of metrics that were not captured during the transition.

Fixing this required more time as we needed to analyse how to run kube-state-metrics in the horizontal shard mode and how it impacts the overall architecture.

Once we finished the analysis and mitigated most of the effects, we gathered a final round of information by performing the rebuild on the production fleet to collect production data.

The improvement is obvious after all the mitigation has been applied

The results showed that the analysis and mitigation efforts have significantly improved our rebuild process.

Notably, we did not receive any complaints from our customers after several attempts to rebuild the fleet compared to similar migrations we had conducted in the past. At the end of the experiment, we agreed that we were confident enough to rebuild the whole fleet during upgrades.

So now we have solved the question of “How can we minimise the impacts on our customer side as we need to upgrade all of the worker nodes to the new versions?”. Next, we had to address the second consideration “How do we ensure that our customer workloads will not break after the cluster has been upgraded?”

Navigating the world of deprecated APIs is like stepping through a constantly evolving maze. While we had a strategy in place for the cluster rebuild, addressing deprecations required a slightly different approach. As we move forward in our journey, we’ll delve into how we tackled these changes head-on, ensuring our systems not only remained functional but also adhered to the best practices of the ever-evolving Kubernetes ecosystem.

If the above question intrigues your curiosity, let’s explore the next chapter of our upgrade story in Part II.