Unpacking the Complexities of Kubernetes Upgrades: Beyond the One-Click Update Part II

In the first blog post, we went over the context of why we need a Kubernetes upgrade and what benefits it would bring to us. Then we covered the initial stages, as our primary concern was the fluidity of the cluster rebuild during the upgrade. Ensuring stability and minimal disruption is key, especially when dealing with intricate systems like Kubernetes. We’ve explored how we ensured a seamless transition and laid the foundation for a more resilient infrastructure. But as with any technological advancement, new challenges emerge alongside the solutions. And in Kubernetes, one such challenge is the deprecation of APIs.

In this blog post we will answer the question:

How do we ensure that our customer workloads will not break after the cluster has been upgraded?

After ensuring the cluster rebuild is safe, we are back to the original concern of ensuring that our customer’s workload will not break following an upgrade.

This topic is mainly about Kubernetes API deprecations as each release contains different APIs that are deprecated or removed. If we want to perform the in-place upgrades, we need to be 100% sure that none of our customer’s workloads are using those APIs that will be removed in the next version.

Kubernetes 1.25 — Each new Kubernetes version has its own sets of removed and deprecated APIs

To ensure that there will be no objects that will be affected after the upgrade, we need visibility.

We envision something similar to what GKE’s insight provides

An example of the insight dashboard captured from https://matduggan.com/gke-google-kubernetes-engine-review/

There are two types of information we need to have to be able to provide this kind of dashboard.

1. The objects that will be removed in the next Kubernetes version

This is the most important information that we need from all of our clusters: How many objects in each of them will break in the next version of Kubernetes? Having this visibility will allow us to discuss upcoming issues with our customers who own them, send them notifications, or help them migrate those objects.

Once we are sure that no objects will be deprecated in the cluster, we will have the confidence to upgrade that cluster. Unfortunately, these details are not available out of the box from the Kubernetes API. Several tools in the open-source community can provide this information such as kube-no-trouble, a command line that you can point to your current Kubernetes clusters and generate the missing information in an onscreen report:

kubent can show you which objects will be deprecated in the future versions

In the end, we selected FairwindsOps’ Pluto which is very similar to kubent but has a more updated dataset for the latest Kubernetes version.

There is some additional work required to convert this information into a convenient format, which for us is Prometheus’ metrics. As Pluto is also a command line, we needed to develop a way to convert the output into the Prometheus metric.

We ended up with a metric that is called

schip_deprecated_objects

The metric contains labels like API group/version, resource name, resource namespace, and which version the resource will be deprecated and removed. With this, we can filter for objects being affected in each cluster and each namespace.

The usage of the deprecated APIs call

Even though we know about the objects being affected, it’s not the only thing that could break when we upgrade the version.Another thing that we need to verify is that there is no client that is using or targeting a specific API version that will be deprecated.

This is what GKE shows in the dashboard you saw above. These clients could be a Kubernetes operator or could be code that interacts directly with the Kubernetes API in the cluster.

If we do not take this into consideration, these clients could break after the version upgrade if they are not implemented in a forward-compatible way. Luckily, this information is available out of the box from the Kubernetes API.

In this KEPS, Kubernetes adds the annotation “k8s.io/deprecated”: “true” to the requests that are made to the deprecated objects. As we already export our Kubernetes audit logs to Grafana Cloud’s Loki we already have this information ready to be used.

After we have all the information we need, we are able to draft a dashboard that would answer a really simple question, “Can we perform an in-place upgrade of this cluster?”

The dashboard for a specific cluster and version that shows 0 objects and 0 API calls to removed APIs means we can go ahead and upgrade that cluster.

The dashboard should look like this for us to be able to upgrade

The dashboard shows zero objects will break

However, if the dashboard looks like the one below, we need to take care of these objects and clients before we can upgrade them. This could mean our own components or it could also mean that we need to reach out to our customers if the affected objects belong to them.

This dashboard shows which objects and clients in which namespace will break after the upgrade

With the information provided by the dashboard, we are able to analyze, develop, and execute an effective upgrade plan.

Moreover, we could also share the dashboard with our customers; the Grafana dashboard is one of the services provided in SCHIP’s features. They could use it to gain visibility of their migration process and to identify the objects that must be migrated to the newer version.

The upgrade

We have invested a lot of time and effort making sure the rollout has minimal impact, and that no objects and API will break after the upgrade. However, we are still not super confident, basically because we have never done an in-place upgrade before. After reading the very interesting blog post about reddit’s broken Kubernetes upgrade story, I have some fear deep down too.

As sane people would do, we start testing our process with a lower-risk cluster that does not host many customers.

The first time we execute an in-place upgrade

Everything went very well without any unexpected errors or failures. This gave us a bit more confidence about the upgrade process, but we are still not very confident until we do it on a more critical cluster.

We slowly launched this in waves, going from lower to higher risk clusters. Each time we finished, we gained more confidence. Finally, we are able to finish the upgrades for all of the targeted clusters without causing any incidents.

What do we win from achieving in-place upgrades?

We managed to save tons of time for our engineers that would have normally been spent on coordinating the upgrades. We used to devote an entire quarter (three months!) to upgrading all of our customers to a newer version. Now this time can instead be spent improving other parts of the system and providing more features to our customers.

We also gain more confidence and become more comfortable performing the full cluster rebuild as we minimise its impact.

Our customers benefit from not having to expend effort during the migration process as it becomes transparent for them — the version upgrade is now not so different from normal cluster maintenance. Customers also have more visibility of deprecated objects so they can take care of them earlier whenever they have the bandwidth to do so.

Is in-place upgrade a silver bullet after all?

In-place upgrades have proven to be very useful,and is now the de-facto way of upgrading our clusters. However, we will continue to invest in improving the blue-green upgrades mechanism.

because that approach allows us to rebalance our clusters when they become too big. Oversized Kubernetes clusters create a lot of problems in terms of resource management, networking, performance and more.

Rebalancing the cluster de-risk us from hitting several limits of AWS or Kubernetes. Furthermore, during the planning of the in-place upgrade, we could also benefit from using the blue-green upgrade for a subset of customers that we think might have more risks than others.

We actually did this during the first upgrade by moving some of the critical customers early with a blue-green upgrade before performing an in-place upgrade for the rest of the customers. This makes us a lot more confident as the risk is significantly lower.

Future work

Even with a successful implementation of the first version of the in-place upgrades. We acknowledge that there is more room for improvement. What we did was just a step towards making the Kubernetes version upgrade seamless for us and our customers.

These are the few areas we have identified for future enhancements:

Compatibility complexity

Each version has its own complexity. Handling version upgrades where many APIs have been deprecated could be very problematic.

For example, Kubernetes version 1.25, where PodSecurityPolicy is removed from the API, has caused us a lot of headache to ensure that the upgrade from version 1.24 is as seamless as possible. We need to make sure our automation can handle this kind of removal in a non-disruptive way as it breaks some of our Helm chart that contains PodSecurityPolicy definitions.

We have learnt that the best way is to stop using deprecated objects as soon as possible and migrate them early whenever the replacement is available. This will prevent us from having to deal with the API removal that mostly will cause, if not downtime, a lot of complexity.

Rebalance the number of the cluster

Despite having the ability to perform in-place upgrades to every cluster we have, at the same time, we need to keep provisioning the cluster with new Kubernetes versions once they are available. This creates a problem — eventually, we will reach a point where we operate an infinite number of Kubernetes clusters as no cluster will die. We are still working on creating a policy and a strategy to keep the number of Kubernetes clusters we run in production as efficient and cost-effective as possible.

This emphasises the need to keep improving our blue-green upgrade mechanism to be as automated and transparent as possible, allowing us to rebalance our customers’ workloads among clusters easily.

Generalise the release process

Now the process of upgrading is still somewhat muddy as we are running behind the schedule for the current release and we are still trying to catch up. Even though we have all the data and visibility in real-time from the investment we put in, the actual upgrade schedule and upgrade plan are still being done manually.

Once we finish all the migration we planned, we will have more time to stop and do some retrospective on the process. This will allow us to generalise the process in terms of:

When should we start the upgrade for a given version?
How and when should we contact our customers?
What is the most effective communication channel?

These are just initial lists of things that we need to answer, there are more questions that come up along the way. But as with everything in life, we need to prioritise and focus on the most important things first, and revisit what’s left when we can do so.

Cloud limitation

We learned about cloud limitations the hard way when one of our upgrades failed. We need to remember the fact that an upgrade is also basically a full cluster rebuild, which means we need to refresh every instance in the cluster.

But when using Cloud computing, infinite capacity means that there are no limitations, right? Sadly this is not true. Even Cloud providers have limitations and in our case, we have encountered several events where AWS runs out of certain EC2 resources for some instance types. This happened to us in the Frankfurt region (eu-central-1).

To avoid running out of resources, we need to make sure that our infrastructure is flexible enough to handle the starvation of the instance types. It’s also a good idea to prepare capacity reservations ahead if you plan to perform a cluster rebuild at scale.

Rollback behaviour

This is somewhat specific to our tooling where we use CloudFormation and CDK to provision our infrastructure. Our design keeps all of the EKS-related components under a single CloudFormation stack including the EKS (control plane) and the node pools.

If you remember from the beginning, we know that EKS upgrade is an irreversible process, once you upgrade it, you cannot roll back. Ever.

This is a problem for us because in one incident we had an issue where the EKS control plane upgraded successfully, but we encountered some problems with the nodepools upgrades. This could be anything from running the wrong script from our side to things we have less control over, like instance-type starvation.

However, the problem occurred because of the way CloudFormation works. When the stack update does not finish successfully, it will roll back all the changes, and as we have everything under the same stack, the rollback also tries to revert the version of EKS which cannot be done. This caused us some problems recovering the CloudFormation stack that is stuck in ROLLBACK_FAILED, and requires manual intervention to resolve it.

From this specific event, we learned that we should not couple the upgrades of multiple components in a single CloudFormation stack to ensure that we can roll back or roll forward safely without being blocked by this limitation.

Final notes

Lastly, I want to give credit to my team CPR for putting in the great work to make this happen especially Christian Polanco, Sebastian Caldarola, and Oscar Alejandro Ferrer who were leading this workstream from idea to implementation.