In the previous episode of this series, I shared the story about the stream of incidents related to certificates that we faced and how we improved our system by learning from the data.
In this episode, I’m telling you the story of the night of 24 February, as an urgent page awakened me while I was on-call. I found myself in the middle of what could potentially be the most devastating incident our team had ever faced.
Extensibility
To understand why it happened, I need to provide some context. The most important keyword here is “Extensibility.” Our platform, SCHIP, is a multi-tenancy Kubernetes platform with over 30 active clusters hosting customers with different application needs.
As a platform team, we need to maintain a certain level of homogeneity among these clusters. In other words, the interface where our customers can interact with the clusters is somewhat restricted, mainly at the namespace level. Failing to keep them homogeneous would exponentially increase the maintenance effort, making it unsustainable for our team.
SCHIP originally served only one use case, “Stateless Microservices.” However, as we onboarded more customers, the use cases expanded beyond simple stateless applications.
There are teams that we call platform teams, which provide services to other customers on top of our clusters. They generally need to run more than just pods in a namespace.
We’re figuring out a sustainable way to allow these platform teams to do things like “I need this CRD to do X” without interfering with other customers or increasing support toil for our team. In the meantime, we came up with the idea of implementing a mechanism called Extensibility.
It uses the GitOps approach combined with ArgoCD’s feature “ApplicationSet”
This is how it works:
1. Extensibility repository
A Git repository is created for each platform team to define extra privileges, applications, and other extra objects.
This allows our team to validate the requested extra privileges via pull requests opened by the platform teams and have everything defined as code.
2. ApplicationSet with cluster mapping
ArgoCD’s ApplicationSet is used to define the “Cluster/Repository” mapping. We define which repositories will be installed into which clusters/set of clusters using a matrix generator with a combination of cluster labels and GitHub repository labels.
There are more details about the cluster generator and how we accomplish it, but I’d prefer to discuss them in a possible future blog post.
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: {platform-team}-extensibility
spec:
generators:
- matrix:
generators:
- clusters:
selector:
matchExpressions:
- key: account
operator: In
values:
- a
- b
- scmProvider:
cloneProtocol: https
github:
organization: <redacted>
api: <redacted>
allBranches: true
tokenRef:
secretName: <redacted>
key: <redacted>
filters:
- labelMatch: ^extensions$
branchMatch: "(master|main)"
pathsExist: [deploy]
template:
metadata:
name: "{{name}}-{{repository}}"
spec:
project: "extensibility"
source:
repoURL: "{{url}}"
targetRevision: "{{branch}}"
path: "deploy"
destination:
server: '{{server}}'
syncPolicy:
automated:
selfHeal: true
prune: true
Bash3. ArgoCD for extensibility in each cluster
As ApplicationSet is defined in a single cluster where we have an ApplicationSet controller running, we call this a tooling cluster. We also install an ArgoCD application controller into each cluster responsible for managing those objects defined in the repository exclusively. This allows us to limit which objects this ArgoCD can manage and gives the platform teams a separate RBAC to access this ArgoCD for their visibility.
The incident
Now that we understand what Extensibility is, I can tell you what happened during this incident.
In the middle of the night of 24 February, I got a page from our alerting system for an “ArgoCD out-of-sync” alert. At first, I didn’t think this was a critical alert as there were problems like ArgoCD performance that could make the sync time longer than our alert threshold.
Looking closer, I found that this was alerted from our “tooling” cluster and there were applications in the Deleting state.
Ok, some apps were being deleted, no big deal, but when I looked at the application’s name, I got goosebumps.
The application being deleted was the mother of all applications, managing all the ApplicationSets for all the platform teams.
What does this mean? Worst case scenario, this could wipe out every privilege and object used by the platform teams. Just thinking about it, my hands started to shake!
Now fully awake, I declared an incident to start evaluating the impact and my team came to help.
The impact
ApplicationSet generates Applications inside the cluster it’s running to manage the applications in the destination cluster. For example, inside the tooling cluster an Application, {cluster-a}-extensibility, will manage Applications in the destination cluster, {cluster-a}.
When the issue happened, we checked the tooling cluster and saw the generated applications for platform teams were being removed, but not entirely.
We caught this in the middle of the removal, so what we decided to do was to put the fire out as soon as possible. We scaled down the ArgoCD Application Controller as it’s a component responsible for propagating the deletion to the destination clusters.
If the ArgoCD was not running, the deletion would stop propagating.
kubectl get applications.argoproj.io -o json | jq -r ‘.items[] | select(.metadata.deletionTimestamp != null) | .metadata.name’
Using this query, we identified all the Applications that were pending to be deleted, but the better question was, what happened to the ones that were already deleted?
The extensibility application inside the destination cluster responsible for creating all the resources, including privileges and CRDs, was already deleted. This scared me. However, somehow, the objects managed by the extensibility applications remained.
Why didn’t it delete everything?
We could have faced a more devastating situation, but luckily, we were saved by this condition. The objects created by the extensibility applications did not have the finalizers set.
Following ArgoCD’s documentation:
To have ArgoCD cascade the deletions to all the resources, the finalizers should be set to the resources (including ArgoCD’s sub-applications).
Without finalizers, the managed objects were just raw Kubernetes manifests including NetworkPolicies, ClusterRoles, and ArgoCD’s Applications. They weren’t removed but left dangling without the parent’s ArgoCD applications.
The root cause
Digging into the ApplicationSets controller’s log during the incident we found something interesting.
A Github Enterprise maintenance on the morning of Sunday 24 February (the same day as the incident) caused several errors returned to the ApplicationSet controller.
There were varieties of errors ranging from code 500, 502, and 503 during the maintenance but the most interesting one we discovered was:
“error listing repos: error listing repositories for schip: invalid character ‘\u003c’ looking for the beginning of value”
Which was followed by a log line “generated 0 applications”. which indicated that the error caused ApplicationSet to think it did not have any repositories to generate.
This is a problem because as we are using a matrix generator
- matrix:
generators:
- clusters:
selector:
matchExpressions:
- key: account
operator: In
values:
- a
- b
- scmProvider:
cloneProtocol: https
github:
organization: <redacted>
api: <redacted>
allBranches: true
tokenRef:
secretName: <redacted>
key: <redacted>
filters:
- labelMatch: ^extensions$
branchMatch: "(master|main)"
pathsExist: [deploy]
YAMLIf the scmProvider discovers that there are no repositories matching the filter, it will generate zero applications for this ApplicationSet.
We also found a similar issue reported in https://github.com/argoproj/argo-cd/issues/14318.
It’s still inconclusive but we believe that it’s somewhere in the line of the ApplicationSet controller using go-github to list the repositories. If during maintenance, GitHub Enterprise doesn’t return an error, but instead a 200 status code with something like HTML content, it’s possible that the ApplicationSet controller misunderstands the response and generates 0 Applications.
How did we recover?
- We scaled down ApplicationSet and Application Controller to ensure there would be no deletion events triggered.
- Since there were Applications already marked for deletion, we eliminated the Applications and ApplicationSets in the tolling cluster by removing their finalizer. Then when we recovered the Application Controller, it stopped deleting these objects.
- Since the Application Controller was downscaled during this clean-up, there were no impacts.
- We scaled up the ApplicationSet and Application controller and triggered the generation of all ApplicationSets/Applications.
- Let them sync and we were back to normal.
The prevention
We cannot guarantee this won’t happen again since we don’t control the GitHub Enterprise maintenance and the bug is still open.
The safety net we applied here is to ensure that all of our ApplicationSets have this parameter set:
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: {platform-team}-extensibility
spec:
syncPolicy:
preserveResourcesOnDeletion: true
YAMLBased on the official documentation:
With this configuration, even though the ApplicationSet is to be removed again, all of its generated Applications won’t be So even without the finalizer, there won’t be any deletions in case the issue happens again.
In addition, we’re looking at contributing to the argoproj to implement a sanity check for the repositories listing response to avoid this particular problem.
Final note
I hope this blog post was fun to read and even better if it reminded any of you who have a similar setup to put some guardrails in before this wakes you up in the middle of the night like me.