Why did we transition from Gatekeeper to Kyverno for Kubernetes Policy Management?

If you have been following my blogs, you will have seen a few articles about Gatekeeper as a tool for Kubernetes Policy management. I have been using Gatekeeper for Kubernetes since around 2021 and published a few posts about my experiences:

Enforce your policies as code on Kubernetes using GateKeeper(OPA)

Enforce your policies as code on Kubernetes using Gatekeeper [Part2]

Our Runtime team at Adevinta has been using Gatekeeper as the main Kubernetes policy enforcement. It does this job well, especially in our multi-tenancy Kubernetes environment where policies are key to ensuring security and management of boundaries.

We have implemented crucial policies for our scenarios, for example:

Enforce certain allowed ingress hostname formats
Reject prohibited annotations used by our integrations
Forbid deletion of important resources such as namespaces

We also have some complex cases, such as rejecting a node assignment, which are explained in this post. Our testing shows that Gatekeeper can get the job done:

Advanced Gatekeeper policies — rejecting a node assignment

Gatekeeper also provides a good testing framework, Gator, that allows you to validate your policy declaratively.

If Gatekeeper is so great, why replace it?

What triggered our re-evaluation?

Lately, our team has had more and more scenarios where we need to verify resources submitted to the cluster and require modification to either:

Add more metadata to the resource
Apply configurations such as nodeSelector, affinity, and many more

This capability is known as MutatingWebhook where we intercept incoming resources and modify their specifications before persisting them into the Kubernetes database.

As Gatekeeper is our “webhook” implementation, the first thing we tried to explore was its MutatingWebhook capability (Learn more here )

Our first use case was fairly simple

If a pod has annotation X, add toleration Y to the spec

Gatekeeper Mutation CRD, allows us to perform certain mutations including:

AssignMetadata — defines changes to the metadata section of a resource
Assign — any change outside the metadata section
ModifySet — adds or removes entries from a list, such as the arguments to a container
AssignImage — defines changes to the components of an image string

The Assign mutation sounds like it could be used for our sample use case.

After extensive testing, we were forced to admit that this feature was not able to handle a basic use case, let alone a more advanced scenario.

This is what the policy should look like after using Gatekeeper Assign to perform the mutation:

apiVersion: mutations.gatekeeper.sh/v1alpha1
kind: Assign
metadata:
 name: add-gpu-toleration
spec:
 applyTo:
 - groups: [""]
   kinds: ["Pod"]
   versions: ["v1"]
 match:
   scope: Namespaced
   kinds:
     - apiGroups: [""]
       kinds: ["Pod"]
   namespaces: ["your-namespace"]  # Optional: specify if you want to limit to certain namespaces
 location: "spec.tolerations"
 parameters:
   assignIf:
     in: ["true"]
     path: "spec.nodeSelector.node\\.x\\.io/gpu"
   assign:
     value:
       - key: "node.x.io/gpu"
         operator: "Equal"
         value: "true"
         effect: "NoSchedule"

YAML

Unfortunately, the Assign feature does not allow modifying the field that is not part of the location (the observed field). Accessing data from other parts of the spec is also not possible, let alone accessing contextual data from the context that requires an even more complex setup: https://open-policy-agent.github.io/gatekeeper/website/docs/externaldata#external-data-for-gatekeeper-mutating-webhook

The solution

This functional failure forced us to find an alternative tool. Kyverno seemed to be good at what we wanted to achieve.

The policy that we wanted could be implemented very easily in Kyverno as:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
 name: enforce-gpu-toleration
spec:
 validationFailureAction: Enforce
 rules:
   - name: enforce-gpu-toleration
     match:
       any:
         - resources:
             kinds:
               - Pod
     preconditions:
       all:
       - key: "{{ request.object.spec.nodeSelector.\"node.x.io/gpu\" || '' }}"
         operator: Equals
         value: "true"
     mutate:
       patchesJson6902: |-
         - path: "/spec/tolerations/-"
           op: add
           value:
             key: "node.x.io/gpu"
             operator: "Exists"
             effect: "NoSchedule"

YAML

Similar to Gatekeeper’s Gator, Kyverno allows us to test the policy using: https://kyverno.io/docs/kyverno-cli/reference/kyverno_test/. We can define test suites with mock objects like this:

apiVersion: cli.kyverno.io/v1alpha1
kind: Test
metadata:
 name: test-gpu-toleration
policies:
- ../policy.yaml
resources:
- resource.yaml
results:
- kind: Pod
 patchedResource: patchedResource1.yaml
 policy: enforce-gpu-toleration
 resources:
 - pod-1
 result: pass
 rule: enforce-gpu-toleration
- kind: Pod
 patchedResource: patchedResource2.yaml
 policy: enforce-gpu-toleration
 resources:
 - pod-2
 result: skip
 rule: enforce-gpu-toleration
- kind: Pod
 patchedResource: patchedResource3.yaml
 policy: enforce-gpu-toleration
 resources:
 - pod-3
 result: skip
 rule: enforce-gpu-toleration

YAML

The objects can then be tested with a simple CLI input:

kyverno-cli test /kyvernopolicies --detailed-results

YAML

We can improve the accuracy of our testing further using the Chainsaw framework. Chainsaw offers end-to-end Kyverno policy testing with virtual Kubernetes clusters like Kind.

[chainsaw](https://github.com/kyverno/chainsaw)

What to look out for?

So far, the experience of using Kyverno has been quite smooth. However, there are a few considerations that you should keep in mind, such as:

Latency: Kyverno is a Kubernetes webhook. Adding each rule will have an impact on webhook latency. It’s strongly recommended to monitor the webhook’s latency metric such as:

apiserver_admission_webhook_admission_duration_seconds_bucket

apiserver_admission_webhook_fail_open_count

apiserver_admission_webhook_request_total

The performance of Kyverno can impact latency and prevent the update of important objects such as Pods and Deployments. It’s also highly recommended to monitor the resource consumption of Kyverno’s pod to ensure there will be no OOMKilled or CPU throttling happening.

Enforcement mode: For each policy, you can select whether you want it to be Enforced or Audited. In audit mode, Kyverno will not reject any request but instead will produce a resource inside the cluster called ’admissionreport.kyverno.io’ and also ‘policyreport.kyverno.io’. These audit logs could stack up if you are not cleaning them properly, causing your ETCD to go into stale read mode.
Here are the upstream issues referencing this problem
- https://github.com/kyverno/kyverno/issues/6462
- https://github.com/kyverno/kyverno/issues/8974

We decided to move to Kyverno

Following our tests, we made an internal decision to freeze the usage of Gatekeeper for Kubernetes policy management and fully transition to Kyverno.

Our decision was based on the following considerations:

Ease of Policy Writing and Maintenance: Kyverno uses YAML to define policies, which is inherently simpler and more familiar to our team compared to Gatekeeper’s Rego language. This change is expected to enhance productivity and encourage greater team participation in policy creation and management.
Feature Completeness: Kyverno supports both ValidatingWebhooks and MutatingWebhooks use cases out-of-the-box, unlike Gatekeeper which has a limited feature for MutatingWebhook capabilities. Also, Kyverno is feature parity with Gatekeeper so at the end we can gradually move the existing policies into Kyverno.
Resource Efficiency: Our observations indicated that Kyverno is more resource-efficient than Gatekeeper, particularly as the complexity and number of APIs we need to enforce increase. Gatekeepers need to sync all information into their memory/inventory, leading to performance bottlenecks in recent policy updates.

We are still working on gaining more experience in using Kyverno and I will write about it again as we acquire more insights.

Why did we transition from Gatekeeper to Kyverno for Kubernetes Policy Management?

What triggered our re-evaluation?

The solution

What to look out for?

We decided to move to Kyverno

Related techblogs

Trial by Fire: Tales from the SRE Frontlines — Ep2: The Scary ApplicationSet

Trial by Fire: Tales from the SRE Frontlines — Ep1: Challenge the certificates

It’s not always DNS — unless it is