Back to all tech blogs

OPA memory usage considerations and lessons from our transition to Kyverno

  • DevOps
5 min read
Scaling policy enforcement: Lessons from OPA's memory hurdles and Kyverno migration

Running multi-tenant Kubernetes clusters requires robust governance and policy enforcement to ensure security, compliance, and consistent resource usage for all tenants. 

Open Policy Agent (OPA) was the backbone of SCHIP, our Kubernetes platform, policy enforcement for a long time—even as we began migrating to Kyverno.

As we deployed many clusters and tenants, it became nearly impossible to write policies without factoring in the state of other objects in the cluster. OPA offers a powerful feature to sync external data allowing us to leverage additional context from various Kubernetes resources. However, turning on this feature also introduces additional resource overhead, particularly in terms of memory consumption. 

In this article, I’d like to share key considerations when enabling OPA’s data sync capabilities, how it can impact memory usage and why you need to balance these benefits with the resource costs. While the goal here is not to discourage you from using advanced OPA features, it’s crucial to be aware of their implications.

Finally, I’ll also share lessons learned from our transition to Kyverno, including how we prioritised the migration of rules based on their resource impact. This article should help you make more informed decisions about your policy management and potential migration paths.

A simple OPA policy

A simple OPA policy can work with the context provided by the object being validated itself. The example policy below checks the host field under the ingress object being validated

package policy.ingress_without_host

violation[{"msg": msg}] {
   ingress := input.review.object.spec.rules[_]
   not ingress.host
   msg := "Invalid ingress. Please add a host to the ingress"
Bash

However, real-world policies often require access to other objects in the cluster. For example, verifying that a label is unique across all pods and namespaces is impossible unless the policy has visibility of all these resources.

Using OPA Sync for complex policies

To address this challenge, Open Policy Agent (OPA) allows object data to be synchronised into its client so that ConstraintTemplates can access them. Gatekeeper provides a way to enable this via Sync Configuration, as shown below:

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
 name: config
spec:
 sync:
   syncOnly:
     - group: ""
       version: "v1"
       kind: "Namespace"
     - group: ""
       version: "v1"
       kind: "Pod"
YAML

Alternatively, SyncSet can be used to achieve similar results. Once enabled, policies can reference synced data using data.inventory. For example:

You can access the synced information via data.inventory using:

data.inventory.namespace[ns][_]["RoleBinding"]
Bash

This feature is highly useful, and it worked well for us  until we encountered performance issues.

Performance Challenges with High Pod Fluctuations

Recently, we noticed that OPA started experiencing performance degradation and sporadic OOMKills across multiple clusters. Although our VerticalPodAutoscaler (VPA) helped adjust resource allocation dynamically, the alerts became noisy and disruptive.

Investigating the Memory Usage Spikes

Upon further analysis, we found that clusters with high pod fluctuation — where the number of pods rapidly increases and decreases — were the primary culprits. This is common in:

  • Development clusters with frequent deployments, especially full cluster deployment
  • Clusters running a large number of CronJobs

Memory Usage Correlation

We observed significant memory usage fluctuations in Gatekeeper-controller pods, often by several gigabytes. The root cause? We were syncing pod data into Gatekeeper’s inventory, causing excessive memory consumption when pod counts surged.

Memory usage of the gatekeeper-controller pods fluctuated by several gigabytes
Memory usage of the gatekeeper-controller pods fluctuated by several gigabytes
Memory usage of the gatekeeper-controller pods fluctuated by several gigabytes

As you can see from the graph, the memory usage fluctuates in the magnitude of GBs due to changes in the number of pods.

The number of pods doubled in a short time as a behaviour of full cluster deployment
The number of pods doubled in a short time as a behaviour of full cluster deployment
The number of pods doubled in a short time as a behaviour of full cluster deployment

In our case, this is because we are using pods information in the inventory.

apiVersion: config.gatekeeper.sh/v1alpha1
kind: Config
metadata:
 name: config
spec:
 sync:
   syncOnly:
     - group: ""
       kind: Namespace
       version: v1
     - group: "networking.k8s.io"
       version: "v1"
       kind: "Ingress"
     - group: ""
       kind: Pod
       version: v1
     - group: "rbac.authorization.k8s.io"
       version: "v1"
       kind: "RoleBinding"
YAML

Since we were already migrating to Kyverno, as mentioned in our previous article Why Did We Transition from Gatekeeper to Kyverno for Kubernetes Policy Management?, we decided to prioritise migration based on impact.

Optimising Migration for Resource Efficiency

Given our constraints, we decided to prioritise migrating policies that use data.inventory, particularly those dealing with high-volume objects like pods.

Migration Results

After migrating policies that use data.inventory of pods to Kyverno, we managed to remove pods from Gatekeeper’s sync configuration. This reduced memory usage from 8GB to 2.7GB — with just a single configuration change in one cluster.

This is significant, considering that we operate 30+ clusters, each running 3+ controller pods, making this optimisation highly impactful.

We remove Pod from the sync configuration
We remove Pod from the sync configuration
We remove Pod from the sync configuration
Memory usage dropped from 8GB to 2.7 GB
Memory usage dropped from 8GB to 2.7 GB
Memory usage dropped from 8GB to 2.7 GB

Impact on Kyverno

Kyverno does not rely on a pre-synced inventory, it instead fetches data dynamically using API calls. Here’s an example of how Kyverno retrieves pod information:

context:
 - name: pods-access
   apiCall:
     urlPath: "/api/v1/namespaces/{{request.object.metadata.name}}/pods"
     jmesPath: "items[? !starts_with(metadata.name, 'system')].metadata.name"
YAML

Memory vs. API Load Trade-Off

This approach reduces memory consumption but increases the load on the Kubernetes API server. To mitigate this, consider:

1. Monitoring API Server Load — Ensure that the additional requests don’t overload the API server.

2. Using specific pre-conditions — Reduce unnecessary evaluations:

preconditions:
- key: "{{ request.operation }}"
 operator: Equals
 value: "DELETE"
YAML

Final Thoughts

This article does not aim to criticise Gatekeeper’s inventory sync feature — which remains extremely useful. However, it’s important to exercise caution when enabling inventory sync, as excessive memory usage could impact Gatekeeper’s stability — potentially preventing new deployments and pod creations.

This blog describes a strategy for transitioning from OPA to Kyverno by prioritising policies that rely on high-impact inventory rules — such as those referencing high-volume objects like pods.

By applying this approach, we yielded a substantial reduction in memory usage, improved stability, and reduced operational noise — without compromising policy enforcement.

Related techblogs

Discover all techblogs

Why did we transition from Gatekeeper to Kyverno for Kubernetes Policy Management?

Read more about Why did we transition from Gatekeeper to Kyverno for Kubernetes Policy Management?
Navigating Challenges: Considering the transition from Gatekeeper to Kyverno in Kubernetes Policy Management

Achieve a smaller blast radius with Highly Available Kubernetes clusters

Read more about Achieve a smaller blast radius with Highly Available Kubernetes clusters

The Karpenter Effect: Redefining Our Kubernetes Operations

Read more about The Karpenter Effect: Redefining Our Kubernetes Operations
A reflection on our journey towards AWS Karpenter, improving our Upgrades, Flexibility, and Cost-Efficiency in a 2,000+ Nodes Fleet