How To Reduce Your Grafana Cloud Costs With Adaptive Metrics

Are you a Grafana Cloud user like us? Have you ever found yourself in a difficult situation regarding cost optimisation?

In this blog post, I am sharing my experience with the adaptive metrics feature from Grafana and how it helps our team at Adevinta to reduce our Grafana cloud costs.

Adaptive Metrics

What is Adaptive Metrics?

Adaptive Metrics is a cardinality optimisation feature of Grafana Cloud. It’s part of the cost management hub that allows you to discover and eliminate safely unused metrics in your Grafana Cloud stack using aggregations.

The key features I want to highlight here are:

Discover

This is, in my opinion, the most useful cost optimisation feature as Adaptive metrics help us analyse the usage of metrics inside the stack from:

Dashboards
Alerts
Query

Discover will provide you with the “recommended aggregation rules” which combine all usage from the sources mentioned above, and the expected series that the rules would save. This allows you to estimate how many series would be reduced by applying the rules.

The rules panel shows all the recommendations for aggregation rules of the unused metrics

An example of the recommendation to count, sum and drop an unused label which will decrease 88% of the unused series and can be seen when expanded

Eliminate safely

Analysing the actual usage allows us to aggregate the metrics “safely” including:

Unused metrics (where no one is using them at all)
Partially used metrics (where only a subset of the metrics are used, either with aggregations or only some labels)

Technically, this feature works by aggregating your metrics before storing the series (a similar approach to the MutatingWebhook in Kubernetes).

After applying the Eliminate Safely recommendations, it shows how much time series you have reduced. This figure can then be used to calculate the effect on your resource usage bills.

A panel that shows the number of time series being reduced by the current applied rules is available in the overview dashboard

Our experience

We have noticed an increase in our Grafana Cloud bill during recent months and decided to explore the Adaptive metrics feature.

By focusing on the top recommended rules suggested by Adaptive metrics, we have reduced significantly ingested metrics by >33%.

(Note: we need to scope to only top recommended rules because we federated a lot of metrics).

Ingested metrics reduced by more than 33%

At our scale, these metric ingestion reductions have had a very positive impact on our cloud costs, and our bills have been reduced greatly.

Our bills were reduced by more than 20% after applying the recommendation

How did we apply the rules?

We were very careful to follow the best practices of Infrastructure as Code. Grafana also provides official Terraform modules to persist our rules as code.

For our deployment, we simply copied the recommendation from the UI and stated it in the Terraform code:

The Terraform module allows us to define the rules as code

Recommendations

Lastly, I want to share some insights from our experiences:

Everything should be in the code

Make sure to persist your dashboards and alerts, and any queries that your services will perform in the code.

This will help ensure that all usages are recorded and reflected in the analysis. You can always search for the impact within the code when deciding on the consolidation you want to apply.

Consider performing aggregation locally

The adaptive metrics feature is super useful as you have seen the result we can achieve, but for me, the killer feature is the analysis of usage.

The ability to apply the rule itself is indeed useful but if you don’t need those metrics, why send them over before being aggregated? Using the analysis result from the feature, we can apply the same rules to our metrics server such as Prometheus during either scraping or federation.

Doing so creates two big benefits:

1. Reducing the volume of metrics to be stored in your local Prometheus which will result in a smaller database, smaller memory footprint and a more stable instance.

2. A reduction of AWS network costs because we are no longer sending unused metrics through the NAT gateway and over the internet.

Applying Adaptive metrics rules requires your discretion because there are cases where you need the high metrics cardinality locally. Where this is the case, these rules should be applied before sending the metrics over.

In the case where you don’t need them at all, Adaptive metrics rules could be applied at scraping time.

Apply the rules discretely

Think before applying the rules because the metrics will be aggregated and you will not be able to access the raw data anymore. You never know if some metrics that are not used will become useful in the future. Carefully apply the aggregations and leave room for some metrics that might be useful, and if possible convert them into dashboards, or alerts to aid with your Adaptive analysis.

Conclusion

Adaptive metrics is a powerful tool that can help you significantly reduce your Grafana Cloud costs. By carefully analysing the recommendations and applying the rules discretely, you can optimise your metrics ingestion and reduce your cloud bills.

Remember to persist your dashboards, alerts and queries in code to ensure that all usages are recorded and reflected in the analysis. Consider performing aggregation locally to further reduce your costs.

With careful planning and execution, you can effectively leverage adaptive metrics to gain control over your Grafana Cloud costs and optimise your observability stack.