Trial by Fire: Tales from the SRE Frontlines — Ep1: Challenge the certificates

In the age of seamless digital experiences, where businesses operate around the clock and users expect unparalleled service, a group of SREs are navigating the unpredictable waters of system failures, bugs, and unexpected challenges. I recently published a tale of one such incident “It’s Not Always DNS… Unless It Is”. The response was overwhelming, signaling a clear appetite for more stories like this.

Seeing this enthusiasm and recognising that our journey has been filled with numerous learning moments inspired me to curate and share more of these stories. Thus, the series “Trials by Fire: Tales from the SRE Frontlines” was born. Ranging from minor glitches to significant system meltdowns like the one I previously shared, each tale is a testament to Site Reliability Engineers’ challenges, ingenuity and resilience.

I hope you’ll find these stories as enlightening and engaging as many have found our previous tales. Enjoy the series!

In this first episode, I am going to tell you the story of how we “challenged” the status quo of our managed certificate offering before being bitten by an unavoidable situation.

Our on-call lives were in danger

This story arises from our team’s sense of survival. Every week our on-call rotation was flooded with alerts, big and small, and a handful of incidents. We were having a kind of breakdown, and in a retrospective session, we decided to approach this problem analytically. We would analyse what annoyed us the most and find a way to fix those issues.

A statement to come together to improve overall on-callers’ lives

During the analysis exercise, certificates were ranked among the top annoyances.

Manually collected records of issues and incidents involving certificates

As we gathered the data about how annoying the certificates were to us, we found many occurrences of issues and incidents involving certificates in the previous three months. Every occurrence generates toil for the team, distracting us from working on value-added features for our users.

This points directly to the need to improve the overall capability of the certificates.

What does our platform provide to our customers in terms of certificates?

To recap quickly, At Adevinta, our team operates an internal runtime platform as a service called SCHIP, which is built atop the Kubernetes platform but provides more than just a plain Kubernetes API.

SCHIP provides our customers with features like integrated logging, curated metrics and managed ingress, and DNS. Among all the great features, it also provides managed certificates for our customers’ ingresses.

How it worked from our customer’s point of view is fairly simple. They could simply annotate their ingresses like so:

metadata:
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt"

Then they would add this section to their ingresses’ definition:

spec:
  tls:
  - hosts:
    - <your-app-name>.<your-namespace>.schip.io
    secretName: <your-app-name>-tls

SCHIP will then manage any ingresses with the domain name in the .schip.io zone automatically.

In the background, we use the standard open-source tool cert-manager to manage the certificate’s lifecycle.

After making these changes, cert-manager will take care of:

Requesting a certificate for customer’s ingresses using the ACME DNS-01 challenge
Creating the new SSL certificate as a Kubernetes secret
Notify the ingress controller about the new certificate

As you might have already seen in the configuration, we were providing this integration in a best-effort manner using Let’s Encrypt, a free, automated and open certificate authority (CA), run for the public’s benefit. It is a service provided by the Internet Security Research Group (ISRG).

Customers also had the option to bring their own certificates or delegate DNS challenges to their own DNS zones for a custom domain name. However, I will skip those details for the sake of this blog post.

Everything was working well, but there were issues with the setup.

Let’s Encrypt was providing a best-effort service

We stated this fact transparently for our customers

Let’s Encrypt was great but it came with the following service limitations:

50 certificates per registered domain per week
300 orders per account per 3 hours
5 duplicate certificates per week (renewal)

With our initial scale, this service had served its purpose. Combined with the automation provided by the cert-manager operator it could generate digital certificates for our customers’ services without many issues.

However, as we grew, it became more problematic. Even though our solution was stated as best-effort, our customers relied on it more than it was designed to support. Several support requests and incidents when the certificate turned invalid in the production environment had proven this. Our team acknowledged this fact and we decided to provide a set of SLIs around certificate capabilities, in line with our observation of these problems.

We provide SLIs based on the certificate availability of our managed domains and external domains

The data

We were trying to find how much we were consuming from Let’s Encrypt with the query from our cert-manager’s metrics.

In Production, we were issuing roughly 20 challenges every 7 days that were getting valid status (some of these requests may be a renewal, but this should give us an estimated number). Also, we were using the integration ourselves. Every time we deployed a development cluster we used for internal testing, it consumed nine new domains in each cluster. This means that if we were to create five clusters in a week, we would consume all the quota ourselves.

Why were certificates turned invalid?

When an issue was reported, there was no doubt that the cause was the certificate had expired for our customer. With the information mentioned above, the most likely cause of this would be the rate-limit from Let’s Encrypt.

The data confirmed that the rate-limit was the issue

We looked at the rate limit found in the cert-manager and it definitely matched the issues our customer faced. Certificate renewal challenges could not be completed due to the saturated limit, and the number of development clusters we had created in that period.

Basically, once the Let’s Encrypt limit was hit:

Any certificates that expired would fail to renew
Any onboarding application would not be issued a certificate
New dev clusters would also not be issued a certificate

The problem was the dev clusters, why not just fix it?

Of course, changing the domain used in the dev clusters would relieve the pressure on the quota, but that was a super short-term solution. Living with this limitation, we were destined for doom in terms of scaling. At that time, we were in the process of onboarding new platform teams joining the company, potentially doubling the size of our current workloads. We knew that if we did not act on this foreseeable issue we would have a high risk of not being able to cope with the upcoming onboarding.

The solution

The root of the evil is the limit, If there were no limits, we would not have a problem right? That’s why we decided to look for a new ACME certificate provider that offers a better quota for the limit, or ideally, an unlimited quota, to be our new provider.

We were looking at several ACME providers, and we came across this article on Medium where the author faced similar problems and suggested an alternative that matched with what we were looking for.

ZeroSSL

ZeroSSL is another ACME provider that provides a digital certificates service which also integrates well with cert-manager.

With a paid plan, we would be able to create an unlimited number of 90-day certificates — which was the main service we were looking for.

So, we started to do some PoC testing with ZeroSSL and it worked exactly as we had hoped.

For the proof of concept, the steps required for us were:

Create the new ClusterIssuer with the credentials created from the ZeroSSL account
Change the ACME server URL for the ClusterIssuer to https://acme.zerossl.com/v2/DV90
Reference this new ClusterIssuer in the ingress object as:

cert-manager.io/cluster-issuer: zerossl

And that’s it, we had the new certificate issued from ZeroSSL and the cert-manager hooks in nicely, successfully generating a certificate.

The implementation and the rollout plan

Implementation was simple as shown above. However, the challenge lay in the rollout plan, as our customers were having their ingresses pointed directly to the LetsEncrypt ClusterIssuer and some with delegated Issuers. We also needed to ensure a smooth transition for the current customers and understand what impact we might have during the transition.

Change applications’ existing certificates from Let’s Encrypt to ZeroSSL

After editing the ClusterIssuer reference in the ingress, the certificate status was changed to invalid as it was trying to issue the certificate using a new issuer.

status:
  conditions:
  - lastTransitionTime: "2022-11-04T15:57:32Z"
    message: 'Fields on existing CertificateRequest resource not up to date: [spec.issuerRef]'
    reason: RequestChanged
    status: "False"
    type: Ready
  - lastTransitionTime: "2022-11-04T15:57:32Z"
    message: Issuing certificate as Secret was previously issued by ClusterIssuer.cert-manager.io/letsencrypt
    reason: IncorrectIssuer
    status: "True"
    type: Issuing

And the certificate flow would normally go:

 k get challenge
NAME                                          STATE   DOMAIN                                                AGE
ing-metrics-tls-84s42-1120992888-2070757738   valid   zerossl1.ingress-metrics.schip.dev.mpi-internal.com   114s

Then the certificate becomes valid in a few minutes. if it fails to issue the new certificates, the service will not face any issues as the secret that keeps the original certificate will not be touched.

The existing application renews the certificates from Let’s Encrypt to ZeroSSL

This should be the same situation as we need to change the issuer and it will immediately get reissued. So, that means, it’s already being renewed once we change the issuer.

In a basic case, this looks fairly simple, we decided to put out the new ClusterIssuer and named it “managed” instead of “zerossl”. This would allow us to change providers again in the future transparently for our customers as we learned the interface lesson this time.

We updated our documentation and default settings so new customers could start with ZeroSSL right away. This would allow us to maintain the LetsEncrypt interface for current customers while we figure out how to transition them smoothly to the new provider. Actually, most of our effort was put into making a smooth transition. There were so many details that I could not cover them all here.

The result

Ultimately, we experienced better service in our offering. We did not have the situation of expiring certificates again and the list of expiring certificates was mostly empty. In the past, we were alerted frequently about many certificates being close to expiring due to the failure of renewal attempts affected by the Let’s Encrypt rate limit.

Can we all live happily now?

Well, you could guess at this point that if it worked well, I would not ask you this trick question right?

Don’t get me wrong, it was actually working well, we didn’t worry about it for a very long time. We almost forgot we used to fight with certificates weekly.

However, out of nowhere, some months later, one customer reported:

“Failed to create Order: 429 : 429 Too Many Requests”

What? We had paid for unlimited certificate quotas. This can’t be happening. There was something wrong here.

Actually, there is nothing in this world that does not have limitations, and that applies to ZeroSSL as well.

It turns out that they also have a hard limit on the number of requests sent to their server.

Please note there is a maximum of 200 requests per 5 minutes. This is a security measure we have set in the ZeroSSL firewall in order to ensure our services working stable. Without this limit, we had massive abuse from very few users which causes the service to have bad performance for all others who are using the service adequately.

Please make sure you write the scripts in a way to not make hundreds or thousands of requests at the same time.

There are also certain endpoints that return a 429 in case they are heavily overused.

At our scale, where we running 30 Kubernetes clusters with thousands of services with certificates, all of our cert-managers share the same ZeroSSL account. It’s not a surprise we are exhausting this request limit.

Our first mitigation was to develop a Kubernetes operator that reconciles order objects and if they fail with the rate limit, delete them to trigger issuance again.

This worked for a good while, but eventually, we were hit again with a 429 rate limit issue. However, this time, it was something different, and more serious.

The following month, another customer reported that their certificates were not renewed and issued. When we investigated we found a serious problem:

Most of the certificates in one of the clusters was not ready

We tried to manually trigger the renewal process, but it did nothing. This was when we found that actually, the limit was not hitting only at the certificate level, but at the issuer level too.

The error occurs at the API call for the ACME account registration of the upstream cert-manager code. This particular case has the condition of not retrying when the API error is a 4XX, and this one was 429. This is the error message we get when the API call fails:

Status: 
Acme: 
Uri: https://acme.zerossl.com/account/xxxxxxxx
Conditions: 
Last Transition Time: 2023-05-22T09:53:18Z 
Message: Failed to register ACME account: 429 : <html> 
<head><title>429 Too Many Requests</title></head> 
<body> 
<center><h1>429 Too Many Requests</h1></center> 
<hr><center>nginx</center> 
</body> 
</html>

cert-manager/issuers "msg"="failed to register an ACME account" "error"="429 : <html>\r\n<head><title>429 Too Many Requests</title></head>\r\n<body>\r\n<center><h1>429 Too Many Requests</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n"

Other people also faced the same problem in this Github issue.

For us, it was worse because our customers also shared the same account on their custom certificates, and some customers had more than ten issuers. All of these contributed to the registration API limit.

Our engineers (Javier and Christian) opened the issue to the upstream cert-manager project to discuss this use case and proposed a change to add some retrying logic to the ACME registration. After discussing it with the ZeroSSL team, we decided this was the most sensible approach. We could not wait until the upstream discussion was finished because the incident was still active and our customer still didn’t have their certificates renewed.

We decided to fork the cert-manager project to apply the retry logic as proposed in the upstream issue and to deploy the forked version. The result was satisfying. It might not be the cleanest approach, but jittering the retry would eventually make the ACME registration finish successfully for ZeroSSL.

Apart from this, we also partitioned the ZeroSSL account proportionally to reduce the risk and spread the load on the request limits. The result of this improvement was impressive and we have not had any incidents regarding certificates since then.

What’s next?

Even though we don’t have the issue with certificates anymore, it doesn’t mean that we’re totally safe.

Imagine if the ZeroSSL server goes down — we would not be able to issue or renew any certificates for our customers, or at least not in a way that is transparent to them.

We found a promising solution which is to abstract the issue with a proxy layer so that it could have a main provider and a fail-over provider.

We could do that by using the external issuers which implement this logic. It could watch for CustomResources of its own issuer reference, then replicate the request to the underlying issuers, implementing fail-over where appropriate.

Wrap up

We started by trying to improve the on-call healthiness as we were suffering from many incidents and pages. We identified certificates as the main contributor to the pages and incidents we needed to manage.

We discovered the limitations of Let’s Encrypt that make the certificate solution unstable, so we needed to find an alternative solution.

In the end, the solution was fairly simple to find and implement, but making the transition seamless and transparent for our users required a lot of effort. However, abstracting the migration for the users and making it seamless is what makes the platform team valuable.

I hope that this story helps you understand the importance of a platform team, especially in terms of abstracting toil for the platform users. Also, there’s nothing in this world that does not have limitations, it’s important to make our system resilient to be able to manage when we eventually hit the limitations.

If you find this useful, feel free to comment or share your experience with us!