There is a classic Sysadmin explanation given when there’s something odd happening with the network, “it’s always DNS.”
Most of the time, it’s actually something else though, for example, a layer-8 issue (human error). But in this blog post, I want to tell you about how we spent weeks jumping down a different rabbit hole until we found that this time the problem really was caused by DNS.
Background
Before rushing into the story, I first want to explain a little bit about what we do.
At Adevinta, our team operates an internal Platform as a Service: The Common Platform, where Software Engineers from across our marketplaces can build, deploy and manage their Microservices.
These Microservices run atop SCHIP, a runtime built on top of Kubernetes.
SCHIP runs a multi-tenant Kubernetes environment where we have more than 20 tenants in many clusters. In any given cluster, multiple deployments of several tenants are operating at the same time.
The journey begins
On a calm and sunny Friday afternoon, we were contacted by one of our customers who reported that they had noticed a spike of HTTP error (5xx) in some of their services running in SCHIP.
We immediately declared an incident and started to take a look at the problem.
At first glance, our team suspected that there might be some problems with our ingress controller which is responsible for accepting traffic for the whole cluster. We noticed that during that time, the number of controller’s instances varied between 2–5, and it stayed at two instances while we were investigating.
Initially, we thought that the downscaled events of the ingress controller were due to the fluctuation of requests per second, causing some of the requests to be dropped.
We also noted that there was a small number of CPU throttling in our ingress controllers. Based on these observations and assumptions, we decided to increase the minimum number of instances to match the peak, reduce the risk of ingress downscaling, and give more CPU resources to the controllers.
The situation seemed to improve (it hadn’t), so we decided to leave it until the following Monday to validate our assumption.
The first wave
As soon as Monday started, we got another report confirming that the issue still persisted, and the quality of service of our customers was still significantly affected.
We validated that the number of ingress controller instances did not fluctuate anymore, and the CPU metric shows no throttling, so the problem has to be elsewhere.
While investigating, the issues still happened periodically throughout the day, but each drop in performance didn’t last long enough to be relevant.
It’s definitely a networking issue
During one of the occurrences, we noticed a correlated error that led us to the next theory: it must be a network issue.
Why did we think that? It’s because we could link a degradation with one of our internal alerts which was “A fluent-bit agent stopped processing logs.” The fluent-bit agent is responsible for tailing container logs and sending them to the logs forwarder. When we received the alert, it meant that a fluent-bit agent residing in one of the nodes had a networking issue and could not perform the usual forwarding. This caused its buffer to overflow and prevented the agent from processing more logs.
The service that reported the 5xx error happened to co-locate in the same node where the misbehaving fluent-bit resided.
For your information, we use kubenurse, a little tool to monitor all network connections within a Kubernetes cluster and expose useful metrics in Prometheus metrics. For example, the ‘kubenurse_errors_total’ metric which we regularly use to identify a node that doesn’t have an updated network configuration.
We had been annoyed by this fluent-bit alert for some time and believed that the issue was the node misbehaving (easy right?). So, our course of action had always been to drain the problematic node and restart the networking layer. Since we could see some nodes really did have a delay in syncing with an updated picture of all nodes in the clusters (we were using calico as our networking layer), this made sense.
Once again, the situation seemed to improve (it hadn’t).
We continued to receive more reports after the mitigation, but not all of them correlated to the networking error metrics we were monitoring.
This is serious, we need to find the root cause
As the issue became ‘noisier,’ it started to affect the team’s velocity to deliver and our on-call health was severely impacted. As a team, we decided to open a dedicated incident, specifically to investigate this issue properly, and assigned 2–3 engineers to dig into the problem.
We first came up with a dashboard that combined all of the information we thought could be related to the issue and would possibly help us identify the root cause.
For example:
- Network errors (pod in the node cannot reach other nodes)
- Pod/Node movements (we noticed that the error happened every hour and it could be related to a huge amount of CronJobs pods being created)
- All 5xx requests happening in the cluster
- API servers usage metrics
- DNS Hit/Miss and DNS Latency SLIs
The dashboard turned out to be more confusing because each combination of the indicators displayed made sense until they didn’t.
Finally, we decided to take a look at the affected application itself to better understand the root cause of the problem. This changed the game a little bit. We had previously assumed that all applications behave the same way, directing our attention to some specific components without checking what the application itself did.
The first thing we discovered was that the QoS drop came from just a few applications, and mostly from a single application, not fleet-wide. This application was basically a proxy server that sat in front of many Microservices making internal calls using a Kubernetes service name resolution.
Kubernetes DNS for Services and Pods
For example, it calls a service foo using just http://foo/api directly.
This discovery, along with the correlation that whenever we got a report for reduced service performance, usually, we also got paged by the fluent-bit issue mentioned above. This hinted at something we had not previously considered.
Our fluent-bit agent also calls the logs forwarder inside the cluster using this URL log-router-ingress-forward as a hostname.
Yes, at this point we were quite sure that we were on to something — it had to be DNS.
When we took a look at our SLIs (Service Level Indicators) for DNS Latency for the affected cluster, we found it had been fairly low compared to all other clusters.
We had been ignoring this degradation for some time because we did not see any issues and there were no wider impacts in other clusters before this incident. Even though the fluent-bit paged error had been annoying us, it was not happening all the time. Usually draining the node mitigated the noise, so we believed that it was a random underlying node issue.
This problem taught us a great lesson about the benefit of having solid SLIs for your service and how useful it is when you can truly trust them.
Let’s confirm that it’s DNS
Let’s take a look first at our DNS architecture:
In our setup, we have `CoreDNS` deployments running as a high-level cluster DNS resolution. This is complemented with a DNS Masq agent running in every node as a Local DNS Cache to reduce requests hitting `CoreDNS.`
DNS Latency SLI
Our KPI for measuring the healthiness of DNS in the cluster is the DNS Latency SLI. It is calculated as the percentage of internal requests resolved in less than 16ms and the external request resolved in less than 64ms.
Internal requests are those whose top-level domain is a `cluster.local`, hence resolved by the Cluster-level DNS (CoreDNS). External requests are resolved by the upstream DNS provider (in our case Amazon).
We saw that the problem came mainly from internal DNS requests, which have a 16ms threshold. We could also see that the time of the service degradation was aligned with the increasing errors in DNS misses metric.
The issue(s)
There were several issues causing this degradation which I will walk you through.
The concurrent queries limitation
As we realised that this was related to DNS, we started with the logs — and the first issue was quite obvious:
I0920 12:39:44.923951 6 nanny.go:146] dnsmasq[25]: Maximum number of concurrent DNS queries reached (max: 150)
I0920 12:39:55.097844 6 nanny.go:146] dnsmasq[25]: Maximum number of concurrent DNS queries reached (max: 150)
DNS Masq by default only allows 150 concurrent DNS queries per node. We were exceeding that limit given the number of requests happening in the affected cluster, so we bumped the setting up from 150 to 1500 as recommended by Google.
After we rolled out the change, the DNS Latency SLI improved instantly.
However, we could not celebrate for long, because after a week, we could see a declining trend. It was like we had just stuck a band-aid on a much deeper problem.
Missing caches
After going deeper into debugging DNS Masq, we found that only external DNS queries were being cached. This means that every cluster.local query was still being forwarded to the Cluster DNS (CoreDNS). After investigating further down this route we found that some changes in recent CoreDNS versions prevent DNS Masq from caching answers from CoreDNS.
(credit to: https://qasim-sarfraz.medium.com/dns-caching-gone-wrong-a329dc00452e)
I will skip the details here for the sake of storytelling but, you can read the linked blog if you’re interested.
TLDR;
The Recursion Available (RA) flag (set by the server), as well as Recursion Desired (RD) (set by the client) flags, are bits set on the DNS query and its corresponding response by the server.
When RA is disabled, DNS Masq explicitly refuses to cache it.
By forcing the RA bit to always be present we could fix the issue and make DNS Masq cache those local queries.
. {
header {
set ra aa
clear rd
}
}
We once again saw an improvement to the SLI (on 10/11 in the screenshot)
And there was a huge improvement in DNSMasq cache hit/miss metrics.
Ok, surely after we had been through so many possible solutions, the problem must be solved, right? Not quite.
After going through weeks of debugging this issue, constantly checking the SLIs daily became something of an addiction. After another week, we sensed something was off.
The main villain, the n-dots
To gather real data, we ran DNStop selectively in some problematic nodes that showed us something suspicious.
There were “garbage” domains (unresolvable addresses) flooding the DNS requests from the node. This was the result of the famous ndots issue.
The ndots issue and its implication on DNS performance in Kubernetes is well-known. Any domain with less than 5 dots in the domain name triggers several local searches defined in the configuration before it reaches out.
As a consequence of this issue, we detected hundreds of requests per second trying to resolve hosts in the following domains:
schip.io.svc.cluster.local
cluster.local.cluster.local
cluster.local.svc.cluster.local
cluster.local.system-namespace.svc.cluster.local
grafana.net.platform-services.cluster.local
grafana.net.platform-services.svc.cluster.local
kube-system.platform-services.svc.cluster.local
svc.cluster.local.system-namespace.svc.cluster.local
Even with proper caching, some of our clusters were big enough to produce hundreds of these requests per second. Especially the cluster that we had been investigating.
Now that we discovered the real problem, we thought of several actions that we could take to improve the situation.
Discard garbage domains
The best thing to do is to catch those invalid searches and return an NXDOMAIN response. The best place to do that is as close to the POD as possible — at the Node Local Cache (DNSMasq).
We introduced a configuration that would intercept these garbage domains and return an NXDOMAIN response to the client. These requests will not be forwarded to our CoreDNS instances, saving them from being flooded.
We introduced a configuration in our local DNS to drop invalid domains
Use FQDN as a hostname
To avoid ndots issues, the most straightforward solution is to have at least five dots in our hostname. Fluent-bit is one of the biggest abusers of the DNS requests. We changed the hostname of the receiving log router, that it needs to establish a connection with every time it needs to send logs , to use a FQDN hostname instead of a service name.
As it now has five dots in the domain, it doesn’t trigger local search anymore.
We also reached out to our customers that were affected by this update to change their internal resolution to FQDN, as their affected application also generated a lot of local searches.
Did it work?
After we identified the problem and applied all of the enhancements, let’s take a look at the impact of those actions
This chart shows the performance improvement from the time we started implementing cache fixes until we discarded the garbage domains. You can clearly see the drastic improvement to the SLI on the far right after we applied the patch.
Also, requests reaching the Cluster DNS (CoreDNS) dropped from more than 4000 requests/second to less than 1000 requests/second.
And our SLO was back to its green state without further degradation.
Why did it happen to only one cluster?
Going through the journey, some of you might wonder why the issue only affected one of our customers given that the configuration was not cluster-specific, but was applied to all clusters in the same way.
The fact is, this problem did affect everyone. We found 2–3 more big clusters whose DNS Latency SLI was affected. However, the cluster we have described here had the biggest amount of DNS queries in the fleet due to its pod density, and the behaviour of the applications that were running on it. In Particular, we discovered that there were a number of CronJobs scheduled to run every X hours, making the DNS health worse when they were scheduled. Which is why the issues were mostly reported to be happening at midnight.
Timeline
Taking a look at the whole month of debugging this, we can see clearly what happened.
We saw an improvement after we killed the nodes with the proxy application. When it started to degrade again, we fixed the DNS cache. Then performance dropped again, until we discarded those garbage DNS requests.
Lesson learned
Before we wrap up, let’s talk about what it taught us as a team.
The importance of SLO
This issue changed my perspective on SLO (Service Level Objectives). I used to think that it was a good indicator to look at, but never used it as a safeguard. When an SLO degraded before, I never prioritised understanding why the performance had dropped.
However, you need to make sure that you have defined, solid and practical SLIs for your SLO. Otherwise, you could end up jumping straight into a rabbit hole trying to improve things when you don’t even know if it’s broken.
Organise the investigation
One of the reasons that this issue remained for a month, was that our approach to the problem was scattered. There were multiple efforts by several groups of our on-call engineers, but they weren’t organised properly.
We didn’t have a single place where we gathered all of the information from our observations. Instead, everyone who looked at the problem had to start an investigation from scratch.
We also didn’t list and prioritise all of our assumptions in a single place. There were times when we worked on the same assumptions or even worked on already discarded assumptions.
There is a very important lesson I learned from my beloved Technical Lead — it’s super important to discard one hypothesis before starting a new one. Focus on your primary hypothesis and try your best to prove it wrong until you come up with another idea to find the root cause.
Gather the data, dig deep
Building on the above principle, when you work on a hypothesis, you need to dig as deep as possible and find all of the data you can to support the hypothesis. Sometimes you may not have all the data available from your existing metric system, so don’t be afraid to get creative and gather more information.
For example, in this case, running DNStop gave us very valuable information that led us to the correct conclusion, and we didn’t have that information available in our metrics system.
Benefit of the economies of scale
As this incident impacted everyone, the improvements to the DNS architecture also benefited all of the other customers running on our platform — even though they were not aware of it. Making one change to our multi-tenancy environment improved performance for everyone, increasing the overall benefit of our efforts.
Moreover, we could also apply the learnings from this problem to our other platforms with a different technology stack.
I hope everyone enjoys reading this blog, and remember, sometimes it’s really DNS.