Understanding and Checking/Analyzing Your DockerHub Rate Limit

We’ve been hitting docker rate limiting pretty hard lately in our EKS clusters. Here are some interesting things we learned:

  • The anonymous request rate limit for DockerHub is 100 requests per IP address per hour.
  • If you are in a private IP space and have internet gateways, you are probably being rate limited on the IPs of the gateways.
  • So, if you have 600 servers going through 6 gateways, you have 600 requests, not 60,000 (obviously this is a massive difference).
  • In kubernetes, you should specify an image tag (which is not mandatory) and pull-if-not-present in order to ensure you pull images less frequently.

If you need to observe your servers and how they are acting with the rate limit, you can refer here -> https://www.docker.com/blog/checking-your-current-docker-pull-rate-limits-and-status/.

For anonymous requests, basically just run:

TOKEN=$(curl "https://auth.docker.io/token?service=registry.docker.io&scope=repository:ratelimitpreview/test:pull" | jq -r .token)

curl --head -H "Authorization: Bearer $TOKEN" https://registry-1.docker.io/v2/ratelimitpreview/test/manifests/latest 2>&1 | grep RateLimit

And you will get output like this, showing the rate limit (100) and how many you have left (100 for me as I haven’t pulled recently).

RateLimit-Limit: 100;w=21600
RateLimit-Remaining: 100;w=21600

Kubernetes PLEG Issues / Lots of Ephemeral Pods / Airflow

What is the Use Case?

We’ve been hosting a service for over a year now that basically deploys Apache Airflow over kubernetes in a SaaS model. Each internal client/user gets one instance of their own, including its own dedicated scheduler, web server, and general namespace to run task pods. Teams can run hundreds or thousands of parallel tasks each on their instance, all scheduled on the central cluster as an individual pod per task.

We use EKS v1.16 on AWS. One interesting problem we have run into is that Airflow can create a ton (tens of thousands) of short-lived/ephemeral pods, and they often have very low resource constraints. Often, they are very short-lived.

This can mean that a node with low CPU/memory usage may have hundreds or thousands of pods scheduled on it back-to-back as they keep creating/running/being cleaned at a rapid pace (which is very cool).

So, What is the Problem?

It turns out that, while CPU and memory can be very low on some nodes, the sheer act of creating/managing/destroying so many pods can cause issues in its own right. We use the prometheus operator in our Kubernetes, and it starts alerting us of KubeletPlegDurationHigh – The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of 10 seconds on node <node-id>.”

What is the PLEG?

You can review this article to understand the Pod Lifecycle Event Generator (PLEG) more: https://developers.redhat.com/blog/2019/11/13/pod-lifecycle-event-generator-understanding-the-pleg-is-not-healthy-issue-in-kubernetes/. It is very helpful. I’ve extracted the useful bits here:

The PLEG module in kubelet (Kubernetes) adjusts the container runtime state with each matched pod-level event and keeps the pod cache up to date by applying changes.

Let’s take a look at the dotted red line below in the process image.

The original image is here: Kubelet: Pod Lifecycle Event Generator (PLEG).

Monitoring the Issue

Assuming you have the prometheus operator installed and have the relevant metrics/alerts, here is a chart that lets you view the PLEG activity well in graphical form. This helps you understand if your solutions are helping much.

You don’t need the kubernetes_cluster spec, unless you’ve added that as an external label as well over multiple prometheuses (we query this from Thanos which aggregates multiple prometheus instances).

Here’s one of the queries in text form with that removed so you can copy paste easier:

quantile(.95, kubelet_pleg_relist_latency_microseconds) / 1000000

Mitigating the Issue

There are numerous things you can do to help mitigate this issue:

  1. Add more nodes to the cluster / increase minimum on auto scaler range. More nodes = more distribution of pods = less PLEG issues as they are on a per-node basis.
  2. Monitor and find the threshold/count of pods where issues happen, then adjust the kubelet settings to it can’t have that many pods. Generally we only see PLEG issues when we pass 45 pods on a node *and* have lots of ephemeral pods. This will change based on instance type and workload I’m sure, but I’m sure you can spot a trend and set the minimum to help mitigate. This is a good solution as an explicit pod limit will make the CA scale up new nodes properly.
  3. Distribute pods better around the cluster. Kubernetes, when running lots of ephemeral pods, tends to hot spot a bit and put more of these short lived pods on a few nodes that have less resources. You can use things like https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ in newer kubernetes versions to reduce hot-spotting and mitigate PLEG issues (and other issues like docker rate limiting). This really just helps you use your existing servers more optimally.

I’m sure there are ‘better’ ways to fix this, but we haven’t found them yet. I’ll circle back and update this if and when we find them.

EKS Kubernetes Auto Scaling / Ingress / Remote Disconnect Errors

Problem Overview

We use AWS EKS (v1.16) kubernetes for our auto scaling Presto deployments, and we and front it with an nginx ingress leveraging a network load balancer.

We found that, once we started auto scaling, we started getting remote disconnect errors from clients fairly frequently. This was pretty hard to explain because we had actually gone to great lengths to make sure Presto itself was gracefully terminating in a way that would not damage live queries.

Where is the Issue?

The root cause of this issue is that:

  1. We use ingress.
  2. Ingress uses a cloud load balancer.
  3. The cloud load balancer talks to the nginx ingress controller as a NodePort service.
  4. This means the LB will route traffic through any random node in the cluster.
  5. So, we gracefully terminate presto, but the NodePort service on the node that is scaling down may still be used for routing traffic to another node (e.g. the coordinator in this case).

It turns out that there really is no good way to fix this in EKS at this point in time. We originally hit this bug: https://github.com/kubernetes/autoscaler/issues/1907, and when we tried the workaround of using externalTrafficPolicy = Local, we hit this other bug: https://github.com/kubernetes/cloud-provider-aws/issues/87.

Other solutions are being developed now and will allow you to exclude certain nodes from the LB config using labels/etc, but they are not ready yet.

What is a Workaround?

Unfortunately, we did not solve this purely using the NGINX ingress. We found that we had to schedule the ingress services on some non-auto-scaling core nodes, and then we added them to the load balancer specifically (actually, to a separate LB we created and manage with terraform). This way, ingress always comes into nodes that do not auto scale, and those nodes route to the other services in a reliable way using the CNI black magic. It’s not a feel-good solution, but it remains stable during auto scaling of the rest of the cluster, so it works until a real k8s/AWS solution is developed.

Kubernetes – Get terminationGracePeriodSeconds and Other Values Missing From Describe Pod/Deployment

When checking what is running in kubernetes, people generally do something like this:

kubectl get deploy -n <namespace>
kubectl get pods -n <namespace>

And to describe extended parameters on a deployment or pod:

kubectl describe deploy -n <namespace> <deployment-name>
kubectl describe pod -n <namespace> <pod-name>

Interestingly, these more verbose describe commands are still missing a lot of information. It turns out that the only way to get *all* of the information is to go back to the get command and to tell it to output everything to YAML or a similar format:

kubectl get deploy -n <namespace> -o yaml
kubectl get pods -n <namespace> -o yaml

These commands will yield far more configuration options than the describe commands. Things like terminationGracePeriodSeconds will be readily available here.

Presto – Get and List the Connectors on All Nodes in Cluster

Some problems in presto are the result of having connector definitions only on a subset of nodes in the cluster. For example, a recent error on the presto-sql forum during insert into a hive table was:

java.lang.IllegalArgumentException: No page sink provider for catalog 'hive'
at com.google.common.base.Preconditions.checkArgument(Preconditions.java:216)
at io.prestosql.split.PageSinkManager.providerFor(PageSinkManager.java:67)
at io.prestosql.split.PageSinkManager.createPageSink(PageSinkManager.java:61)
at io.prestosql.operator.TableWriterOperator$TableWriterOperatorFactory.createPageSink(TableWriterOperator.java:114)
at io.prestosql.operator.TableWriterOperator$TableWriterOperatorFactory.createOperator(TableWriterOperator.java:105)
at io.prestosql.operator.DriverFactory.createDriver(DriverFactory.java:114)
at io.prestosql.execution.SqlTaskExecution$DriverSplitRunnerFactory.createDriver(SqlTaskExecution.java:941)
at io.prestosql.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:1069)
at io.prestosql.execution.executor.PrioritizedSplitRunner.process(PrioritizedSplitRunner.java:163)
at io.prestosql.execution.executor.TaskExecutor$TaskRunner.run(TaskExecutor.java:484)

If you have a decent size cluster, it is very painful to go to each node and check its catalogs. This problem can be even worse if you have an old node join the cluster after maintenance or something like that.

In any case, you can use the following URL on presto (/v1/service/presto) to list all nodes and their registered connectors in one shot. This will help you track down the problem fast :). You can even be lazy and parse the JSON in chrome dev tools/etc so you don’t have to eyeball all the nodes.

https://nonprod.presto.your-company.com/v1/service/presto

Example Output

  "environment": "nonprod",
"services": [
{
"id": "a35ae2a7-fa95-43c9-b893-180449a48c5a",
"nodeId": "blue-presto-worker-865b8db58-g92wn",
"type": "presto",
"pool": "general",
"location": "/blue-presto-worker-865b8db58-g92wn",
"properties": {
"node_version": "331-n-2.6.1",
"coordinator": "false",
"https": "https://10-234-232-180.nonprod-presto.pod.cluster.local:8443",
"https-external": "https://10-234-232-180.nonprod-presto.pod.cluster.local:8443",
"connectorIds": "hive-dl,system,cr-meta,ar-meta,dc-meta"
}
},
{
"id": "b8dd0f39-00b0-4c78-b0c0-ff8e753419d8",
"nodeId": "blue-presto-worker-865b8db58-d2nsz",
"type": "presto",
"pool": "general",
"location": "/blue-presto-worker-865b8db58-d2nsz",
"properties": {
"node_version": "331-n-2.6.1",
"coordinator": "false",
"https": "https://10-234-234-106.nonprod-presto.pod.cluster.local:8443",
"https-external": "https://10-234-234-106.nonprod-presto.pod.cluster.local:8443",
"connectorIds": "hive-dl,system,cr-meta,ar-meta,dc-meta"
}
},
...