We use AWS EKS (v1.16) kubernetes for our auto scaling Presto deployments, and we and front it with an nginx ingress leveraging a network load balancer.
We found that, once we started auto scaling, we started getting remote disconnect errors from clients fairly frequently. This was pretty hard to explain because we had actually gone to great lengths to make sure Presto itself was gracefully terminating in a way that would not damage live queries.
Where is the Issue?
The root cause of this issue is that:
- We use ingress.
- Ingress uses a cloud load balancer.
- The cloud load balancer talks to the nginx ingress controller as a NodePort service.
- This means the LB will route traffic through any random node in the cluster.
- So, we gracefully terminate presto, but the NodePort service on the node that is scaling down may still be used for routing traffic to another node (e.g. the coordinator in this case).
It turns out that there really is no good way to fix this in EKS at this point in time. We originally hit this bug: https://github.com/kubernetes/autoscaler/issues/1907, and when we tried the workaround of using externalTrafficPolicy = Local, we hit this other bug: https://github.com/kubernetes/cloud-provider-aws/issues/87.
Other solutions are being developed now and will allow you to exclude certain nodes from the LB config using labels/etc, but they are not ready yet.
What is a Workaround?
Unfortunately, we did not solve this purely using the NGINX ingress. We found that we had to schedule the ingress services on some non-auto-scaling core nodes, and then we added them to the load balancer specifically (actually, to a separate LB we created and manage with terraform). This way, ingress always comes into nodes that do not auto scale, and those nodes route to the other services in a reliable way using the CNI black magic. It’s not a feel-good solution, but it remains stable during auto scaling of the rest of the cluster, so it works until a real k8s/AWS solution is developed.