We use AWS EKS (v1.16) kubernetes for our auto scaling Presto deployments, and we and front it with an nginx ingress leveraging a network load balancer.
We found that, once we started auto scaling, we started getting remote disconnect errors from clients fairly frequently. This was pretty hard to explain because we had actually gone to great lengths to make sure Presto itself was gracefully terminating in a way that would not damage live queries.
Where is the Issue?
The root cause of this issue is that:
- We use ingress.
- Ingress uses a cloud load balancer.
- The cloud load balancer talks to the nginx ingress controller as a NodePort service.
- This means the LB will route traffic through any random node in the cluster.
- So, we gracefully terminate presto, but the NodePort service on the node that is scaling down may still be used for routing traffic to another node (e.g. the coordinator in this case).
It turns out that there really is no good way to fix this in EKS at this point in time. We originally hit this bug: https://github.com/kubernetes/autoscaler/issues/1907, and when we tried the workaround of using externalTrafficPolicy = Local, we hit this other bug: https://github.com/kubernetes/cloud-provider-aws/issues/87.
Other solutions are being developed now and will allow you to exclude certain nodes from the LB config using labels/etc, but they are not ready yet.
What is a Workaround?
Unfortunately, we did not solve this purely using the NGINX ingress. We found that we had to schedule the ingress services on some non-auto-scaling core nodes, and then we added them to the load balancer specifically (actually, to a separate LB we created and manage with terraform). This way, ingress always comes into nodes that do not auto scale, and those nodes route to the other services in a reliable way using the CNI black magic. It’s not a feel-good solution, but it remains stable during auto scaling of the rest of the cluster, so it works until a real k8s/AWS solution is developed.
Due to increased query sizes on our presto clusters (causing aggregation failures), I’m in the middle of evaluating moving from 16 core 64GB RAM general purpose EC2 machines (m4.4xlarge) to 64 core 256GB RAM general purpose machines (a 4x increase in power/RAM).
Here is the list of m4 and m5 models for 16-core/64GB and 64-core/256GB specs. Below, we’ll see how they compare to each other and what the best option is.
||Instance Storage (GB)
||$0.80 per Hour
||$3.20 per Hour
||$0.768 per Hour
||$3.072 per Hour
EC2 uses the EC2 Compute Unit (ECU) term to describe CPU resources for each instance size where one ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
There are a few good things to notice here:
- For m4.4xlarge to m4.16xlare, we are getting 4x the resources for exactly 4x the cost ($.80 x 4 = $3.20). The one exception is we get less than 4x the ECU units (so technically less than 4x the processing power). So, compute roughly scales linearly within a model it seems.
- Pretty much the exact same situation holds true for the m5 models; going from xlarge to 16xlarge is exactly a 4x increase in cost and resources except for ECUs which are a little less than 4x.
- The m5 models have more ECUs than their m4 counterparts and they also cost less, so they are a better deal both performance and cost wise.
So, we’ll go with m5.16xlarge instances which cost $3.072 an hour. This comes out to $2,211 a month.
This is just a quick note for anyone facing this issue.
A few of us lost about a day debugging what we thought was a terraform issue originally. While we were creating an auto scaling group (ASG), we were getting “Invalid details specified: You are not authorized to use launch template…”.
It turned out that the same error was presented in the AWS console when we tried to create the ASG there.
After some substantial debugging, it turned out that terraform was allowed to create a launch template with an AMI (Amazon Machine Image) that did not exist. We had used the AMI ID from our non-prod account in our prod account, but AMIs must exist in each account with unique IDs – so it wasn’t working.
It took us a while to get to this point in our debugging because, frankly, we were very astounded that the error message was so miss-leading. We spent a very long time trying to figure out everything that could trigger a permissions error on the template itself, not realizing that a missing resource used within the template would make the whole template present that error.
In AWS, you can generally extend the root (or other) volume of any of your EC2 instances without downtime. The steps slightly vary by OS, file system type, etc though.
On a rather default-configured AWS instance running the main marketplace Centos 7 image, I had to run the following commands.
- Find/modify volume in the AWS console “volumes” page under the EC2 service.
- Wait for it to get into the “Optimizing” state (visible in the volume listing).
- Run: sudo file -s /dev/xvd*
- If you’re in my situation, this will output a couple lines like this.
- /dev/xvda: x86 boot sector; partition 1: ID=0x83, active, starthead 32, startsector 2048, 134215647 sectors, code offset 0x63
- /dev/xvda1: SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs)
- The important part is the XFS; that is the file system type.
- Run: lsblk
- Again, in my situation the output looked like this:
- NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
- xvda 202:0 0 64G 0 disk
- └─xvda1 202:1 0 64G 0 part /
- This basically says that the data is in one partition under xvda. Note; mine said 32G to start. I increased it to 64G and am just going back through the process to document it.
- Run: sudo growpart /dev/xvda 1
- This grows partition #1 of /dev/xvda to take up remaining space.
- Run: sudo xfs_growfs -d /
- This tells the root volume to take up the available space in the partition.
- After this, you can just do a “df -h” to see the increased partition size.
Note, your volume may take hours to get out of the “optimizing” stage, but it still can be used immediately.
You can view the raw AWS instructions here in case any of this doesn’t line up for you when you go to modify your instance: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-modify-volume.html.
I was very surprised to see how incredibly hard it is to determine an AMI ID in AWS for use with Packer.
I generally use Centos 7 marketplace images for my servers; e.g. CentOS 7 (x86_64) – with Updates HVM. There is no place anywhere in the AWS UI or the linked Centos product page to actually find what the AMI ID is in a given region (and it does change per region).
I came across this stack-overflow post which was a life-saver though. Basically, for us-east-1 as an example, you can run this command using the AWS CLI (yeah, you actually have to use the CLI – that’s how wrong this is).
aws ec2 describe-images \
--owners aws-marketplace \
--filters Name=product-code,Values=aw0evgkw8e5c1q413zgy5pjce \
--query 'Images[*].[CreationDate,Name,ImageId]' \
--filters "Name=name,Values=CentOS Linux 7*" \
--region us-east-1 \
--output table \
| sort -r
And you get output like this:
| 2019-01-30T23:40:58.000Z| CentOS Linux 7 x86_64 HVM EBS ENA 1901_01-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-05713873c6794f575.4 | ami-02eac2c0129f6376b |
| 2018-06-13T15:53:24.000Z| CentOS Linux 7 x86_64 HVM EBS ENA 1805_01-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-77ec9308.4 | ami-9887c6e7 |
| 2018-05-17T08:59:21.000Z| CentOS Linux 7 x86_64 HVM EBS ENA 1804_2-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-55a2322a.4 | ami-d5bf2caa |
| 2018-04-04T00:06:30.000Z| CentOS Linux 7 x86_64 HVM EBS ENA 1803_01-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-8274d6ff.4 | ami-b81dbfc5 |
| 2017-12-05T14:46:53.000Z| CentOS Linux 7 x86_64 HVM EBS 1708_11.01-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-95096eef.4 | ami-02e98f78 |
The upper one will be the newest and probably the one you want (at least in my case).
I hope that saves you some precious googling time; it took me a while to find it since AWS’s less than admirable documentation on the subject shows up first.