Ansible – Refer to Host in Group by Index

Occasionally it is very useful to refer to a host in a group by an index.  For example, if you are setting up Apache or HAProxy, you may need to push a configuration file out to each host that can redirect to all other hosts.

It is actually quite easy to refer to the hosts in a group by index, but its not necessarily easy to google it unfortunately.  Here is the syntax for the first 3 hosts in a group:

{{groups['coordinators'][0]}}
{{groups['coordinators'][1]}}
{{groups['coordinators'][2]}}

Presto Coordinator High Availability (HA)

Quick Recap – What is Presto?

Presto is an open-source distributed SQL query engine made by Facebook that runs as its own cluster.  It is able to refer to an existing hive metastore and run queries on the hive tables in HDFS/etc itself using its own resources.  It is much faster as it does everything in-memory rather than by using map-reduce.  It can connect to numerous data sources aside from hive as well (though I have only used it with Hive over Azure’s ADLS personally).

No High Availability

We started using Presto in an enterprise use case, and I was astounded to find out that it doesn’t have any high-availability (HA) built into it.  Presto as a product is wonderful – it is fast, easy to set up, provides pretty solid query diagnostics, handles massive queries in a very stable manner, etc.  So, the complete lack of a HA solution seems very strange given the strength of the product.

The critical component in Presto is the Coordinator, and it is a single point of failure.  It is the brains of the operation; it parses queries, breaks them into tasks, controls where work gets scheduled, etc.  Users only talk to the Coordinator node.

Coordinators vs Workers

Despite the importance of a coordinator, the only real differences between a coordinator and  the rest of the worker nodes at a configuration level are that:

  1. Coordinators specify that they are a coordinator.
  2. Coordinators can run an embedded discovery server – all nodes in the cluster report to this discovery server (including the coordinator itself).  This discovery server can actually be run separately from the coordinator as well if desired; I think the provision of an embedded one is relatively new.  The discovery server is how the coordinator knows the full set of nodes it is managing.
  3. Coordinators can choose whether or not they themselves are used to process queries (as opposed to just managing them).

Again, coordinators take client connections (e.g. JDBC, ODBC, etc), and they take queries from those connections, parse them, validate them, break them into tasks, and schedule them across the pool of available workers.

Workers just report to the discovery server and handle the tasks they are allocated.

HA Options

There is surprisingly little to find online about making Presto HA.  The only two solutions that I’ve seen are:

  • Run multiple clusters behind a load balancer.
  • Run multiple coordinators and some form of proxy service to ensure only one is ever active at a time.

Both of these have challenges and/or drawbacks.

If you are running multiple clusters, you probably want them to be active/active so you don’t only use half of your nodes at a time.  Handling this properly requires that your proxy service issue a redirect to the target cluster’s coordinator so that the client (e.g. JDBC connection) can re-send the request to that and talk to it directly.  This will work, but you’ve still limited the maximum query size you can do as you split your nodes into 2 or more clusters, so they cannot all co-operate on very large queries.

Running multiple coordinators for HA is preferable as you get to combine all of your nodes into a single, large cluster that can attack large queries.  It is not trivial to do though as if two coordinators operate at once, they can degrade and even deadlock the cluster.  We’ll dig into how to run with multiple coordinators now.

Using Multiple Coordinators

If you want to set up Presto using multiple coordinators, here is the general approach:

  1. Set up 2 or 3 nodes as coordinators.
  2. Tell them to run their own discovery servers in their config.
  3. Tell them to point at localhost for their own discovery server – this is quite important.
  4. Tell them not to do work (It will keep things more stable, but unfortunately, that means that your cluster has less power.  You’ll probably have far more workers than coordinators though, so it shouldn’t be an issue).
  5. Install HA proxy on the coordinator nodes.  Have all the coordinator nodes registered in order and make all but the first one a “backup”.  So, for example, run HA Proxy port 8385 and run Presto on port 8321.  All traffic will go to node #1 unless its down, in which case it will go to node #2, and so on.
  6. Set up a load balancer in front of the coordinator nodes pointing at the HA proxy port and make sure traffic can get through.
  7. Set up all worker nodes to target the load balancer for the discovery server.  So, all workers target the load balancer, which goes to any coordinator, all of which redirect to the primary one.  The primary coordinator always has all workers reaching it courtesy of HA proxy.
  8. As each coordinator itself only reports to its localhost discovery server, coordinators will not end up talking to each other’s discovery servers and will not interfere with each other.  Only one coordinator will ever have workers registered with it at a time.

Let Coordinators Use the Load Balancer?

If you let coordinators use the load balancer, then they will all end up at the primary coordinator’s discovery server.  Now… I have seen people online saying that they ran all nodes as coordinators (e.g. in the linked Gooogle Group conversations below) in which case this must be happening.

When I tried it though, I clearly got this warning from all the coordinators (and probably the workers too, but I didn’t check).  It comes out once a second.

2018-12-29T01:38:01.479Z WARN http-worker-176 com.facebook.presto.execution.SqlTaskManager Switching coordinator affinity from awe4s to 9mdsu
2018-12-29T01:38:01.806Z WARN http-worker-175 com.facebook.presto.execution.SqlTaskManager Switching coordinator affinity from 9mdsu to awe4s

Someone in a Git Hub issue I forgot the link to stated that this means that the memory management may get muddled up, which sounds scary.  I did provide a link to the warning in code below which somewhat verifies this.

So, maybe it is a disaster, or maybe it’s harmless – but in any case, I didn’t want warning messages coming at me once a second that looked this bad.  So, I opted to only have each coordinator talk to its own discovery server, which makes them 100% idle (not processing anything) unless they are the current primary coordinator.  This waste is unfortunate, but as we’ll have far more workers than coordinators, it’s not the end of the world.

Drawbacks

This will keep your cluster running in the event that a coordinator fails.  Any active queries at the time a coordinator fails will fail though – we can’t do anything about that unless Presto starts supporting HA internally.  Also, the fail-over period will be very much tied to your HA proxy configuration and your load balancer health checks (mine takes around 30 seconds using an Azure load balancer and HA proxy, I’ll be looking to reduce that).

Useful Links

 

Azure + Terraform + Linux Custom Script Extension (Scale Set or VM)

Overview

Whether you are creating a virtual machine or a scale set in Azure, you can specify a “Custom Script Extension” to tailor the VM after creation.

Terraform Syntax

I’m not going to go into detail on how to do the entire scale set or VM, but here is the full extension block that should go inside either one of them.

resource "azurerm_virtual_machine_scale_set" "some-name" {
  # ... normal scale set config ...

  extension {
    name                 = "your-extension-name"
    publisher            = "Microsoft.Azure.Extensions"
    type                 = "CustomScript"
    type_handler_version = "2.0"

    settings = <<SETTINGS
    {
    "fileUris": ["https://some-blob-storage.blob.core.windows.net/my-scripts/run_config.sh"],
    "commandToExecute": "bash run_config.sh"
    }
SETTINGS
  }
}

Things to notice include:

  1. The extension settings have to be valid JSON (e.g. no new-lines in strings, proper quoting).
  2. This can get frustrating, so it helps to use a bash “heredoc” style block to write it the JSON (to help avoid quote escaping, etc). https://stackoverflow.com/a/2500451/857994
  3. Assuming you have a non-trivial use case, it is very beneficial to maintain your script(s) outside of your VM image.  After all… you don’t want to go make a new VM image every time you find a typo in your script.  This is what fileUris does; it lets you refer to a script in azure storage or in any reachable web location.
  4. You can easily create new Azure storage, create a blob container, and upload a file and mark it as public so that you can refer to it without authentication.  Don’t put anything sensitive in it in this case though; if you do, use a storage key instead.  I prefer to make it public but then pass any “secret” properties to it from the command-to-execute, that way all variables are managed by Terraform at execution time.
  5. The command-to-execute can call the scripts downloaded form the fileURIs.  When the extension is run on your VM or scale set VM(s) after deployment, the scripts are uploaded to /var/lib/waagent/custom-script/download/1/script-name.sh and then run with the command-to-execute.  This location serves as the working directory.

Debugging Failures

Sometimes things can go wrong when running custom scripts; even things outside your control.  For example, on Centos7.5, I keep getting 40% of my VMs or so stuck on “creating” and they clearly haven’t run the scripts.

In this case, you can look at the following log file to get more information:

/var/log/azure/custom-script/handler.log

Azure – Linux VM Image Creation – Powershell – With Service Principal/Account

Overview

I was working on creating generalized VM images for use with scale sets and auto-scaling and I found it rather painful to get the complete set of examples for:

  1. De-provision user/etc from VM.
  2. Use Azure Powershell with a Service principal.
  3. Generalize the VM and create an image.

So, here’s a short mostly-code post on how to do that.

Specific Steps

Fair warning… as far as I know, you can’t use the VM after doing this… but you can create a new copy of it from the image, so that doesn’t matter much.

Before getting to Powershell, run this in your VM to de-provision the most recently set up user account (e.g. I’ll install everything on user “john” created with the Azure VM).  This will remove that user.

sudo waagent -deprovision+user

Now, just run the below command after setting your own values for the 5 variables up top.  This will log in to the RM with the credentials you provide in the pop-up, and then it will stop and generalize the VM, adn tehn create an image from it and store that image in the same resource group as the VM.

$vmName = "YOUR_VM_NAME"
$rgName = "YOUR_RG_NAME"
$location = "YOUR_REGION"
$imageName = "YOUR_IMAGE_NAME"
$tenant = "YOUR_TENANT_ID"

$c = Get-Credential # Input your service principal client-id/secret.
Connect-AzureRmAccount -Credential $c -ServicePrincipal -Tenant $tenant

Stop-AzureRmVM -ResourceGroupName $rgName -Name $vmName -Force
Set-AzureRmVm -ResourceGroupName $rgName -Name $vmName -Generalized
$vm = Get-AzureRmVM -Name $vmName -ResourceGroupName $rgName
$image = New-AzureRmImageConfig -Location $location -SourceVirtualMachineId $vm.Id
New-AzureRmImage -Image $image -ImageName $imageName -ResourceGroupName $rgName

Configuration Trouble?

  • If you’re not sure what a service account / principal is or how to create one, the process is quite involved and I highly recommend following one of the many Microsoft-provided tutorials.
  • You can find your tenant ID by clicking the directory + subscription button at the top of the portal OR by hovering over your name/info at the top right corner.
  • The region strings can be tricky; but just Google the Microsoft site if you’re not sure.  A US East 2 example is “EastUS2”.

What’s Next?

Your VM image can now be found in that resource group – go to the portal and see.  You can go into the image in the portal and create a new VM from it, or you can use it to boot up a scale set, etc.

Centos7 and RHEL7 Increasing Open File Descriptors & Process Limits (AND SystemD / SystemCTL!)

What’s the Problem?

When deploying on RHEL7 or Centos7, it is fairly common to see a warning like the following one (which I just got while installing Presto from Facebook):

WARNING: Current OS file descriptor limit is 4096. Presto recommends at least 8192.

There are a variety of these issues… but the basic problem is that your OS has set limits for things and sometimes we need to raise those limits depending on what we’re running (especially when we’re running large apps on large servers).

The ulimit being referred to here always ends up being extra hard to edit as you have to do it in multiple places and most blogs/posts don’t cover them all for some reason (having suffered through it multiple times now, I know that).

How Do We View the Limits?

In this warning, we see that the “OS File Descriptor” limit is 4096 currently.  So, lets look at the current settings with the “ulimit -a” command:

$> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257564
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

We can see in here that “max user processes” is 4096.  We can also see another option, open files, is 1024.

So, let’s increase both of those (only the first is relevant to the warning though).

Increasing the Limits

Edit/etc/sysctl.conf and add:

fs.file-max = 65536

Edit /etc/security/limits.conf  and add:

* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535

For some reason, the proc limit is also defined in a separate file located roughly at this path (the number can vary) – so please edit /etc/security/limits.d/20-nproc.conf  and make the contents into the following:

* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535
root soft nproc unlimited

That last one is the one that most places miss.

Verifying the New Limits

Here’s the last tricky part… if you run “ulimit -a” again now, it won’t really look much better.  So, re-log-in to your shell/server and then run it, and you’ll see the settings are now updated (yay!).

$> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257564
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65535
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

But What About SystemD and SystemCTL?

I felt victorious at this point, but alas, when I ran presto and haproxy they both spit out warnings and/or errors again for the same reason.  What is this!?

It turns out I was running both in SystemD, and SystemD has its own way of managing these things.  So, in that case, the final step is to go to your unit file in /etc/systemd/system/your-app.service and add the following inside the [Service] section (the … just implies there may be content above or below it, just add those two properties in the existing section).

[Service]
...
LimitNPROC=65535
LimitNOFILE=65535
...

After adding that you should do a “sudo systemctl daemon-reload” and “sudo systemctl restart your-app” to apply the settings.

And finally, everything is right with the world!

Docker + Windows 10 – Volume Mount Shows No Files // Firewall

I wasted roughly an hour on this two separate times now.  Basically, my docker volume mount would stop showing files.

I dug through endless git hub pages and error reports, tried making the docker NAT private and everything… but the problem ended up being that I went home from work and was using my VPN!

So, before spending too much time on the complicated solutions you find online; just start by disabling your VPN if you have one running and see if that helps first.

Centos7 / RHEL7 Services with SystemD + Systemctl For Dummies – Presto Example

History – SystemV & Init.d

Historically in Centos and RHEL, you would use system-v to run a service.  Basically an application (e.g. Spring Boot) would provide an init-d script and you would either place it in /etc/init.d or place a symbolic link from there to your script.

The scripts would have functions for start/stop/restart/status and they would follow some general conventions.  Then you could use “chkconfig” to turn the services on so they would start with the sysem when it rebooted.

SystemD and SystemCTL

Things have moved on a bit and now you can use SystemD instead.  It is a very nice alternative.  Basically, you put a “unit” file in /etc/systemd/system/.service.  This unit file has basic information on what type of application you are trying to run and how it works.  You can specify the working directory, etc as well.

Here is an example UNIT file for Facebook’s Presto application.  We would place this at /etc/systemd/system/presto.service.

[Unit]
Description=Presto
After=syslog.target network.target

[Service]
User=your-user-here
Type=forking
ExecStart=/opt/presto/current/bin/launcher start
ExecStop=/opt/presto/current/bin/launcher stop
WorkingDirectory=/opt/presto/current/bin/
Restart=always

[Install]
WantedBy=multi-user.target

Here are the important things to note about this:

  1. You specify the user the service will run as – it should have access to the actual program location.
  2. Type can be “forking” or “simple”.  Forking implies that you have specific start and stop commands to manage the service (i.e. it kind of manages itself).  Simple implies that you’re just running something like a bash script or a Java JAR that runs forever (so SystemD will just make sure to start it with the command you give and restart it if it fails).
  3. Restart=always will make sure that, as long as you had it started in the first place, it starts whenever it does.  Try it; just kill -9 your application and it will come back.
  4. The install section is critical if you want the application to start up when the computer reboots.  You can not enable it for restart without this.

Useful Commands

  • sudo systemctl status presto (or your app name) –> current status.
  • sudo systemctl stop presto
  • sudo systemctl start presto
  • sudo systemctl restart presto
  • sudo systemctl enable presto -> enable for starting on reboot of server.
  • sudo systemctl disable presto -> don’t start on reboot of server.
  • sudo systemctl is-enabled presto; echo $? –> show if it is currently enabled for start-on-boot.