Azure LB Dropping Traffic Mysteriously – HaProxy / NGNIX / Apache / etc.

Failure Overview

I lost a good portion of last week fighting dropping traffic / intermittent connection issues in a basic tier azure load balancer.  The project this was working on had been up and running for 6 months without configuration changes and had not been restarted in 100 days.  Restarting it did not help, so clearly something had changed about the environment.  It also started happening in multiple deployments in different azure subscriptions, implying that it was not an isolated issue or server/etc related.

Solution

After doing a crazy amount of tests and eventually escalating to Azure support, who reviewed the problem for over 12 hours, Azure support pointed out this:

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#types

“Do not translate or proxy a health probe through the instance that receives the health probe to another instance in your VNet as this configuration can lead to cascading failures in your scenario. Consider the following scenario: a set of third-party appliances is deployed in the backend pool of a Load Balancer resource to provide scale and redundancy for the appliances and the health probe is configured to probe a port that the third-party appliance proxies or translates to other virtual machines behind the appliance. If you probe the same port you are using to translate or proxy requests to the other virtual machines behind the appliance, any probe response from a single virtual machine behind the appliance will mark the appliance itself dead. This configuration can lead to a cascading failure of the entire application scenario as a result of a single backend instance behind the appliance. The trigger can be an intermittent probe failure that will cause Load Balancer to mark down the original destination (the appliance instance) and in turn can disable your entire application scenario. Probe the health of the appliance itself instead.”

I was using a load balancer over a scale set, and the load balancer pointed at HaProxy, which was designed to route traffic to the “primary” server.  So, I wanted Azure’s load balancer to consider every server up as long as it could route to the “primary” server, even if other things on this server specifically were down.

But having the health probe check HAProxy meant that the health probe was routed to the “primary” server and triggered this error.

This seems like an Azure quirk to me… but they have it documented.  Once I switched the health probe to target something not routed by HaProxy the LB stabilized and everything was ok.

 

Azure VM Unresponsive, Can’t SSH

My VM Was Non-Responsive

Today I had an Azure virtual machine go down very unexpectedly.

I received error reports from users and tried to go to the related service endpoint myself… and sure enough, it didn’t come up.  Then, I tried to ssh onto the VM and I couldn’t.

I hopped into the Azure portal, went to the VM, and things actually looked alright… it wasn’t stopped, or de-allocated, or anything.

Why?

After multiple minutes of digging around the Azure portal for more information, suddenly the “Activity Log” popped up with a new entry.   This was relatively disconcerting as the issue had been reported over half an hour ago and I had been on the portal for multiple minutes.

The activity log said I had a “health event” which was “updated”.  Upon expanding it, I could see more events that had been “in progress”.  When you click the “in progress” event, you can get JSON for it and look into the details.  In my case, the bottom of the details said this:

    "properties": {
        "title": "We're sorry, your virtual machine isn't available because an unexpected failure on the host server",
        "details": null,
        "currentHealthStatus": "Unavailable",
        "previousHealthStatus": "Unknown",
        "type": "Downtime",
        "cause": "PlatformInitiated"
    }

So, the physical host which was running my VM in azure died. Azure automatically noticed this and moved it to a new physical host, though much slower than I would have appreciated.

The VM came up after a few more minutes and all was right with the world. So… the moral of the story is that if your VM is unresponsive, it may be because the host died, and you may have to wait quite a while to see information on that in the activity log. But it does auto resolve apparently which is nice.

Azure: Tagging All Resources in a Resource Group With its Tags

Recently, I had to go back and correctly tag a whole bunch of items in a new resource group, none of which had been given tags.

This kind of task can be daunting in the Azure portal… you have to click each, click the tags tab, and then type each key/value, for each tag, and save.  So… tagging 50 resources with 5 tags each ends up being 50 * 2 * 5 + 50 = 550 clicks at minimum, plus all the typing!  Clearly, this is a task better suited for the CLI.

Using the Azure CLI

Microsoft actually has a very full featured tutorial on this subject right here.  The more advanced code they provide will actually find every resource you have in every group and give each resource the tags from the group.  It will even optionally retain existing tags for resources that are already tagged.

I wanted something a little simpler with the login included so that I can quickly copy it in to fix a resource group here and there without worrying about affecting all the other resource groups.  So, here is the code. It also counts the items so you can see progress as it can take some time.

Note, I wanted to forcibly replace all the tags on the resources with the RG tags as some of them were incorrect. You can get code to merge with existing tags from the link noted above if you prefer.

tenant="your-tenant-id"
subscription="your-subscription-name"
rg="your-resource-group-name"

# Login to azure - it will give you a message and code to log in via
# a web browser on any device
az login --tenant "${tenant}" --subscription "${subscription}"

# Show subscriptions just to show that we're on the correct one.
echo "Listing subscriptions:"
az account list --output table

# Get the tags from the resource group in a useful format.
jsontag=$(az group show -n $rg --query tags)
t=$(echo $jsontag | tr -d '"{},' | sed 's/: /=/g')

# Get all resources in the target resource group, and loop through
# them applying the tags from the resource group. Count them to show
# progress as this can take time.
i=0
r=$(az resource list -g $rg --query [].id --output tsv)
for resid in $r
do
az resource tag --tags $t --id $resid
let "i+=1"
echo $i
done

Also note that you can find the total number of resources you are targeting in advance with this command so the counter is more practical :).

az resource list --resource-group "your-rg-name" --query "[].name" | jq length

MRemoteNG – SSH – Connect to Azure VM

What is MRemoteNG?

MRemoteNG is a nice Windows OS tool for managing multiple SSH sessions (and session configurations) in one window – so you can log onto 10 servers and hop around trivially.  It is built on top of Putty.

How Do You Use It With Azure VMs?

  • When you create a VM in Azure, you give it a public key (assuming you didn’t use password authentication, which you should generally avoid).
  • You can generate a key pair with PuTTYGen if you don’t have one (but then I’m assuming that you do have one if you already created the VM).
  • Take the private key corresponding to that public key and save it into a file (it may already be in an “id_rsa” file in your .ssh directory in your user directory; e.g. C:\users\your-name\.ssh\id_rsa).
  • Open PuTTYgen (it should come with MRemoteNG or Putty, otherwise you can get it yourself.
  • Load the private key file.
  • Click “Save private key” with Type = RSA selected (2048 bits is fine).  It will save as a “PPK” file.
  • Save it to your .ssh folder for consistency, or anywhere else – it really doesn’t matter much.
  • Open MRemoteNG -> Tools -> Options -> Advanced -> Launch Putty -> Expand “SSH” -> Click Auth (Don’t expand) -> Put your PPK file path in “Private key file for authentication”.
  • Click Session in putty and give the session a name in the “Saved Sessions” text box and then click Save.  It should appear in the box below that.
  • Now you have a saved session that can use this private key via a PPK file.
  • Close Putty, make a new connection in MRemoteNG and select “Putty Session” = the new session you saved.  It should be listed as an option.
  • Celebrate!

Azure PaaS Postgres 10 Database Create + Connect Centos PSQL or DBeaver

Today I started using the “Azure Database for PostreSQL” PaaS service offering.  It went pretty smoothly, but connecting took a little more effort than I expected (all for good reasons!).

Creating the PostreSQL Service

You can find the creation screen in the Azure portal by pressing (+), clicking Databases, and scrolling down.

As with most things in Azure, creating the service through the portal was pretty trivial.  You basically just provide the name, region, resource group, subscription, select the size you want, specify a user + password, and you’re done!  It takes around a minute to complete with a smallish database size.

postgres-create

Connecting to the Database

We’re going to connect with DBeaver (its like SQuirreL and DBVizualizer if you haven’t heard of it).  Then we will also connect with the “psql” command line utility from Linux.  This should be pretty quick – but there are two wrenches in the works:

  1. SSL is enabled.
  2. Azure has blocked all inbound IPs by default – nothing can connect in.

Connecting with DBeaver

  • Go to your Postgres instance in the portal and view the “Overview” screen.
  • Open DBeaver, create a new Postgres connection.
  • Copy the server name from the portal into the host section of DBeaver.
  • Copy the Server Admin Login name from the portal into the user name section of DBeaver.
  • Type in your password for that Admin user.
  • Set the database as Postgres in DBeaver.
  • You can leave the port as the default 5432.
  • Now, go to driver properties on the left of DBeaver and set:
    • ssl to true
    • sslmode to require

This is shown here:

dbeaver-postgres

At this point, you’ve got all the connection details in DBeaver set up properly; but you still can’t connect.  You’ll have to go into the Azure portal, click “Connection Security”, and then create a firewall rule that allows your IP in.  You can also, alternatively, add in a pre-defined subnet you have for yourself, your company, etc.  At that point, everything on that subnet will be able to connect properly.

After this, you should be able to “Test Connection” successfully.

Connecting with PSQL from Centos 7

Assuming you opened up the firewall or subnet as noted at the end of the previous example with DBeaver, you can then just:

Install the PSQL client library:

And connect with the psql utility:

  • psql “sslmode=require host=yourhost.postgres.database.azure.com dbname=postgres user=youruser@yourhost”

Azure + Packer – Create Image With Only Access to Resource Group (Not Subscription)

What Was the Problem?

I recently had to create a VM image for an Azure scale-set using packer.  Overall, the experience was great… but getting off the ground took me about an hour.  This was because most tutorials/examples assume you have contributor access to the whole subscription, but I wanted to do it with a service principal that just had access to a specific resource group.

Working Configuration

Basically, you just need the right combination (or lack-there-of) of fields.

The tricky ones to get right were the combination of build_resource_group_name, managed_image_resource_group_name, and managed_image_name while leaving out location.

There was a Git Hub issue chain on this (https://github.com/hashicorp/packer/issues/5873) that went on for a very long time before someone finally worked out that you had to leave out location when you wanted to do this without subscription level contributor access.

Here is a reference config file that works if you populate your details:

{
"builders":[
{
"type":"azure-arm",
"client_id":"your-client-id",
"client_secret":"your-client-secret",
"tenant_id":"your-tenant-id",
"subscription_id":"your-subscription",
"build_resource_group_name":"your-existing-rg",
"managed_image_resource_group_name":"your-existing-rg",
"managed_image_name":"your-result-output-image-name",
"os_type":"Linux",
"image_publisher":"OpenLogic",
"image_offer":"CentOS",
"image_sku":"7.5",
"azure_tags":{
"ApplicationName":"Some Sample App"
},
"vm_size":"Standard_D2s_v3"
}
],
"provisioners":[
{
"execute_command":"chmod +x {{ .Path }}; {{ .Vars }} sudo -E sh '{{ .Path }}'",
"inline":[
"yum -y install haproxy-1.5.18-8.el7",
"/usr/sbin/waagent -force -deprovision+user && export HISTSIZE=0 && sync"
],
"inline_shebang":"/bin/sh -x",
"type":"shell"
}
]
}