Azure LB Dropping Traffic Mysteriously – HaProxy / NGNIX / Apache / etc.

Failure Overview

I lost a good portion of last week fighting dropping traffic / intermittent connection issues in a basic tier azure load balancer.  The project this was working on had been up and running for 6 months without configuration changes and had not been restarted in 100 days.  Restarting it did not help, so clearly something had changed about the environment.  It also started happening in multiple deployments in different azure subscriptions, implying that it was not an isolated issue or server/etc related.

Solution

After doing a crazy amount of tests and eventually escalating to Azure support, who reviewed the problem for over 12 hours, Azure support pointed out this:

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#types

“Do not translate or proxy a health probe through the instance that receives the health probe to another instance in your VNet as this configuration can lead to cascading failures in your scenario. Consider the following scenario: a set of third-party appliances is deployed in the backend pool of a Load Balancer resource to provide scale and redundancy for the appliances and the health probe is configured to probe a port that the third-party appliance proxies or translates to other virtual machines behind the appliance. If you probe the same port you are using to translate or proxy requests to the other virtual machines behind the appliance, any probe response from a single virtual machine behind the appliance will mark the appliance itself dead. This configuration can lead to a cascading failure of the entire application scenario as a result of a single backend instance behind the appliance. The trigger can be an intermittent probe failure that will cause Load Balancer to mark down the original destination (the appliance instance) and in turn can disable your entire application scenario. Probe the health of the appliance itself instead.”

I was using a load balancer over a scale set, and the load balancer pointed at HaProxy, which was designed to route traffic to the “primary” server.  So, I wanted Azure’s load balancer to consider every server up as long as it could route to the “primary” server, even if other things on this server specifically were down.

But having the health probe check HAProxy meant that the health probe was routed to the “primary” server and triggered this error.

This seems like an Azure quirk to me… but they have it documented.  Once I switched the health probe to target something not routed by HaProxy the LB stabilized and everything was ok.

 

Azure VM Unresponsive, Can’t SSH

My VM Was Non-Responsive

Today I had an Azure virtual machine go down very unexpectedly.

I received error reports from users and tried to go to the related service endpoint myself… and sure enough, it didn’t come up.  Then, I tried to ssh onto the VM and I couldn’t.

I hopped into the Azure portal, went to the VM, and things actually looked alright… it wasn’t stopped, or de-allocated, or anything.

Why?

After multiple minutes of digging around the Azure portal for more information, suddenly the “Activity Log” popped up with a new entry.   This was relatively disconcerting as the issue had been reported over half an hour ago and I had been on the portal for multiple minutes.

The activity log said I had a “health event” which was “updated”.  Upon expanding it, I could see more events that had been “in progress”.  When you click the “in progress” event, you can get JSON for it and look into the details.  In my case, the bottom of the details said this:

    "properties": {
        "title": "We're sorry, your virtual machine isn't available because an unexpected failure on the host server",
        "details": null,
        "currentHealthStatus": "Unavailable",
        "previousHealthStatus": "Unknown",
        "type": "Downtime",
        "cause": "PlatformInitiated"
    }

So, the physical host which was running my VM in azure died. Azure automatically noticed this and moved it to a new physical host, though much slower than I would have appreciated.

The VM came up after a few more minutes and all was right with the world. So… the moral of the story is that if your VM is unresponsive, it may be because the host died, and you may have to wait quite a while to see information on that in the activity log. But it does auto resolve apparently which is nice.

Azure CLI Get Scale Set Private IP Addresses

Getting Scale Set Private IPs is Hard

I have found that it is impressively difficult to get the private IP addresses of Azure scale set instances in almost every tool.

For example, if you go and create a scale set in Terraform, even Terraform will not provide you the addresses or a way to look them up to act upon them in future steps.  Similarly, you cannot easily list the addresses in Ansible.

You can make dynamic inventories in Ansible based on scripts though.  So, in order to make an ansible playbook target the nodes in a recently created scale set dynamically, I decided to use a dynamic inventory created by the Azure CLI.

Azure CLI Command

Here is an azure CLI command (version 2.0.58) which directly lists the IP addresses of scale set nodes.  I hope it helps you as it has helped me.  It took a while to build it out from the docs but its pretty simple now that it’s done.

az vmss nic list --resource-group YourRgName \
--vmss-name YourVmssName \
--query "[].ipConfigurations[].privateIpAddress"

The output will look similar to this, though I just changed the IP addresses to fake ones here an an example.

[
"123.123.123.123",
"123.123.123.124"
]

Azure Scale Set vs Availability Set

Why Was I Worried?

I have been habitually using scale sets for all of my needs as long as my requirements only involved needing multiple copies of a VM image running safely. Then I started to worry about the difference between a scale set and an availability set… were my scale set VMs not safe?

TLDR; I actually read Azure and Azure CLI documentation and made a simple but cool command below that put my mind at ease for scale sets, so feel free to skip to that if you like.

Research

There is a good stack overflow right here which I added to just now.  It has quite a few good answers about availability sets vs scale sets, including some info about a scale set by default having 5 fault domains.  So, I recommend starting there if you’re interested in digging in.

A good summary of what I found is that:

  • Availability sets by default will spread your resources over fault domains to ensure that outage of one due to a power or network issue, etc does not affect another.
  • Availability sets also allow mixing of resources; e.g. 2 VMs with different configuration.
  • Scale sets only allow you to have an identical image deployed and they provide the ability to scale it out linearly.
  • Scale sets implicitly have one “placement group”.  If you want to go over 100 VMs, you have to remove that restriction.
  • A placement group has 5 fault domains and is similar (or maybe the same as) an availability set.

Validation

As I’m responsible for highly available infrastructure, I wasn’t keen on just accepting this.  So, I fiddled around with the Azure CLI for scale sets and made this simple command which indeed shows my 10 instance scale set is indeed spread across multiple fault domains – I hope you find it useful too.

az vmss get-instance-view --subscription "your-subscription-id" \ 
--resource-group "your-rg" --name "your-scale-set-name" \
--instance-id "*" | grep platformFaultDomain

    "platformFaultDomain": 0,
    "platformFaultDomain": 1,
    "platformFaultDomain": 2,
    "platformFaultDomain": 4,
    "platformFaultDomain": 0,
    "platformFaultDomain": 1,
    "platformFaultDomain": 3,
    "platformFaultDomain": 4,
    "platformFaultDomain": 2,
    "platformFaultDomain": 3

Here are some additional good resources: