Azure LB Dropping Traffic Mysteriously – HaProxy / NGNIX / Apache / etc.

Failure Overview

I lost a good portion of last week fighting dropping traffic / intermittent connection issues in a basic tier azure load balancer.  The project this was working on had been up and running for 6 months without configuration changes and had not been restarted in 100 days.  Restarting it did not help, so clearly something had changed about the environment.  It also started happening in multiple deployments in different azure subscriptions, implying that it was not an isolated issue or server/etc related.

Solution

After doing a crazy amount of tests and eventually escalating to Azure support, who reviewed the problem for over 12 hours, Azure support pointed out this:

https://docs.microsoft.com/en-us/azure/load-balancer/load-balancer-custom-probe-overview#types

“Do not translate or proxy a health probe through the instance that receives the health probe to another instance in your VNet as this configuration can lead to cascading failures in your scenario. Consider the following scenario: a set of third-party appliances is deployed in the backend pool of a Load Balancer resource to provide scale and redundancy for the appliances and the health probe is configured to probe a port that the third-party appliance proxies or translates to other virtual machines behind the appliance. If you probe the same port you are using to translate or proxy requests to the other virtual machines behind the appliance, any probe response from a single virtual machine behind the appliance will mark the appliance itself dead. This configuration can lead to a cascading failure of the entire application scenario as a result of a single backend instance behind the appliance. The trigger can be an intermittent probe failure that will cause Load Balancer to mark down the original destination (the appliance instance) and in turn can disable your entire application scenario. Probe the health of the appliance itself instead.”

I was using a load balancer over a scale set, and the load balancer pointed at HaProxy, which was designed to route traffic to the “primary” server.  So, I wanted Azure’s load balancer to consider every server up as long as it could route to the “primary” server, even if other things on this server specifically were down.

But having the health probe check HAProxy meant that the health probe was routed to the “primary” server and triggered this error.

This seems like an Azure quirk to me… but they have it documented.  Once I switched the health probe to target something not routed by HaProxy the LB stabilized and everything was ok.

 

HA Proxy + Centos 7 (or RHEL 7) – Won’t bind to any ports – SystemD!

What are the Symptoms?

This has bitten me badly twice now. I was deploying Centos 7.5 servers and trying to run HA Proxy on them through SystemD (I’m not sure if it is an issue otherwise).

Basically, no matter what port I use I get this message:

Starting frontend main: cannot bind socket [0.0.0.0:80]

Note that as I was too lazy to set up separate logging for the HAProxy config, I found this message in /var/log/messages with the other system messages.

Of course, seeing this your first thought is “he’s running another process on that port!”… but nope.  Also, the permissions are set up properly, etc.

What is the Problem?

The problem here is actually SE Linux.  I haven’t quite dug into why, but when running under SystemD, SELinux will deny access to all ports for HAProxy unless you go out of your way to allow it to access them.

How Do We Fix It?

The fix is very simple thankfully, just set this selinux boolean as a root/sudo user:

sudo setsebool -P haproxy_connect_any 1

…and voilà! if you restart your HAProxy it will connect fine.  I spent a lot of time on this before I found a decent documentation and forum references in these places.  I hope this helps you fix it faster!  I also found a stack-overflow eventually… but the accepted/good answer is like 10 down so I missed it the first pile of times.