Azure CLI Get Scale Set Private IP Addresses

Getting Scale Set Private IPs is Hard

I have found that it is impressively difficult to get the private IP addresses of Azure scale set instances in almost every tool.

For example, if you go and create a scale set in Terraform, even Terraform will not provide you the addresses or a way to look them up to act upon them in future steps.  Similarly, you cannot easily list the addresses in Ansible.

You can make dynamic inventories in Ansible based on scripts though.  So, in order to make an ansible playbook target the nodes in a recently created scale set dynamically, I decided to use a dynamic inventory created by the Azure CLI.

Azure CLI Command

Here is an azure CLI command (version 2.0.58) which directly lists the IP addresses of scale set nodes.  I hope it helps you as it has helped me.  It took a while to build it out from the docs but its pretty simple now that it’s done.

az vmss nic list --resource-group YourRgName \
--vmss-name YourVmssName \
--query "[].ipConfigurations[].privateIpAddress"

The output will look similar to this, though I just changed the IP addresses to fake ones here an an example.

[
"123.123.123.123",
"123.123.123.124"
]

Azure Scale Set vs Availability Set

Why Was I Worried?

I have been habitually using scale sets for all of my needs as long as my requirements only involved needing multiple copies of a VM image running safely. Then I started to worry about the difference between a scale set and an availability set… were my scale set VMs not safe?

TLDR; I actually read Azure and Azure CLI documentation and made a simple but cool command below that put my mind at ease for scale sets, so feel free to skip to that if you like.

Research

There is a good stack overflow right here which I added to just now.  It has quite a few good answers about availability sets vs scale sets, including some info about a scale set by default having 5 fault domains.  So, I recommend starting there if you’re interested in digging in.

A good summary of what I found is that:

  • Availability sets by default will spread your resources over fault domains to ensure that outage of one due to a power or network issue, etc does not affect another.
  • Availability sets also allow mixing of resources; e.g. 2 VMs with different configuration.
  • Scale sets only allow you to have an identical image deployed and they provide the ability to scale it out linearly.
  • Scale sets implicitly have one “placement group”.  If you want to go over 100 VMs, you have to remove that restriction.
  • A placement group has 5 fault domains and is similar (or maybe the same as) an availability set.

Validation

As I’m responsible for highly available infrastructure, I wasn’t keen on just accepting this.  So, I fiddled around with the Azure CLI for scale sets and made this simple command which indeed shows my 10 instance scale set is indeed spread across multiple fault domains – I hope you find it useful too.

az vmss get-instance-view --subscription "your-subscription-id" \ 
--resource-group "your-rg" --name "your-scale-set-name" \
--instance-id "*" | grep platformFaultDomain

    "platformFaultDomain": 0,
    "platformFaultDomain": 1,
    "platformFaultDomain": 2,
    "platformFaultDomain": 4,
    "platformFaultDomain": 0,
    "platformFaultDomain": 1,
    "platformFaultDomain": 3,
    "platformFaultDomain": 4,
    "platformFaultDomain": 2,
    "platformFaultDomain": 3

Here are some additional good resources:

 

Azure Custom Script Extension – Text File Busy – Centos7.5 – VM Stuck on Creating

What’s Wrong

I’ve been building a scale set on Azure and have repeatedly observed around 40% of my VMs getting stuck on “Creating” in the azure portal.  The scale set uses a custom script VM extension and runs on the Centos 7.5 OS.

Debugging

After looking around online a lot, I came across numerous Git Hub issues against the custom script extension or the Azure Linux agent.  They are for varying OS’s, but they often involve the VM getting stuck in creating.  For example, here is one vs Ubuntu:

If you go to this file “/var/log/azure/custom-script/handler.log”, you can see details about what the custom script extension is doing.  Also note that “/var/log/waagent.log” can be useful as well.

$> vi /var/log/azure/custom-script/handler.log
+ /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension install
/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-shim: line 77: /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension: Text file busy

In my case, it failed with “Text file busy”. for some reason. Again, there are numerous Git Hub entries for this – but no solutions:

Somewhere else online I saw reports that the Agent was failing while downloading files.  Note that if your plugin download works, you should see the script and more info in this location -> /var/lib/waagent/custom-script/download/1/script-name.sh (in my case, it is not there).

My custom script extension takes a script out of Azure Blob storage… so I’m going to try to bundle that script into the image and just issue the run command from the custom script extension to see if that makes it go away.

Result – Failure

Taking the script out of blob storage and putting it into the VM itself, and just calling it with the custom script extension’s command-to-execute mitigated this issue.  This is unfortunate as internalizing the script means every tweak requires a new image… but at least the scale set can work properly now and be stable :).

Avoiding downloading files made the issue less likely to occur… but it did come back.  It is just rarer.

I tried downgrading the Azure Linux Agent (waagent) to a version noted in one of those Git Hub issues.  It did not help.  I also tried reverting to Centos 7.3 which didn’t help.  I can’t find any way to make this work reliably.

Workaround

My workaround will be:

  • Take all customizations I was doing with the agent.
  • Move them into a packer build (from Hashicorp).
  • Packer will build the image I need for each environment, fully configured and working.
  • This way, I just run the image and don’t worry about modifying its config with the custom script extension.

This is painful and frustrating, so I will also raise the bug with Microsoft while doing the workaround.

 

Azure + Terraform + Linux Custom Script Extension (Scale Set or VM)

Overview

Whether you are creating a virtual machine or a scale set in Azure, you can specify a “Custom Script Extension” to tailor the VM after creation.

Terraform Syntax

I’m not going to go into detail on how to do the entire scale set or VM, but here is the full extension block that should go inside either one of them.

resource "azurerm_virtual_machine_scale_set" "some-name" {
  # ... normal scale set config ...

  extension {
    name                 = "your-extension-name"
    publisher            = "Microsoft.Azure.Extensions"
    type                 = "CustomScript"
    type_handler_version = "2.0"

    settings = <<SETTINGS
    {
    "fileUris": ["https://some-blob-storage.blob.core.windows.net/my-scripts/run_config.sh"],
    "commandToExecute": "bash run_config.sh"
    }
SETTINGS
  }
}

Things to notice include:

  1. The extension settings have to be valid JSON (e.g. no new-lines in strings, proper quoting).
  2. This can get frustrating, so it helps to use a bash “heredoc” style block to write it the JSON (to help avoid quote escaping, etc). https://stackoverflow.com/a/2500451/857994
  3. Assuming you have a non-trivial use case, it is very beneficial to maintain your script(s) outside of your VM image.  After all… you don’t want to go make a new VM image every time you find a typo in your script.  This is what fileUris does; it lets you refer to a script in azure storage or in any reachable web location.
  4. You can easily create new Azure storage, create a blob container, and upload a file and mark it as public so that you can refer to it without authentication.  Don’t put anything sensitive in it in this case though; if you do, use a storage key instead.  I prefer to make it public but then pass any “secret” properties to it from the command-to-execute, that way all variables are managed by Terraform at execution time.
  5. The command-to-execute can call the scripts downloaded form the fileURIs.  When the extension is run on your VM or scale set VM(s) after deployment, the scripts are uploaded to /var/lib/waagent/custom-script/download/1/script-name.sh and then run with the command-to-execute.  This location serves as the working directory.

Debugging Failures

Sometimes things can go wrong when running custom scripts; even things outside your control.  For example, on Centos7.5, I keep getting 40% of my VMs or so stuck on “creating” and they clearly haven’t run the scripts.

In this case, you can look at the following log file to get more information:

/var/log/azure/custom-script/handler.log