What’s Wrong
I’ve been building a scale set on Azure and have repeatedly observed around 40% of my VMs getting stuck on “Creating” in the azure portal. The scale set uses a custom script VM extension and runs on the Centos 7.5 OS.
Debugging
After looking around online a lot, I came across numerous Git Hub issues against the custom script extension or the Azure Linux agent. They are for varying OS’s, but they often involve the VM getting stuck in creating. For example, here is one vs Ubuntu:
If you go to this file “/var/log/azure/custom-script/handler.log”, you can see details about what the custom script extension is doing. Also note that “/var/log/waagent.log” can be useful as well.
$> vi /var/log/azure/custom-script/handler.log
+ /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension install
/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-shim: line 77: /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension: Text file busy
In my case, it failed with “Text file busy”. for some reason. Again, there are numerous Git Hub entries for this – but no solutions:
Somewhere else online I saw reports that the Agent was failing while downloading files. Note that if your plugin download works, you should see the script and more info in this location -> /var/lib/waagent/custom-script/download/1/script-name.sh (in my case, it is not there).
My custom script extension takes a script out of Azure Blob storage… so I’m going to try to bundle that script into the image and just issue the run command from the custom script extension to see if that makes it go away.
Result – Failure
Taking the script out of blob storage and putting it into the VM itself, and just calling it with the custom script extension’s command-to-execute mitigated this issue. This is unfortunate as internalizing the script means every tweak requires a new image… but at least the scale set can work properly now and be stable :).
Avoiding downloading files made the issue less likely to occur… but it did come back. It is just rarer.
I tried downgrading the Azure Linux Agent (waagent) to a version noted in one of those Git Hub issues. It did not help. I also tried reverting to Centos 7.3 which didn’t help. I can’t find any way to make this work reliably.
Workaround
My workaround will be:
- Take all customizations I was doing with the agent.
- Move them into a packer build (from Hashicorp).
- Packer will build the image I need for each environment, fully configured and working.
- This way, I just run the image and don’t worry about modifying its config with the custom script extension.
This is painful and frustrating, so I will also raise the bug with Microsoft while doing the workaround.