Azure Custom Script Extension – Text File Busy – Centos7.5 – VM Stuck on Creating

What’s Wrong

I’ve been building a scale set on Azure and have repeatedly observed around 40% of my VMs getting stuck on “Creating” in the azure portal.  The scale set uses a custom script VM extension and runs on the Centos 7.5 OS.

Debugging

After looking around online a lot, I came across numerous Git Hub issues against the custom script extension or the Azure Linux agent.  They are for varying OS’s, but they often involve the VM getting stuck in creating.  For example, here is one vs Ubuntu:

If you go to this file “/var/log/azure/custom-script/handler.log”, you can see details about what the custom script extension is doing.  Also note that “/var/log/waagent.log” can be useful as well.

$> vi /var/log/azure/custom-script/handler.log
+ /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension install
/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-shim: line 77: /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension: Text file busy

In my case, it failed with “Text file busy”. for some reason. Again, there are numerous Git Hub entries for this – but no solutions:

Somewhere else online I saw reports that the Agent was failing while downloading files.  Note that if your plugin download works, you should see the script and more info in this location -> /var/lib/waagent/custom-script/download/1/script-name.sh (in my case, it is not there).

My custom script extension takes a script out of Azure Blob storage… so I’m going to try to bundle that script into the image and just issue the run command from the custom script extension to see if that makes it go away.

Result – Failure

Taking the script out of blob storage and putting it into the VM itself, and just calling it with the custom script extension’s command-to-execute mitigated this issue.  This is unfortunate as internalizing the script means every tweak requires a new image… but at least the scale set can work properly now and be stable :).

Avoiding downloading files made the issue less likely to occur… but it did come back.  It is just rarer.

I tried downgrading the Azure Linux Agent (waagent) to a version noted in one of those Git Hub issues.  It did not help.  I also tried reverting to Centos 7.3 which didn’t help.  I can’t find any way to make this work reliably.

Workaround

My workaround will be:

  • Take all customizations I was doing with the agent.
  • Move them into a packer build (from Hashicorp).
  • Packer will build the image I need for each environment, fully configured and working.
  • This way, I just run the image and don’t worry about modifying its config with the custom script extension.

This is painful and frustrating, so I will also raise the bug with Microsoft while doing the workaround.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s