Using SystemD Timers (Like Cron) – Centos

The more I use it, the more I like using SystemD to manage services.  It is sometimes verbose, but in general, it is very easy to use and very powerful.  It can restart failed services, start services automatically, handle logging and rolling for simplistic services, and much more.

SystemD Instead of Cron

Recently, I had to take a SystemD service and make it run on a regular timer.  So, of course, my first thought was using cron… but I’m not a particularly big fan of cron for reasons I won’t go into here.

I found out that SystemD can do this as well!

Creating a SystemD Timer

Let’s say you have a (very) simple service defined like this in SystemD (in /etc/systemd/system/your-service.service):

[Unit]
Description=Do something useful.

[Service]
Type=simple
ExecStart=/opt/your-service/do-something-useful.sh
User=someuser
Group=someuser

Then you can create a timer for it by creating another SystemD file in the same location but ending with “.timer” instead of “.service”. It can handle basically any interval, can start at reboot, and can even time tasks based on the reboot time.

[Unit]
Description=Run service once daily.

[Timer]
OnCalendar=*-*-* 00:30:00
Unit=your-service.service

[Install]
WantedBy=multi-user.target

Once the timer file is made, you can enable it like to:

sudo systemctl daemon-reload
sudo systemctl enable your-service.timer
sudo systemctl start your-service.timer

After this, you can verify the timer is working and check its next (and later on, previous) execution time with this command:

$> systemctl list-timers --all
NEXT                         LEFT          LAST                         PASSED  UNIT                         ACTIVATES
Tue 2019-03-05 20:00:01 UTC  14min left    Mon 2019-03-04 20:00:01 UTC  23h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Wed 2019-03-06 00:30:00 UTC  4h 44min left n/a                          n/a     your-service.timer     your-service.service
n/a                          n/a           n/a                          n/a     systemd-readahead-done.timer systemd-readahead-done.service

HA Proxy + Centos 7 (or RHEL 7) – Won’t bind to any ports – SystemD!

What are the Symptoms?

This has bitten me badly twice now. I was deploying Centos 7.5 servers and trying to run HA Proxy on them through SystemD (I’m not sure if it is an issue otherwise).

Basically, no matter what port I use I get this message:

Starting frontend main: cannot bind socket [0.0.0.0:80]

Note that as I was too lazy to set up separate logging for the HAProxy config, I found this message in /var/log/messages with the other system messages.

Of course, seeing this your first thought is “he’s running another process on that port!”… but nope.  Also, the permissions are set up properly, etc.

What is the Problem?

The problem here is actually SE Linux.  I haven’t quite dug into why, but when running under SystemD, SELinux will deny access to all ports for HAProxy unless you go out of your way to allow it to access them.

How Do We Fix It?

The fix is very simple thankfully, just set this selinux boolean as a root/sudo user:

sudo setsebool -P haproxy_connect_any 1

…and voilà! if you restart your HAProxy it will connect fine.  I spent a lot of time on this before I found a decent documentation and forum references in these places.  I hope this helps you fix it faster!  I also found a stack-overflow eventually… but the accepted/good answer is like 10 down so I missed it the first pile of times.

Azure Custom Script Extension – Text File Busy – Centos7.5 – VM Stuck on Creating

What’s Wrong

I’ve been building a scale set on Azure and have repeatedly observed around 40% of my VMs getting stuck on “Creating” in the azure portal.  The scale set uses a custom script VM extension and runs on the Centos 7.5 OS.

Debugging

After looking around online a lot, I came across numerous Git Hub issues against the custom script extension or the Azure Linux agent.  They are for varying OS’s, but they often involve the VM getting stuck in creating.  For example, here is one vs Ubuntu:

If you go to this file “/var/log/azure/custom-script/handler.log”, you can see details about what the custom script extension is doing.  Also note that “/var/log/waagent.log” can be useful as well.

$> vi /var/log/azure/custom-script/handler.log
+ /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension install
/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-shim: line 77: /var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/bin/custom-script-extension: Text file busy

In my case, it failed with “Text file busy”. for some reason. Again, there are numerous Git Hub entries for this – but no solutions:

Somewhere else online I saw reports that the Agent was failing while downloading files.  Note that if your plugin download works, you should see the script and more info in this location -> /var/lib/waagent/custom-script/download/1/script-name.sh (in my case, it is not there).

My custom script extension takes a script out of Azure Blob storage… so I’m going to try to bundle that script into the image and just issue the run command from the custom script extension to see if that makes it go away.

Result – Failure

Taking the script out of blob storage and putting it into the VM itself, and just calling it with the custom script extension’s command-to-execute mitigated this issue.  This is unfortunate as internalizing the script means every tweak requires a new image… but at least the scale set can work properly now and be stable :).

Avoiding downloading files made the issue less likely to occur… but it did come back.  It is just rarer.

I tried downgrading the Azure Linux Agent (waagent) to a version noted in one of those Git Hub issues.  It did not help.  I also tried reverting to Centos 7.3 which didn’t help.  I can’t find any way to make this work reliably.

Workaround

My workaround will be:

  • Take all customizations I was doing with the agent.
  • Move them into a packer build (from Hashicorp).
  • Packer will build the image I need for each environment, fully configured and working.
  • This way, I just run the image and don’t worry about modifying its config with the custom script extension.

This is painful and frustrating, so I will also raise the bug with Microsoft while doing the workaround.

 

Running Terraform on Centos7/RHEL7 With Docker

Install Docker

Here is a lean version of the Docker site content that I tested on Centos 7.5.  It yum installs some pre-requisites, adds the stable Docker Community Edition repository to yum, and then installs and starts Docker.

sudo yum install -y yum-utils \
device-mapper-persistent-data \
lvm2
sudo yum-config-manager \
--add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce
sudo systemctl start docker

Now Docker is started – but only the root user can really use it.  So, let’s create the docker group and add our current user to it.  That way we can use docker with our current user and avoid having to use sudo on every command.

These instructions from from here: https://docs.docker.com/install/linux/linux-postinstall/#manage-docker-as-a-non-root-user.

sudo groupadd docker
sudo usermod -aG docker $USER

After this, please re-log in (e.g. exit out of SSH and jump back into your server) so that your group memberships apply.

Now Docker is running and we can use it as ourselves.

Get Terraform Working in Docker

We will run Terraform as a single command inside of a Docker image.  So, let’s start by getting the latest Terraform image form Hashicorp:

docker pull hashicorp/terraform

Create a directory for your Terraform work and give ownership to your user. Also create a sub-directory to act as the Docker volume in which we will put your Terraform plans.

sudo mkdir /opt/terraform && sudo chown $USER:$USER /opt/terraform
cd /opt/terraform
mkdir tf-vol

Now let’s create a file at /opt/terraform/tf-vol/plan.tf with a sample Terraform plan (just a debug one).

output "test" {
  value = "Hello World!"
}

After this, we can run Terraform and tell docker to use that tf-vol directory as as a volume. Terraform will use it as the working directory, will find our plan, and will display “Hello World!”.

$ docker run -i -t -v /opt/terraform/tf-vol:/tf-vol/ -w /tf-vol/ hashicorp/terraform:light apply

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

Outputs:

test = Hello World!

So, we now have Docker installed, and Terraform running with it using an external volume to store our plans.

Centos7 and RHEL7 Increasing Open File Descriptors & Process Limits (AND SystemD / SystemCTL!)

What’s the Problem?

When deploying on RHEL7 or Centos7, it is fairly common to see a warning like the following one (which I just got while installing Presto from Facebook):

WARNING: Current OS file descriptor limit is 4096. Presto recommends at least 8192.

There are a variety of these issues… but the basic problem is that your OS has set limits for things and sometimes we need to raise those limits depending on what we’re running (especially when we’re running large apps on large servers).

The ulimit being referred to here always ends up being extra hard to edit as you have to do it in multiple places and most blogs/posts don’t cover them all for some reason (having suffered through it multiple times now, I know that).

How Do We View the Limits?

In this warning, we see that the “OS File Descriptor” limit is 4096 currently.  So, lets look at the current settings with the “ulimit -a” command:

$> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257564
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 4096
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

We can see in here that “max user processes” is 4096.  We can also see another option, open files, is 1024.

So, let’s increase both of those (only the first is relevant to the warning though).

Increasing the Limits

Edit/etc/sysctl.conf and add:

fs.file-max = 65536

Edit /etc/security/limits.conf  and add:

* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535

For some reason, the proc limit is also defined in a separate file located roughly at this path (the number can vary) – so please edit /etc/security/limits.d/20-nproc.conf  and make the contents into the following:

* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535
root soft nproc unlimited

That last one is the one that most places miss.

Verifying the New Limits

Here’s the last tricky part… if you run “ulimit -a” again now, it won’t really look much better.  So, re-log-in to your shell/server and then run it, and you’ll see the settings are now updated (yay!).

$> ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 257564
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 65535
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

But What About SystemD and SystemCTL?

I felt victorious at this point, but alas, when I ran presto and haproxy they both spit out warnings and/or errors again for the same reason.  What is this!?

It turns out I was running both in SystemD, and SystemD has its own way of managing these things.  So, in that case, the final step is to go to your unit file in /etc/systemd/system/your-app.service and add the following inside the [Service] section (the … just implies there may be content above or below it, just add those two properties in the existing section).

[Service]
...
LimitNPROC=65535
LimitNOFILE=65535
...

After adding that you should do a “sudo systemctl daemon-reload” and “sudo systemctl restart your-app” to apply the settings.

And finally, everything is right with the world!