Continuing on my journey; setting up Apache Airflow on Windows directly was a disaster for various reasons.
Setting it up in the WSL (Windows Subsystem for Linux) copy of Ubuntu worked great. But unfortunately, you can’t run services/etc properly in that, and I’d like to run it in a state reasonably similar to how we’ll eventually deploy it.
So, my fallback plan is Docker on Windows, which is working great (no surprise there). It was also much less painful to set up in the end than the other options. I’m also switching from Ubuntu to CentOS (non-enterprise version of RHEL) as I found out that docker has service files tested with it here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html.
Assuming you have docker for Windows set up properly, just do the following to set up Airflow in a new CentOS container.
Get and Run CentOS With Python 3.6 in Docker
docker pull centos/python-36-centos7
docker container run --name airflow-centos -it centos/python-36-centos7:latest /bin/bash
Install Airflow with Pip
pip install --upgrade pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow
Set up Airflow
First install VIM. Yes… this is docker so the images are hyper-stripped down to contain only the essentials. You have to install anything else.
First, install VIM. I think I had to go connect to the container as root to do this using this command:
docker exec -it -u root airflow-centos /bin/bash
Then you can just install with yum fine. I’m not 100% sure this was needed, so feel free to try it as the normal user first.
yum install vim
I jumped back into the normal user after that (by removing the -u root from the command above).
Then set up Airflow’s home directory and database.
- Set the Airflow home directory (permanently for the user).
- vi ~/.bashrc and add this to the bottom of the file.
- export AIRFLOW_HOME=~/airflow
- Then re-source the file so you can use it immediately:
- Initialize the Airflow database (we just did defaults, so it will use a local SQLite DB).
Then verify the install worked by checking its version:
root@03bae42c5cdb:/# airflow version
[2018-11-07 20:26:44,372] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.10.0
Run Airflow Services
The actual Airflow hello world page here: https://airflow.apache.org/start.html just says to run Airflow like this:
- airflow webserver -p 8080
- airflow scheduler
You probably want to run these in the background and tell the logs to go to a file, etc.
It’s more professional just to run it as a service (on CentOS/RHEL which is why I switched to CentOS from Ubuntu). But it turns out that running it as a service in Docker is tricky.
Even if you get everything set up properly, Docker by default enables/disables some features for security that make systemctl not work (so you can’t start the service). It sounds like this is a whole rework to get this working (read here). https://serverfault.com/questions/824975/failed-to-get-d-bus-connection-operation-not-permitted.
Also, I realize my idea may have been flawed in the first place (running it as service in a container). Containers are really intended to hold micro-services. So, it would make more sense to launch the web server and the scheduler as their own containers and allow them to communicate with each other probably (I’m still figuring this out). This thread nudged me into realizing that: https://forums.docker.com/t/systemctl-status-is-not-working-in-my-docker-container/9075.
It says:
Normally when you run a container you aren’t running an init system. systemctl is a process that communicates with systemd over dbus. If you aren’t running dbus or systemd, I would expect systemctl to fail.
What is the pid1 of your docker container? It should reflect the entrypoint and command that were used to launch the container.
For example, if I do the following, my pid1 would be bash:
$ docker run --rm -it centos:7 bash
[root@180c9f6866f1 /]# ps faux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.7 0.1 11756 2856 ? Ss 03:01 0:00 bash
root 15 0.0 0.1 47424 3300 ? R+ 03:02 0:00 ps faux
Since only bash
and ps faux
are running in the container, there would be nothing for systemctl
to communicate with.
So, the below steps probably get it working if you set the container up right in the first place (as a privileged container), but it isn’t working for me for now. So feel free to stop reading here and use Airflow, but it won’t be running as a service.
I might come back and update this post and/or make future one on how to run airflow in multiple containers. I’m also aware that there is an awesome image here that gets everything off the ground instantly; but I was really trying to get it working myself to understand it better: https://hub.docker.com/r/puckel/docker-airflow/.
—- service setup (not complete yet)
I found information on the Airflow website here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html stating:
Airflow can integrate with systemd based systems. This makes watching your daemons easy as systemd can take care of restarting a daemon on failure. In the scripts/systemd
directory you can find unit files that have been tested on Redhat based systems. You can copy those to/usr/lib/systemd/system
. It is assumed that Airflow will run under airflow:airflow
. If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files.
Environment configuration is picked up from /etc/sysconfig/airflow
. An example file is supplied. Make sure to specify the SCHEDULER_RUNS
variable in this file when you run the scheduler. You can also define here, for example, AIRFLOW_HOME
or AIRFLOW_CONFIG
.
I didn’t see much in the installation, so I found the scripts on Git Hub for the 1.10 version that we are running (based on our earlier version prompt):
https://github.com/apache/incubator-airflow/tree/v1-10-stable/scripts/systemd
Based on this, I:
- Switched back to a root shell (yes you do have to do it):
- docker exec -it -u root airflow-centos /bin/bash
- Changed to the directory noted in the red text above.
- cd /usr/lib/systemd/system
- Pulled the service files for the two tasks noted in the simpler Airflow getting started page. We can deal with the rest later as we learn more. Note that you have to use the link you get by clicking “Raw” on git-hub for this to work.
- Also, based, on the red text above, you have to set up a conf file and they provided a sample. So:
- Also, it recommends adding airflow home here; so let’s make it the same as it was when we put it in our “.bashrc” file (that way the service knows it).
- So, vi /etc/sysconfig/airflow and add this at the end:
- export AIRFLOW_HOME=~/airflow
- You also need to use the root shell to create the airflow user:
- After this I tried to start the service with systemctl and noticed the error referred to in this serverfault solution https://serverfault.com/questions/824975/failed-to-get-d-bus-connection-operation-not-permitted, which is when I gave up running it as a service in Docker for the time being. I’m just running it using the simple commands noted earlier now from the Airflow getting started document.