Install Airflow on Windows + Docker + CentOs

Continuing on my journey; setting up Apache Airflow on Windows directly was a disaster for various reasons.

Setting it up in the WSL (Windows Subsystem for Linux) copy of Ubuntu worked great.  But unfortunately, you can’t run services/etc properly in that, and I’d like to run it in a state reasonably similar to how we’ll eventually deploy it.

So, my fallback plan is Docker on Windows, which is working great (no surprise there).  It was also much less painful to set up in the end than the other options.  I’m also switching from Ubuntu to CentOS (non-enterprise version of RHEL) as I found out that docker has service files tested with it here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html.

Assuming you have docker for Windows set up properly, just do the following to set up Airflow in a new CentOS container.

Get and Run CentOS With Python 3.6 in Docker

docker pull centos/python-36-centos7
docker container run --name airflow-centos -it centos/python-36-centos7:latest /bin/bash

Install Airflow with Pip

pip install --upgrade pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow

Set up Airflow

First install VIM.  Yes… this is docker so the images are hyper-stripped down to contain only the essentials.  You have to install anything else.

First, install VIM. I think I had to go connect to the container as root to do this using this command:

docker exec -it -u root airflow-centos /bin/bash

Then you can just install with yum fine. I’m not 100% sure this was needed, so feel free to try it as the normal user first.

yum install vim

I jumped back into the normal user after that (by removing the -u root from the command above).

Then set up Airflow’s home directory and database.

  • Set the Airflow home directory (permanently for the user).
    • vi ~/.bashrc and add this to the bottom of the file.
      • export AIRFLOW_HOME=~/airflow
    • Then re-source the file so you can use it immediately:
      • ~/.bashrc
  • Initialize the Airflow database (we just did defaults, so it will use a local SQLite DB).
    • airflow initdb

Then verify the install worked by checking its version:

root@03bae42c5cdb:/# airflow version
[2018-11-07 20:26:44,372] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.10.0

Run Airflow Services

The actual Airflow hello world page here: https://airflow.apache.org/start.html just says to run Airflow like this:

  • airflow webserver -p 8080
  • airflow scheduler

You probably want to run these in the background and tell the logs to go to a file, etc.

It’s more professional just to run it as a service (on CentOS/RHEL which is why I switched to CentOS from Ubuntu).  But it turns out that running it as a service in Docker is tricky.

Even if you get everything set up properly, Docker by default enables/disables some features for security that make systemctl not work (so you can’t start the service).  It sounds like this is a whole rework to get this working (read here).  https://serverfault.com/questions/824975/failed-to-get-d-bus-connection-operation-not-permitted.

Also, I realize my idea may have been flawed in the first place (running it as  service in a container).  Containers are really intended to hold micro-services.  So, it would make more sense to launch the web server and the scheduler as their own containers and allow them to communicate with each other probably (I’m still figuring this out).  This thread nudged me into realizing that: https://forums.docker.com/t/systemctl-status-is-not-working-in-my-docker-container/9075.

It says:

Normally when you run a container you aren’t running an init system. systemctl is a process that communicates with systemd over dbus. If you aren’t running dbus or systemd, I would expect systemctl to fail.

What is the pid1 of your docker container? It should reflect the entrypoint and command that were used to launch the container.

For example, if I do the following, my pid1 would be bash:

$ docker run --rm -it centos:7 bash
[root@180c9f6866f1 /]# ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.7  0.1  11756  2856 ?        Ss   03:01   0:00 bash
root        15  0.0  0.1  47424  3300 ?        R+   03:02   0:00 ps faux

Since only bash and ps faux are running in the container, there would be nothing for systemctl to communicate with.


So, the below steps probably get it working if you set the container up right in the first place (as a privileged container), but it isn’t working for me for now.  So feel free to stop reading here and use Airflow, but it won’t be running as a service.

I might come back and update this post and/or make future one on how to run airflow in multiple containers.  I’m also aware that there is an awesome image here that gets everything off the ground instantly; but I was really trying to get it working myself to understand it better: https://hub.docker.com/r/puckel/docker-airflow/.

—- service setup (not complete yet)

I found information on the Airflow website here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html stating:

Airflow can integrate with systemd based systems. This makes watching your daemons easy as systemd can take care of restarting a daemon on failure. In the scripts/systemd directory you can find unit files that have been tested on Redhat based systems. You can copy those to/usr/lib/systemd/system. It is assumed that Airflow will run under airflow:airflow. If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files.

Environment configuration is picked up from /etc/sysconfig/airflow. An example file is supplied. Make sure to specify the SCHEDULER_RUNS variable in this file when you run the scheduler. You can also define here, for example, AIRFLOW_HOME or AIRFLOW_CONFIG.

I didn’t see much in the installation, so I found the scripts on Git Hub for the 1.10 version that we are running (based on our earlier version prompt):

https://github.com/apache/incubator-airflow/tree/v1-10-stable/scripts/systemd

Based on this, I:

 

Python List Comprehension

List comprehensions in python are a short-hand way to create lists based on intelligent logic.

For example, let’s generate all the even numbers from 0 to 15 in one line:

[x for x in range(15) if x % 2 == 0]
#Output: [0, 2, 4, 6, 8, 10, 12, 14]

Note that range() returns a sequence. You could just have easily listed [0,1,2,3…] and gave it a direct list or a list from earlier in your code.

You can have multiple for clauses. Each one can run over a different source list. You’ll end up with the Cartesian product though. For example:

[(x,y) for x in range(3) for y in range(2)]
#Output: [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]

In these results you’ll notice we had x going exclusively up to 3 and y going exclusively up to 2. So, we end up with 6 results. If we had x and y as large numbers, it would be a very big result. We could have any number of for clauses, so the result list could get large fast.

Also notice that we didn’t have to have any if’s for filtering. If’s, like for’s can have any number. So, this is also valid:

[x for x in range(15) if x % 2 == 0 and x not in [2,4,6]]
#Output: [0, 8, 10, 12, 14]

In this case, we filtered it down to even numbers in the first if, and then filtered out some predefined numbers using the not in operator (which is very cool in itself for a language).

Python Loop Index Variable Scope

While crash-studying python for a new job, I found out that this code is actually not an error!

for i in [1, 2, 3]:
pass # Do nothing.
print(i)

It blew my mind that this code actually prints 3. For some crazy reason, python keeps the index variables around after the loop exits; they are not in the scope of the loop.

I found this in the python documentation; but it is described much better here in this blog post: https://eli.thegreenplace.net/2015/the-scope-of-index-variables-in-pythons-for-loops/.

I heavily recommend reading that link as it has lots of good info (thanks to Eli Bendersky).  But in case you’re lazy, here’s a historical anecdote quoted from it that I particularly liked:

“Why this is so

I actually asked Guido van Rossum about this behavior and he was gracious enough to reply with some historical background (thanks Guido!). The motivation is keeping Python’s simple approach to names and scopes without resorting to hacks (such as deleting all the values defined in the loop after it’s done – think about the complications with exceptions, etc.) or more complex scoping rules.

In Python, the scoping rules are fairly simple and elegant: a block is either a module, a function body or a class body. Within a function body, names are visible from the point of their definition to the end of the block (including nested blocks such as nested functions). That’s for local names, of course; global names (and other nonlocal names) have slightly different rules, but that’s not pertinent to our discussion.

The important point here is: the innermost possible scope is a function body. Not a for loop body. Not a with block body. Python does not have nested lexical scopes below the level of a function, unlike some other languages (C and its progeny, for example).

So if you just go about implementing Python, this behavior is what you’ll likely to end with. Here’s another enlightening snippet:

for i in range(4):
    d = i * 2
print(d)

Would it surprise you to find out that d is visible and accessible after the for loop is finished? No, this is just the way Python works. So why would the index variable be treated any differently?

By the way, the index variables of list comprehensions are also leaked to the enclosing scope. Or, to be precise, were leaked, before Python 3 came along.”

And for those like me who didn’t know, Guido van Rossum is the author of the Python programming language.

Oh, and by the way, you can avoid this variable leak with a lambda according to the python documentation here: https://docs.python.org/3.6/tutorial/datastructures.html.

For example:

squares = list(map(lambda x: x**2, range(10)))

Python Dependency Management and Virtual Environments (vs Maven or NPM).

Historically, I’ve mostly used Python for automating minimal tasks that would otherwise have been bash scripts. So, terms like the following were alien to me, and I didn’t really know how to manage dependencies properly in Python.

  • pip
  • freezing
  • virtual environment

The main languages I’ve used in recent memory were Java and JavaScript.  They both have a dependency manager; so I expected Python to have one.  In Java, people generally use Maven.  In JavaScript, they generally use NPM (or YARN).  Either way, you make a file, note down some modules you require and their versions (if you’re smart), and then run a “mvn install” or “npm install” to go get all the stuff you need.

Maven is also a build system so, its more like NPM + WebPack in JavaScript; but nonetheless, they work similarly from a dependency management perspective.

Moving on to Python, I’ve learned the following:

Python’s version of Maven or NPM, and why it’s different:

  • PIP is python’s version of NPM or Maven.
  • However, it installs things globally for the python version, and not on a project basis.
  • So, if I had 2 projects with conflicting dependencies, I could have issues because… well… everything is global.
  • In a lot of cases, people install python modules as they need them by just randomly adding “pip install” to their release notes or running it when they’re hacking a server and need a new library.
  • This is clearly not a “production-ready” solution though.

Working on a per-project level:

  • Virtual environments are a bolt-on that allows you to properly run python in isolated environments.
  • You can install the module for working with virtual environments globally by running “pip install virtualenv”.
  • After this, for each project, you can create your own virtual environment with “virtualenv <env-name>”.  You can also specify the target python version you want to use if you have multiple, etc.
  • You activate a virtual environment by sourcing or running the “activate” bash or bat script (Linux or windows) in its bin folder.  The prior command will have created a folder with the environment name with many sub-folders, one of which is the bin folder.
  • Once the environment is activated, your shell prompt will change to show you’re within it.  Now if you run “pip list”, you’ll notice that you only have 3 basic dependencies; you are shielded from all of your global system ones.
  • You can run pip installs and python code here until your project works great (but only while you’re in the virtual environment).
  • Note that you should not necessarily keep your python code in your virtual environment.  This is probably similar to how you should not keep your Java code inside your maven directory or your JavaScript code inside your NPM directory.  I haven’t had experience either way with this, but I’ve seen it generally recommended in nearly all documentation I’ve come across.

Freezing your dependencies:

  • When you’re happy with it, you can do “pip freeze -l > requirements.txt” in order to generate a file that locks down your dependencies (the -l means just local ones, not global – and you should do it from your virtual environment).
  • Then you can install these in other places (e.g. on a prod server with automation) by doing “pip install -r requirements.txt”.  This makes it quite similar to installing a JavaScript application with npm install (which would get dependencies from the package.json file).
  • Again, if you were running multiple projects on the server, you might want to do this in a virtual environment to keep things isolated/clean.

I probably have a lot to learn still, and I’m sure this gets more complex as I’ve used enough languages to know that it takes time to fully learn these things.  But, I feel more comfortable with the idea of python in production now that I can see how you can isolate projects and install specific dependencies from a target file.

Python Multiple Assignment

The documentation website for Python uses this example of Fibonacci – https://docs.python.org/3.6/tutorial/introduction.html:

>>> a, b = 0, 1
>>> while b < 1000:
...     print(b, end=',')
...     a, b = b, a+b
...
1,1,2,3,5,8,13,21,34,55,89,144,233,377,610,987,

I’m quite an experienced programmer, and I found this difficult to digest even though it’s rather simple. The multiple assignment lines threw me off a bit.

In Java and similar languages, you would do this:

int x = 5, y = 7, z = 2;

to assign multiple values. So, each value gets its assignment immediately. In python this is not the case.

In python, you note all the target variables in a list, then you note all the values. So, a more clear example would be:

x, y, z = 5, 7, 2

This would provide the same assignments as the Java example does. This seems quirky but maybe I’m just too used to C-style languages :).

So, the initial python example, ‘a’ starts as 0, ‘b’ starts as 1, and every cycle ‘a’ = ‘b’ and b = ‘a’ + ‘b’, which makes sense.