Airflow Task and Dag State/Status Enumerations

I’ve been working to make a kind of “remote dag watcher dag” lately.  Basically, I want to render all the tasks in a dag from another airflow on my airflow.  This way, I can watch a set of tasks across multiple airflows (some of which I may not control).

As part of this, I am having to use the (fairly bad) APIs a we use both the experimental and the plugin one here (https://github.com/teamclairvoyant/airflow-rest-api-plugin).

Anyway, the APIs don’t document task and DAG state, so I frequently have been looking them up in code.  Here’s a reference for ease:

https://github.com/apache/airflow/blob/1.10.5/airflow/utils/state.py

# -*- coding: utf-8 -*-
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations
# under the License.
#
from __future__ import unicode_literals

from builtins import object


class State(object):
    """
    Static class with task instance states constants and color method to
    avoid hardcoding.
    """

    # scheduler
    NONE = None
    REMOVED = "removed"
    SCHEDULED = "scheduled"

    # set by the executor (t.b.d.)
    # LAUNCHED = "launched"

    # set by a task
    QUEUED = "queued"
    RUNNING = "running"
    SUCCESS = "success"
    SHUTDOWN = "shutdown"  # External request to shut down
    FAILED = "failed"
    UP_FOR_RETRY = "up_for_retry"
    UP_FOR_RESCHEDULE = "up_for_reschedule"
    UPSTREAM_FAILED = "upstream_failed"
    SKIPPED = "skipped"

    task_states = (
        SUCCESS,
        RUNNING,
        FAILED,
        UPSTREAM_FAILED,
        SKIPPED,
        UP_FOR_RETRY,
        UP_FOR_RESCHEDULE,
        QUEUED,
        NONE,
        SCHEDULED,
    )

    dag_states = (
        SUCCESS,
        RUNNING,
        FAILED,
    )

    state_color = {
        QUEUED: 'gray',
        RUNNING: 'lime',
        SUCCESS: 'green',
        SHUTDOWN: 'blue',
        FAILED: 'red',
        UP_FOR_RETRY: 'gold',
        UP_FOR_RESCHEDULE: 'turquoise',
        UPSTREAM_FAILED: 'orange',
        SKIPPED: 'pink',
        REMOVED: 'lightgrey',
        SCHEDULED: 'tan',
        NONE: 'lightblue',
    }

    @classmethod
    def color(cls, state):
        return cls.state_color.get(state, 'white')

    @classmethod
    def color_fg(cls, state):
        color = cls.color(state)
        if color in ['green', 'red']:
            return 'white'
        return 'black'

    @classmethod
    def finished(cls):
        """
        A list of states indicating that a task started and completed a
        run attempt. Note that the attempt could have resulted in failure or
        have been interrupted; in any case, it is no longer running.
        """
        return [
            cls.SUCCESS,
            cls.FAILED,
            cls.SKIPPED,
        ]

    @classmethod
    def unfinished(cls):
        """
        A list of states indicating that a task either has not completed
        a run or has not even started.
        """
        return [
            cls.NONE,
            cls.SCHEDULED,
            cls.QUEUED,
            cls.RUNNING,
            cls.SHUTDOWN,
            cls.UP_FOR_RETRY,
            cls.UP_FOR_RESCHEDULE
        ]

Create a Date in Airflow’s Execution_Date Format (ISO with Time Zone)

If you are using Apache Airflow and you find a need to make a date in order to compare it to the airflow execution date, here is a simple/clean way of doing it.

Airflow’s dates appear to be in the ISO standard format, with a time zone qualifier.  From their docs here (https://airflow.apache.org/docs/stable/timezone.html):

Support for time zones is enabled by default. Airflow stores datetime information in UTC internally and in the database. It allows you to run your DAGs with time zone dependent schedules. At the moment Airflow does not convert them to the end user’s time zone in the user interface. There it will always be displayed in UTC.

The execution_date will always be in UTC.  So, this piece of code should always work to get the current time in airflow execution_date’s time format:

from datetime import datetime, timezone
datetime.now(timezone.utc).isoformat('T')

Also, you should note that these dates appear in this format:

2020-04-01T00:41:15.926862+00:00

This is great because it means that you can compare them as strings since the numbers are all most-significant-first and 24-hour based.

Install Airflow on Windows + Docker + CentOs

Continuing on my journey; setting up Apache Airflow on Windows directly was a disaster for various reasons.

Setting it up in the WSL (Windows Subsystem for Linux) copy of Ubuntu worked great.  But unfortunately, you can’t run services/etc properly in that, and I’d like to run it in a state reasonably similar to how we’ll eventually deploy it.

So, my fallback plan is Docker on Windows, which is working great (no surprise there).  It was also much less painful to set up in the end than the other options.  I’m also switching from Ubuntu to CentOS (non-enterprise version of RHEL) as I found out that docker has service files tested with it here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html.

Assuming you have docker for Windows set up properly, just do the following to set up Airflow in a new CentOS container.

Get and Run CentOS With Python 3.6 in Docker

docker pull centos/python-36-centos7
docker container run --name airflow-centos -it centos/python-36-centos7:latest /bin/bash

Install Airflow with Pip

pip install --upgrade pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow

Set up Airflow

First install VIM.  Yes… this is docker so the images are hyper-stripped down to contain only the essentials.  You have to install anything else.

First, install VIM. I think I had to go connect to the container as root to do this using this command:

docker exec -it -u root airflow-centos /bin/bash

Then you can just install with yum fine. I’m not 100% sure this was needed, so feel free to try it as the normal user first.

yum install vim

I jumped back into the normal user after that (by removing the -u root from the command above).

Then set up Airflow’s home directory and database.

  • Set the Airflow home directory (permanently for the user).
    • vi ~/.bashrc and add this to the bottom of the file.
      • export AIRFLOW_HOME=~/airflow
    • Then re-source the file so you can use it immediately:
      • ~/.bashrc
  • Initialize the Airflow database (we just did defaults, so it will use a local SQLite DB).
    • airflow initdb

Then verify the install worked by checking its version:

root@03bae42c5cdb:/# airflow version
[2018-11-07 20:26:44,372] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.10.0

Run Airflow Services

The actual Airflow hello world page here: https://airflow.apache.org/start.html just says to run Airflow like this:

  • airflow webserver -p 8080
  • airflow scheduler

You probably want to run these in the background and tell the logs to go to a file, etc.

It’s more professional just to run it as a service (on CentOS/RHEL which is why I switched to CentOS from Ubuntu).  But it turns out that running it as a service in Docker is tricky.

Even if you get everything set up properly, Docker by default enables/disables some features for security that make systemctl not work (so you can’t start the service).  It sounds like this is a whole rework to get this working (read here).  https://serverfault.com/questions/824975/failed-to-get-d-bus-connection-operation-not-permitted.

Also, I realize my idea may have been flawed in the first place (running it as  service in a container).  Containers are really intended to hold micro-services.  So, it would make more sense to launch the web server and the scheduler as their own containers and allow them to communicate with each other probably (I’m still figuring this out).  This thread nudged me into realizing that: https://forums.docker.com/t/systemctl-status-is-not-working-in-my-docker-container/9075.

It says:

Normally when you run a container you aren’t running an init system. systemctl is a process that communicates with systemd over dbus. If you aren’t running dbus or systemd, I would expect systemctl to fail.

What is the pid1 of your docker container? It should reflect the entrypoint and command that were used to launch the container.

For example, if I do the following, my pid1 would be bash:

$ docker run --rm -it centos:7 bash
[root@180c9f6866f1 /]# ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.7  0.1  11756  2856 ?        Ss   03:01   0:00 bash
root        15  0.0  0.1  47424  3300 ?        R+   03:02   0:00 ps faux

Since only bash and ps faux are running in the container, there would be nothing for systemctl to communicate with.


So, the below steps probably get it working if you set the container up right in the first place (as a privileged container), but it isn’t working for me for now.  So feel free to stop reading here and use Airflow, but it won’t be running as a service.

I might come back and update this post and/or make future one on how to run airflow in multiple containers.  I’m also aware that there is an awesome image here that gets everything off the ground instantly; but I was really trying to get it working myself to understand it better: https://hub.docker.com/r/puckel/docker-airflow/.

—- service setup (not complete yet)

I found information on the Airflow website here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html stating:

Airflow can integrate with systemd based systems. This makes watching your daemons easy as systemd can take care of restarting a daemon on failure. In the scripts/systemd directory you can find unit files that have been tested on Redhat based systems. You can copy those to/usr/lib/systemd/system. It is assumed that Airflow will run under airflow:airflow. If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files.

Environment configuration is picked up from /etc/sysconfig/airflow. An example file is supplied. Make sure to specify the SCHEDULER_RUNS variable in this file when you run the scheduler. You can also define here, for example, AIRFLOW_HOME or AIRFLOW_CONFIG.

I didn’t see much in the installation, so I found the scripts on Git Hub for the 1.10 version that we are running (based on our earlier version prompt):

https://github.com/apache/incubator-airflow/tree/v1-10-stable/scripts/systemd

Based on this, I:

 

Apache Airflow Windows 10 Install (Ubuntu)

After my failed attempt at installing Aifrflow into python on Windows the normal way, I heard that it is better to run it in an Ubuntu sub-system available in the Windows 10 store.  So, I’m changing to this route.

You can find and install “Ubuntu” on the Windows 10 store, and it will give you a full fledged Ubuntu Linux shell.  Here’s what the installation looks like:

Ubuntu Installation

It installs quite quickly, then you just press “Launch”.  The shell opens, and in my case, I was presented with this:

Installing, this may take a few minutes…
WslRegisterDistribution failed with error: 0x8007019e
The Windows Subsystem for Linux optional component is not enabled. Please enable it and try again.
See https://aka.ms/wslinstall for details.
Press any key to continue…

Go to your start menu and type “features” and click “Turn Windows features on or off”, then check the “Windows Subsystem for Linux” box and press “OK”.

It will install some things and take a few minutes.  For me, it took about 2 minutes on “Searching for required files” even though I’m on a very fast corporate internet connection.  So, don’t be discouraged if that happens.

Unfortunately, you’ll have to reboot once this finishes!  Such is windows :(.

After the reboot, open the “Ubuntu” shell from your windows button search, and then it will take a minute to install and will ask you to create a user and ID (note that “admin” will not work, so don’t bother trying that).

Installing, this may take a few minutes…
Please create a default UNIX user account. The username does not need to match your Windows username.
For more information visit: https://aka.ms/wslusers
Enter new UNIX username:
Enter new UNIX password:
Retype new UNIX password:
passwd: password updated successfully
Installation successful!
To run a command as administrator (user “root”), use “sudo “.
See “man sudo_root” for details.

If you check, you’ll already have python installed.  It is version 3.6.5 for me which is good, because a previous post where I tried to install it on windows showed that Airflow is not compatible (yet) with Python 3.7 when pip installing as it added the “async” keyword which broke some things.

$ python3 –version
Python 3.6.5

Now, we should just have to install Airflow.  But we need pip first, and when I try to install pip the way it recommends (when you try to use it as is), then it doesn’t work.  So, I found this: https://askubuntu.com/questions/672808/sudo-apt-get-install-python-pip-is-failing which recommends:

sudo apt-get install software-properties-common
sudo apt-add-repository universe
sudo apt-get update

After you run those commands, you can run the last one:

sudo apt-get install python-pip

This is actually the one the Ubuntu terminal recommended if you just tried to blindly use pip in the first place; but it wouldn’t have worked without the other 3 first.  This took around 5 minutes to install for me, and and it will require you to say “y” for yes once to kick it off.

After this, we can FINALLY install Airflow properly.  This is a pretty big victory if you realize that I started on my other blog post trying to make it work in Windows first, and that was a rabbit hole in itself!

export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow

If you’re wondering why that first export line is there, just skip it and read the terminal error message which recommends it.  I ran into the same thing in the pure Windows install which failed in the other blog post.

This installation took around 3 minutes for me.  The Airflow documentation recommends initializing its database (SQLite by default) when you’re done as other things won’t work without it – https://airflow.apache.org/installation.html:

Surprisingly, I found I had to open a new terminal before I could use the airflow command.  I’m not sure if this is a quirk about running it on windows, or if I should have just sourced my profile again/etc as I didn’t play around with it.

In any case, initialize the DB and then check the version, and hopefully you’re as happy as I am to be done with that.
 

airflow initdb

hujo8003@USLJ96YRQ2:~$ airflow version
[2018-11-06 11:36:38,930] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.10.0

 

 

(Failed) Apache Airflow Windows-10 Install

NOTE: I fought this installation a lot and fixed numerous issues, but in the end I got stuck on a C++ compilation failure in the Airflow install via pip.  So, I’m switching to installing it in a new post in an Ubuntu shell available in the Windows 10 store since I’ll be running Airflow in Linux in production anyway.  So, there is probably some helpful stuff here; but its not a full solution and I recommend against doing Airflow this way given my experiences here.

I’m relatively new to Python and have only really used it for simplistic scripts in the past; but now I’ll be using it for a new job along with Apache Airflow (which is very cool).

Anyway, I just had a terrible time installing Airflow… so I thought I’d document the issues here and a working solution on Windows 10:

  1. Install python 3.6.7 from here: https://www.python.org/downloads/release/python-367/
    • (Do not use Python 3.7; as of 2018-11-06, “pip install apache-airflow” will install apache-airflow-1.10.0, and the installer will try and use the “async” keyword, which is now a reserved word in Python 3.7, so it will fail).
  2. Make sure Python and its Scripts directory are in your path (Python’s installer may or may not do this.  If you open a new command line after the Python install and “python –version” doesn’t show 3.6.7, you need to do it.
    • Note that the scripts directory is where pip is; this is your package installer for adding modules to Python.
  3. Upgrade pip with:  python -m pip install –upgrade pip
  4. The installation command for Airflow is “pip install apache-airflow”.  But in my case, this failed a few more times due to other dependencies/issues.  So, I had to do the following before this worked:
    • Set this environment variable: “set SLUGIFY_USES_TEXT_UNIDECODE=yes”
    • Install Microsoft Visual C++ 14 build packages (this is time consuming) and upgrade the build tools in Pip.
      • pip install –upgrade setuptools
      • install the “Build Tools for Visual Studio 2017” from: https://www.visualstudio.com/downloads/#build-tools-for-visual-studio-2017
        • Once the interface opens for the installer,  install Visual C++ build tools – Build Windows desktop applications using the Microsoft C++ toolset, ATL, or MFC. I also checked the following boxes on the right:
          • Windows 10 SDK (10.0.17134.0)
          • Visual C++ tools for CMake
          • Testing tools core features – Build Tools
          • VC++ 2015.3 v14.00 (v140) toolset for desktop
          • Windows Universal CRT SDK
          • Windows 8.1 SDK
          • Installation is 2.5GB! Woah.
  5. Open a new command line so it picks up everything and then run
    • set SLUGIFY_USES_TEXT_UNIDECODE=yes
    • pip install apache-airflow
    • IT STILL DIDN’T WORK! – The error is a complex SYSTEM_THREADING related one and online docs for it seem to not have a resolution.  
      There’s probably a way to fix this; but at this point I’m going to switch to installing it in a Ubuntu shell subsystem from the Windows 10 store in a new blog post – I’ve wasted enough time on this given I’ll be running Airflow in Linux in production anyway.