What is Jupyter Hub?

First Things First… What is Jupyter?

Lately, I’ve been moving into the Python world where I quickly encountered Jupyter notebooks.  They seem like a pretty dominant technology that lets you script python block-by-block and render the results.  You can also render data into charts, manage user-interface widgets, and do most anything else.

What is the Problem With Jupyter?

But Jupyter really just runs on a single machine.  What about when you want to share this information to say, teach a class, or work with a team of data scientists?

So… We Have Jupyter Hub!

Jupyter Hub is a multi-user version of Jupyter… so it fixes our problems! Here I’ll paraphrase content and use images from a wonderful video I watched on YouTube – you can watch it at the bottom of this post if you like.

Basically, Jupyter Hub just provides a higher level service to the standard Jupyter notebooks.  It contains:

  1. A proxy server to route requests.
  2. A “hub” which handles authentication, user details, and spawning new notebooks.  Authentication is flexible and can most likely tie in your corporate authentication system.
  3. Any number of spawned Jupyter processes to run notebooks for the given users.  A variety of spawning techniques exist (e.g. spawning to Docker).

You can see this architecture below.

Image result for jupyter hub

So, if you need multi-user Jupyter, I suggest you look into installing and trying Jupyter hub, and I highly recommend the video below as a starting point!

Jupyter Auto-Run Cells on Load

Why Do We Need This?

If you are making a Jupyter notebook that heavily uses widgets and conceals the code used to make them, you’ll quickly run into an issue. Another person coming to this notebook would basically just see this message for all of your widgets:

“A Jupyter widget could not be displayed because the widget state could not be found. This could happen if the kernel storing the widget is no longer available, or if the widget state was not saved in the notebook. You may be able to create the widget by running the appropriate cells.”.

You can simulate this for yourself by pressing the “restart the kernel (with dialog)” button and then force refreshing your browser (ctrl + shift + r in chrome).

How Do We Do It?

I came across this stack-overflow post which gives a good solution (especially if you are already hiding code in other areas to make it look neater, like I noted in this blog).

Just paste this in its own cell at the top of your notebook:

%%html
<script>
    // AUTORUN ALL CELLS ON NOTEBOOK-LOAD!
    require(
        ['base/js/namespace', 'jquery'], 
        function(jupyter, $) {
            $(jupyter.events).on("kernel_ready.Kernel", function () {
                console.log("Auto-running all cells-below...");
                jupyter.actions.call('jupyter-notebook:run-all-cells-below');
                jupyter.actions.call('jupyter-notebook:save-notebook');
            });
        }
    );
</script>

 
Then all your cells will run on load and all of your widgets will show up nice and neat the first time around.

Read-only / Protected Jupyter Notebooks

Jupyter notebooks are fantastic, but they’re really geared at developers.  I had to lock one down so it could be used by non-developers (without damaging it).  It took quite a lot of googling!

I figure that a lot of people must need this.  If I were a university instructor, I’d like to send students to a server, let them play, but prevent them from breaking my example.

Locking Things Down

Here are the different things I did to mitigate damage:

  1. You can actually make the entire notebook read-only by setting the file permissions on command line or in properties (works for Windows and Linux).  Jupyter will detect this (after you reload the page) and show that you can’t save anything.
  2. You can make individual cells un-deletable and un-editable (so they don’t mess up the top cells that the lower down cells they’re working with depend on):
    • Run a cell.
    • Click “Edit Metadata” in its banner.
    • Add:
      • “deletable”: false,
      • “edittable”: false,
  3. You can actually hide the code for a range of cells like this code which hides the first four (which is very useful if you’re using iPython UI widgets and just want to show widgets and now how they were made) – disclaimer – I got this off stack overflow but am having trouble finding it to reference it currently:
from IPython.display import HTML
HTML('''

    code_show=true;
    function code_toggle_and_hide_move_buttons() {
        if (code_show){
            $('div.input').slice(0,4).hide();
            $('#move_up_down').hide()
        }
        else {
            $('div.input').slice(0,4).show();
            $('#move_up_down').show()
        }
        code_show = !code_show
    }
    $(document).ready(code_toggle_and_hide_move_buttons);


    

''')

Further Recommendations

I would also suggest making sure you disable the terminal in your Jupyter config file and that you set a known location for your notebooks to be loaded from so that you can add the read-only attribute.

Also, you can disable various hotkeys in the UI, and you can use a CSS selector similar to the one in my “hide code” example above to hide the move-up/move-down cell buttons to help prevent errors cropping in that way.

 

Python PIP Install Local Module While Developing

I’m definitely still in the early stages of learning module/package building and deployment for python.  So, take this with a grain of salt…

But I ran into a case where I wanted to develop/manipulate a package locally in PyCharm while I was actively using it in another project I was developing (actually, in a Jupyter notebook).  It turns out there’s a pretty cool way to do this.

Module Preparation

The first thing I had to do was prepare the package so that it was deployable using the standard python distribution style.

In my case, I just made a directory for my package (lower-case name, underscore separators).  Inside the directory, I created:

Here’s an example.  Ignore everything that I didn’t mention; all of that is auto-generated by PyCharm and not relevant.  In fact, it probably would have been better to create a sub-directory in this project for the package; but I just considered the top level directory the package directory for now.

package-layout

Module Installation

Once you have your module set up like this, you can jump into your command line, assuming you have PIP installed, and you can run this command (tailored for your package directory location):

λ pip install -e C:\dev\python\jupyter_audit_query_tools
Obtaining file:///C:/dev/python/jupyter_audit_query_tools
Installing collected packages: PostgresQueryRunner
Running setup.py develop for PostgresQueryRunner
Successfully installed PostgresQueryRunner

You’ll also be able to see the package mapped to that directory when you list the packages in PIP:

λ pip list | grep postgres
postgres-query-runner 1.1.0 c:\dev\python\jupyter_audit_query_tools

Module Usage

After this, you should be able to import and use the package / modules in your interpreter or notebook.  You can change the code in the package and it will update in the places you’re using it assuming you re-import the package.  So, in Jupyter, this would mean clicking the restart-kernel/re-run button.

Install Airflow on Windows + Docker + CentOs

Continuing on my journey; setting up Apache Airflow on Windows directly was a disaster for various reasons.

Setting it up in the WSL (Windows Subsystem for Linux) copy of Ubuntu worked great.  But unfortunately, you can’t run services/etc properly in that, and I’d like to run it in a state reasonably similar to how we’ll eventually deploy it.

So, my fallback plan is Docker on Windows, which is working great (no surprise there).  It was also much less painful to set up in the end than the other options.  I’m also switching from Ubuntu to CentOS (non-enterprise version of RHEL) as I found out that docker has service files tested with it here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html.

Assuming you have docker for Windows set up properly, just do the following to set up Airflow in a new CentOS container.

Get and Run CentOS With Python 3.6 in Docker

docker pull centos/python-36-centos7
docker container run --name airflow-centos -it centos/python-36-centos7:latest /bin/bash

Install Airflow with Pip

pip install --upgrade pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow

Set up Airflow

First install VIM.  Yes… this is docker so the images are hyper-stripped down to contain only the essentials.  You have to install anything else.

First, install VIM. I think I had to go connect to the container as root to do this using this command:

docker exec -it -u root airflow-centos /bin/bash

Then you can just install with yum fine. I’m not 100% sure this was needed, so feel free to try it as the normal user first.

yum install vim

I jumped back into the normal user after that (by removing the -u root from the command above).

Then set up Airflow’s home directory and database.

  • Set the Airflow home directory (permanently for the user).
    • vi ~/.bashrc and add this to the bottom of the file.
      • export AIRFLOW_HOME=~/airflow
    • Then re-source the file so you can use it immediately:
      • ~/.bashrc
  • Initialize the Airflow database (we just did defaults, so it will use a local SQLite DB).
    • airflow initdb

Then verify the install worked by checking its version:

root@03bae42c5cdb:/# airflow version
[2018-11-07 20:26:44,372] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.10.0

Run Airflow Services

The actual Airflow hello world page here: https://airflow.apache.org/start.html just says to run Airflow like this:

  • airflow webserver -p 8080
  • airflow scheduler

You probably want to run these in the background and tell the logs to go to a file, etc.

It’s more professional just to run it as a service (on CentOS/RHEL which is why I switched to CentOS from Ubuntu).  But it turns out that running it as a service in Docker is tricky.

Even if you get everything set up properly, Docker by default enables/disables some features for security that make systemctl not work (so you can’t start the service).  It sounds like this is a whole rework to get this working (read here).  https://serverfault.com/questions/824975/failed-to-get-d-bus-connection-operation-not-permitted.

Also, I realize my idea may have been flawed in the first place (running it as  service in a container).  Containers are really intended to hold micro-services.  So, it would make more sense to launch the web server and the scheduler as their own containers and allow them to communicate with each other probably (I’m still figuring this out).  This thread nudged me into realizing that: https://forums.docker.com/t/systemctl-status-is-not-working-in-my-docker-container/9075.

It says:

Normally when you run a container you aren’t running an init system. systemctl is a process that communicates with systemd over dbus. If you aren’t running dbus or systemd, I would expect systemctl to fail.

What is the pid1 of your docker container? It should reflect the entrypoint and command that were used to launch the container.

For example, if I do the following, my pid1 would be bash:

$ docker run --rm -it centos:7 bash
[root@180c9f6866f1 /]# ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.7  0.1  11756  2856 ?        Ss   03:01   0:00 bash
root        15  0.0  0.1  47424  3300 ?        R+   03:02   0:00 ps faux

Since only bash and ps faux are running in the container, there would be nothing for systemctl to communicate with.


So, the below steps probably get it working if you set the container up right in the first place (as a privileged container), but it isn’t working for me for now.  So feel free to stop reading here and use Airflow, but it won’t be running as a service.

I might come back and update this post and/or make future one on how to run airflow in multiple containers.  I’m also aware that there is an awesome image here that gets everything off the ground instantly; but I was really trying to get it working myself to understand it better: https://hub.docker.com/r/puckel/docker-airflow/.

—- service setup (not complete yet)

I found information on the Airflow website here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html stating:

Airflow can integrate with systemd based systems. This makes watching your daemons easy as systemd can take care of restarting a daemon on failure. In the scripts/systemd directory you can find unit files that have been tested on Redhat based systems. You can copy those to/usr/lib/systemd/system. It is assumed that Airflow will run under airflow:airflow. If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files.

Environment configuration is picked up from /etc/sysconfig/airflow. An example file is supplied. Make sure to specify the SCHEDULER_RUNS variable in this file when you run the scheduler. You can also define here, for example, AIRFLOW_HOME or AIRFLOW_CONFIG.

I didn’t see much in the installation, so I found the scripts on Git Hub for the 1.10 version that we are running (based on our earlier version prompt):

https://github.com/apache/incubator-airflow/tree/v1-10-stable/scripts/systemd

Based on this, I:

 

Python List Comprehension

List comprehensions in python are a short-hand way to create lists based on intelligent logic.

For example, let’s generate all the even numbers from 0 to 15 in one line:

[x for x in range(15) if x % 2 == 0]
#Output: [0, 2, 4, 6, 8, 10, 12, 14]

Note that range() returns a sequence. You could just have easily listed [0,1,2,3…] and gave it a direct list or a list from earlier in your code.

You can have multiple for clauses. Each one can run over a different source list. You’ll end up with the Cartesian product though. For example:

[(x,y) for x in range(3) for y in range(2)]
#Output: [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]

In these results you’ll notice we had x going exclusively up to 3 and y going exclusively up to 2. So, we end up with 6 results. If we had x and y as large numbers, it would be a very big result. We could have any number of for clauses, so the result list could get large fast.

Also notice that we didn’t have to have any if’s for filtering. If’s, like for’s can have any number. So, this is also valid:

[x for x in range(15) if x % 2 == 0 and x not in [2,4,6]]
#Output: [0, 8, 10, 12, 14]

In this case, we filtered it down to even numbers in the first if, and then filtered out some predefined numbers using the not in operator (which is very cool in itself for a language).

Python Loop Index Variable Scope

While crash-studying python for a new job, I found out that this code is actually not an error!

for i in [1, 2, 3]:
pass # Do nothing.
print(i)

It blew my mind that this code actually prints 3. For some crazy reason, python keeps the index variables around after the loop exits; they are not in the scope of the loop.

I found this in the python documentation; but it is described much better here in this blog post: https://eli.thegreenplace.net/2015/the-scope-of-index-variables-in-pythons-for-loops/.

I heavily recommend reading that link as it has lots of good info (thanks to Eli Bendersky).  But in case you’re lazy, here’s a historical anecdote quoted from it that I particularly liked:

“Why this is so

I actually asked Guido van Rossum about this behavior and he was gracious enough to reply with some historical background (thanks Guido!). The motivation is keeping Python’s simple approach to names and scopes without resorting to hacks (such as deleting all the values defined in the loop after it’s done – think about the complications with exceptions, etc.) or more complex scoping rules.

In Python, the scoping rules are fairly simple and elegant: a block is either a module, a function body or a class body. Within a function body, names are visible from the point of their definition to the end of the block (including nested blocks such as nested functions). That’s for local names, of course; global names (and other nonlocal names) have slightly different rules, but that’s not pertinent to our discussion.

The important point here is: the innermost possible scope is a function body. Not a for loop body. Not a with block body. Python does not have nested lexical scopes below the level of a function, unlike some other languages (C and its progeny, for example).

So if you just go about implementing Python, this behavior is what you’ll likely to end with. Here’s another enlightening snippet:

for i in range(4):
    d = i * 2
print(d)

Would it surprise you to find out that d is visible and accessible after the for loop is finished? No, this is just the way Python works. So why would the index variable be treated any differently?

By the way, the index variables of list comprehensions are also leaked to the enclosing scope. Or, to be precise, were leaked, before Python 3 came along.”

And for those like me who didn’t know, Guido van Rossum is the author of the Python programming language.

Oh, and by the way, you can avoid this variable leak with a lambda according to the python documentation here: https://docs.python.org/3.6/tutorial/datastructures.html.

For example:

squares = list(map(lambda x: x**2, range(10)))