Jupyter Auto-Run Cells on Load

Why Do We Need This?

If you are making a Jupyter notebook that heavily uses widgets and conceals the code used to make them, you’ll quickly run into an issue. Another person coming to this notebook would basically just see this message for all of your widgets:

“A Jupyter widget could not be displayed because the widget state could not be found. This could happen if the kernel storing the widget is no longer available, or if the widget state was not saved in the notebook. You may be able to create the widget by running the appropriate cells.”.

You can simulate this for yourself by pressing the “restart the kernel (with dialog)” button and then force refreshing your browser (ctrl + shift + r in chrome).

How Do We Do It?

I came across this stack-overflow post which gives a good solution (especially if you are already hiding code in other areas to make it look neater, like I noted in this blog).

Just paste this in its own cell at the top of your notebook:

%%html
<script>
    // AUTORUN ALL CELLS ON NOTEBOOK-LOAD!
    require(
        ['base/js/namespace', 'jquery'], 
        function(jupyter, $) {
            $(jupyter.events).on("kernel_ready.Kernel", function () {
                console.log("Auto-running all cells-below...");
                jupyter.actions.call('jupyter-notebook:run-all-cells-below');
                jupyter.actions.call('jupyter-notebook:save-notebook');
            });
        }
    );
</script>

 
Then all your cells will run on load and all of your widgets will show up nice and neat the first time around.

Read-only / Protected Jupyter Notebooks

Jupyter notebooks are fantastic, but they’re really geared at developers.  I had to lock one down so it could be used by non-developers (without damaging it).  It took quite a lot of googling!

I figure that a lot of people must need this.  If I were a university instructor, I’d like to send students to a server, let them play, but prevent them from breaking my example.

Locking Things Down

Here are the different things I did to mitigate damage:

  1. You can actually make the entire notebook read-only by setting the file permissions on command line or in properties (works for Windows and Linux).  Jupyter will detect this (after you reload the page) and show that you can’t save anything.
  2. You can make individual cells un-deletable and un-editable (so they don’t mess up the top cells that the lower down cells they’re working with depend on):
    • Run a cell.
    • Click “Edit Metadata” in its banner.
    • Add:
      • “deletable”: false,
      • “edittable”: false,
  3. You can actually hide the code for a range of cells like this code which hides the first four (which is very useful if you’re using iPython UI widgets and just want to show widgets and now how they were made) – disclaimer – I got this off stack overflow but am having trouble finding it to reference it currently:
from IPython.display import HTML
HTML('''

    code_show=true;
    function code_toggle_and_hide_move_buttons() {
        if (code_show){
            $('div.input').slice(0,4).hide();
            $('#move_up_down').hide()
        }
        else {
            $('div.input').slice(0,4).show();
            $('#move_up_down').show()
        }
        code_show = !code_show
    }
    $(document).ready(code_toggle_and_hide_move_buttons);


    

''')

Further Recommendations

I would also suggest making sure you disable the terminal in your Jupyter config file and that you set a known location for your notebooks to be loaded from so that you can add the read-only attribute.

Also, you can disable various hotkeys in the UI, and you can use a CSS selector similar to the one in my “hide code” example above to hide the move-up/move-down cell buttons to help prevent errors cropping in that way.

 

Python PIP Install Local Module While Developing

I’m definitely still in the early stages of learning module/package building and deployment for python.  So, take this with a grain of salt…

But I ran into a case where I wanted to develop/manipulate a package locally in PyCharm while I was actively using it in another project I was developing (actually, in a Jupyter notebook).  It turns out there’s a pretty cool way to do this.

Module Preparation

The first thing I had to do was prepare the package so that it was deployable using the standard python distribution style.

In my case, I just made a directory for my package (lower-case name, underscore separators).  Inside the directory, I created:

Here’s an example.  Ignore everything that I didn’t mention; all of that is auto-generated by PyCharm and not relevant.  In fact, it probably would have been better to create a sub-directory in this project for the package; but I just considered the top level directory the package directory for now.

package-layout

Module Installation

Once you have your module set up like this, you can jump into your command line, assuming you have PIP installed, and you can run this command (tailored for your package directory location):

λ pip install -e C:\dev\python\jupyter_audit_query_tools
Obtaining file:///C:/dev/python/jupyter_audit_query_tools
Installing collected packages: PostgresQueryRunner
Running setup.py develop for PostgresQueryRunner
Successfully installed PostgresQueryRunner

You’ll also be able to see the package mapped to that directory when you list the packages in PIP:

λ pip list | grep postgres
postgres-query-runner 1.1.0 c:\dev\python\jupyter_audit_query_tools

Module Usage

After this, you should be able to import and use the package / modules in your interpreter or notebook.  You can change the code in the package and it will update in the places you’re using it assuming you re-import the package.  So, in Jupyter, this would mean clicking the restart-kernel/re-run button.

Install Airflow on Windows + Docker + CentOs

Continuing on my journey; setting up Apache Airflow on Windows directly was a disaster for various reasons.

Setting it up in the WSL (Windows Subsystem for Linux) copy of Ubuntu worked great.  But unfortunately, you can’t run services/etc properly in that, and I’d like to run it in a state reasonably similar to how we’ll eventually deploy it.

So, my fallback plan is Docker on Windows, which is working great (no surprise there).  It was also much less painful to set up in the end than the other options.  I’m also switching from Ubuntu to CentOS (non-enterprise version of RHEL) as I found out that docker has service files tested with it here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html.

Assuming you have docker for Windows set up properly, just do the following to set up Airflow in a new CentOS container.

Get and Run CentOS With Python 3.6 in Docker

docker pull centos/python-36-centos7
docker container run --name airflow-centos -it centos/python-36-centos7:latest /bin/bash

Install Airflow with Pip

pip install --upgrade pip
export SLUGIFY_USES_TEXT_UNIDECODE=yes
pip install apache-airflow

Set up Airflow

First install VIM.  Yes… this is docker so the images are hyper-stripped down to contain only the essentials.  You have to install anything else.

First, install VIM. I think I had to go connect to the container as root to do this using this command:

docker exec -it -u root airflow-centos /bin/bash

Then you can just install with yum fine. I’m not 100% sure this was needed, so feel free to try it as the normal user first.

yum install vim

I jumped back into the normal user after that (by removing the -u root from the command above).

Then set up Airflow’s home directory and database.

  • Set the Airflow home directory (permanently for the user).
    • vi ~/.bashrc and add this to the bottom of the file.
      • export AIRFLOW_HOME=~/airflow
    • Then re-source the file so you can use it immediately:
      • ~/.bashrc
  • Initialize the Airflow database (we just did defaults, so it will use a local SQLite DB).
    • airflow initdb

Then verify the install worked by checking its version:

root@03bae42c5cdb:/# airflow version
[2018-11-07 20:26:44,372] {__init__.py:51} INFO - Using executor SequentialExecutor
____________ _____________
____ |__( )_________ __/__ /________ __
____ /| |_ /__ ___/_ /_ __ /_ __ \_ | /| / /
___ ___ | / _ / _ __/ _ / / /_/ /_ |/ |/ /
_/_/ |_/_/ /_/ /_/ /_/ \____/____/|__/
v1.10.0

Run Airflow Services

The actual Airflow hello world page here: https://airflow.apache.org/start.html just says to run Airflow like this:

  • airflow webserver -p 8080
  • airflow scheduler

You probably want to run these in the background and tell the logs to go to a file, etc.

It’s more professional just to run it as a service (on CentOS/RHEL which is why I switched to CentOS from Ubuntu).  But it turns out that running it as a service in Docker is tricky.

Even if you get everything set up properly, Docker by default enables/disables some features for security that make systemctl not work (so you can’t start the service).  It sounds like this is a whole rework to get this working (read here).  https://serverfault.com/questions/824975/failed-to-get-d-bus-connection-operation-not-permitted.

Also, I realize my idea may have been flawed in the first place (running it as  service in a container).  Containers are really intended to hold micro-services.  So, it would make more sense to launch the web server and the scheduler as their own containers and allow them to communicate with each other probably (I’m still figuring this out).  This thread nudged me into realizing that: https://forums.docker.com/t/systemctl-status-is-not-working-in-my-docker-container/9075.

It says:

Normally when you run a container you aren’t running an init system. systemctl is a process that communicates with systemd over dbus. If you aren’t running dbus or systemd, I would expect systemctl to fail.

What is the pid1 of your docker container? It should reflect the entrypoint and command that were used to launch the container.

For example, if I do the following, my pid1 would be bash:

$ docker run --rm -it centos:7 bash
[root@180c9f6866f1 /]# ps faux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.7  0.1  11756  2856 ?        Ss   03:01   0:00 bash
root        15  0.0  0.1  47424  3300 ?        R+   03:02   0:00 ps faux

Since only bash and ps faux are running in the container, there would be nothing for systemctl to communicate with.


So, the below steps probably get it working if you set the container up right in the first place (as a privileged container), but it isn’t working for me for now.  So feel free to stop reading here and use Airflow, but it won’t be running as a service.

I might come back and update this post and/or make future one on how to run airflow in multiple containers.  I’m also aware that there is an awesome image here that gets everything off the ground instantly; but I was really trying to get it working myself to understand it better: https://hub.docker.com/r/puckel/docker-airflow/.

—- service setup (not complete yet)

I found information on the Airflow website here: https://airflow.readthedocs.io/en/stable/howto/run-with-systemd.html stating:

Airflow can integrate with systemd based systems. This makes watching your daemons easy as systemd can take care of restarting a daemon on failure. In the scripts/systemd directory you can find unit files that have been tested on Redhat based systems. You can copy those to/usr/lib/systemd/system. It is assumed that Airflow will run under airflow:airflow. If not (or if you are running on a non Redhat based system) you probably need to adjust the unit files.

Environment configuration is picked up from /etc/sysconfig/airflow. An example file is supplied. Make sure to specify the SCHEDULER_RUNS variable in this file when you run the scheduler. You can also define here, for example, AIRFLOW_HOME or AIRFLOW_CONFIG.

I didn’t see much in the installation, so I found the scripts on Git Hub for the 1.10 version that we are running (based on our earlier version prompt):

https://github.com/apache/incubator-airflow/tree/v1-10-stable/scripts/systemd

Based on this, I:

 

Python List Comprehension

List comprehensions in python are a short-hand way to create lists based on intelligent logic.

For example, let’s generate all the even numbers from 0 to 15 in one line:

[x for x in range(15) if x % 2 == 0]
#Output: [0, 2, 4, 6, 8, 10, 12, 14]

Note that range() returns a sequence. You could just have easily listed [0,1,2,3…] and gave it a direct list or a list from earlier in your code.

You can have multiple for clauses. Each one can run over a different source list. You’ll end up with the Cartesian product though. For example:

[(x,y) for x in range(3) for y in range(2)]
#Output: [(0, 0), (0, 1), (1, 0), (1, 1), (2, 0), (2, 1)]

In these results you’ll notice we had x going exclusively up to 3 and y going exclusively up to 2. So, we end up with 6 results. If we had x and y as large numbers, it would be a very big result. We could have any number of for clauses, so the result list could get large fast.

Also notice that we didn’t have to have any if’s for filtering. If’s, like for’s can have any number. So, this is also valid:

[x for x in range(15) if x % 2 == 0 and x not in [2,4,6]]
#Output: [0, 8, 10, 12, 14]

In this case, we filtered it down to even numbers in the first if, and then filtered out some predefined numbers using the not in operator (which is very cool in itself for a language).