Docker + Windows “Error starting userland proxy”

Docker Start Error

I ran into a new docker issue today.  Basically, I restarted my PC, and when I tried to bring up a container with a Postgres instance I use for testing, I received this confusing error:

Error response from daemon: driver failed programming external connectivity on endpoint postgres (15b348b1f5bf8d2bfd17c1c41b340d1c66f63ace7cab39ea69aeca3f69ed7442): Error starting userland proxy: mkdir /port/tcp:0.0.0.0:5432:tcp:172.17.0.2:5432: input/output error
Error: failed to start containers: postgres

What Does it Mean?

It turns out this is a big headache which is still unresolved, and which has one of the longer Git Issue threads I’ve ever seen right here.

Here’s a summary of it:

  • Windows 10 has a “Fast Start Up Mode”, and Docker doesn’t play well with it (or vice versa).
  • So, after a restart, you may find that you see this issue.
  • Theoretically, restating the Docker Daemon fixes this (which is a little annoying but fine).  You should be able to do that in Services.
  • This personally didn’t help me the first try.  So, I went and disabled Fast Start mode (which is also annoying) by:
    • Go to start and type “Power and Sleep”, click it when it pops up.
    • Click “Additional power settings” on the right.
    • Click “Choose what the power buttons do”.
    • Click “Change settings that are currently unavailable” and log in if you can’t already toggle the “Turn on fast startup (recommended) checkbox.
    • Turn off that checkbox.

Note that once you reboot you have to wait a bit for docker to come up (it can take a few minutes).  For example, the first 4 or 5 times I ran “docker version”, the daemon showed as down even though I could see the service running.  But a minute later it was up and working fine.

Database Star Schemas and Snowflake Schemas

Schema Confusion

A lot of people very regularly work with databases (even high end ones), but get thrown by terms like star-schema, snowflake-schema, etc. due to lack of formal training or working with data warehousing technologies.

These same people will often be perfectly comfortable with indexing, query optimization, foreign keys, concepts of de-normalization and normal forms, etc.

I personally started working with the actual “Snowflake” database recently https://www.snowflake.com/about/ and had to review what a snowflake shema was when I started looking at it.

Useful Articles

I found an interesting article on Star schemas vs Snowflake schemas pretty quickly, and back tracked it to precursor articles digging into the Star and Snowflake schemas respectively.  Here are each in case you want the original content; I’m just going to paraphrase it below to give people a quick overview and/or refresher.

Star Schema

A star schema just means that your main table has a primary key made out of multiple columns, each of which is a foreign key to a “dimension” table.  Then you have one or more “fact” columns in addition to the primary key.

The dimension columns will be all the relevant attributes you may want to aggregate and/or query the main table on.  For example, you might have a table for the date which breaks out the year, month, day, and day-of-week so they can be directly used.  You may then have another dimension table for the geographical region with columns for the continent, country, and city, for example, so you can aggregate on those.

Each dimension table is NOT de-normalized though.  So, if you have “New York City” as the city for 1 million rows, you are literally repeating that a million times.  This makes queries easy to write but has a penalty in terms of data storage (which can be bad if you’re, say, in the cloud and paying more for more storage over time).

Snowflake Schema

Plain and simple; a snowflake schema is a star schema where the dimension tables are normalized.  This means that, for example, the geographical region dimension table itself would actually be turned into 4 tables (kind of its own star schema).  You would have one table for the continent, one for the country, one for the city, and one main table for the combination of the 3 as a primary key.

This makes queries more complex and possibly a little slower, but it means we have complete normalization and are not wasting any data storage.  Also, if say, a city changed its name, we would have exactly one database cell to update where as in a star schema we would have to update potentially millions of rows with copies of that name.

Why the Names?

If you think of a “Star Schema”, picture a main table with, say, 5 extra dimension tables around it like the 5 points of a star.  Makes sense, right?

Now, for a snowflake, picture each point being 5 tables by itself… so each point is its own star.  This starts to branch out like a snowflake.  Just think of fractals if you don’t believe me :).

 

 

Jupyter Contributor Extensions – Auto Run + Hide Code

My Problem

I’ve been messing around with Jupyter quite a bit trying to make a nice notebook for people that are not necessarily coders.  So, it would be nice to give them a mostly graphical notebook with widgets and just let they play with the results at the end.

Unfortunately, you cannot auto-run things in Jupyter notebooks properly, and the hacks are brittle.  You also cannot hide code easily, etc.

The Solution?

Thankfully, while these features are not built into Jupyter for some reason, there are a ton of contributor extensions to Jupyter it turns out!  For example, if you need to reliably auto-run cells on start-up, you can install the init_cell plugin and the hide_input plugin.

The installation is actually very easy and can me done in a few bash commands as shown below, and there are a ton of other plugins around that you can use as well.  You can even manage the plugins from within the UI after you set them up.

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --system
jupyter nbextension enable init_cell/main
jupyter nbextension enable hide_input/main

To use these, just read the links (or the documentation inside the UI at Edit > NB Extensions Config).  You’ll find that there is just a cell metadata attribute you can add for each cell you want to affect.  You can enable and disable the plugins in the UI as you like too.

Jupyter/Hub – Export Data to User (Not Local) PC

While building a mostly widget-based notebook for some other people, I came across a situation where I needed to allow them to export data from a pandas data frame to CSV.  This seemed trivial, but it actually was not.

What’s the Problem!?

I was building a notebook that was intended to run on Jupyter Hub… so, not on the same PC as the person using it. So, when I just saved the file, it was on the notebook server and the user could not access it.

Solutions?

My first thought was to have the notebook servers automatically set up a file server and just to save the files there.  Then the notebook could give users the URL to the file via the file server. I’m sure this would work, but it requires extra components and would require some clean-up of old files now and then.

While searching online, I found this solution which is much more elegant (though it will take a little extra memory). 

It base64 encodes the content and provides a link to it that way (the link actually contains all the data).  You can find the original article from medium by clicking here.  It has some other options as well.  I changed the display line and added an extra import and some altered CSV generation arguments; aside from that it is theirs.


from IPython.display import HTML
import base64

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv(index=False, line_terminator='\r\n')
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

display(create_download_link, your_data_frame)

I hope it helps you!

What is Jupyter Hub?

First Things First… What is Jupyter?

Lately, I’ve been moving into the Python world where I quickly encountered Jupyter notebooks.  They seem like a pretty dominant technology that lets you script python block-by-block and render the results.  You can also render data into charts, manage user-interface widgets, and do most anything else.

What is the Problem With Jupyter?

But Jupyter really just runs on a single machine.  What about when you want to share this information to say, teach a class, or work with a team of data scientists?

So… We Have Jupyter Hub!

Jupyter Hub is a multi-user version of Jupyter… so it fixes our problems! Here I’ll paraphrase content and use images from a wonderful video I watched on YouTube – you can watch it at the bottom of this post if you like.

Basically, Jupyter Hub just provides a higher level service to the standard Jupyter notebooks.  It contains:

  1. A proxy server to route requests.
  2. A “hub” which handles authentication, user details, and spawning new notebooks.  Authentication is flexible and can most likely tie in your corporate authentication system.
  3. Any number of spawned Jupyter processes to run notebooks for the given users.  A variety of spawning techniques exist (e.g. spawning to Docker).

You can see this architecture below.

Image result for jupyter hub

So, if you need multi-user Jupyter, I suggest you look into installing and trying Jupyter hub, and I highly recommend the video below as a starting point!

Escape HTML in WordPress

This sounds a little silly, but even as a developer I briefly got confused while trying to render (or rather, not render) HTML in WordPress.

I typically dump code into WordPress using their markdown syntax, which works great for almost everything.  But if you need to actually put HTML in a code block, that fails because it will literally get rendered into the page!

The solution is easy though.  Just google “HTML Entity Encoder” or something similar and you’ll get to a site like this https://mothereff.in/html-entities where you can enter your HTML and have it encoded so that it will display properly.

In case that doesn’t make sense to you, let’s use a div opening tag as an example.  It would change to &lt;div&gt; where the “lt” is less-than, or “<“, and the “rt” is greater-than, or “>”.  Since it’s encoded, it will be displayed properly, but the div tag will not be interpreted part of the page.

 

Jupyter Auto-Run Cells on Load

Why Do We Need This?

If you are making a Jupyter notebook that heavily uses widgets and conceals the code used to make them, you’ll quickly run into an issue. Another person coming to this notebook would basically just see this message for all of your widgets:

“A Jupyter widget could not be displayed because the widget state could not be found. This could happen if the kernel storing the widget is no longer available, or if the widget state was not saved in the notebook. You may be able to create the widget by running the appropriate cells.”.

You can simulate this for yourself by pressing the “restart the kernel (with dialog)” button and then force refreshing your browser (ctrl + shift + r in chrome).

How Do We Do It?

I came across this stack-overflow post which gives a good solution (especially if you are already hiding code in other areas to make it look neater, like I noted in this blog).

Just paste this in its own cell at the top of your notebook:

%%html
<script>
    // AUTORUN ALL CELLS ON NOTEBOOK-LOAD!
    require(
        ['base/js/namespace', 'jquery'], 
        function(jupyter, $) {
            $(jupyter.events).on("kernel_ready.Kernel", function () {
                console.log("Auto-running all cells-below...");
                jupyter.actions.call('jupyter-notebook:run-all-cells-below');
                jupyter.actions.call('jupyter-notebook:save-notebook');
            });
        }
    );
</script>

 
Then all your cells will run on load and all of your widgets will show up nice and neat the first time around.