Connecting to Hive from Python

I was using Hive via Presto for a project, but then I ran into an issue where Presto cannot support Hive views.  So, to be kind to the user, I wanted to present the view definition so they could see how to query the underlying tables.

Unfortunately, you can’t get view definitions from Presto either! So, I had to directly query hive from a Python project.

Two Options

There are two options that I found for achieving this, and surprisingly neither one was great.  You would think this was easy right!?

  1. Use PyHive – This is the standard connector you would have expected to find, except it does not install and/or work on Windows.  So, if you develop on Windows and deploy to Linux, it is painful.  Also, you need some other things on the system for it to work which can be painful to find.
  2. Use JayDeBeApi – This uses the Java JAR to connect to Hive which means it needs Java installed on your machine.  DO NOT USE THIS – I quickly ran into a critical bug that happens on both Windows and Linux – if you open one connection, do work, and close it, you cannot open another connection.  It happens on Windows and Linux.  There is a git story for it and the person had to resort to putting it in another script and calling it as a sub-process for each command which is ridiculous.

So, as I’m deploying on Linux (even though I develop on Windows), PyHive wins.

More on PyHive

So, to install PyHive, you would do the following (but it probably won’t work yet, at least not on Centos7 where I tried it).

pip install pyhive[hive]

Additional Dependencies

In order to get “pyhive[hive]” to install on a server (I tested with Centos7), you have to ensure some other dependencies are available as well.

I was working from Python 3.6 in a virtual environment, and the following worked properly:

sudo yum install gcc-c++ python-devel.x86_64 cyrus-sasl-devel.x86_64
pip install pyhive[hive]

Windows Development

Note that if you do the install without the extra [hive] you will not get all the dependencies.  The reason they’re broken out is this technically supports both Hive and Presto, and that means you get to pick which dependencies you need.

This is a mixed blessing; you can install the package on Windows and develop without the extra [hive] but if you try to execute the code it will fail.  To run it on Linux you need the full set of dependencies.

I recommend guarding the pyhive import and any related code in your project with if os.name != “nt”: in order to ensure you can run through on Windows without getting errors.  Hopefully your project is like mine where this is a side case and I can test plenty without the final calls.

Query Code

The following is a short example of how to do a query from PyHive assuming you have it all set up properly as we talked about above.

conn = None
cursor = None

try:
    query = "describe extended ``.``"
    conn = hive.Connection(host="host-name", port="10000")

    cursor = conn.cursor()
    cursor.execute(query)
    query_results = cursor.fetchall()
    column_names = [part[0] for part in cursor.description]
    df = pd.DataFrame(query_results, columns=column_names)

except Exception as ex:
    logger.info("Error while pulling view details.", ex)
    raise ex

finally:

    if cursor is not None:
        cursor.close()
    if conn is not None:
        conn.close()

Logging in Python 3 (Like Java Log4J/Logback)

What is Proper Logging?

Having a proper logger is essential for any production application.  In the Java world, almost every framework automatically pulls in Logback or Log4J, and libraries tend to use SLF4J in order to be logger agnostic and to wire up to these loggers.  So, I had to set out to see how to do similar logging in python.

While it can get fancier, I think the following things are essential when setting up a logger; so they were what I was looking for:

  1. It should be externally configured from a file that your operations team can change.
  2. It should write to a file automatically, not just console.
  3. It should roll the file it writes to at a regular size (date or time rolling on top of that can be beneficial too; but the size restriction ensures you won’t fill up your disk with a ton of logs and break your applications).
  4. It should keep a history of a few previous rolled files to aid debugging.
  5. It should use a format that specifies both the time of the logs and the class that logged them.

On top of these, obviously we must be able to log at different levels and filter out which logs go to the file easily.  This way, when we have issues, operations can jack up the logging level and figure out what is going wrong as needed.

How Do We Do it in Python 3?

It turns out that Python actually has a strong logging library built into its core distribution.  The only extra library I had to add to use it was PyYAML, and even that could have been avoided (Python supports JSON out of the box and that could be used instead, but people seem to prefer YAML configuration in the community).

In the place where your app starts up, write the following code. Note that you have to install the PyYAML module yourself. Also, this expects the “logging.yaml” to be in the same directory as the startup code (change that if you like though). We’ll show the “logging.yaml” content lower.

import logging
import logging.config
import yaml

# Initialize the logger once as the application starts up.
with open("logging.yaml", 'rt') as f:
config = yaml.safe_load(f.read())
logging.config.dictConfig(config)

# Get an instance of the logger and use it to write a log!
# Note: Do this AFTER the config is loaded above or it won't use the config.
logger = logging.getLogger(__name__)
logger.info("Configured the logger!")

Then, when you want to use the logger in other modules, simply do this:

import logging
logger.info("Using the logger from another module.")

Of course, you just have to import logging once at the top of each module, not every time you write a log.

This code uses “logging.yaml” which contains the following settings. Note that:

  • It defines a formatter with the time, module name, level name, and the logging message.
  • It defines a rotating file handler which writes to my.log and rolls the file at 10MB, keeping 5 historic copies.  The handler is set up to use our simple format from above.
  • The “root” logger writes to our handler and allows only INFO messages through.
  • The handler is set to DEBUG, so if the root logger is increased to DEBUG during an investigation, it will let the messages through to the log file.

Here is the “logging.yaml” example file:

---
version: 1
disable_existing_loggers: False

# Define format of output logs (named 'simple').
formatters:
    simple:
        format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

handlers:

    # Create rotating file handler using 'simple' format.
    file_handler:
        class: logging.handlers.RotatingFileHandler
        level: INFO
        formatter: simple
        filename: operations-audit-query-service.log
        maxBytes: 10485760 # 10MB
        backupCount: 5
        encoding: utf8

root:

    level: INFO
    handlers: [file_handler]

References

The code and YAML for this was adapted from this very good blog which I recommend reading: https://fangpenlin.com/posts/2012/08/26/good-logging-practice-in-python/.

Python Dictionary Comprehension For Multi Level Caching

What’s the Use Case?

I was coding a multi-level cache in Python and came across dictionary comprehensions.  It turns out they are very useful for this! So, it’s a nice example to teach the feature.

Let’s say our cache is like a database schema layout:

  1. database_1
    1. table_1
      1. column_1 – type 1
      2. column_2 – type 2
    2. table_2
      1. column_1 – type 1
      2. column_2 – type 2
  2. database_2
    1. table_1
      1. column_1 – type 1

What’s a Dictionary Comprehension?

A dictionary comprehension basically lets you create a dictionary out of an expression.  So, you can essentially say “for each value in this list create a key for my dictionary where the value is X if some condition is true”.

A couple things to note:

  • You can technically provide any number of input iterables. So, if you want to form a dictionary from multiple sources, it can work; but I”m not going to get into that; google elsewhere!
  • You can provide any number of “if” clauses to prune the results down.  You could achieve this with one, but using one for each condition is neater to write.

A quick example:

>>> some_list = [1,3,8,12]
>>> {key: key * key for key in some_list if key * key % 2 == 0}
{8: 64, 12: 144}

So, here we can see that we took in a list of 4 numbers and made a dictionary out of the number and its square, but only kept the results where the square was even.

The first part “key: key * key” is really just a key : value pair.  The key is on the left and the value (the key * key) could be anything you wanted on the right.  You can call “key” anything you like in the “for key” section.  The “in some_list” is the source collection where our input comes from – again, you can have multiple of these.  Finally, the “if key % 2 == 0” is a filter condition which, again, you can have multiple of.

Why is it Useful For Multi-Level Caching?

In our database example, we must first query the list of databases, then query the list of tables for any database we care about, then query the list of columns for any table we care about.

We don’t want to cache things we don’t need.

So, first off, it would be nice to prime the cache with just the database names and empty table values like so. If the cache is already populated, we just return its top level keys which are the database names:

if cache is None:
    cache = {database: None for database in get_database()}
return list(cache.keys())

Now, what about when the user goes to list the tables in a database?

if cache[database_name] is None:
    cache[database_name] = {table: None for table in get_tables(database_name)}
return list(cache[database_name].keys())

Finally, what about when the user goes to list the columns in a database?

if cache[database_name][table_name] is None:
    cache[database_name][table_name] = get_columns(database_name, table_name)
return cache[database_name][table_name]

So, we can see here that it was trivial to use dictionary comprehensions to turn a list into a dictionary with empty keys as a utility while building the multi level cache out – which is very cool.

This might not have been the best way to build a cache – but it seems pretty simple and efficient to me. Building classes around things is usually a better approach though admittedly :).
 

Simple Flask Web Service With Debugging

I needed to make a quick Python web-service and quickly found that of the two most common frameworks, Flask is intended to be minimalist and extensible while Django is intended to be large out of the box and somewhat opinionated (it has ORM tools, etc).

Being that I wanted to get something running fast and I basically just needed a REST API to return a couple results sets from a database, I chose to go with Flask.

Simple Example

The Flask documentation itself has a wonderful examaple here: http://flask.pocoo.org/docs/0.12/quickstart/#a-minimal-application which I followed along with… but here are the cliffs notes:

First, Set up a new virtual environment (PyCharm is a good IDE and gives this as an option when creating a new project).

Create a new file at the top-level of your project and enter this code:

from flask import Flask, request

app = Flask(__name__)

@app.route('/query/run', methods=['GET'])
def run_query():
    query = request.args.get('query')
    limit = request.args.get('limit')
    return f"You ran {query} with {limit} records."


if __name__ == "__main__":
    app.run(debug=True)

Run the application with PyCharm, or just with your python interpreter.

At this point, you should have a running flask application, and if you go to:

localhost:5000/query/run?query="select * from abc"&limit=5

in your browser, you should see the response properly. You’ve also turned on the interactive debugger (which can be a security issue, so only do this locally). That means if you make changes to your file, it will reload the app and use them immediately. So, if you change “abc” to “abcdef”, save, and refresh the browser, you’ll see the changes.

Note that this is just available locally at the minute as we’re focused on developing the application, not on deploying it yet. So, if you want to show it to someone else from your desktop, you will have to tell Flask to expose the service to all interfaces (0.0.0.0).

No Private Variables / Methods in Python (Jupyter & iPython)!?

Having coded for a long time and in a relatively large number of languages, I was a little panicky to realize that Python doesn’t have private variables and/or methods.

Some Context

When I came across this fact, I was trying to write an iPython notebook for use by others.  You really can’t secure things in iPython as users have access to execute arbitrary code, but I thought that you could at least make it relatively hard to break by storing things in a module and loading them.  But even this doesn’t work because nothing is private in Python – a user could just interrogate my classes and get right to the connection information variable (even if they had little to no knowledge of programming).

In Java, you can technically get around private variables by cracking open classes with reflection… but non-technical people wouldn’t know that and most programmers wouldn’t bother.  The entry bar in Python is a lot lower unfortunately.

What Does Python Do Instead?

This stack overflow post says the following which helps shed some light on the situation.

It’s cultural. In Python, you don’t write to other classes’ instance or class variables. In Java, nothing prevents you from doing the same if you really want to – after all, you can always edit the source of the class itself to achieve the same effect. Python drops that pretence of security and encourages programmers to be responsible. In practice, this works very nicely.

If you want to emulate private variables for some reason, you can always use the __ prefix from PEP 8. Python mangles the names of variables like __foo so that they’re not easily visible to code outside the class that contains them (although you can get around it if you’re determined enough, just like you can get around Java’s protections if you work at it).

By the same convention, the _ prefix means stay away even if you’re not technically prevented from doing so. You don’t play around with another class’s variables that look like __foo or _bar.

So… basically, python’s style guide, PEP-8, suggests using “_xyz” for internal identifiers and “__xyz ” (with 2 underscores) for private identifiers.  Private identifiers will be name-mangled outside the module, so people won’t know what they’re using… but they’re not really private.  People can still probe around, find them, and use them if they are determined.

Again, in Java you could go use reflection to crack open a private class… so while I’m a little annoyed at this in Python, it is true that it’s not really terribly different from a real security standpoint.

Final Thoughts

So… it seems that if you want to use real secrets (like database connection details), you have to put them in a separate application on a separate server behind an API.  That way the user (especially in an iPython context) is completely decoupled from the information as a whole.

Jupyter Contributor Extensions – Auto Run + Hide Code

My Problem

I’ve been messing around with Jupyter quite a bit trying to make a nice notebook for people that are not necessarily coders.  So, it would be nice to give them a mostly graphical notebook with widgets and just let they play with the results at the end.

Unfortunately, you cannot auto-run things in Jupyter notebooks properly, and the hacks are brittle.  You also cannot hide code easily, etc.

The Solution?

Thankfully, while these features are not built into Jupyter for some reason, there are a ton of contributor extensions to Jupyter it turns out!  For example, if you need to reliably auto-run cells on start-up, you can install the init_cell plugin and the hide_input plugin.

The installation is actually very easy and can me done in a few bash commands as shown below, and there are a ton of other plugins around that you can use as well.  You can even manage the plugins from within the UI after you set them up.

pip install jupyter_contrib_nbextensions
jupyter contrib nbextension install --system
jupyter nbextension enable init_cell/main
jupyter nbextension enable hide_input/main

To use these, just read the links (or the documentation inside the UI at Edit > NB Extensions Config).  You’ll find that there is just a cell metadata attribute you can add for each cell you want to affect.  You can enable and disable the plugins in the UI as you like too.

Jupyter/Hub – Export Data to User (Not Local) PC

While building a mostly widget-based notebook for some other people, I came across a situation where I needed to allow them to export data from a pandas data frame to CSV.  This seemed trivial, but it actually was not.

What’s the Problem!?

I was building a notebook that was intended to run on Jupyter Hub… so, not on the same PC as the person using it. So, when I just saved the file, it was on the notebook server and the user could not access it.

Solutions?

My first thought was to have the notebook servers automatically set up a file server and just to save the files there.  Then the notebook could give users the URL to the file via the file server. I’m sure this would work, but it requires extra components and would require some clean-up of old files now and then.

While searching online, I found this solution which is much more elegant (though it will take a little extra memory). 

It base64 encodes the content and provides a link to it that way (the link actually contains all the data).  You can find the original article from medium by clicking here.  It has some other options as well.  I changed the display line and added an extra import and some altered CSV generation arguments; aside from that it is theirs.


from IPython.display import HTML
import base64

def create_download_link( df, title = "Download CSV file", filename = "data.csv"):  
    csv = df.to_csv(index=False, line_terminator='\r\n')
    b64 = base64.b64encode(csv.encode())
    payload = b64.decode()
    html = '<a download="{filename}" href="data:text/csv;base64,{payload}" target="_blank">{title}</a>'
    html = html.format(payload=payload,title=title,filename=filename)
    return HTML(html)

display(create_download_link, your_data_frame)

I hope it helps you!