Hive Metastore DB (HMS DB) – Get All Tables/Columns including Partition Keys

How Hive Stores Schemas

I recently had to sync the schemas from a hive database to a normal MySQL database for reasons I won’t bother going into.  The exercise required me to get all columns from all tables in hive for each DB though, and I found that this was not amazingly straight-forward.

The hive metastore DB is a normal MySQL/etc database with a hive schema in it.  The hive schema holds the hive tables though.  So, the information schema is irrelevant to hive; to get the hive table details, you have to interrogate the TBLS table, for example.  To get columns, you need to interrogate COLUMNS_V2, and to get the databases themselves, you look toward the DBS table.

Missing Columns

Other posts I’ve seen would leave you here – but  I found when I joined these three tables together (plus some others you have to route through), I was still missing some columns in certain tables for some reason.  It turns out that partition columns are implicit in hive.  So, in a file system of hive data (like HDFS), a partition column in a table is literally represented by just having the directory named with the partition value; there are no columns with the value in the data.

What this means is that partition columns don’t show up in these normal tables.  You have to look to a separate partition keys table to find them with a separate query.

The Working Query

The query below finds all columns of any kind and sorts them in the order they’ll appear when you select from a table in hive/presto/etc.  I hope it helps you!

select db_name, table_name, column_name from (
SELECT d.NAME as db_name, t.TBL_NAME as table_name, c.COLUMN_NAME as column_name, c.INTEGER_IDX as idx
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
JOIN SDS s on t.SD_ID = s.SD_ID
JOIN COLUMNS_V2 c on c.CD_ID = s.CD_ID
WHERE d.NAME = :dbName
union
SELECT d.NAME as db_name, t.TBL_NAME as table_name, k.PKEY_NAME as column_name, 50000 + k.INTEGER_IDX
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
join PARTITION_KEYS k on t.TBL_ID = k.TBL_ID
where d.NAME = :dbName
) x
order by db_name, table_name, idx

Does Spring JdbcTemplate Close Connections? … Not Always.

Common Advice – Correct?

Decent developers usually know that they have to try/catch/finally to ensure they clean up connections, file handles, or any number of things.  But then, for Java, you hear “just use JdbcTemplate! it does all this boilerplate for you!”.

Uncommon Scenario

Normally when you’re writing an average app, you generally want lots of queries to be able to run in parallel, efficiently, using the same user and password.  In this case, you can easily just use a connection pool and “not worry about it”.  Spring JdbcTemplates will just grab connections from your data source and pool them appropriately based on the data source.  You don’t have to worry about if they are opened, closed, or whatever.

I ran into a scenario today where that was not true though.  I have an app where each user connects to each back-end data-source using their own personal account which is managed by the application itself.  So, each user needs his or her own connection.  So… pooling would not make much sense unless each user had to do parallel operations (which they don’t).

What Happens to the Connections?

So, here’s the fun part.  I had, for the longest time, assumed that JdbcTemplates would clean up connections in addition to results sets.  In fact, you’ll see this online a lot.  But be careful!  This does not appear to be the case, or if it is, it is at least data source dependent… and that actually makes sense if you think about their purose.

Here is how I verified this. I created a JdbcTemplate which is based on a new data source each time (which is needed as the user/password change).

private NamedParameterJdbcTemplate getJdbcTemplate(String email, String password) {
    SimpleDriverDataSource ds = new SimpleDriverDataSource();
    ds.setDriverClass(HiveDriver.class);
    ds.setUrl(url);
    ds.setUsername(email);
    ds.setPassword(password);
    return new NamedParameterJdbcTemplate(ds);
}

Then I used the template for a number of queries in a normal manner (like this):

getDirectHiveJdbcTemplate(email, catalog)
.queryForList("describe extended `mytable`.`mytable`",
new MapSqlParameterSource())

Then I took a heap dump of the process with this command (run it from your command line in your JDK bin folder in Program Files or the Linux install location with minor changes):

jmap.exe -F -dump:format=b,file=C:\temp\dump.bin your-pid

You can get the PID easily by looking at your running process from JVisualVM (which is also in the bin directory).

Once the dump is complete, load the file into JVisualVM (you need to use the 3rd option of file type to make it go in, I think its pattern is . or something.

Finally, go to the classes tab, go to the very bottom of the screen, and search for the class of interest (in my case HiveConnection). I can see as many instances as I have run queries as each query made a new connection from a new data source. They are definitely not being cleaned up.

This surprised me because even though creating a new template/data-source each time is not normal, I expected them to clean up the connections when they were garbage collected or as part of normal operations.  After thinking about it more, I realize operations in my case would not me “normal”, but the lack of clean up when out of scope still definitely is a surprise to me.

 

Azure PaaS Postgres 10 Database Create + Connect Centos PSQL or DBeaver

Today I started using the “Azure Database for PostreSQL” PaaS service offering.  It went pretty smoothly, but connecting took a little more effort than I expected (all for good reasons!).

Creating the PostreSQL Service

You can find the creation screen in the Azure portal by pressing (+), clicking Databases, and scrolling down.

As with most things in Azure, creating the service through the portal was pretty trivial.  You basically just provide the name, region, resource group, subscription, select the size you want, specify a user + password, and you’re done!  It takes around a minute to complete with a smallish database size.

postgres-create

Connecting to the Database

We’re going to connect with DBeaver (its like SQuirreL and DBVizualizer if you haven’t heard of it).  Then we will also connect with the “psql” command line utility from Linux.  This should be pretty quick – but there are two wrenches in the works:

  1. SSL is enabled.
  2. Azure has blocked all inbound IPs by default – nothing can connect in.

Connecting with DBeaver

  • Go to your Postgres instance in the portal and view the “Overview” screen.
  • Open DBeaver, create a new Postgres connection.
  • Copy the server name from the portal into the host section of DBeaver.
  • Copy the Server Admin Login name from the portal into the user name section of DBeaver.
  • Type in your password for that Admin user.
  • Set the database as Postgres in DBeaver.
  • You can leave the port as the default 5432.
  • Now, go to driver properties on the left of DBeaver and set:
    • ssl to true
    • sslmode to require

This is shown here:

dbeaver-postgres

At this point, you’ve got all the connection details in DBeaver set up properly; but you still can’t connect.  You’ll have to go into the Azure portal, click “Connection Security”, and then create a firewall rule that allows your IP in.  You can also, alternatively, add in a pre-defined subnet you have for yourself, your company, etc.  At that point, everything on that subnet will be able to connect properly.

After this, you should be able to “Test Connection” successfully.

Connecting with PSQL from Centos 7

Assuming you opened up the firewall or subnet as noted at the end of the previous example with DBeaver, you can then just:

Install the PSQL client library:

And connect with the psql utility:

  • psql “sslmode=require host=yourhost.postgres.database.azure.com dbname=postgres user=youruser@yourhost”

Database Star Schemas and Snowflake Schemas

Schema Confusion

A lot of people very regularly work with databases (even high end ones), but get thrown by terms like star-schema, snowflake-schema, etc. due to lack of formal training or working with data warehousing technologies.

These same people will often be perfectly comfortable with indexing, query optimization, foreign keys, concepts of de-normalization and normal forms, etc.

I personally started working with the actual “Snowflake” database recently https://www.snowflake.com/about/ and had to review what a snowflake shema was when I started looking at it.

Useful Articles

I found an interesting article on Star schemas vs Snowflake schemas pretty quickly, and back tracked it to precursor articles digging into the Star and Snowflake schemas respectively.  Here are each in case you want the original content; I’m just going to paraphrase it below to give people a quick overview and/or refresher.

Star Schema

A star schema just means that your main table has a primary key made out of multiple columns, each of which is a foreign key to a “dimension” table.  Then you have one or more “fact” columns in addition to the primary key.

The dimension columns will be all the relevant attributes you may want to aggregate and/or query the main table on.  For example, you might have a table for the date which breaks out the year, month, day, and day-of-week so they can be directly used.  You may then have another dimension table for the geographical region with columns for the continent, country, and city, for example, so you can aggregate on those.

Each dimension table is NOT de-normalized though.  So, if you have “New York City” as the city for 1 million rows, you are literally repeating that a million times.  This makes queries easy to write but has a penalty in terms of data storage (which can be bad if you’re, say, in the cloud and paying more for more storage over time).

Snowflake Schema

Plain and simple; a snowflake schema is a star schema where the dimension tables are normalized.  This means that, for example, the geographical region dimension table itself would actually be turned into 4 tables (kind of its own star schema).  You would have one table for the continent, one for the country, one for the city, and one main table for the combination of the 3 as a primary key.

This makes queries more complex and possibly a little slower, but it means we have complete normalization and are not wasting any data storage.  Also, if say, a city changed its name, we would have exactly one database cell to update where as in a star schema we would have to update potentially millions of rows with copies of that name.

Why the Names?

If you think of a “Star Schema”, picture a main table with, say, 5 extra dimension tables around it like the 5 points of a star.  Makes sense, right?

Now, for a snowflake, picture each point being 5 tables by itself… so each point is its own star.  This starts to branch out like a snowflake.  Just think of fractals if you don’t believe me :).