Hive 3.0 Standalone Metastore – Why?
Hive version 3.0 allows you to download a standalone metastore. This is cool because it does not require you to deploy hadoop and/or run the rest of Hive’s fairly large deployment. This makes a lot of sense because many tools that use hive for schema management do not actually care about Hive’s query engine.
For example, Presto is a clustered query engine in its own right; it has no interest in using hadoop/map-reduce to execute a query on hive data; it just wants to view and manage hive’s metadata through its thrift metastore interface. Similarly, Apache Spark loves to work with hive, but it actually goes directly to the underlying database for performance reasons and works against that. So, it also does not need hive’s query engine.
Can/Should We Use It?
Unfortunately, Presto only currently supports Hive 2.X. From it’s own documentation: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”
If you read online though, you will find that it does seem to work… but with limited features. If you look at this git entry for example: https://groups.google.com/forum/#!topic/presto-users/iAeEecsnS9I, you will see:
“We have tested Presto 0.203e with Hive 3.0 Metastore, and it works fine. We tested it by running TPC-DS queries, and Presto completed all 99 queries.”
But lower down, you will see:
However, Presto is not able to read Hive managed (transactional tables) in Hive 3.x…
Yes, this is a known limitation.
Unfortunately, transactional ACID v2 tables are the default for Hive 3.x. So, basically all managed tables will not work in Hive 3.x even though external tables will work. So, it might be okay to use it if you only do external tables… but in our case we let people use Spark however they like and they likely create many managed tables. So, this rules out using Hive 3.0 with the standalone metastore for us.
I’m going to see if Hive 2.0 can be run without the hive server and hadoop next.
Site Note – SchemaTool
I would just like to make a side-note that while I did manage to run the Hive Standalone Metastore without installing hadoop, I did have to install (but not run) hadoop in order to use the schematool provided with hive for creating the hive RDMBS schema. This is due to library dependencies.
There is a “create on first run” config you can do instead of this as well but they don’t recommend using it in production; so just keep that in mind.