Hive – Run With Local Map Reduce

Use Case

I was working to deploy hive for a new system.  I’ve used hive a fair bit but have not personally deployed it myself.  So, I went through some online instructions and ended up installing hadoop, configuring it, starting YARN (which I’ve also used in the past), and then installing hive to run against it.

I was intending on running vs AWS s3 and not HDFS, so I realized I didn’t need the DFS.  Then I thought harder and realized that it would be nice to run without YARN as well.  A colleague pointed out that in his deployment, hive did the work locally and he ran nothing aside from hiveserver2 and the hive metastore.  He was running multiple instances of hiveserver2 and the metastore, and they wouldn’t work together for any map-reduce tasks, but as he didn’t really want people using that execution engine anyway, that was just fine (and it is for me too!).

I didn’t realize this was an option… so I googled around and found this link: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode (scroll to the heading Hive, Map-Reduce and Local-Mode).

Local Map Reduce

It turns out that in the property mapred.job.tracker is used to control if hive executes local map reduce or not.  It is supposedly defaulted to “local”, meaning if you don’t override it then hive actually will execute in local mode by default.

With varying degrees of hive documentation maintenance, this is a little hard to rely on though.  This site says it:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

But if you look in your hive-default.xml.template file, you will not find this value.  Also, if you run the JDBC command “set” against hive, it will list all its configuration values, and you won’t see it there either!

I pulled down the hive source code though, and in the POM file, you can find an XML block which clearly sets a system property mapred.job.tracker to “local” (I’m kind of surprised it was in the POM).

So, the system property is defaulted to “local”. You can’t find any references to this past this point, but it is apparently used when hive is interacting with hadoop; so I suppose hadoop picks up the property and uses it in a way that isn’t obvious here.

So… you’ll run locally by default as long as you don’t add extra configuration to avoid it (which I did initially when following some other tutorial).

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

Very short post – my HiveServer2 process was running without errors after deployment, but it wasn’t really running.

Connecting via JDBC yielded errors saying the connection was being refused.  Analyzing the server showed that the port was not open using:

sudo netstat -nlp | grep 10000

I enabled debug logs with the extra command line parameter:

--hiveconf hive.root.logger=DEBUG,console

And it still didn’t show much, except something about creating the scratch directories (but not an error).

After a while, I figured out that the scratch directories were set to be created at the root of the file system in a new directory which didn’t exist yet. The user running hive did not have these permissions.

So, I created the scratch directory and gave ownership to the hive user, and then everything came up and worked great on the next hiveserver2 service restart.

Hive Server 2 – Required field ‘serverProtocolVersion’ is unset!

Issue Context and Error

I have been working to install hive server 2 in order to work with Presto, among other things.  I wanted to ensure I had Hive’s JDBC interface (to port 10000) working well as I need it to enable users to easily submit partition repair queries (msck repair table) and similar things.  Unfortunately, when I went to connect over JDBC, I got this error (a small part of a huge stack trace):

Required field 'serverProtocolVersion' is unset!

The Solution

I think if you carefully read the full stack-trace, you’ll see something about user impersonation… but missed it. I actually figured it out by increasing the logging level when running hive server. You can do that like this:

./hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=DEBUG,console

Once I did this, I clearly saw this error:

2019-06-06T13:53:13,183  WARN [HiveServer2-Handler-Pool: Thread-36] thrift.ThriftCLIService: Error opening session:
org.apache.hive.service.cli.HiveSQLException: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: centos is not allowed to impersonate centos

Googling this quickly helped me to find this stack overflow: https://stackoverflow.com/a/50753233/857994. The proposed solution there is to add this entry to your hive-site.xml:

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value> 
</property>

After that, everything works great :).

Hive 3 Standalone Metastore + Presto

Hive 3.0 Standalone Metastore – Why?

Hive version 3.0 allows you to download a standalone metastore.  This is cool because it does not require you to deploy hadoop and/or run the rest of Hive’s fairly large deployment.  This makes a lot of sense because many tools that use hive for schema management do not actually care about Hive’s query engine.

For example, Presto is a clustered query engine in its own right; it has no interest in using hadoop/map-reduce to execute a query on hive data; it just wants to view and manage hive’s metadata through its thrift metastore interface.  Similarly, Apache Spark loves to work with hive, but it actually goes directly to the underlying database for performance reasons and works against that.  So, it also does not need hive’s query engine.

Can/Should We Use It?

Unfortunately, Presto only currently supports Hive 2.X.  From it’s own documentation: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

If you read online though, you will find that it does seem to work… but with limited features.  If you look at this git entry for example: https://groups.google.com/forum/#!topic/presto-users/iAeEecsnS9I, you will see:

“We have tested Presto 0.203e with Hive 3.0 Metastore, and it works fine. We tested it by running TPC-DS queries, and Presto completed all 99 queries.”

But lower down, you will see:

However, Presto is not able to read Hive managed (transactional tables) in Hive 3.x…

Yes, this is a known limitation.

Unfortunately, transactional ACID v2 tables are the default for Hive 3.x.  So, basically all managed tables will not work in Hive 3.x even though external tables will work.  So, it might be okay to use it if you only do external tables… but in our case we let people use Spark however they like and they likely create many managed tables.  So, this rules out using Hive 3.0 with the standalone metastore for us.

I’m going to see if Hive 2.0 can be run without the hive server and hadoop next.

Site Note – SchemaTool

I would just like to make a side-note that while I did manage to run the Hive Standalone Metastore without installing hadoop, I did have to install (but not run) hadoop in order to use the schematool provided with hive for creating the hive RDMBS schema.  This is due to library dependencies.

There is a “create on first run” config you can do instead of this as well but they don’t recommend using it in production; so just keep that in mind.

Useful Links

Connecting to Hive from Python

I was using Hive via Presto for a project, but then I ran into an issue where Presto cannot support Hive views.  So, to be kind to the user, I wanted to present the view definition so they could see how to query the underlying tables.

Unfortunately, you can’t get view definitions from Presto either! So, I had to directly query hive from a Python project.

Two Options

There are two options that I found for achieving this, and surprisingly neither one was great.  You would think this was easy right!?

  1. Use PyHive – This is the standard connector you would have expected to find, except it does not install and/or work on Windows.  So, if you develop on Windows and deploy to Linux, it is painful.  Also, you need some other things on the system for it to work which can be painful to find.
  2. Use JayDeBeApi – This uses the Java JAR to connect to Hive which means it needs Java installed on your machine.  DO NOT USE THIS – I quickly ran into a critical bug that happens on both Windows and Linux – if you open one connection, do work, and close it, you cannot open another connection.  It happens on Windows and Linux.  There is a git story for it and the person had to resort to putting it in another script and calling it as a sub-process for each command which is ridiculous.

So, as I’m deploying on Linux (even though I develop on Windows), PyHive wins.

More on PyHive

So, to install PyHive, you would do the following (but it probably won’t work yet, at least not on Centos7 where I tried it).

pip install pyhive[hive]

Additional Dependencies

In order to get “pyhive[hive]” to install on a server (I tested with Centos7), you have to ensure some other dependencies are available as well.

I was working from Python 3.6 in a virtual environment, and the following worked properly:

sudo yum install gcc-c++ python-devel.x86_64 cyrus-sasl-devel.x86_64
pip install pyhive[hive]

Windows Development

Note that if you do the install without the extra [hive] you will not get all the dependencies.  The reason they’re broken out is this technically supports both Hive and Presto, and that means you get to pick which dependencies you need.

This is a mixed blessing; you can install the package on Windows and develop without the extra [hive] but if you try to execute the code it will fail.  To run it on Linux you need the full set of dependencies.

I recommend guarding the pyhive import and any related code in your project with if os.name != “nt”: in order to ensure you can run through on Windows without getting errors.  Hopefully your project is like mine where this is a side case and I can test plenty without the final calls.

Query Code

The following is a short example of how to do a query from PyHive assuming you have it all set up properly as we talked about above.

conn = None
cursor = None

try:
    query = "describe extended ``.``"
    conn = hive.Connection(host="host-name", port="10000")

    cursor = conn.cursor()
    cursor.execute(query)
    query_results = cursor.fetchall()
    column_names = [part[0] for part in cursor.description]
    df = pd.DataFrame(query_results, columns=column_names)

except Exception as ex:
    logger.info("Error while pulling view details.", ex)
    raise ex

finally:

    if cursor is not None:
        cursor.close()
    if conn is not None:
        conn.close()