Hive + Presto + Ranger Version Hell

My Use Case

I was trying to test out Apache Ranger in order to give Presto column-level security over hive data.  Presto itself doesn’t seem to support Ranger yet, though some github entries suggest it will soon.  Ranger can integrate with hive though so that when presto queries hive, the security can work fine (apparently).

Conflicting Versions

I started off by deploying a version of Hive I’ve worked with before; 2.3.5, the latest 2.x version (I avoided 3.x).  After that, I deployed Presto .220, also the latest version.

This was all working great, so I moved on to Ranger.  This is when I found out that the Ranger docs specifically say that it only works with Hive version 1.2.0:

Apache Ranger version 0.5.x is compatible with only the component versions mentioned below

HIVE 1.2.0 https://hive.apache.org/downloads.html

That came from this link: https://cwiki.apache.org/confluence/display/RANGER/Apache+Ranger+0.5.0+Installation.

Alternative Options

I have a fairly stringent need for the security Ranger provides.  So, I was willing to use a 1.x version of hive, depending on what the feature loss was.  After all, quite a few big providers seem to use 1.x.

Unfortunately, the next thing I noticed was that Presto says: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

That is coming from its latest documentation: https://prestodb.github.io/docs/current/connector/hive.html.

I’m not particularly excited to start digging through old versions of Presto as well.

Next Steps

I’m going to try to stick with Hive 2.x for now and a modern version of Presto.  So, my options are:

  1. Research Ranger more and see if it can actually work with Hive 2.x.  Various vendors seem to use Ranger and Hive/Presto together; so I’m curious to see how.  Maybe the documentation on Ranger is just out of date (I know, being hopeful).
  2. Look at Ranger alternatives like Apache Sentry and see if they support Hive 2.x.  Apparently Ranger is beating out Sentry in features, usage, and future support… so I’m not excited about using Sentry.  But if it works, I can always migrate back to Ranger once its support grows for either Hive or Presto.

Update

I starting digging in from JIRA and mailing lists and found that Ranger appears to have had work done on it as early as 2017 for supporting hive 2.3.2.  Here’s a link.  https://issues.apache.org/jira/browse/RANGER-1927.

So, I’m going to give installing ranger a shot on 2.3.5 and see if it works.  If not, I’ll try with 2.3.2 and/or seek community help.  Hopefully I’ll come back and update this afterward with some good news :).

Hive – Run With Local Map Reduce

Use Case

I was working to deploy hive for a new system.  I’ve used hive a fair bit but have not personally deployed it myself.  So, I went through some online instructions and ended up installing hadoop, configuring it, starting YARN (which I’ve also used in the past), and then installing hive to run against it.

I was intending on running vs AWS s3 and not HDFS, so I realized I didn’t need the DFS.  Then I thought harder and realized that it would be nice to run without YARN as well.  A colleague pointed out that in his deployment, hive did the work locally and he ran nothing aside from hiveserver2 and the hive metastore.  He was running multiple instances of hiveserver2 and the metastore, and they wouldn’t work together for any map-reduce tasks, but as he didn’t really want people using that execution engine anyway, that was just fine (and it is for me too!).

I didn’t realize this was an option… so I googled around and found this link: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode (scroll to the heading Hive, Map-Reduce and Local-Mode).

Local Map Reduce

It turns out that in the property mapred.job.tracker is used to control if hive executes local map reduce or not.  It is supposedly defaulted to “local”, meaning if you don’t override it then hive actually will execute in local mode by default.

With varying degrees of hive documentation maintenance, this is a little hard to rely on though.  This site says it:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

But if you look in your hive-default.xml.template file, you will not find this value.  Also, if you run the JDBC command “set” against hive, it will list all its configuration values, and you won’t see it there either!

I pulled down the hive source code though, and in the POM file, you can find an XML block which clearly sets a system property mapred.job.tracker to “local” (I’m kind of surprised it was in the POM).

So, the system property is defaulted to “local”. You can’t find any references to this past this point, but it is apparently used when hive is interacting with hadoop; so I suppose hadoop picks up the property and uses it in a way that isn’t obvious here.

So… you’ll run locally by default as long as you don’t add extra configuration to avoid it (which I did initially when following some other tutorial).

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

Very short post – my HiveServer2 process was running without errors after deployment, but it wasn’t really running.

Connecting via JDBC yielded errors saying the connection was being refused.  Analyzing the server showed that the port was not open using:

sudo netstat -nlp | grep 10000

I enabled debug logs with the extra command line parameter:

--hiveconf hive.root.logger=DEBUG,console

And it still didn’t show much, except something about creating the scratch directories (but not an error).

After a while, I figured out that the scratch directories were set to be created at the root of the file system in a new directory which didn’t exist yet. The user running hive did not have these permissions.

So, I created the scratch directory and gave ownership to the hive user, and then everything came up and worked great on the next hiveserver2 service restart.

S3 Eventual Consistency

Consistent Distributed File Systems

Historically, I’ve used standard HDFS, MapR’s version of HDFS (MapR-FS), and ADLS (Azure’s data lake service).  All of these behave very much like you would expect a local file system to.  If you write files and another process lists files, it will immediately see them and be able to use them without issue.

Amazon s3 File System Issues

I was surprised when I started learning about Amazon s3 after using all of these prior file systems.  I understand that s3 is an object store… similar to Azure Blob Storage.  I also understand that it is the main data lake solution though.

Maybe it’s just because I’m new and am missing something… but there doesn’t seem to be any AWS version of ADLS.

The s3 storage service is eventually consistent.   This means that if you run Spark, or similar tools on it, they will likely produce improper results or fail.  This is because multiple tasks will write files in parallel and list them and they won’t necessarily get the fully up to date view of the storage.  So, they may write 10 files, list them, and see 5 files, etc.

I came across a very good article describing this in detail here: https://www.opendoor.com/w/blog/why-s3guard-with-s3-as-a-filesystem-spark.

The TLDR is that you have to use a consistency layer between your big data frameworks and s3 to ensure they function well.  You can confirm this by reading the short hadoop documentation site here -> https://wiki.apache.org/hadoop/AmazonS3.

Note that the first article recommends S3Guard which works based on DynamoDB, but there may be other options (e.g. EMR will have a way of dealing with this).

Determine Compatibility of hadoop-aws and aws-java-sdk-bundle JARs

When you’re integrating hadoop and other big-data frameworks into AWS s3, you will quickly run into a situation where you need to include the hadoop-aws and aws-java-sdk-bundle JARs into your class path.

Unfortunately, these JARs are separately versioned and it is hard to figure out compatibility.  The hadoop-aws JAR has to match your hadoop version exactly, so that one is fine.

Determining the Right Version

  1. Check your hadoop version.
  2. Get the hadoop-aws.jar with the same exact version.
  3. Go to the maven central page for the correct version of the hadoop-aws.jar and look at its compile dependencies.  E.g. at https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.9 you can see the SDK dependency is com.amazonaws » aws-java-sdk-bundle 1.11.199.

Databricks / Spark – Generate Parquet Sample Data

I frequently find myself needing to generate parquet data for sample tests… e.g. when setting up a new hive instance, or testing Apache Drill, presto, etc. I always end up writing basically the same code in a different way because I never save it. So here it is!

This code makes a 5000 row data frame with 3 columns, 2 being integers and one being a string. It then names the columns well and saves them to a parquet file with 5 sub-files due to the coalesce.

I have it set up to write to ADLS in Azure but you can change the path so it works with your HDFS or whatever.

NOTE: It will overwrite the previous results in the destination, so (1) don’t write over data you want to keep, and (2) if you need to tune this, just keep re-running it :).

//Create a mutable list buffer based on a loop.
import scala.collection.mutable.ListBuffer
var lb = ListBuffer[(Int, Int, String)]()
for (i <- 1 to 5000) {
  lb += ((i, i*i, "Number is " + i + "."))
}

//Convert it to a data frame.
import spark.implicits._
val df = lb.toDF("value", "square", "description")

df.coalesce(5).write.mode(SaveMode.Overwrite).parquet("adl://some-adls-instance.azuredatalakestore.net/johntest/sample_data.parquet")

Hive Server 2 – Required field ‘serverProtocolVersion’ is unset!

Issue Context and Error

I have been working to install hive server 2 in order to work with Presto, among other things.  I wanted to ensure I had Hive’s JDBC interface (to port 10000) working well as I need it to enable users to easily submit partition repair queries (msck repair table) and similar things.  Unfortunately, when I went to connect over JDBC, I got this error (a small part of a huge stack trace):

Required field 'serverProtocolVersion' is unset!

The Solution

I think if you carefully read the full stack-trace, you’ll see something about user impersonation… but missed it. I actually figured it out by increasing the logging level when running hive server. You can do that like this:

./hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=DEBUG,console

Once I did this, I clearly saw this error:

2019-06-06T13:53:13,183  WARN [HiveServer2-Handler-Pool: Thread-36] thrift.ThriftCLIService: Error opening session:
org.apache.hive.service.cli.HiveSQLException: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: centos is not allowed to impersonate centos

Googling this quickly helped me to find this stack overflow: https://stackoverflow.com/a/50753233/857994. The proposed solution there is to add this entry to your hive-site.xml:

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value> 
</property>

After that, everything works great :).