Building Apache Ranger

I was not particularly thrilled to see that I have to build ranger myself to get the various binaries needed for it.

Anyway, the first thing I did was download a “release”.  There is surprisingly little information on what a “release” is or how to use it.  But, given that all installation documentation seems to ask for artifacts named like ranger-%version-number%-admin.tar.gz” and I didn’t see any gz files, I assumed it was more like a bundled source code release that had to be built.

Note: referring to documentation here: https://ranger.apache.org/quick_start_guide.html and he release I used is here: http://mirrors.sonic.net/apache/ranger/1.2.0/apache-ranger-1.2.0.tar.gz.

Docker Build Script

My initial thought was to do the build using the convenient sounding “build_ranger_using_docker.sh” script which is in the root directory.  So, I installed docker quickly, did a docker login, and ran it.  It failed! (on Centos 7 for the record).

It tried to download a version of maven which doesn’t exist on the maven site currently.  If you switch to a slightly newer one the script breaks due to the maven release artifacts being a little different too.  So, I reverted to 3.3.9 which required changes to multiple lines.

After that, it went through to the end and failed on the last step with “gosu: not found”.  There had been some scary red text higher up about “no ultimately trusted keys found” related to installing gosu.

I tried various ways of fixing this and they all failed (on Centos 7.x)… but to be honest, I didn’t invest my own time in reading up on gosu or why the various proposed solutions were failing.

Build with Maven

Giving up and building with Maven failed on Centos and my Windows 10 box with similar python errors about half way through despite one being on Python 2 and one being on Python 3.  So, building straight from source wasn’t great either.

Success

I decided to go back to the docker build.  This time, I removed some of the maven validations, used a newer version of maven (which I’m confident doesn’t matter much).  But I also removed the gosu install and usage from the final build commands.

This finally worked.  Note that my copy is hacky and doesn’t bother using the “builder” account to do the build.  But it worked at least and built the artifacts.  So, I’m happy enough for my own purposes.  If it was a long-running web app or something, I’d go work out the bugs in the docker container/gosu/etc – but that’s not required for a build utility.

After this, you see a nice listing of tar.gz files in the ./target folder like so:

antrun ranger-1.2.0-kafka-plugin.zip ranger-1.2.0-sqoop-plugin.zip
archive-tmp ranger-1.2.0-kms.tar.gz ranger-1.2.0-src.tar.gz
maven-shared-archive-resources ranger-1.2.0-kms.zip ranger-1.2.0-src.zip
ranger-1.2.0-admin.tar.gz ranger-1.2.0-knox-plugin.tar.gz ranger-1.2.0-storm-plugin.tar.gz
ranger-1.2.0-admin.zip ranger-1.2.0-knox-plugin.zip ranger-1.2.0-storm-plugin.zip
ranger-1.2.0-atlas-plugin.tar.gz ranger-1.2.0-kylin-plugin.tar.gz ranger-1.2.0-tagsync.tar.gz
ranger-1.2.0-atlas-plugin.zip ranger-1.2.0-kylin-plugin.zip ranger-1.2.0-tagsync.zip
ranger-1.2.0-hbase-plugin.tar.gz ranger-1.2.0-migration-util.tar.gz ranger-1.2.0-usersync.tar.gz
ranger-1.2.0-hbase-plugin.zip ranger-1.2.0-migration-util.zip ranger-1.2.0-usersync.zip
ranger-1.2.0-hdfs-plugin.tar.gz ranger-1.2.0-ranger-tools.tar.gz ranger-1.2.0-yarn-plugin.tar.gz
ranger-1.2.0-hdfs-plugin.zip ranger-1.2.0-ranger-tools.zip ranger-1.2.0-yarn-plugin.zip
ranger-1.2.0-hive-plugin.tar.gz ranger-1.2.0-solr-plugin.tar.gz rat.txt
ranger-1.2.0-hive-plugin.zip ranger-1.2.0-solr-plugin.zip version
ranger-1.2.0-kafka-plugin.tar.gz ranger-1.2.0-sqoop-plugin.tar.gz

Here was my final docker file.  Note that you should read up on gosu/etc before using it and I take no responsibility for any security issues; you should use the official one – if you can :).

default_command="mvn -DskipTests=true clean compile package install assembly:assembly"
build_image=0
if [ "$1" = "-build_image" ]; then
build_image=1
shift
fi

params=$*
if [ $# -eq 0 ]; then
params=$default_command
fi

image_name="ranger_dev"
remote_home=
container_name="--name ranger_build"

if [ ! -d security-admin ]; then
echo "ERROR: Run the script from root folder of source. e.g. $HOME/git/ranger"
exit 1
fi

images=`docker images | cut -f 1 -d " "`
[[ $images =~ $image_name ]] && found_image=1 || build_image=1

if [ $build_image -eq 1 ]; then
echo "Creating image $image_name ..."
docker rmi -f $image_name

docker build -t $image_name - < /scripts/mvn.sh RUN echo 'set -x; if [ "\$1" = "mvn" ]; then usermod -u \$(stat -c "%u" pom.xml) bash -c '"'"'ln -sf /.m2 \$HOME'"'"'; exec "\$@"; fi; exec "\$@" ' >> /scripts/mvn.sh

RUN chmod -R 777 /scripts
RUN chmod -R 777 /tools

ENTRYPOINT ["/scripts/mvn.sh"]
Dockerfile

fi

src_folder=`pwd`

LOCAL_M2="$HOME/.m2"
mkdir -p $LOCAL_M2
set -x
docker run --rm -v "${src_folder}:/ranger" -w "/ranger" -v "${LOCAL_M2}:${remote_home}/.m2" $container_name $image_name $params

 

Install Docker CE on Linux Centos 7.x

This is just a short post paraphrasing the very good (and verbose!) instructions on the Docker site here: https://docs.docker.com/install/linux/docker-ce/centos/.

Basically, to install Docker CE on a fresh Centos 7.x server, you have to:

  • Install the YUM config manager.
  • Install device-mapper-persistent data and LVM (for the storage driver).
  • Use the YUM config manager to add the stable  docker YUM repository.
  • Install docker.
  • Start docker.
  • Test that it worked.

This script does all of that and basically just saves you from skimming through the linked page repeatedly to find the few commands you need.

sudo yum install -y yum-utils \
  device-mapper-persistent-data \
  lvm2
sudo yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install -y docker-ce docker-ce-cli containerd.io
sudo systemctl start docker
sudo docker run hello-world

Assuming it works, you should see “Hello from Docker!” among various other output on your screen.

Hive + Presto + Ranger Version Hell

My Use Case

I was trying to test out Apache Ranger in order to give Presto column-level security over hive data.  Presto itself doesn’t seem to support Ranger yet, though some github entries suggest it will soon.  Ranger can integrate with hive though so that when presto queries hive, the security can work fine (apparently).

Conflicting Versions

I started off by deploying a version of Hive I’ve worked with before; 2.3.5, the latest 2.x version (I avoided 3.x).  After that, I deployed Presto .220, also the latest version.

This was all working great, so I moved on to Ranger.  This is when I found out that the Ranger docs specifically say that it only works with Hive version 1.2.0:

Apache Ranger version 0.5.x is compatible with only the component versions mentioned below

HIVE 1.2.0 https://hive.apache.org/downloads.html

That came from this link: https://cwiki.apache.org/confluence/display/RANGER/Apache+Ranger+0.5.0+Installation.

Alternative Options

I have a fairly stringent need for the security Ranger provides.  So, I was willing to use a 1.x version of hive, depending on what the feature loss was.  After all, quite a few big providers seem to use 1.x.

Unfortunately, the next thing I noticed was that Presto says: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

That is coming from its latest documentation: https://prestodb.github.io/docs/current/connector/hive.html.

I’m not particularly excited to start digging through old versions of Presto as well.

Next Steps

I’m going to try to stick with Hive 2.x for now and a modern version of Presto.  So, my options are:

  1. Research Ranger more and see if it can actually work with Hive 2.x.  Various vendors seem to use Ranger and Hive/Presto together; so I’m curious to see how.  Maybe the documentation on Ranger is just out of date (I know, being hopeful).
  2. Look at Ranger alternatives like Apache Sentry and see if they support Hive 2.x.  Apparently Ranger is beating out Sentry in features, usage, and future support… so I’m not excited about using Sentry.  But if it works, I can always migrate back to Ranger once its support grows for either Hive or Presto.

Update

I starting digging in from JIRA and mailing lists and found that Ranger appears to have had work done on it as early as 2017 for supporting hive 2.3.2.  Here’s a link.  https://issues.apache.org/jira/browse/RANGER-1927.

So, I’m going to give installing ranger a shot on 2.3.5 and see if it works.  If not, I’ll try with 2.3.2 and/or seek community help.  Hopefully I’ll come back and update this afterward with some good news :).

Hive – Run With Local Map Reduce

Use Case

I was working to deploy hive for a new system.  I’ve used hive a fair bit but have not personally deployed it myself.  So, I went through some online instructions and ended up installing hadoop, configuring it, starting YARN (which I’ve also used in the past), and then installing hive to run against it.

I was intending on running vs AWS s3 and not HDFS, so I realized I didn’t need the DFS.  Then I thought harder and realized that it would be nice to run without YARN as well.  A colleague pointed out that in his deployment, hive did the work locally and he ran nothing aside from hiveserver2 and the hive metastore.  He was running multiple instances of hiveserver2 and the metastore, and they wouldn’t work together for any map-reduce tasks, but as he didn’t really want people using that execution engine anyway, that was just fine (and it is for me too!).

I didn’t realize this was an option… so I googled around and found this link: https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-Hive,Map-ReduceandLocal-Mode (scroll to the heading Hive, Map-Reduce and Local-Mode).

Local Map Reduce

It turns out that in the property mapred.job.tracker is used to control if hive executes local map reduce or not.  It is supposedly defaulted to “local”, meaning if you don’t override it then hive actually will execute in local mode by default.

With varying degrees of hive documentation maintenance, this is a little hard to rely on though.  This site says it:

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Configuration

But if you look in your hive-default.xml.template file, you will not find this value.  Also, if you run the JDBC command “set” against hive, it will list all its configuration values, and you won’t see it there either!

I pulled down the hive source code though, and in the POM file, you can find an XML block which clearly sets a system property mapred.job.tracker to “local” (I’m kind of surprised it was in the POM).

So, the system property is defaulted to “local”. You can’t find any references to this past this point, but it is apparently used when hive is interacting with hadoop; so I suppose hadoop picks up the property and uses it in a way that isn’t obvious here.

So… you’ll run locally by default as long as you don’t add extra configuration to avoid it (which I did initially when following some other tutorial).

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

Very short post – my HiveServer2 process was running without errors after deployment, but it wasn’t really running.

Connecting via JDBC yielded errors saying the connection was being refused.  Analyzing the server showed that the port was not open using:

sudo netstat -nlp | grep 10000

I enabled debug logs with the extra command line parameter:

--hiveconf hive.root.logger=DEBUG,console

And it still didn’t show much, except something about creating the scratch directories (but not an error).

After a while, I figured out that the scratch directories were set to be created at the root of the file system in a new directory which didn’t exist yet. The user running hive did not have these permissions.

So, I created the scratch directory and gave ownership to the hive user, and then everything came up and worked great on the next hiveserver2 service restart.