Building Apache Ranger

I was not particularly thrilled to see that I have to build ranger myself to get the various binaries needed for it.

Anyway, the first thing I did was download a “release”.  There is surprisingly little information on what a “release” is or how to use it.  But, given that all installation documentation seems to ask for artifacts named like ranger-%version-number%-admin.tar.gz” and I didn’t see any gz files, I assumed it was more like a bundled source code release that had to be built.

Note: referring to documentation here: https://ranger.apache.org/quick_start_guide.html and he release I used is here: http://mirrors.sonic.net/apache/ranger/1.2.0/apache-ranger-1.2.0.tar.gz.

Docker Build Script

My initial thought was to do the build using the convenient sounding “build_ranger_using_docker.sh” script which is in the root directory.  So, I installed docker quickly, did a docker login, and ran it.  It failed! (on Centos 7 for the record).

It tried to download a version of maven which doesn’t exist on the maven site currently.  If you switch to a slightly newer one the script breaks due to the maven release artifacts being a little different too.  So, I reverted to 3.3.9 which required changes to multiple lines.

After that, it went through to the end and failed on the last step with “gosu: not found”.  There had been some scary red text higher up about “no ultimately trusted keys found” related to installing gosu.

I tried various ways of fixing this and they all failed (on Centos 7.x)… but to be honest, I didn’t invest my own time in reading up on gosu or why the various proposed solutions were failing.

Build with Maven

Giving up and building with Maven failed on Centos and my Windows 10 box with similar python errors about half way through despite one being on Python 2 and one being on Python 3.  So, building straight from source wasn’t great either.

Success

I decided to go back to the docker build.  This time, I removed some of the maven validations, used a newer version of maven (which I’m confident doesn’t matter much).  But I also removed the gosu install and usage from the final build commands.

This finally worked.  Note that my copy is hacky and doesn’t bother using the “builder” account to do the build.  But it worked at least and built the artifacts.  So, I’m happy enough for my own purposes.  If it was a long-running web app or something, I’d go work out the bugs in the docker container/gosu/etc – but that’s not required for a build utility.

After this, you see a nice listing of tar.gz files in the ./target folder like so:

antrun ranger-1.2.0-kafka-plugin.zip ranger-1.2.0-sqoop-plugin.zip
archive-tmp ranger-1.2.0-kms.tar.gz ranger-1.2.0-src.tar.gz
maven-shared-archive-resources ranger-1.2.0-kms.zip ranger-1.2.0-src.zip
ranger-1.2.0-admin.tar.gz ranger-1.2.0-knox-plugin.tar.gz ranger-1.2.0-storm-plugin.tar.gz
ranger-1.2.0-admin.zip ranger-1.2.0-knox-plugin.zip ranger-1.2.0-storm-plugin.zip
ranger-1.2.0-atlas-plugin.tar.gz ranger-1.2.0-kylin-plugin.tar.gz ranger-1.2.0-tagsync.tar.gz
ranger-1.2.0-atlas-plugin.zip ranger-1.2.0-kylin-plugin.zip ranger-1.2.0-tagsync.zip
ranger-1.2.0-hbase-plugin.tar.gz ranger-1.2.0-migration-util.tar.gz ranger-1.2.0-usersync.tar.gz
ranger-1.2.0-hbase-plugin.zip ranger-1.2.0-migration-util.zip ranger-1.2.0-usersync.zip
ranger-1.2.0-hdfs-plugin.tar.gz ranger-1.2.0-ranger-tools.tar.gz ranger-1.2.0-yarn-plugin.tar.gz
ranger-1.2.0-hdfs-plugin.zip ranger-1.2.0-ranger-tools.zip ranger-1.2.0-yarn-plugin.zip
ranger-1.2.0-hive-plugin.tar.gz ranger-1.2.0-solr-plugin.tar.gz rat.txt
ranger-1.2.0-hive-plugin.zip ranger-1.2.0-solr-plugin.zip version
ranger-1.2.0-kafka-plugin.tar.gz ranger-1.2.0-sqoop-plugin.tar.gz

Here was my final docker file.  Note that you should read up on gosu/etc before using it and I take no responsibility for any security issues; you should use the official one – if you can :).

default_command="mvn -DskipTests=true clean compile package install assembly:assembly"
build_image=0
if [ "$1" = "-build_image" ]; then
build_image=1
shift
fi

params=$*
if [ $# -eq 0 ]; then
params=$default_command
fi

image_name="ranger_dev"
remote_home=
container_name="--name ranger_build"

if [ ! -d security-admin ]; then
echo "ERROR: Run the script from root folder of source. e.g. $HOME/git/ranger"
exit 1
fi

images=`docker images | cut -f 1 -d " "`
[[ $images =~ $image_name ]] && found_image=1 || build_image=1

if [ $build_image -eq 1 ]; then
echo "Creating image $image_name ..."
docker rmi -f $image_name

docker build -t $image_name - < /scripts/mvn.sh RUN echo 'set -x; if [ "\$1" = "mvn" ]; then usermod -u \$(stat -c "%u" pom.xml) bash -c '"'"'ln -sf /.m2 \$HOME'"'"'; exec "\$@"; fi; exec "\$@" ' >> /scripts/mvn.sh

RUN chmod -R 777 /scripts
RUN chmod -R 777 /tools

ENTRYPOINT ["/scripts/mvn.sh"]
Dockerfile

fi

src_folder=`pwd`

LOCAL_M2="$HOME/.m2"
mkdir -p $LOCAL_M2
set -x
docker run --rm -v "${src_folder}:/ranger" -w "/ranger" -v "${LOCAL_M2}:${remote_home}/.m2" $container_name $image_name $params

 

Hive 3 Standalone Metastore + Presto

Hive 3.0 Standalone Metastore – Why?

Hive version 3.0 allows you to download a standalone metastore.  This is cool because it does not require you to deploy hadoop and/or run the rest of Hive’s fairly large deployment.  This makes a lot of sense because many tools that use hive for schema management do not actually care about Hive’s query engine.

For example, Presto is a clustered query engine in its own right; it has no interest in using hadoop/map-reduce to execute a query on hive data; it just wants to view and manage hive’s metadata through its thrift metastore interface.  Similarly, Apache Spark loves to work with hive, but it actually goes directly to the underlying database for performance reasons and works against that.  So, it also does not need hive’s query engine.

Can/Should We Use It?

Unfortunately, Presto only currently supports Hive 2.X.  From it’s own documentation: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

If you read online though, you will find that it does seem to work… but with limited features.  If you look at this git entry for example: https://groups.google.com/forum/#!topic/presto-users/iAeEecsnS9I, you will see:

“We have tested Presto 0.203e with Hive 3.0 Metastore, and it works fine. We tested it by running TPC-DS queries, and Presto completed all 99 queries.”

But lower down, you will see:

However, Presto is not able to read Hive managed (transactional tables) in Hive 3.x…

Yes, this is a known limitation.

Unfortunately, transactional ACID v2 tables are the default for Hive 3.x.  So, basically all managed tables will not work in Hive 3.x even though external tables will work.  So, it might be okay to use it if you only do external tables… but in our case we let people use Spark however they like and they likely create many managed tables.  So, this rules out using Hive 3.0 with the standalone metastore for us.

I’m going to see if Hive 2.0 can be run without the hive server and hadoop next.

Site Note – SchemaTool

I would just like to make a side-note that while I did manage to run the Hive Standalone Metastore without installing hadoop, I did have to install (but not run) hadoop in order to use the schematool provided with hive for creating the hive RDMBS schema.  This is due to library dependencies.

There is a “create on first run” config you can do instead of this as well but they don’t recommend using it in production; so just keep that in mind.

Useful Links

SCP/SSH With Different Private Key

If you need to use SSH or SCP with a different private key file, just specify it with -i.  For example, to copy logs from a remote server using a specific private key file and user, do the following:

scp -i C:\Users\[your-user]\.ssh\pk_file [user]@[ip-addr]:/path/logs/* .

This -i will work regardless of OS, but the example is SSHing to a Linux server from a Windows server assuming you store your private keys in your user .ssh directory.

Java – Regular/Scheduled Task, One Run at a Time

This will be a very short post, and I’m mostly writing it just so it sticks in my head better.

Common Use Cases

There are many times when you may find that you need to regularly run a task in Java.  Here are a few common examples:

  • You have a cache you need to refresh every X minutes to power a dashboard or something similar.
  • You need to prune old files from a file system once an hour.
  • You need to regularly update stats counters for monitoring.

Coding Options

There are a lot of ways to do this, but the recommended approach would be to use a scheduled executor.  Now… this part is easy to remember, but what is sometimes hard to remember is that you have two options when scheduling a task.  I often find myself picking the wrong one as it pops up in Intelli-sense and I forget there are 2 options.

  1. Run the task every X seconds/minutes/etc no matter what.
  2. Run the task every X seconds/minutes/etc *after* the previous task completed.

These two things can be very different.  If you have a task that only takes a couple of seconds, it probably doesn’t matter much.  But if you have a task that takes 2 minutes and you’re running it every 1 minute, then with option 1 you will always be running at least 2 copies of the task, whereas with option 2 you’ll just be running one copy at a time with a minute of buffer in between each task.

For both options, you can create the scheduled executor service the same way:

ScheduledExecutorService se = Executors.newSingleThreadScheduledExecutor();

But for option #1 (run every interval regardless of previous tasks), you would use this function:

se.scheduleAtFixedRate(this::refreshCache, 10, 120, TimeUnit.SECONDS);

And for option #2 (start counting after previous task completes), you would use this function.

se.scheduleWithFixedDelay(this::refreshCache, 10, 120, TimeUnit.SECONDS);

 

Extending a LVM Volume (e.g. /opt) in Cenots 7

What Does LVM Mean?

Taking a description from The Geek Diary:

The Logical Volume Manager (LVM) introduces an extra layer between the physical disks and the file system allowing file systems to:

  • Be resized and moved easily and online without requiring a system-wide outage.
  • Use discontinuous space on disk.
  • Have meaningful names to volumes, rather than the usual cryptic device names.
  • Span multiple physical disks.

Extending a LVM Volume (e.g. /opt):

Run “vgs” to display information on the available volume groups. This will tell you if you have “free” space that you can allocate to one of the existing logical volumes. In our case, we have 30 GB free.

$> vgs
  VG     #PV #LV #SN Attr   VSize   VFree
  rootvg   1   7   0 wz--n- <63.00g <30.00g

Run “lvs” to display the logical volumes on your system and their sizes. Find the one you want to extend.

$> lvs
  LV     VG     Attr       LSize
  homelv rootvg -wi-ao----  1.00g
  optlv  rootvg -wi-ao----  2.00g
  rootlv rootvg -wi-ao----  8.00g
  swaplv rootvg -wi-ao----  2.00g
  tmplv  rootvg -wi-ao----  2.00g
  usrlv  rootvg -wi-ao---- 10.00g
  varlv  rootvg -wi-ao----  8.00g

Extend the logical volume using “lvextend”. In our case, I’m moving /opt from 2g to 5g.

$> lvextend -L 5g rootvg/optlv

Display the logical volumes again if you like. You won’t see a change yet, it will still say 2.00g.

Use df -hT to show what kind of file system you are using for the volume you resized. This can change the next command you have to do.

$> df -hT
Filesystem                Type      ...
/dev/mapper/rootvg-rootlv ext4      ...
devtmpfs                  devtmpfs  ...
tmpfs                     tmpfs     ...
tmpfs                     tmpfs     ...
tmpfs                     tmpfs     ...
/dev/mapper/rootvg-usrlv  ext4      ...
/dev/sda1                 ext4      ...
/dev/mapper/rootvg-optlv  ext4      ...
/dev/mapper/rootvg-tmplv  ext4      ...
/dev/mapper/rootvg-varlv  ext4      ...
/dev/mapper/rootvg-homelv ext4      ...
/dev/sdb1                 ext4      ...
tmpfs                     tmpfs     ...

If it is ext4, you can use the following command to tell the system to recognize the extended volume. If it is not, you will have to find the appropriate command for the given file system.

$> resize2fs /dev/mapper/rootvg-optlv

Now you should see the extended volume size in “lvs” or “df -h”; and you’re done!

$> df -h
Filesystem                 Size  Used Avail Use% Mounted on
/dev/mapper/rootvg-rootlv  7.8G   76M  7.3G   2% /
devtmpfs                   3.9G     0  3.9G   0% /dev
tmpfs                      3.9G  4.0K  3.9G   1% /dev/shm
tmpfs                      3.9G  130M  3.8G   4% /run
tmpfs                      3.9G     0  3.9G   0% /sys/fs/cgroup
/dev/mapper/rootvg-usrlv   9.8G  2.6G  6.7G  29% /usr
/dev/sda1                  976M  119M  790M  14% /boot
/dev/mapper/rootvg-optlv   4.9G  1.9G  2.9G  40% /opt
/dev/mapper/rootvg-tmplv   2.0G   11M  1.8G   1% /tmp
/dev/mapper/rootvg-varlv   7.8G  3.2G  4.2G  44% /var
/dev/mapper/rootvg-homelv  976M   49M  861M   6% /home
/dev/sdb1                   16G   45M   15G   1% /mnt/resource
tmpfs                      797M     0  797M   0% /run/user/1000

Azure CLI Get Scale Set Private IP Addresses

Getting Scale Set Private IPs is Hard

I have found that it is impressively difficult to get the private IP addresses of Azure scale set instances in almost every tool.

For example, if you go and create a scale set in Terraform, even Terraform will not provide you the addresses or a way to look them up to act upon them in future steps.  Similarly, you cannot easily list the addresses in Ansible.

You can make dynamic inventories in Ansible based on scripts though.  So, in order to make an ansible playbook target the nodes in a recently created scale set dynamically, I decided to use a dynamic inventory created by the Azure CLI.

Azure CLI Command

Here is an azure CLI command (version 2.0.58) which directly lists the IP addresses of scale set nodes.  I hope it helps you as it has helped me.  It took a while to build it out from the docs but its pretty simple now that it’s done.

az vmss nic list --resource-group YourRgName \
--vmss-name YourVmssName \
--query "[].ipConfigurations[].privateIpAddress"

The output will look similar to this, though I just changed the IP addresses to fake ones here an an example.

[
"123.123.123.123",
"123.123.123.124"
]

Using SystemD Timers (Like Cron) – Centos

The more I use it, the more I like using SystemD to manage services.  It is sometimes verbose, but in general, it is very easy to use and very powerful.  It can restart failed services, start services automatically, handle logging and rolling for simplistic services, and much more.

SystemD Instead of Cron

Recently, I had to take a SystemD service and make it run on a regular timer.  So, of course, my first thought was using cron… but I’m not a particularly big fan of cron for reasons I won’t go into here.

I found out that SystemD can do this as well!

Creating a SystemD Timer

Let’s say you have a (very) simple service defined like this in SystemD (in /etc/systemd/system/your-service.service):

[Unit]
Description=Do something useful.

[Service]
Type=simple
ExecStart=/opt/your-service/do-something-useful.sh
User=someuser
Group=someuser

Then you can create a timer for it by creating another SystemD file in the same location but ending with “.timer” instead of “.service”. It can handle basically any interval, can start at reboot, and can even time tasks based on the reboot time.

[Unit]
Description=Run service once daily.

[Timer]
OnCalendar=*-*-* 00:30:00
Unit=your-service.service

[Install]
WantedBy=multi-user.target

Once the timer file is made, you can enable it like to:

sudo systemctl daemon-reload
sudo systemctl enable your-service.timer
sudo systemctl start your-service.timer

After this, you can verify the timer is working and check its next (and later on, previous) execution time with this command:

$> systemctl list-timers --all
NEXT                         LEFT          LAST                         PASSED  UNIT                         ACTIVATES
Tue 2019-03-05 20:00:01 UTC  14min left    Mon 2019-03-04 20:00:01 UTC  23h ago systemd-tmpfiles-clean.timer systemd-tmpfiles-clean.service
Wed 2019-03-06 00:30:00 UTC  4h 44min left n/a                          n/a     your-service.timer     your-service.service
n/a                          n/a           n/a                          n/a     systemd-readahead-done.timer systemd-readahead-done.service