Presto + Hive View Storage/Implementation

I’ve been learning about how presto handles views lately.  This is because we are heavily reliant on presto and we recently ran into multiple use cases where our hive metastore had views which wouldn’t work within presto.

What are Presto Views Exactly?

Presto has its own view implementation which is distinct from a hive’s view implementation.  Presto will not use a hive view, and if you try to  query one, you will get a clear error immediately.

A presto view is based on a Presto SQL query string.  A hive view is based on a a hive query string.  A hive query string is written in HQL (Hive Query  Language), and presto simply does not know that SQL dialect.

How Are Views Stored?

Presto actually stores its views in the exact same location as hive does.  The hive metastore database has a TBLS table which holds every hive table and view.  Views have two columns populated that tables ignore – view_original_text and view_expanded_text.  Hive views will have plain SQL in the view_original_text column whereas presto views will have some encoded representation prefixed with “/* Pesto View…”.   If presto queries a view and does not find it’s “/* Pesto View” prefix, it will consider it a hive view and say that it is not supported.

Making Presto Handle Hive Views

I’ve been doing work for some time to try to make presto-sql support hive views.  I’m using the pull request noted in this issue https://github.com/prestodb/presto/issues/7338 as a template.  It is fairly old though and was made against presto-db rather than presto-sql, so the exercise has turned out to be non-trivial.

I’m still chugging along and will post more when done.  But one thing to note is that this PR does not really make presto support hive views.  It actually allows presto to attempt to run hive views as they are.  Many hive views will be valid presto SQL – e.g. where you’re just selecting from a table with some basic joins and/or where clause filters.

So, this PR basically prevents presto from outright failing when it sees a view that does not start with “/* Presto View”.  It then helps it read the hive query text, possibly lightly modify it, and attempt to run it as if the same had been done for a Presto query.

I plan on doing a number of corrections to the SQL as well; e.g. replacing double quotes, converting back-ticks, replacing obvious function names like NVL with COALESCE, etc.  Eventually I may try to fix more by parsing the hive text with ANTLR or something similar to make as many hive views run by default as possible.  But it will never be a complete solution.  A complete solution would be very hard as it would require a full understanding and conversion of hive language to presto language (which is probably not even possible given some of their differences).

Hive Metastore DB (HMS DB) – Get All Tables/Columns including Partition Keys

How Hive Stores Schemas

I recently had to sync the schemas from a hive database to a normal MySQL database for reasons I won’t bother going into.  The exercise required me to get all columns from all tables in hive for each DB though, and I found that this was not amazingly straight-forward.

The hive metastore DB is a normal MySQL/etc database with a hive schema in it.  The hive schema holds the hive tables though.  So, the information schema is irrelevant to hive; to get the hive table details, you have to interrogate the TBLS table, for example.  To get columns, you need to interrogate COLUMNS_V2, and to get the databases themselves, you look toward the DBS table.

Missing Columns

Other posts I’ve seen would leave you here – but  I found when I joined these three tables together (plus some others you have to route through), I was still missing some columns in certain tables for some reason.  It turns out that partition columns are implicit in hive.  So, in a file system of hive data (like HDFS), a partition column in a table is literally represented by just having the directory named with the partition value; there are no columns with the value in the data.

What this means is that partition columns don’t show up in these normal tables.  You have to look to a separate partition keys table to find them with a separate query.

The Working Query

The query below finds all columns of any kind and sorts them in the order they’ll appear when you select from a table in hive/presto/etc.  I hope it helps you!

select db_name, table_name, column_name from (
SELECT d.NAME as db_name, t.TBL_NAME as table_name, c.COLUMN_NAME as column_name, c.INTEGER_IDX as idx
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
JOIN SDS s on t.SD_ID = s.SD_ID
JOIN COLUMNS_V2 c on c.CD_ID = s.CD_ID
WHERE d.NAME = :dbName
union
SELECT d.NAME as db_name, t.TBL_NAME as table_name, k.PKEY_NAME as column_name, 50000 + k.INTEGER_IDX
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
join PARTITION_KEYS k on t.TBL_ID = k.TBL_ID
where d.NAME = :dbName
) x
order by db_name, table_name, idx

Creating Java Key Stores (JKS), Setting Validity Periods, and Analyzing JKS for Expiry Dates

What is a JKS File?

Taken from: https://en.wikipedia.org/wiki/Java_KeyStore

A Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption.

Basically, applications like Presto use JKS files to enable them to do Transport Layer Security (TLS).

How Do You Create One?

As an example, Presto, a common big-data query tool, uses JKS for both secure internal and external communication.  At this link https://prestosql.io/docs/current/security/internal-communication.html they show you how to create one.

Here is an excerpt.  Note that you must make sure you replace *.example.com with the domain you will host the service using the JKS file on or it will not work.  Certificates are domain specific.

keytool -genkeypair -alias example.com -keyalg RSA \
    -keystore keystore.jks
Enter keystore password:
Re-enter new password:
What is your first and last name?
  [Unknown]:  *.example.com
What is the name of your organizational unit?
  [Unknown]:
What is the name of your organization?
  [Unknown]:
What is the name of your City or Locality?
  [Unknown]:
What is the name of your State or Province?
  [Unknown]:
What is the two-letter country code for this unit?
  [Unknown]:
Is CN=*.example.com, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
  [no]:  yes

Enter key password for <presto>
        (RETURN if same as keystore password):

Pit Fall! – 90 Day Expiry Date – Change It?

Now… here’s a fun pitfall.  Certificates are made with an expiry date and have to be reissued periodically (for security reasons).  The default expiry date for JKS is 90 days.

https://docs.oracle.com/javase/tutorial/security/toolsign/step3.html

This certificate will be valid for 90 days, the default validity period if you don’t specify a –validity option.

This is fine on a big managed service with lots of attention.  But if you’re just TLS securing an internal app not many people see, you will probably forget to rotate it or neglect to set up appropriate automation.  Then, when it expires, things will break.

Now… for security reasons you generally shouldn’t set too high a value for certificate expiry time.  But for example purposes, here is how you would set it to 10 years.

keytool -genkeypair -alias example.com -keyalg RSA \
    -keystore keystore.jks -validity 3650

Determine Expiry Date

If you have a JKS file that you are using for your application and you are not sure when it expires, here’s a command that you can use:

keytool -list -v -keystore keystore.jk

This will output different things based on where your key store came from.  E.g. you will probably see more interesting output from a real SSL cert than you will from a self-signed one created like we did above.

In any case, you will clearly see a line in the large output from this command that says something like this:

Valid from: Fri Jul 12 01:27:21 UTC 2019 until: Thu Oct 10 01:27:21 UTC 2019

Note that in this case, it is just valid for 3 months.  Be careful when looking at this output because you may find multiple expiry dates in the output for different components of the JKS file.  You need to make sure you read the right one.  Though, chances are that the one on your domain will be the one that expires earliest anyway.

 

Building Presto Admin

Presto Admin – Is it Worth Using?

I generally deploy Presto clusters using packer and terraform.  Packer builds an image for me with the target presto distribution, some utility scripts, ha proxy (to get multiple coordinators acting in HA), etc.

I kept noticing this presto-admin project though: https://github.com/prestosql/presto-admin.  It allows you to quickly/easily deploy clusters from a central node, and it will handle the coordinator, workers, catalogs, and everything.  That sounds pretty cool.

Advance disclaimer – after I built this, I decided not to use it.  This was because it seems to just deploy a single coordinator and worker set.  For an HA setup, I need multiple coordinators, a load balancer, and workers/users pointing at the load balancer.  So, it’s just not the right fit for me.

Presto Admin – Build

In any case, I did go through the motions of building this – because I could not find a source release.  Fortunately, it’s pretty easy on Centos 7.x (basically RHEL 7.x):

# Download and unzip.
wget https://github.com/prestosql/presto-admin/archive/2.7.tar.gz
tar -xvzf 2.7.tar.gz

# Install pip/etc.
sudo yum install epel-release
sudo yum install python-pip
sudo yum install python-wheel

# Run make file and build web installer.
make dist

After this, just go into the dist folder and find prestoadmin-2.7-online.tar.gz.

I hope this saves you some time; I wasted around 20 minutes trying to find the dist online for download (which I never did).

Presto Custom Password Authentication Plugin (Internal)

Presto Authentication (Out of the Box)

Out of the box, presto will not make you authenticate to run any queries.  For example, you can just connect with JDBC from Java or DBeaver/etc and run whatever queries you want with any user name and no password.

When you want to enable a password, it has a few options out of the box:

So, unfortunately, if you just want to create some users/passwords and hand them out, or get users from an existing system or database, there really isn’t a way to do it here.

Custom Password Authenticator

Most things in Presto are implemented as plugins, including many of its own out-of-the-box features.  This is true for the LDAP authenticator above.  It actually is a “Password Authenticator” plugin implementation.

So, if we copy the LDAP plugin and modify it, we can actually make our own password plugin that lets us use any user-name and password/etc we want!  Note that we’ll just make another “internal” plugin which means we have to recompile presto.  I’ll try to make this an external plugin in another article.

Let’s Do It!

Note: we will be modifying presto code, and presto code only builds on Linux.  I use windows, so I do my work on an Ubuntu desktop in the cloud; but you can do whatever you like.  If you have a Linux desktop, it should build very easily out of Intellij or externally with Maven.

  1. Clone the presto source code.
  2. Open it in Intellij or your editor of choice.
  3. Find the “presto-password-authenticators” sub-project and navigate to com.facebook.presto.password.
  4. Copy LdapAuthenticator.java to MyAuthenticator.java.
  5. Copy LdapConfig to MyAuthenticatorConfig.java.
  6. Copy LdapAuthenticatorFactory to MyAuthenticatorFactory.java.
  7. For MyAutnenticatorConfig, make a password and a user private variable, both strings.  Then make getters and setters for them similar to the LDAP URL ones; though you can take out the patterns/etc.  You can remove everything else; our user name and password will be our only config.
  8. For MyAuthenticatorFactory, change getName() to return “my-authenticator”.  Also edit it so that it uses your new config class and so that it returns your new authenticator in config.
  9. For MyAuthenticator, just make the private Principal authenticate(String user, String password) {} method throw AccessDeniedException if the user and password don’t match the ones from your config.  You can get the ones from your config in the constructor.
  10. Add MyAuthenticator to the PasswordAuthenticationPlugin.java file under the LdapAuthenticatorFactory instance.

This is all the code changes you need for your new authenticator (assuming it works).  But how do we use/test it?

  • The code runs from the presto-main project.
  • Go there and edit config.properties; add these properties:
http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/presto-keystore/keystore.jks
http-server.https.keystore.key=somePassword
  • Then add a password-authenticator.properties in the same directory with these properties:
password-authenticator.name=my-authenticator
user=myuser
password=otherPassword
  • Note that the password-authenticator.name is equal to the string you used in the factory class in step #8.
  • The first config file tells it to use a password plugin, the second one tells it which one to use based on that name and what extra properties to give to it based on the config (in our case, just a user and a password).

The Hard Part

Okay, so all of this was pretty easy to follow.  Unfortunately, if you go try to run this, it won’t work (Note that you can run the presto in IntelliJ by following these instructions).

Anyway, your plugin won’t work because password plugins don’t do anything unless you have HTTPS enabled on the coordinator.   This is because presto developers don’t want you sending clear-text passwords; so the password plugin type just makes it flat out not work!

We put properties in our first config file for HTTPS already.  Now you need to follow the JKS instructions here: https://prestosql.io/docs/current/security/tls.html#server-java-keystore to generate a key store and put it at the path from our config file.

Note that if you’re using AWS/etc, the “first and last name” should be the DNS name of your server.  If you put the IP, it won’t work.

Once you’ve got that set up, you can start presto and connect via JDBC from the local host by filling out the parameters noted here (https://prestosql.io/docs/current/installation/jdbc.html).

SSL Use HTTPS for connections
SSLKeyStorePath The location of the Java KeyStore file that contains the certificate and private key to use for authentication.
SSLKeyStorePassword The password for the KeyStore.

SSL = true, key store path is the one from the config file earlier, and the password is also the one from the config file.

Now, if you connect via JDBC and use the user name and password from your config file, everything should work!  Note that you probably want to get a real certificate in the future or else you’ll have to copy yours key store to each computer/node you use so JDBC works on each (which isn’t great).

UPDATE – 2019/08/04 – YOU should make clients use a trust store with the private key removed for numerous reasons.  See this newer post for how to take this JKS file and modify it for client use properly: https://coding-stream-of-consciousness.com/2019/08/04/presto-internal-tls-password-login-removing-private-key-from-jks-file/.

Presto Doesn’t Work with Apache Ranger (Yet)

Google Group Discovery

After a fairly long fight at building ranger and getting it ready to install, I came across this google group item randomly which made me sad:

https://groups.google.com/forum/m/#!topic/presto-users/gp5tRn9J7kk

It has the following question:

I have setup Presto, Hive, Hue and also setup Ranger for controlling column level access to LDAP users.
Able to see the restrictions getting applied on Hive queries by LDAP users, but however these restrictions are not getting applied on Presto queries.
I understand Presto also uses the same Hive Metastore and Can someone help me why the restricted column access are obeyed in Hive and not Presto when logged in as LDAP user?
And this response:

I am afraid Presto is not integrated with Apache Ranger today. Instead Presto only obeys table-level permissions defined in Hive Metastore.

It’s definitely a roadmap item, we have heard similar requests for integration with Apache Sentry. No specific target date for either at this point.

The Verdict

So, unfortunately, it looks like even if I do finish installing Ranger, I will not be able to get the column level security I’m looking for in Presto.  So, I’m going to move on to analyzing other non-Ranger options.  I’ll also had somewhat ruled out Sentry even before reading this due to a stack-overflow post I read: https://stackoverflow.com/a/56247090/857994 which states:

Just quick update with Cloudera+Hortonworks merge last year. These companies have decided to standardize on Ranger. CDH5 and CDH6 will still use Sentry until CDH product line retires in ~2-3 years. Ranger will be used for Cloudera+Hortonworks’ combined “Unity” platform.

Cloudera were saying to us that Ranger is a more “mature” product. Since Unity hasn’t released yet (as of May 2019), something may come up in the future, but that’s the current direction. If you’re a former Cloudera customer / or CDH user, you would still have to use Apache Sentry. There is a significant overlap between Sentry and Ranger, but if you start fresh, definitely look at Ranger.

I had also already seen numerous other things online agreeing with this and saying that Sentry is weak and Ranger is far more advanced; so this is not surprising.

Eventual Implementation

I found this page https://cwiki.apache.org/confluence/display/RANGER/Presto+Plugin which tells you how to use a ranger-presto plugin.  It was literally made and last edited on May 19th 2019 and refers to version 1.2 of Ranger (the current release).

As I’m writing this on June 9th and 1.2 was released in September 2018 (based on its release note creation date at this site https://cwiki.apache.org/confluence/display/RANGER/Apache+Ranger+1.2.0+-+Release+Notes), this is clearly not released yet.

I double checked on git hub and sure enough, this was just committed 20 days ago.

I wrote one of the committers to get their view on this problem and potential release schedules/etc just for future reference.

Other Options

Apparently Starburst, a Presto vendor company that works on top of various clouds (Azure and AWS), has integrated Sentry and Ranger into their Presto distribution.  You can see that here: https://www.starburstdata.com/technical-blog/presto-security-apache-ranger/.

AWS is also working on Cloud Formation (still in Preview) which supports column level authorization with its Athena (Presto) engine.

 

Hive + Presto + Ranger Version Hell

My Use Case

I was trying to test out Apache Ranger in order to give Presto column-level security over hive data.  Presto itself doesn’t seem to support Ranger yet, though some github entries suggest it will soon.  Ranger can integrate with hive though so that when presto queries hive, the security can work fine (apparently).

Conflicting Versions

I started off by deploying a version of Hive I’ve worked with before; 2.3.5, the latest 2.x version (I avoided 3.x).  After that, I deployed Presto .220, also the latest version.

This was all working great, so I moved on to Ranger.  This is when I found out that the Ranger docs specifically say that it only works with Hive version 1.2.0:

Apache Ranger version 0.5.x is compatible with only the component versions mentioned below

HIVE 1.2.0 https://hive.apache.org/downloads.html

That came from this link: https://cwiki.apache.org/confluence/display/RANGER/Apache+Ranger+0.5.0+Installation.

Alternative Options

I have a fairly stringent need for the security Ranger provides.  So, I was willing to use a 1.x version of hive, depending on what the feature loss was.  After all, quite a few big providers seem to use 1.x.

Unfortunately, the next thing I noticed was that Presto says: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

That is coming from its latest documentation: https://prestodb.github.io/docs/current/connector/hive.html.

I’m not particularly excited to start digging through old versions of Presto as well.

Next Steps

I’m going to try to stick with Hive 2.x for now and a modern version of Presto.  So, my options are:

  1. Research Ranger more and see if it can actually work with Hive 2.x.  Various vendors seem to use Ranger and Hive/Presto together; so I’m curious to see how.  Maybe the documentation on Ranger is just out of date (I know, being hopeful).
  2. Look at Ranger alternatives like Apache Sentry and see if they support Hive 2.x.  Apparently Ranger is beating out Sentry in features, usage, and future support… so I’m not excited about using Sentry.  But if it works, I can always migrate back to Ranger once its support grows for either Hive or Presto.

Update

I starting digging in from JIRA and mailing lists and found that Ranger appears to have had work done on it as early as 2017 for supporting hive 2.3.2.  Here’s a link.  https://issues.apache.org/jira/browse/RANGER-1927.

So, I’m going to give installing ranger a shot on 2.3.5 and see if it works.  If not, I’ll try with 2.3.2 and/or seek community help.  Hopefully I’ll come back and update this afterward with some good news :).