Presto + Hive View Storage/Implementation

I’ve been learning about how presto handles views lately.  This is because we are heavily reliant on presto and we recently ran into multiple use cases where our hive metastore had views which wouldn’t work within presto.

What are Presto Views Exactly?

Presto has its own view implementation which is distinct from a hive’s view implementation.  Presto will not use a hive view, and if you try to  query one, you will get a clear error immediately.

A presto view is based on a Presto SQL query string.  A hive view is based on a a hive query string.  A hive query string is written in HQL (Hive Query  Language), and presto simply does not know that SQL dialect.

How Are Views Stored?

Presto actually stores its views in the exact same location as hive does.  The hive metastore database has a TBLS table which holds every hive table and view.  Views have two columns populated that tables ignore – view_original_text and view_expanded_text.  Hive views will have plain SQL in the view_original_text column whereas presto views will have some encoded representation prefixed with “/* Pesto View…”.   If presto queries a view and does not find it’s “/* Pesto View” prefix, it will consider it a hive view and say that it is not supported.

Making Presto Handle Hive Views

I’ve been doing work for some time to try to make presto-sql support hive views.  I’m using the pull request noted in this issue https://github.com/prestodb/presto/issues/7338 as a template.  It is fairly old though and was made against presto-db rather than presto-sql, so the exercise has turned out to be non-trivial.

I’m still chugging along and will post more when done.  But one thing to note is that this PR does not really make presto support hive views.  It actually allows presto to attempt to run hive views as they are.  Many hive views will be valid presto SQL – e.g. where you’re just selecting from a table with some basic joins and/or where clause filters.

So, this PR basically prevents presto from outright failing when it sees a view that does not start with “/* Presto View”.  It then helps it read the hive query text, possibly lightly modify it, and attempt to run it as if the same had been done for a Presto query.

I plan on doing a number of corrections to the SQL as well; e.g. replacing double quotes, converting back-ticks, replacing obvious function names like NVL with COALESCE, etc.  Eventually I may try to fix more by parsing the hive text with ANTLR or something similar to make as many hive views run by default as possible.  But it will never be a complete solution.  A complete solution would be very hard as it would require a full understanding and conversion of hive language to presto language (which is probably not even possible given some of their differences).

Hive Metastore DB (HMS DB) – Get All Tables/Columns including Partition Keys

How Hive Stores Schemas

I recently had to sync the schemas from a hive database to a normal MySQL database for reasons I won’t bother going into.  The exercise required me to get all columns from all tables in hive for each DB though, and I found that this was not amazingly straight-forward.

The hive metastore DB is a normal MySQL/etc database with a hive schema in it.  The hive schema holds the hive tables though.  So, the information schema is irrelevant to hive; to get the hive table details, you have to interrogate the TBLS table, for example.  To get columns, you need to interrogate COLUMNS_V2, and to get the databases themselves, you look toward the DBS table.

Missing Columns

Other posts I’ve seen would leave you here – but  I found when I joined these three tables together (plus some others you have to route through), I was still missing some columns in certain tables for some reason.  It turns out that partition columns are implicit in hive.  So, in a file system of hive data (like HDFS), a partition column in a table is literally represented by just having the directory named with the partition value; there are no columns with the value in the data.

What this means is that partition columns don’t show up in these normal tables.  You have to look to a separate partition keys table to find them with a separate query.

The Working Query

The query below finds all columns of any kind and sorts them in the order they’ll appear when you select from a table in hive/presto/etc.  I hope it helps you!

select db_name, table_name, column_name from (
SELECT d.NAME as db_name, t.TBL_NAME as table_name, c.COLUMN_NAME as column_name, c.INTEGER_IDX as idx
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
JOIN SDS s on t.SD_ID = s.SD_ID
JOIN COLUMNS_V2 c on c.CD_ID = s.CD_ID
WHERE d.NAME = :dbName
union
SELECT d.NAME as db_name, t.TBL_NAME as table_name, k.PKEY_NAME as column_name, 50000 + k.INTEGER_IDX
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
join PARTITION_KEYS k on t.TBL_ID = k.TBL_ID
where d.NAME = :dbName
) x
order by db_name, table_name, idx

Presto – Internal TLS + Password Login; Removing Private Key from JKS File

Overview

For various reasons, you may have to secure a Presto cluster with TLS, both internally and externally.  This is pretty straight forward following Presto documentation, until you want to also combine that with an LDAP or custom password login mechanism.  Once you have internal TLS, external TLS, and LDAP, you have to play with some extra settings and manipulate your JKS files to get things done.

Internal TLS Settings

For secure internal communication, you should refer to the presto documentation right here: https://prestosql.io/docs/current/security/internal-communication.html.  It will walk you through various configuration settings that enable HTTPS, disable HTTP, and set key stores for TLS.

Part of the instructions have you generate a JKS file (Java Key Store) with a command like this:

keytool -genkeypair -alias example.com -keyalg RSA -keystore keystore.jks
Enter keystore password:
Re-enter new password:
What is your first and last name?
  [Unknown]:  *.example.com (Your site name should go here).

This will get your internal TLS working fine.

Adding External TLS

It would be quite pointless to secure the inside of a cluster if you didn’t secure the connections to the clients using it.  So, you’ve actually set all of the external TLS properties already when you were doing the internal security.  E.g. notice that the properties listed in the LDAP login plugin (which requires external SSL) here: https://prestosql.io/docs/current/security/ldap.html are already referenced in the doc we referred to for internal TLS here https://prestosql.io/docs/current/security/internal-communication.html.

Initially, I figured that I could configure a different JKS file for internal and external communication; but it turns out that this does not work; so don’t try it.   There is some information on that right hereYou need to use the same JKS file in all keystore configurations on the Presto servers.  So, don’t bother trying to tune the properties you already set while doing internal TLS; just keep them.

Given internal and external communication needs the same keystore, a naive second try may be to give clients the same JKS file that you use for internal TLS… but that’s a bad idea for two reasons:

  1. You’re giving away your private key and compromising security.
  2. If you go on to add password-login by LDAP or a custom password authenticator, the private key certificate will bypass it if the clients have it.

So, what you really need to do to allow clients to use TLS safely is use the same JKS file for all the server-side properties, but give clients a copy of that JKS file with the private key removed for use with JDBC/etc.

You can remove the private key from the JKS you made with the internal TLS instructions like this:

keytool -export -alias company.com -file sample.der -keystore keystore.jks
openssl x509 -inform der -in sample.der -out sample.crt
keytool -importcert -file sample.crt -keystore .keystore
The generated .keystore file can be used in JDBC or other connections by referring to it with the SSLTrustStorePath and SSLTrustStorePassword properties.  As it doesn’t have the private key, it will work for SSL, but it will not work as a login mechanism.  So, if you set up password login, clients will have to use it (which is what you want).  You can find JDBC documentation here: https://prestosql.io/docs/current/installation/jdbc.html.

Password Logins

You can do user-name and password login with LDAP out of the box using the documentation I linked earlier.  Alternatively, you can use the custom password plugin documentation I wrote a month ago here: https://coding-stream-of-consciousness.com/2019/06/18/presto-custom-password-authentication-plugin-internal/ to do your own.

In either case, while combining internal TLS and password login, you will have to modify this property:

http-server.authentication.type=PASSWORD
to say this:
http-server.authentication.type=CERTIFICATE,PASSWORD
You need this because you have to set the PASSWORD type to make password logins work… but that forces all traffic to require a password.  Internal nodes doing TLS will start asking each other for passwords and failing since they can’t do that.  So, you add CERTIFICATE to allow them to authenticate to each other using their JKS files.
This is why you had to strip the private key out of the file you gave to the clients.  If they had it and used it as a key store, they could have authenticated to the coordinator with the JKS file instead of a user name/password.  But just having the trust store with the public keys allows SSL to work while not allowing it to be used as the CERTIFICATE login mechanism.
I hope this helps you get it working! I spent longer on this than I would like to admit :).
Note: There is some good related conversation here: https://groups.google.com/forum/#!topic/presto-users/R_byjHcIS8A and here: https://groups.google.com/forum/#!topic/presto-users/TYdvs5kGYE8.  These are the google groups that helped me get this working.

 

 

Creating Java Key Stores (JKS), Setting Validity Periods, and Analyzing JKS for Expiry Dates

What is a JKS File?

Taken from: https://en.wikipedia.org/wiki/Java_KeyStore

A Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption.

Basically, applications like Presto use JKS files to enable them to do Transport Layer Security (TLS).

How Do You Create One?

As an example, Presto, a common big-data query tool, uses JKS for both secure internal and external communication.  At this link https://prestosql.io/docs/current/security/internal-communication.html they show you how to create one.

Here is an excerpt.  Note that you must make sure you replace *.example.com with the domain you will host the service using the JKS file on or it will not work.  Certificates are domain specific.

keytool -genkeypair -alias example.com -keyalg RSA \
    -keystore keystore.jks
Enter keystore password:
Re-enter new password:
What is your first and last name?
  [Unknown]:  *.example.com
What is the name of your organizational unit?
  [Unknown]:
What is the name of your organization?
  [Unknown]:
What is the name of your City or Locality?
  [Unknown]:
What is the name of your State or Province?
  [Unknown]:
What is the two-letter country code for this unit?
  [Unknown]:
Is CN=*.example.com, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
  [no]:  yes

Enter key password for <presto>
        (RETURN if same as keystore password):

Pit Fall! – 90 Day Expiry Date – Change It?

Now… here’s a fun pitfall.  Certificates are made with an expiry date and have to be reissued periodically (for security reasons).  The default expiry date for JKS is 90 days.

https://docs.oracle.com/javase/tutorial/security/toolsign/step3.html

This certificate will be valid for 90 days, the default validity period if you don’t specify a –validity option.

This is fine on a big managed service with lots of attention.  But if you’re just TLS securing an internal app not many people see, you will probably forget to rotate it or neglect to set up appropriate automation.  Then, when it expires, things will break.

Now… for security reasons you generally shouldn’t set too high a value for certificate expiry time.  But for example purposes, here is how you would set it to 10 years.

keytool -genkeypair -alias example.com -keyalg RSA \
    -keystore keystore.jks -validity 3650

Determine Expiry Date

If you have a JKS file that you are using for your application and you are not sure when it expires, here’s a command that you can use:

keytool -list -v -keystore keystore.jk

This will output different things based on where your key store came from.  E.g. you will probably see more interesting output from a real SSL cert than you will from a self-signed one created like we did above.

In any case, you will clearly see a line in the large output from this command that says something like this:

Valid from: Fri Jul 12 01:27:21 UTC 2019 until: Thu Oct 10 01:27:21 UTC 2019

Note that in this case, it is just valid for 3 months.  Be careful when looking at this output because you may find multiple expiry dates in the output for different components of the JKS file.  You need to make sure you read the right one.  Though, chances are that the one on your domain will be the one that expires earliest anyway.

 

Building Presto Admin

Presto Admin – Is it Worth Using?

I generally deploy Presto clusters using packer and terraform.  Packer builds an image for me with the target presto distribution, some utility scripts, ha proxy (to get multiple coordinators acting in HA), etc.

I kept noticing this presto-admin project though: https://github.com/prestosql/presto-admin.  It allows you to quickly/easily deploy clusters from a central node, and it will handle the coordinator, workers, catalogs, and everything.  That sounds pretty cool.

Advance disclaimer – after I built this, I decided not to use it.  This was because it seems to just deploy a single coordinator and worker set.  For an HA setup, I need multiple coordinators, a load balancer, and workers/users pointing at the load balancer.  So, it’s just not the right fit for me.

Presto Admin – Build

In any case, I did go through the motions of building this – because I could not find a source release.  Fortunately, it’s pretty easy on Centos 7.x (basically RHEL 7.x):

# Download and unzip.
wget https://github.com/prestosql/presto-admin/archive/2.7.tar.gz
tar -xvzf 2.7.tar.gz

# Install pip/etc.
sudo yum install epel-release
sudo yum install python-pip
sudo yum install python-wheel

# Run make file and build web installer.
make dist

After this, just go into the dist folder and find prestoadmin-2.7-online.tar.gz.

I hope this saves you some time; I wasted around 20 minutes trying to find the dist online for download (which I never did).

Presto Custom Password Authentication Plugin (Internal)

Presto Authentication (Out of the Box)

Out of the box, presto will not make you authenticate to run any queries.  For example, you can just connect with JDBC from Java or DBeaver/etc and run whatever queries you want with any user name and no password.

When you want to enable a password, it has a few options out of the box:

So, unfortunately, if you just want to create some users/passwords and hand them out, or get users from an existing system or database, there really isn’t a way to do it here.

Custom Password Authenticator

Most things in Presto are implemented as plugins, including many of its own out-of-the-box features.  This is true for the LDAP authenticator above.  It actually is a “Password Authenticator” plugin implementation.

So, if we copy the LDAP plugin and modify it, we can actually make our own password plugin that lets us use any user-name and password/etc we want!  Note that we’ll just make another “internal” plugin which means we have to recompile presto.  I’ll try to make this an external plugin in another article.

Let’s Do It!

Note: we will be modifying presto code, and presto code only builds on Linux.  I use windows, so I do my work on an Ubuntu desktop in the cloud; but you can do whatever you like.  If you have a Linux desktop, it should build very easily out of Intellij or externally with Maven.

  1. Clone the presto source code.
  2. Open it in Intellij or your editor of choice.
  3. Find the “presto-password-authenticators” sub-project and navigate to com.facebook.presto.password.
  4. Copy LdapAuthenticator.java to MyAuthenticator.java.
  5. Copy LdapConfig to MyAuthenticatorConfig.java.
  6. Copy LdapAuthenticatorFactory to MyAuthenticatorFactory.java.
  7. For MyAutnenticatorConfig, make a password and a user private variable, both strings.  Then make getters and setters for them similar to the LDAP URL ones; though you can take out the patterns/etc.  You can remove everything else; our user name and password will be our only config.
  8. For MyAuthenticatorFactory, change getName() to return “my-authenticator”.  Also edit it so that it uses your new config class and so that it returns your new authenticator in config.
  9. For MyAuthenticator, just make the private Principal authenticate(String user, String password) {} method throw AccessDeniedException if the user and password don’t match the ones from your config.  You can get the ones from your config in the constructor.
  10. Add MyAuthenticator to the PasswordAuthenticationPlugin.java file under the LdapAuthenticatorFactory instance.

This is all the code changes you need for your new authenticator (assuming it works).  But how do we use/test it?

  • The code runs from the presto-main project.
  • Go there and edit config.properties; add these properties:
http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/presto-keystore/keystore.jks
http-server.https.keystore.key=somePassword
  • Then add a password-authenticator.properties in the same directory with these properties:
password-authenticator.name=my-authenticator
user=myuser
password=otherPassword
  • Note that the password-authenticator.name is equal to the string you used in the factory class in step #8.
  • The first config file tells it to use a password plugin, the second one tells it which one to use based on that name and what extra properties to give to it based on the config (in our case, just a user and a password).

The Hard Part

Okay, so all of this was pretty easy to follow.  Unfortunately, if you go try to run this, it won’t work (Note that you can run the presto in IntelliJ by following these instructions).

Anyway, your plugin won’t work because password plugins don’t do anything unless you have HTTPS enabled on the coordinator.   This is because presto developers don’t want you sending clear-text passwords; so the password plugin type just makes it flat out not work!

We put properties in our first config file for HTTPS already.  Now you need to follow the JKS instructions here: https://prestosql.io/docs/current/security/tls.html#server-java-keystore to generate a key store and put it at the path from our config file.

Note that if you’re using AWS/etc, the “first and last name” should be the DNS name of your server.  If you put the IP, it won’t work.

Once you’ve got that set up, you can start presto and connect via JDBC from the local host by filling out the parameters noted here (https://prestosql.io/docs/current/installation/jdbc.html).

SSL Use HTTPS for connections
SSLKeyStorePath The location of the Java KeyStore file that contains the certificate and private key to use for authentication.
SSLKeyStorePassword The password for the KeyStore.

SSL = true, key store path is the one from the config file earlier, and the password is also the one from the config file.

Now, if you connect via JDBC and use the user name and password from your config file, everything should work!  Note that you probably want to get a real certificate in the future or else you’ll have to copy yours key store to each computer/node you use so JDBC works on each (which isn’t great).

UPDATE – 2019/08/04 – YOU should make clients use a trust store with the private key removed for numerous reasons.  See this newer post for how to take this JKS file and modify it for client use properly: https://coding-stream-of-consciousness.com/2019/08/04/presto-internal-tls-password-login-removing-private-key-from-jks-file/.

Custom HiveAuthorizationProvider Example to Block Column Access

Hive has a very good set of hooks which you can use to customize all kinds of things.  It also has other “pluggable” areas which are basically hooks, but that aren’t called as such.

Here is a great article to get you started on Hive Hooks -> http://dharmeshkakadia.github.io/hive-hook/.

Creating a HiveAuthorizationProvider

In this case we aren’t implementing a hook specifically, but we’re doing the same exact flow to create our own HiveAuthorizationProvider.  We’ll do a very simple example to just block access to any column named “description” (as a silly example).

package com.john.humphreys.hive;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hive.metastore.api.Database;
import org.apache.hadoop.hive.ql.metadata.AuthorizationException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.metadata.Partition;
import org.apache.hadoop.hive.ql.metadata.Table;
import org.apache.hadoop.hive.ql.security.authorization.HiveAuthorizationProviderBase;
import org.apache.hadoop.hive.ql.security.authorization.Privilege;

import java.util.List;

public class MyHiveAuthorizationProvider
        extends HiveAuthorizationProviderBase {
    @Override
    public void init(Configuration conf) throws HiveException {

    }

    @Override
    public void authorize(Privilege[] readRequiredPriv, Privilege[] writeRequiredPriv) throws HiveException, AuthorizationException {

    }

    @Override
    public void authorize(Database db, Privilege[] readRequiredPriv, Privilege[] writeRequiredPriv) throws HiveException, AuthorizationException {

    }

    @Override
    public void authorize(Table table, Privilege[] readRequiredPriv, Privilege[] writeRequiredPriv) throws HiveException, AuthorizationException {

    }

    @Override
    public void authorize(Partition part, Privilege[] readRequiredPriv, Privilege[] writeRequiredPriv) throws HiveException, AuthorizationException {

    }

    @Override
    public void authorize(Table table, Partition part, List<String> columns, Privilege[] readRequiredPriv, Privilege[] writeRequiredPriv) throws HiveException, AuthorizationException {
        if (columns.contains("description")) {
            throw new AuthorizationException("Not allowed to select description column!");
        }
    }
}

The only dependency required by this in maven is:

<dependency>
  <groupId>org.apache.hive</groupId>
  <artifactId>hive-exec</artifactId>
  <version>2.3.5</version>
</dependency>

You literally don’t need any special build plugins or anything. If you build a project with just this (Java 1.8), and you take the JAR file, and you put it in your hive/lib folder, then you’re almost ready.

The last step is to modify your hive-site.xml and to add these 2 properties:

<property>
  <name>hive.security.authorization.enabled</name>
  <value>true</value>
</property>

<property>
  <name>hive.security.authorization.manager</name>
  <value>com.john.humphreys.hive.MyHiveAuthorizationProvider</value>
</property>

After that, restart your hiveserver2, and when you try to select the “description” column from any table with it, it will get rejected.

Example In Practice

If I have a table called sample_data and I have a description column in it, and I run this query:

select * from (
    select * from (
        select description from sample_data
        ) x
    ) y;

I get this result:

Query execution failed

Reason:
SQL Error [403] [42000]: Error while compiling statement: Not allowed to select description column!

So, we can see it worked properly.

Limitations

Unfortunately, while this guards hive, it surprisingly doesn’t guard Presto when it access data via the hive metastore. So, as I need to guard hive and presto, I need to understand why and see if there is some other option.