Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Some Context

We deploy presto in-house and manage a fork of the code base so that we can introduce company-specific features for auditing, security, or integrations.

One of the main features we provided is support for hive views (which I now see is coming into presto-sql version 330 (but that’s not really relevant here).

https://prestosql.io/docs/current/release/release-330.html

Add experimental support for executing basic Hive views. To enable this feature, the hive.views-execution.enabled configuration property must be set to true. (#2715)

A recent feature request for our implementation was to add support for the unix_timestamp function so that this expression worked:

"unix_timestamp"(a.creation_datetime, 'yyyy-MM-dd')

The Code

There is some good information on making functions here: https://prestosql.github.io/docs.prestosql.io/current/develop/functions.html.  But I mostly just referenced other DateTime functions in the DateTimeFunctions class within the presto code base (which also happens to be where I added the new unix_timestamp function implementation shown below.

    @ScalarFunction("unix_timestamp")
    @LiteralParameters({"x", "y"})
    @SqlType(StandardTypes.DOUBLE)
    public static double unixTimestamp(@SqlType("varchar(x)") Slice datetime, @SqlType("varchar(y)") Slice formatString)
    {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat(formatString.toStringUtf8());
        simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
        try {
            Date parse = simpleDateFormat.parse(datetime.toStringUtf8());
            return parse.toInstant().toEpochMilli() / 1000d;
        }
        catch (ParseException e) {
            throw new PrestoException(INVALID_FUNCTION_ARGUMENT, "'" + datetime.toStringUtf8() +
                    " cannot be parsed by format string " + formatString.toStringUtf8() + ".");
        }
    }

This function takes a varchar date or column, and a varchar format string, and it parses out the date in UTC to the unix time (seconds since 1970). This should follow the hive behavior for this overload (hopefully!).

Now, you can call the function (for this overload) in presto-sql just like you can in hive. We still have to do more testing and we need to provide more overloads, but this is a working implementation, so I hope it helps you do the same.

Presto + Hive View Storage/Implementation

I’ve been learning about how presto handles views lately.  This is because we are heavily reliant on presto and we recently ran into multiple use cases where our hive metastore had views which wouldn’t work within presto.

What are Presto Views Exactly?

Presto has its own view implementation which is distinct from a hive’s view implementation.  Presto will not use a hive view, and if you try to  query one, you will get a clear error immediately.

A presto view is based on a Presto SQL query string.  A hive view is based on a a hive query string.  A hive query string is written in HQL (Hive Query  Language), and presto simply does not know that SQL dialect.

How Are Views Stored?

Presto actually stores its views in the exact same location as hive does.  The hive metastore database has a TBLS table which holds every hive table and view.  Views have two columns populated that tables ignore – view_original_text and view_expanded_text.  Hive views will have plain SQL in the view_original_text column whereas presto views will have some encoded representation prefixed with “/* Pesto View…”.   If presto queries a view and does not find it’s “/* Pesto View” prefix, it will consider it a hive view and say that it is not supported.

Making Presto Handle Hive Views

I’ve been doing work for some time to try to make presto-sql support hive views.  I’m using the pull request noted in this issue https://github.com/prestodb/presto/issues/7338 as a template.  It is fairly old though and was made against presto-db rather than presto-sql, so the exercise has turned out to be non-trivial.

I’m still chugging along and will post more when done.  But one thing to note is that this PR does not really make presto support hive views.  It actually allows presto to attempt to run hive views as they are.  Many hive views will be valid presto SQL – e.g. where you’re just selecting from a table with some basic joins and/or where clause filters.

So, this PR basically prevents presto from outright failing when it sees a view that does not start with “/* Presto View”.  It then helps it read the hive query text, possibly lightly modify it, and attempt to run it as if the same had been done for a Presto query.

I plan on doing a number of corrections to the SQL as well; e.g. replacing double quotes, converting back-ticks, replacing obvious function names like NVL with COALESCE, etc.  Eventually I may try to fix more by parsing the hive text with ANTLR or something similar to make as many hive views run by default as possible.  But it will never be a complete solution.  A complete solution would be very hard as it would require a full understanding and conversion of hive language to presto language (which is probably not even possible given some of their differences).

Hive Metastore DB (HMS DB) – Get All Tables/Columns including Partition Keys

How Hive Stores Schemas

I recently had to sync the schemas from a hive database to a normal MySQL database for reasons I won’t bother going into.  The exercise required me to get all columns from all tables in hive for each DB though, and I found that this was not amazingly straight-forward.

The hive metastore DB is a normal MySQL/etc database with a hive schema in it.  The hive schema holds the hive tables though.  So, the information schema is irrelevant to hive; to get the hive table details, you have to interrogate the TBLS table, for example.  To get columns, you need to interrogate COLUMNS_V2, and to get the databases themselves, you look toward the DBS table.

Missing Columns

Other posts I’ve seen would leave you here – but  I found when I joined these three tables together (plus some others you have to route through), I was still missing some columns in certain tables for some reason.  It turns out that partition columns are implicit in hive.  So, in a file system of hive data (like HDFS), a partition column in a table is literally represented by just having the directory named with the partition value; there are no columns with the value in the data.

What this means is that partition columns don’t show up in these normal tables.  You have to look to a separate partition keys table to find them with a separate query.

The Working Query

The query below finds all columns of any kind and sorts them in the order they’ll appear when you select from a table in hive/presto/etc.  I hope it helps you!

select db_name, table_name, column_name from (
SELECT d.NAME as db_name, t.TBL_NAME as table_name, c.COLUMN_NAME as column_name, c.INTEGER_IDX as idx
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
JOIN SDS s on t.SD_ID = s.SD_ID
JOIN COLUMNS_V2 c on c.CD_ID = s.CD_ID
WHERE d.NAME = :dbName
union
SELECT d.NAME as db_name, t.TBL_NAME as table_name, k.PKEY_NAME as column_name, 50000 + k.INTEGER_IDX
FROM DBS d
JOIN TBLS t on t.DB_ID = d.DB_ID
join PARTITION_KEYS k on t.TBL_ID = k.TBL_ID
where d.NAME = :dbName
) x
order by db_name, table_name, idx

Creating Java Key Stores (JKS), Setting Validity Periods, and Analyzing JKS for Expiry Dates

What is a JKS File?

Taken from: https://en.wikipedia.org/wiki/Java_KeyStore

A Java KeyStore (JKS) is a repository of security certificates – either authorization certificates or public key certificates – plus corresponding private keys, used for instance in SSL encryption.

Basically, applications like Presto use JKS files to enable them to do Transport Layer Security (TLS).

How Do You Create One?

As an example, Presto, a common big-data query tool, uses JKS for both secure internal and external communication.  At this link https://prestosql.io/docs/current/security/internal-communication.html they show you how to create one.

Here is an excerpt.  Note that you must make sure you replace *.example.com with the domain you will host the service using the JKS file on or it will not work.  Certificates are domain specific.

keytool -genkeypair -alias example.com -keyalg RSA \
    -keystore keystore.jks
Enter keystore password:
Re-enter new password:
What is your first and last name?
  [Unknown]:  *.example.com
What is the name of your organizational unit?
  [Unknown]:
What is the name of your organization?
  [Unknown]:
What is the name of your City or Locality?
  [Unknown]:
What is the name of your State or Province?
  [Unknown]:
What is the two-letter country code for this unit?
  [Unknown]:
Is CN=*.example.com, OU=Unknown, O=Unknown, L=Unknown, ST=Unknown, C=Unknown correct?
  [no]:  yes

Enter key password for <presto>
        (RETURN if same as keystore password):

Pit Fall! – 90 Day Expiry Date – Change It?

Now… here’s a fun pitfall.  Certificates are made with an expiry date and have to be reissued periodically (for security reasons).  The default expiry date for JKS is 90 days.

https://docs.oracle.com/javase/tutorial/security/toolsign/step3.html

This certificate will be valid for 90 days, the default validity period if you don’t specify a –validity option.

This is fine on a big managed service with lots of attention.  But if you’re just TLS securing an internal app not many people see, you will probably forget to rotate it or neglect to set up appropriate automation.  Then, when it expires, things will break.

Now… for security reasons you generally shouldn’t set too high a value for certificate expiry time.  But for example purposes, here is how you would set it to 10 years.

keytool -genkeypair -alias example.com -keyalg RSA \
    -keystore keystore.jks -validity 3650

Determine Expiry Date

If you have a JKS file that you are using for your application and you are not sure when it expires, here’s a command that you can use:

keytool -list -v -keystore keystore.jks

This will output different things based on where your key store came from.  E.g. you will probably see more interesting output from a real SSL cert than you will from a self-signed one created like we did above.

In any case, you will clearly see a line in the large output from this command that says something like this:

Valid from: Fri Jul 12 01:27:21 UTC 2019 until: Thu Oct 10 01:27:21 UTC 2019

Note that in this case, it is just valid for 3 months.  Be careful when looking at this output because you may find multiple expiry dates in the output for different components of the JKS file.  You need to make sure you read the right one.  Though, chances are that the one on your domain will be the one that expires earliest anyway.

 

Building Presto Admin

Presto Admin – Is it Worth Using?

I generally deploy Presto clusters using packer and terraform.  Packer builds an image for me with the target presto distribution, some utility scripts, ha proxy (to get multiple coordinators acting in HA), etc.

I kept noticing this presto-admin project though: https://github.com/prestosql/presto-admin.  It allows you to quickly/easily deploy clusters from a central node, and it will handle the coordinator, workers, catalogs, and everything.  That sounds pretty cool.

Advance disclaimer – after I built this, I decided not to use it.  This was because it seems to just deploy a single coordinator and worker set.  For an HA setup, I need multiple coordinators, a load balancer, and workers/users pointing at the load balancer.  So, it’s just not the right fit for me.

Presto Admin – Build

In any case, I did go through the motions of building this – because I could not find a source release.  Fortunately, it’s pretty easy on Centos 7.x (basically RHEL 7.x):

# Download and unzip.
wget https://github.com/prestosql/presto-admin/archive/2.7.tar.gz
tar -xvzf 2.7.tar.gz

# Install pip/etc.
sudo yum install epel-release
sudo yum install python-pip
sudo yum install python-wheel

# Run make file and build web installer.
make dist

After this, just go into the dist folder and find prestoadmin-2.7-online.tar.gz.

I hope this saves you some time; I wasted around 20 minutes trying to find the dist online for download (which I never did).