Presto / Hive find parquet files touched/referenced by a query/predicate.

We had a use case where we needed to find out which parquet files were touched by a query/predicate.  This was so that we could rewrite certain files in a special way to remove specific records.   In this case, presto was not mastering the data itself.

We found this awesome post -> https://stackoverflow.com/a/44011639/857994 on stack overflow which shows this pseudo-column:

select "$path" from table
This correctly shows you the parquet file a row came from, which is awesome!  I also found this  MR  which shows work has been merged to add $file_size  and $file_modified_time properties which is even cooler.
So, newer versions of presto-sql have even more power here.

 

Is Presto Slow When Returning Millions of Records / Big Result Set?

What Was Our Problem?

We were having issues with people reporting that Presto was slow when they were exporting hundreds of millions of records from much larger tables.  The queries were simple where clause filters selecting a few fields from some hundred-billion record tables.

Was It Actually Slow?

No! At least, not when parallelized well and tested properly.  I wrote a Java app to

  • Split a query into N =100 parts by using where clauses with a modulus on an integer column.
  • Query presto in parallel with 30 threads, going through the 100 queries.
  • Output results to standard out.

In an external bash script, I also grepped the results just to show some statistics I outputted.

This was slow!

Why Was it Slow?

First of all, let’s talk about the presto cluster setup:

  • 1 coordinator.
  • 15 workers.
  • All m5.8xlarge = 128GB RAM / 32 processor cores.

This is pretty decent.  So, what could our bottlenecks  be?

  1. Reading from s3.
  2. Processing results in workers.
  3. Slow coordinator due to having 15 workers talk through it.
  4. Slow consumer (our client running the queries).

To rule out 1/2/3 respectively I:

  1. Did a count(*) query which would force a scan over all relevant s3 data.  It came back pretty fast (in 15 seconds or so).
  2. Added more workers.  Having more workers  had minimal effect on the final timings, so we’re not worker bound.
  3. Switched the coordinator to a very large, compute-optimized node type.  This had minimal effect on the timings as well.

So, the problem appears to be with the client!

Why Was the Client Slow?

Our client really wasn’t doing a lot.  It was running 30 parallel queries and outputting the results, which were being grepped.  It was a similarly sized node to our presto coordinator, and it had plenty of CPU, RAM, decent network and disks (EBS).

It turned out though that once we stopped doing the grep and once we stopped writing the results to stdout, and we just held counters/statistics on the results we read, it went from ~25 minutes to ~2 minutes.

If we had run this in Spark or some other engine with good parallel behavior, we would have seen the workload distribute better over more nodes with sufficient ability to parallel process their portions of the records.  But, since we were running on a single node, with all results, the threads/CPU and other resoruces we were using capped out and could not go any faster.

Note: we did not see the client server as having high utilization, but some threads were at 100%.  So, the client app likely had a bottleneck we could avoid if we improved it.

Summary

So… next time you think presto can’t handle returning large numbers of results from the coordinator, take some time to evaluate your testing methodology.  Presto isn’t designed to route hundreds of millions of results, but it does it quite well in our experience.

 

Presto Custom Password Authentication Plugin (Internal)

Presto Authentication (Out of the Box)

Out of the box, presto will not make you authenticate to run any queries.  For example, you can just connect with JDBC from Java or DBeaver/etc and run whatever queries you want with any user name and no password.

When you want to enable a password, it has a few options out of the box:

So, unfortunately, if you just want to create some users/passwords and hand them out, or get users from an existing system or database, there really isn’t a way to do it here.

Custom Password Authenticator

Most things in Presto are implemented as plugins, including many of its own out-of-the-box features.  This is true for the LDAP authenticator above.  It actually is a “Password Authenticator” plugin implementation.

So, if we copy the LDAP plugin and modify it, we can actually make our own password plugin that lets us use any user-name and password/etc we want!  Note that we’ll just make another “internal” plugin which means we have to recompile presto.  I’ll try to make this an external plugin in another article.

Let’s Do It!

Note: we will be modifying presto code, and presto code only builds on Linux.  I use windows, so I do my work on an Ubuntu desktop in the cloud; but you can do whatever you like.  If you have a Linux desktop, it should build very easily out of Intellij or externally with Maven.

  1. Clone the presto source code.
  2. Open it in Intellij or your editor of choice.
  3. Find the “presto-password-authenticators” sub-project and navigate to com.facebook.presto.password.
  4. Copy LdapAuthenticator.java to MyAuthenticator.java.
  5. Copy LdapConfig to MyAuthenticatorConfig.java.
  6. Copy LdapAuthenticatorFactory to MyAuthenticatorFactory.java.
  7. For MyAutnenticatorConfig, make a password and a user private variable, both strings.  Then make getters and setters for them similar to the LDAP URL ones; though you can take out the patterns/etc.  You can remove everything else; our user name and password will be our only config.
  8. For MyAuthenticatorFactory, change getName() to return “my-authenticator”.  Also edit it so that it uses your new config class and so that it returns your new authenticator in config.
  9. For MyAuthenticator, just make the private Principal authenticate(String user, String password) {} method throw AccessDeniedException if the user and password don’t match the ones from your config.  You can get the ones from your config in the constructor.
  10. Add MyAuthenticator to the PasswordAuthenticationPlugin.java file under the LdapAuthenticatorFactory instance.

This is all the code changes you need for your new authenticator (assuming it works).  But how do we use/test it?

  • The code runs from the presto-main project.
  • Go there and edit config.properties; add these properties:
http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/presto-keystore/keystore.jks
http-server.https.keystore.key=somePassword
  • Then add a password-authenticator.properties in the same directory with these properties:
password-authenticator.name=my-authenticator
user=myuser
password=otherPassword
  • Note that the password-authenticator.name is equal to the string you used in the factory class in step #8.
  • The first config file tells it to use a password plugin, the second one tells it which one to use based on that name and what extra properties to give to it based on the config (in our case, just a user and a password).

The Hard Part

Okay, so all of this was pretty easy to follow.  Unfortunately, if you go try to run this, it won’t work (Note that you can run the presto in IntelliJ by following these instructions).

Anyway, your plugin won’t work because password plugins don’t do anything unless you have HTTPS enabled on the coordinator.   This is because presto developers don’t want you sending clear-text passwords; so the password plugin type just makes it flat out not work!

We put properties in our first config file for HTTPS already.  Now you need to follow the JKS instructions here: https://prestosql.io/docs/current/security/tls.html#server-java-keystore to generate a key store and put it at the path from our config file.

Note that if you’re using AWS/etc, the “first and last name” should be the DNS name of your server.  If you put the IP, it won’t work.

Once you’ve got that set up, you can start presto and connect via JDBC from the local host by filling out the parameters noted here (https://prestosql.io/docs/current/installation/jdbc.html).

SSL Use HTTPS for connections
SSLKeyStorePath The location of the Java KeyStore file that contains the certificate and private key to use for authentication.
SSLKeyStorePassword The password for the KeyStore.

SSL = true, key store path is the one from the config file earlier, and the password is also the one from the config file.

Now, if you connect via JDBC and use the user name and password from your config file, everything should work!  Note that you probably want to get a real certificate in the future or else you’ll have to copy yours key store to each computer/node you use so JDBC works on each (which isn’t great).

UPDATE – 2019/08/04 – YOU should make clients use a trust store with the private key removed for numerous reasons.  See this newer post for how to take this JKS file and modify it for client use properly: https://coding-stream-of-consciousness.com/2019/08/04/presto-internal-tls-password-login-removing-private-key-from-jks-file/.

Hive + Presto + Ranger Version Hell

My Use Case

I was trying to test out Apache Ranger in order to give Presto column-level security over hive data.  Presto itself doesn’t seem to support Ranger yet, though some github entries suggest it will soon.  Ranger can integrate with hive though so that when presto queries hive, the security can work fine (apparently).

Conflicting Versions

I started off by deploying a version of Hive I’ve worked with before; 2.3.5, the latest 2.x version (I avoided 3.x).  After that, I deployed Presto .220, also the latest version.

This was all working great, so I moved on to Ranger.  This is when I found out that the Ranger docs specifically say that it only works with Hive version 1.2.0:

Apache Ranger version 0.5.x is compatible with only the component versions mentioned below

HIVE 1.2.0 https://hive.apache.org/downloads.html

That came from this link: https://cwiki.apache.org/confluence/display/RANGER/Apache+Ranger+0.5.0+Installation.

Alternative Options

I have a fairly stringent need for the security Ranger provides.  So, I was willing to use a 1.x version of hive, depending on what the feature loss was.  After all, quite a few big providers seem to use 1.x.

Unfortunately, the next thing I noticed was that Presto says: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

That is coming from its latest documentation: https://prestodb.github.io/docs/current/connector/hive.html.

I’m not particularly excited to start digging through old versions of Presto as well.

Next Steps

I’m going to try to stick with Hive 2.x for now and a modern version of Presto.  So, my options are:

  1. Research Ranger more and see if it can actually work with Hive 2.x.  Various vendors seem to use Ranger and Hive/Presto together; so I’m curious to see how.  Maybe the documentation on Ranger is just out of date (I know, being hopeful).
  2. Look at Ranger alternatives like Apache Sentry and see if they support Hive 2.x.  Apparently Ranger is beating out Sentry in features, usage, and future support… so I’m not excited about using Sentry.  But if it works, I can always migrate back to Ranger once its support grows for either Hive or Presto.

Update

I starting digging in from JIRA and mailing lists and found that Ranger appears to have had work done on it as early as 2017 for supporting hive 2.3.2.  Here’s a link.  https://issues.apache.org/jira/browse/RANGER-1927.

So, I’m going to give installing ranger a shot on 2.3.5 and see if it works.  If not, I’ll try with 2.3.2 and/or seek community help.  Hopefully I’ll come back and update this afterward with some good news :).

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

Very short post – my HiveServer2 process was running without errors after deployment, but it wasn’t really running.

Connecting via JDBC yielded errors saying the connection was being refused.  Analyzing the server showed that the port was not open using:

sudo netstat -nlp | grep 10000

I enabled debug logs with the extra command line parameter:

--hiveconf hive.root.logger=DEBUG,console

And it still didn’t show much, except something about creating the scratch directories (but not an error).

After a while, I figured out that the scratch directories were set to be created at the root of the file system in a new directory which didn’t exist yet. The user running hive did not have these permissions.

So, I created the scratch directory and gave ownership to the hive user, and then everything came up and worked great on the next hiveserver2 service restart.