predicate.

Posted on May 13, 2020 by John Humphreys

We had a use case where we needed to find out which parquet files were touched by a query/predicate. This was so that we could rewrite certain files in a special way to remove specific records. In this case, presto was not mastering the data itself.

We found this awesome post -> https://stackoverflow.com/a/44011639/857994 on stack overflow which shows this pseudo-column:

select "$path" from table

This correctly shows you the parquet file a row came from, which is awesome! I also found this MR which shows work has been merged to add $file_size and $file_modified_time properties which is even cooler.

So, newer versions of presto-sql have even more power here.

Presto Custom Password Authentication Plugin (Internal)

Posted on June 18, 2019 by John Humphreys

Presto Authentication (Out of the Box)

Out of the box, presto will not make you authenticate to run any queries. For example, you can just connect with JDBC from Java or DBeaver/etc and run whatever queries you want with any user name and no password.

When you want to enable a password, it has a few options out of the box:

LDAP (secured)
Kerberos
JWT Tokens (I don’t see this online much but its in their code).

So, unfortunately, if you just want to create some users/passwords and hand them out, or get users from an existing system or database, there really isn’t a way to do it here.

Custom Password Authenticator

Most things in Presto are implemented as plugins, including many of its own out-of-the-box features. This is true for the LDAP authenticator above. It actually is a “Password Authenticator” plugin implementation.

So, if we copy the LDAP plugin and modify it, we can actually make our own password plugin that lets us use any user-name and password/etc we want! Note that we’ll just make another “internal” plugin which means we have to recompile presto. I’ll try to make this an external plugin in another article.

Let’s Do It!

Note: we will be modifying presto code, and presto code only builds on Linux. I use windows, so I do my work on an Ubuntu desktop in the cloud; but you can do whatever you like. If you have a Linux desktop, it should build very easily out of Intellij or externally with Maven.

Clone the presto source code.
Open it in Intellij or your editor of choice.
Find the “presto-password-authenticators” sub-project and navigate to com.facebook.presto.password.
Copy LdapAuthenticator.java to MyAuthenticator.java.
Copy LdapConfig to MyAuthenticatorConfig.java.
Copy LdapAuthenticatorFactory to MyAuthenticatorFactory.java.
For MyAutnenticatorConfig, make a password and a user private variable, both strings. Then make getters and setters for them similar to the LDAP URL ones; though you can take out the patterns/etc. You can remove everything else; our user name and password will be our only config.
For MyAuthenticatorFactory, change getName() to return “my-authenticator”. Also edit it so that it uses your new config class and so that it returns your new authenticator in config.
For MyAuthenticator, just make the private Principal authenticate(String user, String password) {} method throw AccessDeniedException if the user and password don’t match the ones from your config. You can get the ones from your config in the constructor.
Add MyAuthenticator to the PasswordAuthenticationPlugin.java file under the LdapAuthenticatorFactory instance.

This is all the code changes you need for your new authenticator (assuming it works). But how do we use/test it?

The code runs from the presto-main project.
Go there and edit config.properties; add these properties:

http-server.authentication.type=PASSWORD
http-server.https.enabled=true
http-server.https.port=8443
http-server.https.keystore.path=/etc/presto-keystore/keystore.jks
http-server.https.keystore.key=somePassword

Then add a password-authenticator.properties in the same directory with these properties:

password-authenticator.name=my-authenticator
user=myuser
password=otherPassword

Note that the password-authenticator.name is equal to the string you used in the factory class in step #8.
The first config file tells it to use a password plugin, the second one tells it which one to use based on that name and what extra properties to give to it based on the config (in our case, just a user and a password).

The Hard Part

Okay, so all of this was pretty easy to follow. Unfortunately, if you go try to run this, it won’t work (Note that you can run the presto in IntelliJ by following these instructions).

Anyway, your plugin won’t work because password plugins don’t do anything unless you have HTTPS enabled on the coordinator. This is because presto developers don’t want you sending clear-text passwords; so the password plugin type just makes it flat out not work!

We put properties in our first config file for HTTPS already. Now you need to follow the JKS instructions here: https://prestosql.io/docs/current/security/tls.html#server-java-keystore to generate a key store and put it at the path from our config file.

Note that if you’re using AWS/etc, the “first and last name” should be the DNS name of your server. If you put the IP, it won’t work.

Once you’ve got that set up, you can start presto and connect via JDBC from the local host by filling out the parameters noted here (https://prestosql.io/docs/current/installation/jdbc.html).

`SSL`	Use HTTPS for connections
`SSLKeyStorePath`	The location of the Java KeyStore file that contains the certificate and private key to use for authentication.
`SSLKeyStorePassword`	The password for the KeyStore.

SSL = true, key store path is the one from the config file earlier, and the password is also the one from the config file.

Now, if you connect via JDBC and use the user name and password from your config file, everything should work! Note that you probably want to get a real certificate in the future or else you’ll have to copy yours key store to each computer/node you use so JDBC works on each (which isn’t great).

UPDATE – 2019/08/04 – YOU should make clients use a trust store with the private key removed for numerous reasons. See this newer post for how to take this JKS file and modify it for client use properly: https://coding-stream-of-consciousness.com/2019/08/04/presto-internal-tls-password-login-removing-private-key-from-jks-file/.

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

Posted on June 7, 2019 by John Humphreys

Very short post – my HiveServer2 process was running without errors after deployment, but it wasn’t really running.

Connecting via JDBC yielded errors saying the connection was being refused. Analyzing the server showed that the port was not open using:

sudo netstat -nlp | grep 10000

I enabled debug logs with the extra command line parameter:

--hiveconf hive.root.logger=DEBUG,console

And it still didn’t show much, except something about creating the scratch directories (but not an error).

After a while, I figured out that the scratch directories were set to be created at the root of the file system in a new directory which didn’t exist yet. The user running hive did not have these permissions.

So, I created the scratch directory and gave ownership to the hive user, and then everything came up and worked great on the next hiveserver2 service restart.

S3 Eventual Consistency

Posted on June 7, 2019 by John Humphreys

Consistent Distributed File Systems

Historically, I’ve used standard HDFS, MapR’s version of HDFS (MapR-FS), and ADLS (Azure’s data lake service). All of these behave very much like you would expect a local file system to. If you write files and another process lists files, it will immediately see them and be able to use them without issue.

Amazon s3 File System Issues

I was surprised when I started learning about Amazon s3 after using all of these prior file systems. I understand that s3 is an object store… similar to Azure Blob Storage. I also understand that it is the main data lake solution though.

Maybe it’s just because I’m new and am missing something… but there doesn’t seem to be any AWS version of ADLS.

The s3 storage service is eventually consistent. This means that if you run Spark, or similar tools on it, they will likely produce improper results or fail. This is because multiple tasks will write files in parallel and list them and they won’t necessarily get the fully up to date view of the storage. So, they may write 10 files, list them, and see 5 files, etc.

I came across a very good article describing this in detail here: https://www.opendoor.com/w/blog/why-s3guard-with-s3-as-a-filesystem-spark.

The TLDR is that you have to use a consistency layer between your big data frameworks and s3 to ensure they function well. You can confirm this by reading the short hadoop documentation site here -> https://wiki.apache.org/hadoop/AmazonS3.

Note that the first article recommends S3Guard which works based on DynamoDB, but there may be other options (e.g. EMR will have a way of dealing with this).

Determine Compatibility of hadoop-aws and aws-java-sdk-bundle JARs

Posted on June 7, 2019 by John Humphreys

When you’re integrating hadoop and other big-data frameworks into AWS s3, you will quickly run into a situation where you need to include the hadoop-aws and aws-java-sdk-bundle JARs into your class path.

Unfortunately, these JARs are separately versioned and it is hard to figure out compatibility. The hadoop-aws JAR has to match your hadoop version exactly, so that one is fine.

Determining the Right Version

Check your hadoop version.
Get the hadoop-aws.jar with the same exact version.
Go to the maven central page for the correct version of the hadoop-aws.jar and look at its compile dependencies. E.g. at https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.9 you can see the SDK dependency is com.amazonaws » aws-java-sdk-bundle 1.11.199.

Coding Stream of Consciousness

by John Humphreys – Random code from my life.

Category Archives: big-data

Presto / Hive find parquet files touched/referenced by a query/predicate.

Presto Custom Password Authentication Plugin (Internal)

Presto Authentication (Out of the Box)

Custom Password Authenticator

Let’s Do It!

The Hard Part

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

S3 Eventual Consistency

Consistent Distributed File Systems

Amazon s3 File System Issues

Determine Compatibility of hadoop-aws and aws-java-sdk-bundle JARs

Determining the Right Version