S3 Eventual Consistency

Consistent Distributed File Systems

Historically, I’ve used standard HDFS, MapR’s version of HDFS (MapR-FS), and ADLS (Azure’s data lake service).  All of these behave very much like you would expect a local file system to.  If you write files and another process lists files, it will immediately see them and be able to use them without issue.

Amazon s3 File System Issues

I was surprised when I started learning about Amazon s3 after using all of these prior file systems.  I understand that s3 is an object store… similar to Azure Blob Storage.  I also understand that it is the main data lake solution though.

Maybe it’s just because I’m new and am missing something… but there doesn’t seem to be any AWS version of ADLS.

The s3 storage service is eventually consistent.   This means that if you run Spark, or similar tools on it, they will likely produce improper results or fail.  This is because multiple tasks will write files in parallel and list them and they won’t necessarily get the fully up to date view of the storage.  So, they may write 10 files, list them, and see 5 files, etc.

I came across a very good article describing this in detail here: https://www.opendoor.com/w/blog/why-s3guard-with-s3-as-a-filesystem-spark.

The TLDR is that you have to use a consistency layer between your big data frameworks and s3 to ensure they function well.  You can confirm this by reading the short hadoop documentation site here -> https://wiki.apache.org/hadoop/AmazonS3.

Note that the first article recommends S3Guard which works based on DynamoDB, but there may be other options (e.g. EMR will have a way of dealing with this).

Determine Compatibility of hadoop-aws and aws-java-sdk-bundle JARs

When you’re integrating hadoop and other big-data frameworks into AWS s3, you will quickly run into a situation where you need to include the hadoop-aws and aws-java-sdk-bundle JARs into your class path.

Unfortunately, these JARs are separately versioned and it is hard to figure out compatibility.  The hadoop-aws JAR has to match your hadoop version exactly, so that one is fine.

Determining the Right Version

  1. Check your hadoop version.
  2. Get the hadoop-aws.jar with the same exact version.
  3. Go to the maven central page for the correct version of the hadoop-aws.jar and look at its compile dependencies.  E.g. at https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.9 you can see the SDK dependency is com.amazonaws » aws-java-sdk-bundle 1.11.199.

Hive Server 2 – Required field ‘serverProtocolVersion’ is unset!

Issue Context and Error

I have been working to install hive server 2 in order to work with Presto, among other things.  I wanted to ensure I had Hive’s JDBC interface (to port 10000) working well as I need it to enable users to easily submit partition repair queries (msck repair table) and similar things.  Unfortunately, when I went to connect over JDBC, I got this error (a small part of a huge stack trace):

Required field 'serverProtocolVersion' is unset!

The Solution

I think if you carefully read the full stack-trace, you’ll see something about user impersonation… but missed it. I actually figured it out by increasing the logging level when running hive server. You can do that like this:

./hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=DEBUG,console

Once I did this, I clearly saw this error:

2019-06-06T13:53:13,183  WARN [HiveServer2-Handler-Pool: Thread-36] thrift.ThriftCLIService: Error opening session:
org.apache.hive.service.cli.HiveSQLException: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: centos is not allowed to impersonate centos

Googling this quickly helped me to find this stack overflow: https://stackoverflow.com/a/50753233/857994. The proposed solution there is to add this entry to your hive-site.xml:

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value> 
</property>

After that, everything works great :).

Hive 3 Standalone Metastore + Presto

Hive 3.0 Standalone Metastore – Why?

Hive version 3.0 allows you to download a standalone metastore.  This is cool because it does not require you to deploy hadoop and/or run the rest of Hive’s fairly large deployment.  This makes a lot of sense because many tools that use hive for schema management do not actually care about Hive’s query engine.

For example, Presto is a clustered query engine in its own right; it has no interest in using hadoop/map-reduce to execute a query on hive data; it just wants to view and manage hive’s metadata through its thrift metastore interface.  Similarly, Apache Spark loves to work with hive, but it actually goes directly to the underlying database for performance reasons and works against that.  So, it also does not need hive’s query engine.

Can/Should We Use It?

Unfortunately, Presto only currently supports Hive 2.X.  From it’s own documentation: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

If you read online though, you will find that it does seem to work… but with limited features.  If you look at this git entry for example: https://groups.google.com/forum/#!topic/presto-users/iAeEecsnS9I, you will see:

“We have tested Presto 0.203e with Hive 3.0 Metastore, and it works fine. We tested it by running TPC-DS queries, and Presto completed all 99 queries.”

But lower down, you will see:

However, Presto is not able to read Hive managed (transactional tables) in Hive 3.x…

Yes, this is a known limitation.

Unfortunately, transactional ACID v2 tables are the default for Hive 3.x.  So, basically all managed tables will not work in Hive 3.x even though external tables will work.  So, it might be okay to use it if you only do external tables… but in our case we let people use Spark however they like and they likely create many managed tables.  So, this rules out using Hive 3.0 with the standalone metastore for us.

I’m going to see if Hive 2.0 can be run without the hive server and hadoop next.

Site Note – SchemaTool

I would just like to make a side-note that while I did manage to run the Hive Standalone Metastore without installing hadoop, I did have to install (but not run) hadoop in order to use the schematool provided with hive for creating the hive RDMBS schema.  This is due to library dependencies.

There is a “create on first run” config you can do instead of this as well but they don’t recommend using it in production; so just keep that in mind.

Useful Links

Presto Coordinator High Availability (HA)

Quick Recap – What is Presto?

Presto is an open-source distributed SQL query engine made by Facebook that runs as its own cluster.  It is able to refer to an existing hive metastore and run queries on the hive tables in HDFS/etc itself using its own resources.  It is much faster as it does everything in-memory rather than by using map-reduce.  It can connect to numerous data sources aside from hive as well (though I have only used it with Hive over Azure’s ADLS personally).

No High Availability

We started using Presto in an enterprise use case, and I was astounded to find out that it doesn’t have any high-availability (HA) built into it.  Presto as a product is wonderful – it is fast, easy to set up, provides pretty solid query diagnostics, handles massive queries in a very stable manner, etc.  So, the complete lack of a HA solution seems very strange given the strength of the product.

The critical component in Presto is the Coordinator, and it is a single point of failure.  It is the brains of the operation; it parses queries, breaks them into tasks, controls where work gets scheduled, etc.  Users only talk to the Coordinator node.

Coordinators vs Workers

Despite the importance of a coordinator, the only real differences between a coordinator and  the rest of the worker nodes at a configuration level are that:

  1. Coordinators specify that they are a coordinator.
  2. Coordinators can run an embedded discovery server – all nodes in the cluster report to this discovery server (including the coordinator itself).  This discovery server can actually be run separately from the coordinator as well if desired; I think the provision of an embedded one is relatively new.  The discovery server is how the coordinator knows the full set of nodes it is managing.
  3. Coordinators can choose whether or not they themselves are used to process queries (as opposed to just managing them).

Again, coordinators take client connections (e.g. JDBC, ODBC, etc), and they take queries from those connections, parse them, validate them, break them into tasks, and schedule them across the pool of available workers.

Workers just report to the discovery server and handle the tasks they are allocated.

HA Options

There is surprisingly little to find online about making Presto HA.  The only two solutions that I’ve seen are:

  • Run multiple clusters behind a load balancer.
  • Run multiple coordinators and some form of proxy service to ensure only one is ever active at a time.

Both of these have challenges and/or drawbacks.

If you are running multiple clusters, you probably want them to be active/active so you don’t only use half of your nodes at a time.  Handling this properly requires that your proxy service issue a redirect to the target cluster’s coordinator so that the client (e.g. JDBC connection) can re-send the request to that and talk to it directly.  This will work, but you’ve still limited the maximum query size you can do as you split your nodes into 2 or more clusters, so they cannot all co-operate on very large queries.

Running multiple coordinators for HA is preferable as you get to combine all of your nodes into a single, large cluster that can attack large queries.  It is not trivial to do though as if two coordinators operate at once, they can degrade and even deadlock the cluster.  We’ll dig into how to run with multiple coordinators now.

Using Multiple Coordinators

If you want to set up Presto using multiple coordinators, here is the general approach:

  1. Set up 2 or 3 nodes as coordinators.
  2. Tell them to run their own discovery servers in their config.
  3. Tell them to point at localhost for their own discovery server – this is quite important.
  4. Tell them not to do work (It will keep things more stable, but unfortunately, that means that your cluster has less power.  You’ll probably have far more workers than coordinators though, so it shouldn’t be an issue).
  5. Install HA proxy on the coordinator nodes.  Have all the coordinator nodes registered in order and make all but the first one a “backup”.  So, for example, run HA Proxy port 8385 and run Presto on port 8321.  All traffic will go to node #1 unless its down, in which case it will go to node #2, and so on.
  6. Set up a load balancer in front of the coordinator nodes pointing at the HA proxy port and make sure traffic can get through.
  7. Set up all worker nodes to target the load balancer for the discovery server.  So, all workers target the load balancer, which goes to any coordinator, all of which redirect to the primary one.  The primary coordinator always has all workers reaching it courtesy of HA proxy.
  8. As each coordinator itself only reports to its localhost discovery server, coordinators will not end up talking to each other’s discovery servers and will not interfere with each other.  Only one coordinator will ever have workers registered with it at a time.

Let Coordinators Use the Load Balancer?

If you let coordinators use the load balancer, then they will all end up at the primary coordinator’s discovery server.  Now… I have seen people online saying that they ran all nodes as coordinators (e.g. in the linked Gooogle Group conversations below) in which case this must be happening.

When I tried it though, I clearly got this warning from all the coordinators (and probably the workers too, but I didn’t check).  It comes out once a second.

2018-12-29T01:38:01.479Z WARN http-worker-176 com.facebook.presto.execution.SqlTaskManager Switching coordinator affinity from awe4s to 9mdsu
2018-12-29T01:38:01.806Z WARN http-worker-175 com.facebook.presto.execution.SqlTaskManager Switching coordinator affinity from 9mdsu to awe4s

Someone in a Git Hub issue I forgot the link to stated that this means that the memory management may get muddled up, which sounds scary.  I did provide a link to the warning in code below which somewhat verifies this.

So, maybe it is a disaster, or maybe it’s harmless – but in any case, I didn’t want warning messages coming at me once a second that looked this bad.  So, I opted to only have each coordinator talk to its own discovery server, which makes them 100% idle (not processing anything) unless they are the current primary coordinator.  This waste is unfortunate, but as we’ll have far more workers than coordinators, it’s not the end of the world.

Drawbacks

This will keep your cluster running in the event that a coordinator fails.  Any active queries at the time a coordinator fails will fail though – we can’t do anything about that unless Presto starts supporting HA internally.  Also, the fail-over period will be very much tied to your HA proxy configuration and your load balancer health checks (mine takes around 30 seconds using an Azure load balancer and HA proxy, I’ll be looking to reduce that).

Useful Links