S3 Eventual Consistency

Consistent Distributed File Systems

Historically, I’ve used standard HDFS, MapR’s version of HDFS (MapR-FS), and ADLS (Azure’s data lake service).  All of these behave very much like you would expect a local file system to.  If you write files and another process lists files, it will immediately see them and be able to use them without issue.

Amazon s3 File System Issues

I was surprised when I started learning about Amazon s3 after using all of these prior file systems.  I understand that s3 is an object store… similar to Azure Blob Storage.  I also understand that it is the main data lake solution though.

Maybe it’s just because I’m new and am missing something… but there doesn’t seem to be any AWS version of ADLS.

The s3 storage service is eventually consistent.   This means that if you run Spark, or similar tools on it, they will likely produce improper results or fail.  This is because multiple tasks will write files in parallel and list them and they won’t necessarily get the fully up to date view of the storage.  So, they may write 10 files, list them, and see 5 files, etc.

I came across a very good article describing this in detail here: https://www.opendoor.com/w/blog/why-s3guard-with-s3-as-a-filesystem-spark.

The TLDR is that you have to use a consistency layer between your big data frameworks and s3 to ensure they function well.  You can confirm this by reading the short hadoop documentation site here -> https://wiki.apache.org/hadoop/AmazonS3.

Note that the first article recommends S3Guard which works based on DynamoDB, but there may be other options (e.g. EMR will have a way of dealing with this).

Determine Compatibility of hadoop-aws and aws-java-sdk-bundle JARs

When you’re integrating hadoop and other big-data frameworks into AWS s3, you will quickly run into a situation where you need to include the hadoop-aws and aws-java-sdk-bundle JARs into your class path.

Unfortunately, these JARs are separately versioned and it is hard to figure out compatibility.  The hadoop-aws JAR has to match your hadoop version exactly, so that one is fine.

Determining the Right Version

  1. Check your hadoop version.
  2. Get the hadoop-aws.jar with the same exact version.
  3. Go to the maven central page for the correct version of the hadoop-aws.jar and look at its compile dependencies.  E.g. at https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.9 you can see the SDK dependency is com.amazonaws » aws-java-sdk-bundle 1.11.199.

Databricks / Spark – Generate Parquet Sample Data

I frequently find myself needing to generate parquet data for sample tests… e.g. when setting up a new hive instance, or testing Apache Drill, presto, etc. I always end up writing basically the same code in a different way because I never save it. So here it is!

This code makes a 5000 row data frame with 3 columns, 2 being integers and one being a string. It then names the columns well and saves them to a parquet file with 5 sub-files due to the coalesce.

I have it set up to write to ADLS in Azure but you can change the path so it works with your HDFS or whatever.

NOTE: It will overwrite the previous results in the destination, so (1) don’t write over data you want to keep, and (2) if you need to tune this, just keep re-running it :).

//Create a mutable list buffer based on a loop.
import scala.collection.mutable.ListBuffer
var lb = ListBuffer[(Int, Int, String)]()
for (i <- 1 to 5000) {
  lb += ((i, i*i, "Number is " + i + "."))
}

//Convert it to a data frame.
import spark.implicits._
val df = lb.toDF("value", "square", "description")

df.coalesce(5).write.mode(SaveMode.Overwrite).parquet("adl://some-adls-instance.azuredatalakestore.net/johntest/sample_data.parquet")

Hive Server 2 – Required field ‘serverProtocolVersion’ is unset!

Issue Context and Error

I have been working to install hive server 2 in order to work with Presto, among other things.  I wanted to ensure I had Hive’s JDBC interface (to port 10000) working well as I need it to enable users to easily submit partition repair queries (msck repair table) and similar things.  Unfortunately, when I went to connect over JDBC, I got this error (a small part of a huge stack trace):

Required field 'serverProtocolVersion' is unset!

The Solution

I think if you carefully read the full stack-trace, you’ll see something about user impersonation… but missed it. I actually figured it out by increasing the logging level when running hive server. You can do that like this:

./hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=DEBUG,console

Once I did this, I clearly saw this error:

2019-06-06T13:53:13,183  WARN [HiveServer2-Handler-Pool: Thread-36] thrift.ThriftCLIService: Error opening session:
org.apache.hive.service.cli.HiveSQLException: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: centos is not allowed to impersonate centos

Googling this quickly helped me to find this stack overflow: https://stackoverflow.com/a/50753233/857994. The proposed solution there is to add this entry to your hive-site.xml:

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value> 
</property>

After that, everything works great :).

Powershell – Disable Windows 10 Service on PC Unlock

My Issue

Recently I was running on a computer with a policy that was starting a (clearly) malfunctioning program on start-up which basically permanently was taking 100% of my CPU.

I didn’t have the ability to change the policy, but I did have the ability to stop the service.  Unfortunately, it regularly gets brought back as manually enabled and started.

The Fix

In cases like this, on Windows 10, you can search “Task Scheduler” on your start menu.  Then create a new task (not a basic task) that runs with highest privileges.  You can tell it to trigger on PC unlock, and have it start a program.

For the program, you can give your path to powershell (e.g. C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe).  Then you can give the parameters “-windowstyle hidden C:\dev\disable-service.ps1”.  Honestly, the window style hidden isn’t working but since it just runs on unlock its not a big deal; I’ll figure that out later.

The disable-service.ps1 script should have content like this:

Set-Service "Your Service Name" -StartupType Disabled
Stop-Service "Your Service Name"

You can get the full service name from your services utility in Windows. Note that the real service name needs to be used, and that may not be what you see specifically in the services utility or task manager. But if you click on the service in the services utility, you can get this real name easily.