Databricks / Spark – Generate Parquet Sample Data

I frequently find myself needing to generate parquet data for sample tests… e.g. when setting up a new hive instance, or testing Apache Drill, presto, etc. I always end up writing basically the same code in a different way because I never save it. So here it is!

This code makes a 5000 row data frame with 3 columns, 2 being integers and one being a string. It then names the columns well and saves them to a parquet file with 5 sub-files due to the coalesce.

I have it set up to write to ADLS in Azure but you can change the path so it works with your HDFS or whatever.

NOTE: It will overwrite the previous results in the destination, so (1) don’t write over data you want to keep, and (2) if you need to tune this, just keep re-running it :).

//Create a mutable list buffer based on a loop.
import scala.collection.mutable.ListBuffer
var lb = ListBuffer[(Int, Int, String)]()
for (i <- 1 to 5000) {
  lb += ((i, i*i, "Number is " + i + "."))
}

//Convert it to a data frame.
import spark.implicits._
val df = lb.toDF("value", "square", "description")

df.coalesce(5).write.mode(SaveMode.Overwrite).parquet("adl://some-adls-instance.azuredatalakestore.net/johntest/sample_data.parquet")

Hive Server 2 – Required field ‘serverProtocolVersion’ is unset!

Issue Context and Error

I have been working to install hive server 2 in order to work with Presto, among other things.  I wanted to ensure I had Hive’s JDBC interface (to port 10000) working well as I need it to enable users to easily submit partition repair queries (msck repair table) and similar things.  Unfortunately, when I went to connect over JDBC, I got this error (a small part of a huge stack trace):

Required field 'serverProtocolVersion' is unset!

The Solution

I think if you carefully read the full stack-trace, you’ll see something about user impersonation… but missed it. I actually figured it out by increasing the logging level when running hive server. You can do that like this:

./hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10000 --hiveconf hive.root.logger=DEBUG,console

Once I did this, I clearly saw this error:

2019-06-06T13:53:13,183  WARN [HiveServer2-Handler-Pool: Thread-36] thrift.ThriftCLIService: Error opening session:
org.apache.hive.service.cli.HiveSQLException: Failed to open new session: java.lang.RuntimeException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.authorize.AuthorizationException): User: centos is not allowed to impersonate centos

Googling this quickly helped me to find this stack overflow: https://stackoverflow.com/a/50753233/857994. The proposed solution there is to add this entry to your hive-site.xml:

<property>
  <name>hive.server2.enable.doAs</name>
  <value>false</value> 
</property>

After that, everything works great :).

Powershell – Disable Windows 10 Service on PC Unlock

My Issue

Recently I was running on a computer with a policy that was starting a (clearly) malfunctioning program on start-up which basically permanently was taking 100% of my CPU.

I didn’t have the ability to change the policy, but I did have the ability to stop the service.  Unfortunately, it regularly gets brought back as manually enabled and started.

The Fix

In cases like this, on Windows 10, you can search “Task Scheduler” on your start menu.  Then create a new task (not a basic task) that runs with highest privileges.  You can tell it to trigger on PC unlock, and have it start a program.

For the program, you can give your path to powershell (e.g. C:\Windows\System32\WindowsPowerShell\v1.0\powershell.exe).  Then you can give the parameters “-windowstyle hidden C:\dev\disable-service.ps1”.  Honestly, the window style hidden isn’t working but since it just runs on unlock its not a big deal; I’ll figure that out later.

The disable-service.ps1 script should have content like this:

Set-Service "Your Service Name" -StartupType Disabled
Stop-Service "Your Service Name"

You can get the full service name from your services utility in Windows. Note that the real service name needs to be used, and that may not be what you see specifically in the services utility or task manager. But if you click on the service in the services utility, you can get this real name easily.

Presto Hive External Table TEXTFILE Limitations

Background

I was hoping to use hive 2.x with just the hive metastore and not the hive server or hadoop (map-reduce).  Part of this plan was to be able to create tables within Presto; Facebook’s distributed query engine, which can operate over hive, in addition to many other things.

Initial Test Failure – CSV Delimited File

I’m still trying to decide if this is viable or not,  but my first test was not so great.

I created a new database and an external table pointing to a file on my AWS s3 bucket.  The file was a simple CSV delimited file.

Here was the “SQL” I used in Presto:

create schema testdb;

CREATE TABLE testdb.sample_data (
  x varchar(30), y varchar(30), sum varchar(30)
)
WITH (
  format = 'TEXTFILE',
  external_location = 's3a://uat-hive-warehouse/sample_data/'
);

use testdb;

select * from sample_data;

This all ran great, but unfortunately, my results looked like this:

x     | y    | sum
--------------------
1,2,3 | null | null
4,5,6 | null | null
...

The Problem

So, it turns out that Presto over hive, when targeting text files, does not support arbitrarily delimited files. In my case, I was pointing to a simple CSV file. So, it literally read all of the data into the first column and gave up.

It is actually viewing the file as delimited; but it only supports the \001 delimiter; ASCII control code 1, which is unprintable.

You can still go use parquet, orc, or whatever you want – but this makes CSV more than useless from Presto’s perspective. You can go create the table in hive first and it probably will work; but Presto cannot be the creator in this case :(.

Hive 3 Standalone Metastore + Presto

Hive 3.0 Standalone Metastore – Why?

Hive version 3.0 allows you to download a standalone metastore.  This is cool because it does not require you to deploy hadoop and/or run the rest of Hive’s fairly large deployment.  This makes a lot of sense because many tools that use hive for schema management do not actually care about Hive’s query engine.

For example, Presto is a clustered query engine in its own right; it has no interest in using hadoop/map-reduce to execute a query on hive data; it just wants to view and manage hive’s metadata through its thrift metastore interface.  Similarly, Apache Spark loves to work with hive, but it actually goes directly to the underlying database for performance reasons and works against that.  So, it also does not need hive’s query engine.

Can/Should We Use It?

Unfortunately, Presto only currently supports Hive 2.X.  From it’s own documentation: “The Hive connector supports Apache Hadoop 2.x and derivative distributions including Cloudera CDH 5 and Hortonworks Data Platform (HDP).”

If you read online though, you will find that it does seem to work… but with limited features.  If you look at this git entry for example: https://groups.google.com/forum/#!topic/presto-users/iAeEecsnS9I, you will see:

“We have tested Presto 0.203e with Hive 3.0 Metastore, and it works fine. We tested it by running TPC-DS queries, and Presto completed all 99 queries.”

But lower down, you will see:

However, Presto is not able to read Hive managed (transactional tables) in Hive 3.x…

Yes, this is a known limitation.

Unfortunately, transactional ACID v2 tables are the default for Hive 3.x.  So, basically all managed tables will not work in Hive 3.x even though external tables will work.  So, it might be okay to use it if you only do external tables… but in our case we let people use Spark however they like and they likely create many managed tables.  So, this rules out using Hive 3.0 with the standalone metastore for us.

I’m going to see if Hive 2.0 can be run without the hive server and hadoop next.

Site Note – SchemaTool

I would just like to make a side-note that while I did manage to run the Hive Standalone Metastore without installing hadoop, I did have to install (but not run) hadoop in order to use the schematool provided with hive for creating the hive RDMBS schema.  This is due to library dependencies.

There is a “create on first run” config you can do instead of this as well but they don’t recommend using it in production; so just keep that in mind.

Useful Links