Background
I was hoping to use hive 2.x with just the hive metastore and not the hive server or hadoop (map-reduce). Part of this plan was to be able to create tables within Presto; Facebook’s distributed query engine, which can operate over hive, in addition to many other things.
Initial Test Failure – CSV Delimited File
I’m still trying to decide if this is viable or not, but my first test was not so great.
I created a new database and an external table pointing to a file on my AWS s3 bucket. The file was a simple CSV delimited file.
Here was the “SQL” I used in Presto:
create schema testdb; CREATE TABLE testdb.sample_data ( x varchar(30), y varchar(30), sum varchar(30) ) WITH ( format = 'TEXTFILE', external_location = 's3a://uat-hive-warehouse/sample_data/' ); use testdb; select * from sample_data;
This all ran great, but unfortunately, my results looked like this:
x | y | sum -------------------- 1,2,3 | null | null 4,5,6 | null | null ...
The Problem
So, it turns out that Presto over hive, when targeting text files, does not support arbitrarily delimited files. In my case, I was pointing to a simple CSV file. So, it literally read all of the data into the first column and gave up.
It is actually viewing the file as delimited; but it only supports the \001 delimiter; ASCII control code 1, which is unprintable.
You can still go use parquet, orc, or whatever you want – but this makes CSV more than useless from Presto’s perspective. You can go create the table in hive first and it probably will work; but Presto cannot be the creator in this case :(.