Validate/Check Parquet File Schema From PC/Laptop

Checking a Parquet Schema

Generally, when I have had to check the schema of a parquet file in the past, I have checked it within Apache Spark, or by using https://github.com/apache/parquet-mr/tree/master/parquet-tools.

Today, I had to check a parquet file schema and I came across this nifty python utility though: https://github.com/chhantyal/parquet-cli.  I think it’s just a wrapper around pyarrow, but it is slick and works easily.

You can pip install it trivially and then use it to view the data and schema of a parquet file with ease.  Here’s an example of installing it and checking a schema:

$ pip install parquet-cli
...

$ parq part-00000-679c332c.c000.snappy.parquet --schema

# Schema
ID: BYTE_ARRAY String
OrderID: BYTE_ARRAY String
SaleID: BYTE_ARRAY String
OrderDate: BYTE_ARRAY String
Pack: BYTE_ARRAY String
Qnty: BYTE_ARRAY String
Ratio: BYTE_ARRAY String
Name: BYTE_ARRAY String
Org: BYTE_ARRAY String
Category: BYTE_ARRAY String
Type: BYTE_ARRAY String
Percentage: BYTE_ARRAY String

Needless to say, this is much easier than dealing with Spark or parquet tools for schema validation or checks of not-too-huge parquet file data.

Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Some Context

We deploy presto in-house and manage a fork of the code base so that we can introduce company-specific features for auditing, security, or integrations.

One of the main features we provided is support for hive views (which I now see is coming into presto-sql version 330 (but that’s not really relevant here).

https://prestosql.io/docs/current/release/release-330.html

Add experimental support for executing basic Hive views. To enable this feature, the hive.views-execution.enabled configuration property must be set to true. (#2715)

A recent feature request for our implementation was to add support for the unix_timestamp function so that this expression worked:

"unix_timestamp"(a.creation_datetime, 'yyyy-MM-dd')

The Code

There is some good information on making functions here: https://prestosql.github.io/docs.prestosql.io/current/develop/functions.html.  But I mostly just referenced other DateTime functions in the DateTimeFunctions class within the presto code base (which also happens to be where I added the new unix_timestamp function implementation shown below.

    @ScalarFunction("unix_timestamp")
    @LiteralParameters({"x", "y"})
    @SqlType(StandardTypes.DOUBLE)
    public static double unixTimestamp(@SqlType("varchar(x)") Slice datetime, @SqlType("varchar(y)") Slice formatString)
    {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat(formatString.toStringUtf8());
        simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
        try {
            Date parse = simpleDateFormat.parse(datetime.toStringUtf8());
            return parse.toInstant().toEpochMilli() / 1000d;
        }
        catch (ParseException e) {
            throw new PrestoException(INVALID_FUNCTION_ARGUMENT, "'" + datetime.toStringUtf8() +
                    " cannot be parsed by format string " + formatString.toStringUtf8() + ".");
        }
    }

This function takes a varchar date or column, and a varchar format string, and it parses out the date in UTC to the unix time (seconds since 1970). This should follow the hive behavior for this overload (hopefully!).

Now, you can call the function (for this overload) in presto-sql just like you can in hive. We still have to do more testing and we need to provide more overloads, but this is a working implementation, so I hope it helps you do the same.

Launch Spring-Boot JAR From Different Main Class

I found this very useful stack overflow:

https://stackoverflow.com/a/36552613/857994

It shows you how to start a spring-boot JAR from a different main class.  It’s a quirky solution, but it worked great.  Here’s a slightly more obvious/cleaner copy of the command (mostly for my own future reference):

java -cp presto-ws-3.2.2.jar -Dloader.main=com.company.PrestoQueryRunner org.springframework.boot.loader.PropertiesLauncher

AWS EC2 Varying Generation and Linear Cost Example

Due to increased query sizes on our presto clusters (causing aggregation failures), I’m in the middle of evaluating moving from 16 core 64GB RAM general purpose EC2 machines (m4.4xlarge) to 64 core 256GB RAM general purpose machines  (a 4x increase in power/RAM).

Here is the list of m4 and m5 models for 16-core/64GB and 64-core/256GB specs.  Below, we’ll see how they compare to each other and what the best option is.

Type vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage
m4.4xlarge 16 53.5 64 GiB EBS Only $0.80 per Hour
m4.16xlarge 64 188 256 GiB EBS Only $3.20 per Hour
m5.4xlarge 16 70 64 GiB EBS Only $0.768 per Hour
m5.16xlarge 64 256 256 GiB EBS Only $3.072 per Hour

For reference:

EC2 uses the EC2 Compute Unit (ECU) term to describe CPU resources for each instance size where one ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

There are a few good things to notice here:

  • For m4.4xlarge to m4.16xlare, we are getting 4x the resources for exactly  4x the cost  ($.80 x 4 = $3.20).  The one exception is we get less than 4x the ECU  units (so technically less than 4x the processing power).  So, compute roughly scales linearly within a model it seems.
  • Pretty much the exact same situation holds true for the m5 models; going from xlarge to 16xlarge is exactly a 4x increase in cost and resources except for ECUs which are a little less than 4x.
  • The m5 models have more ECUs than their m4 counterparts and they  also cost less, so they are a better deal both  performance and cost wise.

So, we’ll go with  m5.16xlarge instances which cost $3.072 an hour.  This comes out to $2,211 a month.

Kubernetes Ingress Service 60 Second Timeout

Kubernetes has multiple levels of timeouts for calls, including at the load balancer, inbound to the ingress itself, and at the individual ingress resources.

Assuming you have the first two configured and are still hitting a timeout on your  app, this is the annotation you need to add to your service’s ingress resource to boost its timeout:

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"

This will change it up to 1 hour from 1 minute.