Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Posted on February 29, 2020 by John Humphreys

Some Context

We deploy presto in-house and manage a fork of the code base so that we can introduce company-specific features for auditing, security, or integrations.

One of the main features we provided is support for hive views (which I now see is coming into presto-sql version 330 (but that’s not really relevant here).

https://prestosql.io/docs/current/release/release-330.html

Add experimental support for executing basic Hive views. To enable this feature, the hive.views-execution.enabled configuration property must be set to true. (#2715)

A recent feature request for our implementation was to add support for the unix_timestamp function so that this expression worked:

"unix_timestamp"(a.creation_datetime, 'yyyy-MM-dd')

The Code

There is some good information on making functions here: https://prestosql.github.io/docs.prestosql.io/current/develop/functions.html. But I mostly just referenced other DateTime functions in the DateTimeFunctions class within the presto code base (which also happens to be where I added the new unix_timestamp function implementation shown below.

    @ScalarFunction("unix_timestamp")
    @LiteralParameters({"x", "y"})
    @SqlType(StandardTypes.DOUBLE)
    public static double unixTimestamp(@SqlType("varchar(x)") Slice datetime, @SqlType("varchar(y)") Slice formatString)
    {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat(formatString.toStringUtf8());
        simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
        try {
            Date parse = simpleDateFormat.parse(datetime.toStringUtf8());
            return parse.toInstant().toEpochMilli() / 1000d;
        }
        catch (ParseException e) {
            throw new PrestoException(INVALID_FUNCTION_ARGUMENT, "'" + datetime.toStringUtf8() +
                    " cannot be parsed by format string " + formatString.toStringUtf8() + ".");
        }
    }

This function takes a varchar date or column, and a varchar format string, and it parses out the date in UTC to the unix time (seconds since 1970). This should follow the hive behavior for this overload (hopefully!).

Now, you can call the function (for this overload) in presto-sql just like you can in hive. We still have to do more testing and we need to provide more overloads, but this is a working implementation, so I hope it helps you do the same.

Launch Spring-Boot JAR From Different Main Class

Posted on February 17, 2020 by John Humphreys

I found this very useful stack overflow:

https://stackoverflow.com/a/36552613/857994

It shows you how to start a spring-boot JAR from a different main class. It’s a quirky solution, but it worked great. Here’s a slightly more obvious/cleaner copy of the command (mostly for my own future reference):

java -cp presto-ws-3.2.2.jar -Dloader.main=com.company.PrestoQueryRunner org.springframework.boot.loader.PropertiesLauncher

AWS EC2 Varying Generation and Linear Cost Example

Posted on February 16, 2020 by John Humphreys

Due to increased query sizes on our presto clusters (causing aggregation failures), I’m in the middle of evaluating moving from 16 core 64GB RAM general purpose EC2 machines (m4.4xlarge) to 64 core 256GB RAM general purpose machines (a 4x increase in power/RAM).

Here is the list of m4 and m5 models for 16-core/64GB and 64-core/256GB specs. Below, we’ll see how they compare to each other and what the best option is.

Type	vCPU	ECU	Memory (GiB)	Instance Storage (GB)	Linux/UNIX Usage

m4.4xlarge

53.5

64 GiB

EBS Only

$0.80 per Hour

m4.16xlarge

188

256 GiB

EBS Only

$3.20 per Hour

m5.4xlarge

64 GiB

EBS Only

$0.768 per Hour

m5.16xlarge

256

256 GiB

EBS Only

$3.072 per Hour

For reference:

EC2 uses the EC2 Compute Unit (ECU) term to describe CPU resources for each instance size where one ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

There are a few good things to notice here:

For m4.4xlarge to m4.16xlare, we are getting 4x the resources for exactly 4x the cost ($.80 x 4 = $3.20). The one exception is we get less than 4x the ECU units (so technically less than 4x the processing power). So, compute roughly scales linearly within a model it seems.
Pretty much the exact same situation holds true for the m5 models; going from xlarge to 16xlarge is exactly a 4x increase in cost and resources except for ECUs which are a little less than 4x.
The m5 models have more ECUs than their m4 counterparts and they also cost less, so they are a better deal both performance and cost wise.

So, we’ll go with m5.16xlarge instances which cost $3.072 an hour. This comes out to $2,211 a month.

Kubernetes Ingress Service 60 Second Timeout

Posted on February 5, 2020 by John Humphreys

Kubernetes has multiple levels of timeouts for calls, including at the load balancer, inbound to the ingress itself, and at the individual ingress resources.

Assuming you have the first two configured and are still hitting a timeout on your app, this is the annotation you need to add to your service’s ingress resource to boost its timeout:

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"

This will change it up to 1 hour from 1 minute.

Coding Stream of Consciousness

by John Humphreys – Random code from my life.

Monthly Archives: February 2020

Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Some Context

The Code

Launch Spring-Boot JAR From Different Main Class

AWS EC2 Varying Generation and Linear Cost Example

Kubernetes Ingress Service 60 Second Timeout