TMUX Easy Hotkeys + Lean Command Prompt

Posted on March 4, 2020 by John Humphreys

This is just a simple note of some TMUX and BASH profile configuration that I find helpful when developing.

Tmux lets you split screens arbitrarily. So, you can divide one terminal up into, say, 1 half-screen, full width panel up top and 2, quarter-screen panels on the bottom. Generally, you have to hop around by pushing ctrl + b and then a key (like % or “) to split a screen into multiple panels horizontally or vertically. Equally, you do ctrl +b and an arrow key to move between panes.

Two problems with this are:

ctrl+b is an awful hot key, it’s hard to reach and you’ll use it constantly.
Once you end up in quarter-size screens, even on big monitors, your default terminal prompt (the thing that shows up before you type commands) becomes too large for comfortable use.

TMUX Config

Edit (or create) the file ~/.tmux.conf to customize tmux. Then add these lines. The first two switch the ctrl + b hotkey to ctrl + a (which is much more convenient for constant use). The others are just nice fixes to have around for other reasons.

# remap prefix from 'C-b' to 'C-a'
unbind C-b
set-option -g prefix C-a

#Fix some annoying behavior that makes some keys not work.
bind-key C-a send-prefix

# Start window numbering at 1
set -g base-index 1

Bash Profile Prompt Config

Edit your ~/.bashrc file (your bash profile). If a line for PS1= already exists, you’ll want to comment it out or modify it, otherwise just add this in:

PS1="\u:\W$ "

This changes your terminal prompt to only show . So, for example, if my name is John and I’m working in a folder called /opt/code, it would just display john:code vs a full path name with additional info (which takes up way too much space on quarter-size panes).

I hope you find this useful, and thanks for reading!

Validate/Check Parquet File Schema From PC/Laptop

Posted on March 4, 2020 by John Humphreys

Checking a Parquet Schema

Generally, when I have had to check the schema of a parquet file in the past, I have checked it within Apache Spark, or by using https://github.com/apache/parquet-mr/tree/master/parquet-tools.

Today, I had to check a parquet file schema and I came across this nifty python utility though: https://github.com/chhantyal/parquet-cli. I think it’s just a wrapper around pyarrow, but it is slick and works easily.

You can pip install it trivially and then use it to view the data and schema of a parquet file with ease. Here’s an example of installing it and checking a schema:

$ pip install parquet-cli
...

$ parq part-00000-679c332c.c000.snappy.parquet --schema

# Schema
ID: BYTE_ARRAY String
OrderID: BYTE_ARRAY String
SaleID: BYTE_ARRAY String
OrderDate: BYTE_ARRAY String
Pack: BYTE_ARRAY String
Qnty: BYTE_ARRAY String
Ratio: BYTE_ARRAY String
Name: BYTE_ARRAY String
Org: BYTE_ARRAY String
Category: BYTE_ARRAY String
Type: BYTE_ARRAY String
Percentage: BYTE_ARRAY String

Needless to say, this is much easier than dealing with Spark or parquet tools for schema validation or checks of not-too-huge parquet file data.

Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Posted on February 29, 2020 by John Humphreys

Some Context

We deploy presto in-house and manage a fork of the code base so that we can introduce company-specific features for auditing, security, or integrations.

One of the main features we provided is support for hive views (which I now see is coming into presto-sql version 330 (but that’s not really relevant here).

https://prestosql.io/docs/current/release/release-330.html

Add experimental support for executing basic Hive views. To enable this feature, the hive.views-execution.enabled configuration property must be set to true. (#2715)

A recent feature request for our implementation was to add support for the unix_timestamp function so that this expression worked:

"unix_timestamp"(a.creation_datetime, 'yyyy-MM-dd')

The Code

There is some good information on making functions here: https://prestosql.github.io/docs.prestosql.io/current/develop/functions.html. But I mostly just referenced other DateTime functions in the DateTimeFunctions class within the presto code base (which also happens to be where I added the new unix_timestamp function implementation shown below.

    @ScalarFunction("unix_timestamp")
    @LiteralParameters({"x", "y"})
    @SqlType(StandardTypes.DOUBLE)
    public static double unixTimestamp(@SqlType("varchar(x)") Slice datetime, @SqlType("varchar(y)") Slice formatString)
    {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat(formatString.toStringUtf8());
        simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
        try {
            Date parse = simpleDateFormat.parse(datetime.toStringUtf8());
            return parse.toInstant().toEpochMilli() / 1000d;
        }
        catch (ParseException e) {
            throw new PrestoException(INVALID_FUNCTION_ARGUMENT, "'" + datetime.toStringUtf8() +
                    " cannot be parsed by format string " + formatString.toStringUtf8() + ".");
        }
    }

This function takes a varchar date or column, and a varchar format string, and it parses out the date in UTC to the unix time (seconds since 1970). This should follow the hive behavior for this overload (hopefully!).

Now, you can call the function (for this overload) in presto-sql just like you can in hive. We still have to do more testing and we need to provide more overloads, but this is a working implementation, so I hope it helps you do the same.

Launch Spring-Boot JAR From Different Main Class

Posted on February 17, 2020 by John Humphreys

I found this very useful stack overflow:

https://stackoverflow.com/a/36552613/857994

It shows you how to start a spring-boot JAR from a different main class. It’s a quirky solution, but it worked great. Here’s a slightly more obvious/cleaner copy of the command (mostly for my own future reference):

java -cp presto-ws-3.2.2.jar -Dloader.main=com.company.PrestoQueryRunner org.springframework.boot.loader.PropertiesLauncher

AWS EC2 Varying Generation and Linear Cost Example

Posted on February 16, 2020 by John Humphreys

Due to increased query sizes on our presto clusters (causing aggregation failures), I’m in the middle of evaluating moving from 16 core 64GB RAM general purpose EC2 machines (m4.4xlarge) to 64 core 256GB RAM general purpose machines (a 4x increase in power/RAM).

Here is the list of m4 and m5 models for 16-core/64GB and 64-core/256GB specs. Below, we’ll see how they compare to each other and what the best option is.

Type	vCPU	ECU	Memory (GiB)	Instance Storage (GB)	Linux/UNIX Usage

m4.4xlarge

53.5

64 GiB

EBS Only

$0.80 per Hour

m4.16xlarge

188

256 GiB

EBS Only

$3.20 per Hour

m5.4xlarge

64 GiB

EBS Only

$0.768 per Hour

m5.16xlarge

256

256 GiB

EBS Only

$3.072 per Hour

For reference:

EC2 uses the EC2 Compute Unit (ECU) term to describe CPU resources for each instance size where one ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

There are a few good things to notice here:

For m4.4xlarge to m4.16xlare, we are getting 4x the resources for exactly 4x the cost ($.80 x 4 = $3.20). The one exception is we get less than 4x the ECU units (so technically less than 4x the processing power). So, compute roughly scales linearly within a model it seems.
Pretty much the exact same situation holds true for the m5 models; going from xlarge to 16xlarge is exactly a 4x increase in cost and resources except for ECUs which are a little less than 4x.
The m5 models have more ECUs than their m4 counterparts and they also cost less, so they are a better deal both performance and cost wise.

So, we’ll go with m5.16xlarge instances which cost $3.072 an hour. This comes out to $2,211 a month.

Coding Stream of Consciousness

by John Humphreys – Random code from my life.

TMUX Easy Hotkeys + Lean Command Prompt

TMUX Config

Bash Profile Prompt Config

Validate/Check Parquet File Schema From PC/Laptop

Checking a Parquet Schema

Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Some Context

The Code

Launch Spring-Boot JAR From Different Main Class

AWS EC2 Varying Generation and Linear Cost Example