Is Presto Slow When Returning Millions of Records / Big Result Set?

What Was Our Problem?

We were having issues with people reporting that Presto was slow when they were exporting hundreds of millions of records from much larger tables.  The queries were simple where clause filters selecting a few fields from some hundred-billion record tables.

Was It Actually Slow?

No! At least, not when parallelized well and tested properly.  I wrote a Java app to

  • Split a query into N =100 parts by using where clauses with a modulus on an integer column.
  • Query presto in parallel with 30 threads, going through the 100 queries.
  • Output results to standard out.

In an external bash script, I also grepped the results just to show some statistics I outputted.

This was slow!

Why Was it Slow?

First of all, let’s talk about the presto cluster setup:

  • 1 coordinator.
  • 15 workers.
  • All m5.8xlarge = 128GB RAM / 32 processor cores.

This is pretty decent.  So, what could our bottlenecks  be?

  1. Reading from s3.
  2. Processing results in workers.
  3. Slow coordinator due to having 15 workers talk through it.
  4. Slow consumer (our client running the queries).

To rule out 1/2/3 respectively I:

  1. Did a count(*) query which would force a scan over all relevant s3 data.  It came back pretty fast (in 15 seconds or so).
  2. Added more workers.  Having more workers  had minimal effect on the final timings, so we’re not worker bound.
  3. Switched the coordinator to a very large, compute-optimized node type.  This had minimal effect on the timings as well.

So, the problem appears to be with the client!

Why Was the Client Slow?

Our client really wasn’t doing a lot.  It was running 30 parallel queries and outputting the results, which were being grepped.  It was a similarly sized node to our presto coordinator, and it had plenty of CPU, RAM, decent network and disks (EBS).

It turned out though that once we stopped doing the grep and once we stopped writing the results to stdout, and we just held counters/statistics on the results we read, it went from ~25 minutes to ~2 minutes.

If we had run this in Spark or some other engine with good parallel behavior, we would have seen the workload distribute better over more nodes with sufficient ability to parallel process their portions of the records.  But, since we were running on a single node, with all results, the threads/CPU and other resoruces we were using capped out and could not go any faster.

Note: we did not see the client server as having high utilization, but some threads were at 100%.  So, the client app likely had a bottleneck we could avoid if we improved it.

Summary

So… next time you think presto can’t handle returning large numbers of results from the coordinator, take some time to evaluate your testing methodology.  Presto isn’t designed to route hundreds of millions of results, but it does it quite well in our experience.

 

TMUX Easy Hotkeys + Lean Command Prompt

This is just a simple note of some TMUX and BASH profile configuration that I find helpful when developing.

Tmux lets you split screens arbitrarily.  So, you can divide one terminal up into, say, 1 half-screen, full width panel up top and 2, quarter-screen panels on the bottom.  Generally, you have to hop around by pushing ctrl + b and then a key (like % or “) to split a screen into multiple panels horizontally or vertically.  Equally, you do ctrl +b and an arrow key to move between panes.

Two problems with this are:

  • ctrl+b is an awful hot key, it’s hard to reach and you’ll use it constantly.
  • Once you end up in quarter-size screens, even on big monitors, your default terminal prompt (the thing that shows up before you type commands) becomes too large for comfortable use.

TMUX Config

Edit (or create) the file ~/.tmux.conf to customize tmux.  Then add these lines.  The first two switch the ctrl + b hotkey to ctrl + a (which is much more convenient for constant use).  The others are just nice fixes to have around for other reasons.

# remap prefix from 'C-b' to 'C-a'
unbind C-b
set-option -g prefix C-a

#Fix some annoying behavior that makes some keys not work.
bind-key C-a send-prefix

# Start window numbering at 1
set -g base-index 1

Bash Profile Prompt Config

Edit your ~/.bashrc file (your bash profile).  If a line for PS1= already exists, you’ll want to comment it out or modify it, otherwise just add this in:

PS1="\u:\W$ "

This changes your terminal prompt to only show .  So, for example, if my name is John and I’m working in a folder called /opt/code, it would just display john:code vs a full path name with additional info (which takes up way too much space on quarter-size panes).

I hope you find this useful, and thanks for reading!

 

Validate/Check Parquet File Schema From PC/Laptop

Checking a Parquet Schema

Generally, when I have had to check the schema of a parquet file in the past, I have checked it within Apache Spark, or by using https://github.com/apache/parquet-mr/tree/master/parquet-tools.

Today, I had to check a parquet file schema and I came across this nifty python utility though: https://github.com/chhantyal/parquet-cli.  I think it’s just a wrapper around pyarrow, but it is slick and works easily.

You can pip install it trivially and then use it to view the data and schema of a parquet file with ease.  Here’s an example of installing it and checking a schema:

$ pip install parquet-cli
...

$ parq part-00000-679c332c.c000.snappy.parquet --schema

# Schema
ID: BYTE_ARRAY String
OrderID: BYTE_ARRAY String
SaleID: BYTE_ARRAY String
OrderDate: BYTE_ARRAY String
Pack: BYTE_ARRAY String
Qnty: BYTE_ARRAY String
Ratio: BYTE_ARRAY String
Name: BYTE_ARRAY String
Org: BYTE_ARRAY String
Category: BYTE_ARRAY String
Type: BYTE_ARRAY String
Percentage: BYTE_ARRAY String

Needless to say, this is much easier than dealing with Spark or parquet tools for schema validation or checks of not-too-huge parquet file data.

Add a New UDF SQL Function to Presto – Support unix_timestamp From Hive

Some Context

We deploy presto in-house and manage a fork of the code base so that we can introduce company-specific features for auditing, security, or integrations.

One of the main features we provided is support for hive views (which I now see is coming into presto-sql version 330 (but that’s not really relevant here).

https://prestosql.io/docs/current/release/release-330.html

Add experimental support for executing basic Hive views. To enable this feature, the hive.views-execution.enabled configuration property must be set to true. (#2715)

A recent feature request for our implementation was to add support for the unix_timestamp function so that this expression worked:

"unix_timestamp"(a.creation_datetime, 'yyyy-MM-dd')

The Code

There is some good information on making functions here: https://prestosql.github.io/docs.prestosql.io/current/develop/functions.html.  But I mostly just referenced other DateTime functions in the DateTimeFunctions class within the presto code base (which also happens to be where I added the new unix_timestamp function implementation shown below.

    @ScalarFunction("unix_timestamp")
    @LiteralParameters({"x", "y"})
    @SqlType(StandardTypes.DOUBLE)
    public static double unixTimestamp(@SqlType("varchar(x)") Slice datetime, @SqlType("varchar(y)") Slice formatString)
    {
        SimpleDateFormat simpleDateFormat = new SimpleDateFormat(formatString.toStringUtf8());
        simpleDateFormat.setTimeZone(TimeZone.getTimeZone("UTC"));
        try {
            Date parse = simpleDateFormat.parse(datetime.toStringUtf8());
            return parse.toInstant().toEpochMilli() / 1000d;
        }
        catch (ParseException e) {
            throw new PrestoException(INVALID_FUNCTION_ARGUMENT, "'" + datetime.toStringUtf8() +
                    " cannot be parsed by format string " + formatString.toStringUtf8() + ".");
        }
    }

This function takes a varchar date or column, and a varchar format string, and it parses out the date in UTC to the unix time (seconds since 1970). This should follow the hive behavior for this overload (hopefully!).

Now, you can call the function (for this overload) in presto-sql just like you can in hive. We still have to do more testing and we need to provide more overloads, but this is a working implementation, so I hope it helps you do the same.

Launch Spring-Boot JAR From Different Main Class

I found this very useful stack overflow:

https://stackoverflow.com/a/36552613/857994

It shows you how to start a spring-boot JAR from a different main class.  It’s a quirky solution, but it worked great.  Here’s a slightly more obvious/cleaner copy of the command (mostly for my own future reference):

java -cp presto-ws-3.2.2.jar -Dloader.main=com.company.PrestoQueryRunner org.springframework.boot.loader.PropertiesLauncher

AWS EC2 Varying Generation and Linear Cost Example

Due to increased query sizes on our presto clusters (causing aggregation failures), I’m in the middle of evaluating moving from 16 core 64GB RAM general purpose EC2 machines (m4.4xlarge) to 64 core 256GB RAM general purpose machines  (a 4x increase in power/RAM).

Here is the list of m4 and m5 models for 16-core/64GB and 64-core/256GB specs.  Below, we’ll see how they compare to each other and what the best option is.

Type vCPU ECU Memory (GiB) Instance Storage (GB) Linux/UNIX Usage
m4.4xlarge 16 53.5 64 GiB EBS Only $0.80 per Hour
m4.16xlarge 64 188 256 GiB EBS Only $3.20 per Hour
m5.4xlarge 16 70 64 GiB EBS Only $0.768 per Hour
m5.16xlarge 64 256 256 GiB EBS Only $3.072 per Hour

For reference:

EC2 uses the EC2 Compute Unit (ECU) term to describe CPU resources for each instance size where one ECU provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

There are a few good things to notice here:

  • For m4.4xlarge to m4.16xlare, we are getting 4x the resources for exactly  4x the cost  ($.80 x 4 = $3.20).  The one exception is we get less than 4x the ECU  units (so technically less than 4x the processing power).  So, compute roughly scales linearly within a model it seems.
  • Pretty much the exact same situation holds true for the m5 models; going from xlarge to 16xlarge is exactly a 4x increase in cost and resources except for ECUs which are a little less than 4x.
  • The m5 models have more ECUs than their m4 counterparts and they  also cost less, so they are a better deal both  performance and cost wise.

So, we’ll go with  m5.16xlarge instances which cost $3.072 an hour.  This comes out to $2,211 a month.

Kubernetes Ingress Service 60 Second Timeout

Kubernetes has multiple levels of timeouts for calls, including at the load balancer, inbound to the ingress itself, and at the individual ingress resources.

Assuming you have the first two configured and are still hitting a timeout on your  app, this is the annotation you need to add to your service’s ingress resource to boost its timeout:

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"

This will change it up to 1 hour from 1 minute.