Is Presto Slow When Returning Millions of Records / Big Result Set?

What Was Our Problem?

We were having issues with people reporting that Presto was slow when they were exporting hundreds of millions of records from much larger tables.  The queries were simple where clause filters selecting a few fields from some hundred-billion record tables.

Was It Actually Slow?

No! At least, not when parallelized well and tested properly.  I wrote a Java app to

  • Split a query into N =100 parts by using where clauses with a modulus on an integer column.
  • Query presto in parallel with 30 threads, going through the 100 queries.
  • Output results to standard out.

In an external bash script, I also grepped the results just to show some statistics I outputted.

This was slow!

Why Was it Slow?

First of all, let’s talk about the presto cluster setup:

  • 1 coordinator.
  • 15 workers.
  • All m5.8xlarge = 128GB RAM / 32 processor cores.

This is pretty decent.  So, what could our bottlenecks  be?

  1. Reading from s3.
  2. Processing results in workers.
  3. Slow coordinator due to having 15 workers talk through it.
  4. Slow consumer (our client running the queries).

To rule out 1/2/3 respectively I:

  1. Did a count(*) query which would force a scan over all relevant s3 data.  It came back pretty fast (in 15 seconds or so).
  2. Added more workers.  Having more workers  had minimal effect on the final timings, so we’re not worker bound.
  3. Switched the coordinator to a very large, compute-optimized node type.  This had minimal effect on the timings as well.

So, the problem appears to be with the client!

Why Was the Client Slow?

Our client really wasn’t doing a lot.  It was running 30 parallel queries and outputting the results, which were being grepped.  It was a similarly sized node to our presto coordinator, and it had plenty of CPU, RAM, decent network and disks (EBS).

It turned out though that once we stopped doing the grep and once we stopped writing the results to stdout, and we just held counters/statistics on the results we read, it went from ~25 minutes to ~2 minutes.

If we had run this in Spark or some other engine with good parallel behavior, we would have seen the workload distribute better over more nodes with sufficient ability to parallel process their portions of the records.  But, since we were running on a single node, with all results, the threads/CPU and other resoruces we were using capped out and could not go any faster.

Note: we did not see the client server as having high utilization, but some threads were at 100%.  So, the client app likely had a bottleneck we could avoid if we improved it.

Summary

So… next time you think presto can’t handle returning large numbers of results from the coordinator, take some time to evaluate your testing methodology.  Presto isn’t designed to route hundreds of millions of results, but it does it quite well in our experience.

 

HiveServer2 Not Starting JDBC Interface 10000, No Errors.

Very short post – my HiveServer2 process was running without errors after deployment, but it wasn’t really running.

Connecting via JDBC yielded errors saying the connection was being refused.  Analyzing the server showed that the port was not open using:

sudo netstat -nlp | grep 10000

I enabled debug logs with the extra command line parameter:

--hiveconf hive.root.logger=DEBUG,console

And it still didn’t show much, except something about creating the scratch directories (but not an error).

After a while, I figured out that the scratch directories were set to be created at the root of the file system in a new directory which didn’t exist yet. The user running hive did not have these permissions.

So, I created the scratch directory and gave ownership to the hive user, and then everything came up and worked great on the next hiveserver2 service restart.