Presto / Hive find parquet files touched/referenced by a query/predicate.

We had a use case where we needed to find out which parquet files were touched by a query/predicate.  This was so that we could rewrite certain files in a special way to remove specific records.   In this case, presto was not mastering the data itself.

We found this awesome post -> https://stackoverflow.com/a/44011639/857994 on stack overflow which shows this pseudo-column:

select "$path" from table
This correctly shows you the parquet file a row came from, which is awesome!  I also found this  MR  which shows work has been merged to add $file_size  and $file_modified_time properties which is even cooler.
So, newer versions of presto-sql have even more power here.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s