Presto / Hive find parquet files touched/referenced by a query/predicate.

We had a use case where we needed to find out which parquet files were touched by a query/predicate.  This was so that we could rewrite certain files in a special way to remove specific records.   In this case, presto was not mastering the data itself.

We found this awesome post -> https://stackoverflow.com/a/44011639/857994 on stack overflow which shows this pseudo-column:

select "$path" from table
This correctly shows you the parquet file a row came from, which is awesome!  I also found this  MR  which shows work has been merged to add $file_size  and $file_modified_time properties which is even cooler.
So, newer versions of presto-sql have even more power here.

 

Leave a comment