Checking a Parquet Schema
Generally, when I have had to check the schema of a parquet file in the past, I have checked it within Apache Spark, or by using https://github.com/apache/parquet-mr/tree/master/parquet-tools.
Today, I had to check a parquet file schema and I came across this nifty python utility though: https://github.com/chhantyal/parquet-cli. I think it’s just a wrapper around pyarrow, but it is slick and works easily.
You can pip install it trivially and then use it to view the data and schema of a parquet file with ease. Here’s an example of installing it and checking a schema:
$ pip install parquet-cli ... $ parq part-00000-679c332c.c000.snappy.parquet --schema # Schema ID: BYTE_ARRAY String OrderID: BYTE_ARRAY String SaleID: BYTE_ARRAY String OrderDate: BYTE_ARRAY String Pack: BYTE_ARRAY String Qnty: BYTE_ARRAY String Ratio: BYTE_ARRAY String Name: BYTE_ARRAY String Org: BYTE_ARRAY String Category: BYTE_ARRAY String Type: BYTE_ARRAY String Percentage: BYTE_ARRAY String
Needless to say, this is much easier than dealing with Spark or parquet tools for schema validation or checks of not-too-huge parquet file data.