Validate/Check Parquet File Schema From PC/Laptop

Checking a Parquet Schema

Generally, when I have had to check the schema of a parquet file in the past, I have checked it within Apache Spark, or by using https://github.com/apache/parquet-mr/tree/master/parquet-tools.

Today, I had to check a parquet file schema and I came across this nifty python utility though: https://github.com/chhantyal/parquet-cli.  I think it’s just a wrapper around pyarrow, but it is slick and works easily.

You can pip install it trivially and then use it to view the data and schema of a parquet file with ease.  Here’s an example of installing it and checking a schema:

$ pip install parquet-cli
...

$ parq part-00000-679c332c.c000.snappy.parquet --schema

# Schema
ID: BYTE_ARRAY String
OrderID: BYTE_ARRAY String
SaleID: BYTE_ARRAY String
OrderDate: BYTE_ARRAY String
Pack: BYTE_ARRAY String
Qnty: BYTE_ARRAY String
Ratio: BYTE_ARRAY String
Name: BYTE_ARRAY String
Org: BYTE_ARRAY String
Category: BYTE_ARRAY String
Type: BYTE_ARRAY String
Percentage: BYTE_ARRAY String

Needless to say, this is much easier than dealing with Spark or parquet tools for schema validation or checks of not-too-huge parquet file data.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s