Validate/Check Parquet File Schema From PC/Laptop

Checking a Parquet Schema

Generally, when I have had to check the schema of a parquet file in the past, I have checked it within Apache Spark, or by using https://github.com/apache/parquet-mr/tree/master/parquet-tools.

Today, I had to check a parquet file schema and I came across this nifty python utility though: https://github.com/chhantyal/parquet-cli.  I think it’s just a wrapper around pyarrow, but it is slick and works easily.

You can pip install it trivially and then use it to view the data and schema of a parquet file with ease.  Here’s an example of installing it and checking a schema:

$ pip install parquet-cli
...

$ parq part-00000-679c332c.c000.snappy.parquet --schema

# Schema
ID: BYTE_ARRAY String
OrderID: BYTE_ARRAY String
SaleID: BYTE_ARRAY String
OrderDate: BYTE_ARRAY String
Pack: BYTE_ARRAY String
Qnty: BYTE_ARRAY String
Ratio: BYTE_ARRAY String
Name: BYTE_ARRAY String
Org: BYTE_ARRAY String
Category: BYTE_ARRAY String
Type: BYTE_ARRAY String
Percentage: BYTE_ARRAY String

Needless to say, this is much easier than dealing with Spark or parquet tools for schema validation or checks of not-too-huge parquet file data.

Databricks / Spark – Generate Parquet Sample Data

I frequently find myself needing to generate parquet data for sample tests… e.g. when setting up a new hive instance, or testing Apache Drill, presto, etc. I always end up writing basically the same code in a different way because I never save it. So here it is!

This code makes a 5000 row data frame with 3 columns, 2 being integers and one being a string. It then names the columns well and saves them to a parquet file with 5 sub-files due to the coalesce.

I have it set up to write to ADLS in Azure but you can change the path so it works with your HDFS or whatever.

NOTE: It will overwrite the previous results in the destination, so (1) don’t write over data you want to keep, and (2) if you need to tune this, just keep re-running it :).

//Create a mutable list buffer based on a loop.
import scala.collection.mutable.ListBuffer
var lb = ListBuffer[(Int, Int, String)]()
for (i <- 1 to 5000) {
  lb += ((i, i*i, "Number is " + i + "."))
}

//Convert it to a data frame.
import spark.implicits._
val df = lb.toDF("value", "square", "description")

df.coalesce(5).write.mode(SaveMode.Overwrite).parquet("adl://some-adls-instance.azuredatalakestore.net/johntest/sample_data.parquet")