Coding Stream of Consciousness

by John Humphreys – Random code from my life.

Databricks / Spark – Generate Parquet Sample Data

Posted on June 6, 2019 by John Humphreys

I frequently find myself needing to generate parquet data for sample tests… e.g. when setting up a new hive instance, or testing Apache Drill, presto, etc. I always end up writing basically the same code in a different way because I never save it. So here it is!

This code makes a 5000 row data frame with 3 columns, 2 being integers and one being a string. It then names the columns well and saves them to a parquet file with 5 sub-files due to the coalesce.

I have it set up to write to ADLS in Azure but you can change the path so it works with your HDFS or whatever.

NOTE: It will overwrite the previous results in the destination, so (1) don’t write over data you want to keep, and (2) if you need to tune this, just keep re-running it :).

//Create a mutable list buffer based on a loop.
import scala.collection.mutable.ListBuffer
var lb = ListBuffer[(Int, Int, String)]()
for (i <- 1 to 5000) {
  lb += ((i, i*i, "Number is " + i + "."))
}

//Convert it to a data frame.
import spark.implicits._
val df = lb.toDF("value", "square", "description")

df.coalesce(5).write.mode(SaveMode.Overwrite).parquet("adl://some-adls-instance.azuredatalakestore.net/johntest/sample_data.parquet")

Leave a comment Cancel reply