PySpark Schema Generator - A simple tool to generate PySpark schema from JSON data
Hi Folks,
I built a small tool that solves a problem for a data engineer while dealing with JSON data. As we know JSON data is semi-structured and we always ingest them and denormalize them to smaller tables properly for further processing.
In my case I had to generate PySpark Schema from JSON to ingest the data and the JSON structure often gets changed.
The JSON I was dealing was very complex but let me give you an example about the tool, what problem it solves.
For example we have a JSON coming from Kafka like below
{
"name": "PREETish ranjan",
"dob": "2022-03-04T18:30:00.000Z",
"status": "active",
"isActive": true,
"id": 102,
"address": {
"city": "Bhubaneswar",
"PIN": 500016
},
"mobiles": ["8989898989", "5656565656"],
"id_cards": [1, 2, 3, 4, 5]
}
The output i need is like this,
StructType([
StructField('name', StringType(), True),
StructField('dob', StringType(), True),
StructField('status', StringType(), True),
StructField('isActive', BooleanType(), True),
StructField('id', IntegerType(), True),
StructField('address', StructType([StructField('city',
StringType(), True), StructField('PIN', IntegerType(),
True)]), True),
StructField('mobiles', ArrayType(StringType()), True),
StructField('id_cards', ArrayType(IntegerType()), True),
])
If the JSON gets more complex and big its quite difficult to generate the schema if you are dealing the data like this. My tool is a simple javascript tool that generates this for you. It has few bugs and limitations but it works.
Checkout the GitHub: PySpark Schema Generator
Here is the tool:
Thanks for reading!!!
Comments
Post a Comment