PySpark Schema Generator - A simple tool to generate PySpark schema from JSON data

 Hi Folks,

I built a small tool that solves a problem for a data engineer while dealing with JSON data. As we know JSON data is semi-structured and we always ingest them and denormalize them to smaller tables properly for further processing.


In my case I had to generate PySpark Schema from JSON to ingest the data and the JSON structure often gets changed.

The JSON I was dealing was very complex but let me give you an example about the tool, what problem it solves.

For example we have a JSON coming from Kafka like below

{
  "name": "PREETish ranjan",
  "dob": "2022-03-04T18:30:00.000Z",
  "status": "active",
  "isActive": true,
  "id": 102,
  "address": {
    "city": "Bhubaneswar",
    "PIN": 500016
  },
  "mobiles": ["8989898989", "5656565656"],
  "id_cards": [1, 2, 3, 4, 5]
}

The output i need is like this,

StructType([
    StructField('name', StringType(), True),
    StructField('dob', StringType(), True),
    StructField('status', StringType(), True),
    StructField('isActive', BooleanType(), True),
    StructField('id', IntegerType(), True),
    StructField('address', StructType([StructField('city',
                StringType(), True), StructField('PIN', IntegerType(),
                True)]), True),
    StructField('mobiles', ArrayType(StringType()), True),
    StructField('id_cards', ArrayType(IntegerType()), True),
    ])

If the JSON gets more complex and big its quite difficult to generate the schema if you are dealing the data like this. My tool is a simple javascript tool that generates this for you. It has few bugs and limitations but it works.

Checkout the GitHub: PySpark Schema Generator

Here is the tool:



Thanks for reading!!!

Comments

Popular posts from this blog

Use SCSS with ASP.NET Core 5.x or 3.X

Building a Login Flow with .NET MAUI

Generate PySpark Schema dynamically in Python from JSON Sample