Extract ZIP files in Azure Data Lake Storage using Python

 Hi There,

This example shows how to extract a zip file in Azure ADLS using Python

Recently I need to extract a Zip file into an ADLS storage from a Python script. Its the same approach if you are trying to do it from a DJango or Fast API service.

The approach is to read the Zip files in memory in  sequence and upload them to ADLS.

Its unlike what we do in databricks where extract can be done through mount points but extracting a zip from a Python script or Service is a bit different.

Its very simple and short.


To run this I have created a ADLS gen2 Storage and a container named samples.


"""
Author: PREETish
Reach me at: https://www.pritishranjan.com
Queries: https://preetblogs.azurewebsites.net/aboutme
Github: PreetRanjan
"""

import zipfile
import io
from azure.storage.filedatalake import FileSystemClient
from datetime import datetime

connection_string = "<your_connection_string>"
file_system_client = FileSystemClient.from_connection_string(connection_string, file_system_name="samples")

def upload_bytes_to_adls(file_system_client,file_path, file_contents):
    file_client = file_system_client.get_file_client(file_path)
    # Upload bytes to the file
    file_client.upload_data(file_contents, overwrite=True)

def read_file_from_adls(file_system_client,file_path):
    file_client = file_system_client.get_file_client(file_path)
    download = file_client.download_file()
    downloaded_bytes = download.readall()
    return downloaded_bytes

def extract_zip_in_adls(zip_data,extract_dir):
    with io.BytesIO(zip_data) as zip_buffer:
        with zipfile.ZipFile(zip_buffer, "r") as zip_file:
            for file_name in zip_file.namelist():
                with zip_file.open(file_name) as file_in_zip:
                    extract_path = extract_dir + file_name
                    print("Extract & Upload to: ",extract_path)
                    upload_bytes_to_adls(file_system_client,extract_path,file_in_zip.read())
                    print("Uploaded!!")
zip_file_path = "drivetime.zip"
print("Reading ZIP file:",zip_file_path)
zip_bytes = read_file_from_adls(file_system_client,zip_file_path)
print("Zip file Read. Size: ",len(zip_bytes)," Bytes")

# Extract ZIP files and upload each file to ADLS
print("Running Extract and Uplaod...")
extract_zip_in_adls(zip_bytes,"Extract/")


You can check the code in My Github: 


Thanks


Comments

Popular posts from this blog

Use SCSS with ASP.NET Core 5.x or 3.X

Building a Login Flow with .NET MAUI

PySpark Schema Generator - A simple tool to generate PySpark schema from JSON data