Introduction
Welcome to my new blog, where I share my experience over data and analytics engineering and technologies. Today, I’m going to show you a fourth approach to load on-prem data onto Microsoft Fabric Lakehouse other than the standard one provided:
- Dataflow Gen2 Pipelines
- Data Pipelines
- Jupiter Notebooks
I show you how to use python to migrate a Sql Server database unexposed to the global network. This is a great way to leverage the power of OneLake access to file explorer, ingesting and loading data on the Fabric datawarehouse.
What are parquet files and why are they useful?
Parquet files are a columnar storage format that allows for efficient compression and encoding of data. They are especially suited for analytical queries, as they can reduce the amount of data scanned and improve performance. Parquet files are also compatible with many data processing frameworks, such as Spark, Hive, and Presto.
Parquet files can be are generated with little resources (a powerful workstation can fit the workload). The script I wrote is the beginning and an exploratory approach to this new Microsoft product. You can find the script here: https://github.com/Riccardocapelli1/my_blog/tree/main/python
How to load data onto Microsoft Fabric using parquet files?
The process defined in the scripts consists of three main steps:
- Extract data from the local Sql Server database to a Pandas dataframe.
- Convert the dataframe to a parquet files.
- write the parquet file into a folder.
To load data into Microsoft Fabric using Parquet files from a local SQL Server database, we will use Python with the pyarrow, pandas, and sqlalchemy libraries. This method ensures that the data remains unexposed to the global network, providing enhanced security. Additionally, we will leverage the OneLake access function with file explorer, making the process efficient and straightforward.
Prerequisites
Before proceeding, ensure that you have the following prerequisites in place:
Python
installed on your system. You can download Python from the official website: python.org. Required Python Packages: pyarrow, pandas, sqlalchemy, and tqdm. Install them using pip:
pip install pyarrow
pip install pandas
pip install sqlalchemy
pip install tqdm
A SQL Server instance
hosting a database containing the data you want to load into Microsoft Fabric. Microsoft Fabric is at the moment I write this post in trial-only access.
The Python Script
Below is the Python script to load data into Microsoft Fabric using Parquet files:
import time
import pyarrow.parquet as pq
import pandas as pd
import pyarrow as pa
from sqlalchemy import create_engine
from creds import userdb, passworddb
row_group_size = 5000
table_name = "your_table_name_to_query"
columns_list = "your_column_list_to_query_from_your_db"
# Rest of the script...
Replace the placeholders with your specific details:
your_table_name_to_query: The name of the table you want to query from your database. your_column_list_to_query_from_your_db: The list of columns you want to query from your database.
Conclusion
In this guide, we introduced a new exploratory approach to load data into Microsoft Fabric using Parquet files from a local SQL Server database (I tried with a 25GB table with very good performances). By following these steps and leveraging Python libraries, you can ensure secure data loading without exposing your data to the global network. The use of Parquet files and the OneLake access function with file explorer provides an efficient and robust solution for your data loading needs. Moreover, in case computational cost are added to converting SQL Server tables to Parquet files, that’s dodged by doing it in local environment.
This is the first blog of a series, where I will describe further how to explore and customize the script according to your specific requirements, how I would make them work and orchestrate in a local/hybrid data architecture. Happy data loading!