The fourth way: Ingesting data in Fabric using Parquet files

Introduction Welcome to my new blog, where I share my experience over data and analytics engineering and technologies. Today, I’m going to show you a fourth approach to load on-prem data onto Microsoft Fabric Lakehouse other than the standard one provided: Dataflow Gen2 Pipelines Data Pipelines Jupiter Notebooks I show you how to use python to migrate a Sql Server database unexposed to the global network. This is a great way to leverage the power of OneLake access to file explorer, ingesting and loading data on the Fabric datawarehouse....

July 17, 2023 · 3 min · 604 words · Riccardo Capelli

Two approaches to generate Parquet files from on-prem databases

In this guide, firstly I will share my experience trying several approaches to create Parquet files from SQL Server tables, then we’ll explore in detail how to efficiently extract and convert data using PyArrow. All my attemps with the code is available here: https://github.com/Riccardocapelli1/my_blog/tree/main/python My experience It took me several attempts to define the approach that suited me better. Here are mine to generat Parquet files: without clustering the output clustering results with the standard Apache Arrow library clustering results with concurrent threads (using ThreadPoolExecutor() from concurrent) clustering results with with DuckDB Without clustering the output It is effective with small tables (less than 2/3GB size)....

July 18, 2023 · 5 min · 993 words · Riccardo Capelli