This project aims to migrate an on-premises SQL database to the Azure Cloud, leveraging a range of Azure services and ETL concepts to ensure a smooth and efficient transition. The solution incorporates a multi-hop lakehouse architecture, emphasizing security, governance, and data quality.
I used AdventureWorks sample databases from Microsoft.
- OnPrem SQL Server with SQL Server Management Studio (SSMS)
- Azure Data Lake Gen2
- Azure Data Factory (ADF)
- Databricks
- Azure Synapse Analytics
- Serverless SQL Database
- Power BI
- Azure Key Vault
- Azure Resource Groups
The architecture for this project follows the bronze-silver-gold (multi-hop) lakehouse model:
- Bronze Layer: Raw data ingested into Azure Data Lake Gen2. (Exactly the same as in the On-Prem DB)
- Silver Layer: Casted all columns with "date" in name to datetime type
- Gold Layer: Changed column names from CamelCase to snake_case.
The pipeline consists of the following main tasks:
- Lookup: Gets names of all tables within the On-Prem SQL Database
- ForEach: Selects all from each table, and copies data into a bronze layer
- Notebooks: Performs modifications on data and writes data to Azure Data Lake
The pipeline consists of the following main tasks:
- Get Metadata: Retrieves metadata from the Azure Data Lake container to list child items.
- For Each: Iterates through the listed items and executes a stored procedure to create or alter Serverless SQL views for Delta tables.
- Multi-Hop Architecture: Bronze-Silver-Gold layers for data processing and storage.
- Security and Governance: Managed through Azure Key Vault and secure access controls.
- Quality Checks: Incorporated to ensure data integrity and accuracy throughout the ETL process.
-
Azure Resources:
- Ensure all necessary Azure resources (Data Lake, Data Factory, Databricks, Synapse Analytics, etc.) are provisioned and configured.
-
Configuration Files:
- Update configuration files with your Azure environment details and credentials.
-
Deployment:
- Deploy the ETL pipeline in Azure Data Factory.
- Set up Databricks notebooks and jobs.
- Configure Serverless SQL views (using Synapse Analytics) and Power BI reports.
- Run ETL Pipeline: Trigger the pipeline in Azure Data Factory to start the data migration process.
- Monitor: Use ADF monitoring tools to track the progress and troubleshoot any issues.
- Reporting: Access reports and dashboards in Power BI for data analysis.
- Data Encryption: All data is encrypted in transit and at rest.
- Access Controls: Access to resources is managed using Azure Role-Based Access Control (RBAC) and Key Vault secrets.
- Data Validation: Perform regular data validation checks to ensure data quality.
- Error Handling: Implement error handling and logging mechanisms to track and resolve issues.
For any questions or inquiries, please contact me at [email protected]