Retail Data Pipeline
2024-12-1
Overview
A pipeline where I Extract the data with a Lambda function written in Python, Load the data into a staging table inside the database, and Transform the data into a normalized form using PostgreSQL functions and triggers.
Data Flow
- EC2 a temporary instance to run a Python script to generate mock data and place it in an S3 bucket.
- S3 Acts as a data lake here.
- RDS a PostgreSQL database ready for OLTP workload, initiated and configured with a staging table, a set of related tables in a star schema, and a function to transform/normalize the data that gets triggered with inserts into the staging table.
- Lambda a serverless function that gets invoked with every file uploaded to the S3 bucket, connects to the database, and inserts the contents of the file into the staging table.