Retail Data Pipeline

2024-12-1

Overview

A pipeline where I Extract the data with a Lambda function written in Python, Load the data into a staging table inside the database, and Transform the data into a normalized form using PostgreSQL functions and triggers.

Data Flow

  1. EC2 a temporary instance to run a Python script to generate mock data and place it in an S3 bucket.
  2. S3 Acts as a data lake here.
  3. RDS a PostgreSQL database ready for OLTP workload, initiated and configured with a staging table, a set of related tables in a star schema, and a function to transform/normalize the data that gets triggered with inserts into the staging table.
  4. Lambda a serverless function that gets invoked with every file uploaded to the S3 bucket, connects to the database, and inserts the contents of the file into the staging table.

GitHub