Dec 15, 2025 - 12 MIN READ

Building a Serverless ETL Pipeline on AWS: From Raw Data to Interactive Dashboards

How I built a complete data analytics pipeline using AWS serverless services—processing CSV files, cleaning data, enabling SQL analytics, and serving insights through interactive dashboards—all for less than $2/month.

Peter Mangoro

In today's data-driven world, businesses need efficient ways to process, analyze, and visualize their data. As part of my cloud computing coursework, I set out to build a production-ready ETL (Extract, Transform, Load) pipeline that could handle e-commerce sales data from ingestion to visualization—all using serverless AWS services.

The result? A fully automated pipeline that processes CSV files, cleans and transforms data, stores it efficiently, and serves it through interactive dashboards—all while staying within AWS's free tier for most operations.

The Challenge

The goal was ambitious: create an end-to-end data pipeline that:

Automatically processes incoming sales data
Cleans and transforms raw CSV files
Stores data in an optimized format for analytics
Enables SQL queries on the processed data
Provides real-time visualizations through web dashboards
Scales automatically without managing servers
Costs less than $2/month

Architecture Overview

The pipeline follows a modern serverless architecture pattern:

CSV Upload → S3 Raw → Lambda Trigger → Glue ETL → S3 Clean (Parquet) 
    → Athena → QuickSight/Nuxt Dashboard → CloudFront (HTTPS)

Key Components

Data Ingestion: CSV files are uploaded to an S3 bucket, triggering an automated workflow.

ETL Processing: AWS Glue processes the data, cleaning inconsistencies, validating fields, and transforming it into a queryable format.

Storage: Cleaned data is stored in Parquet format (a columnar storage format) partitioned by date for efficient querying.

Analytics: Amazon Athena enables SQL queries directly on the Parquet files without loading data into a database.

Visualization: Two interfaces provide insights:

A custom Nuxt 4 dashboard with interactive charts
Amazon QuickSight dashboards for business intelligence

Delivery: CloudFront CDN serves the dashboard globally with HTTPS encryption.

The Journey: 8 Phases of Development

Phase 1: Infrastructure Setup

I started by creating the foundational AWS resources:

5 S3 buckets for different data layers (raw, clean, dashboard, scripts, results)
IAM roles with least-privilege permissions
Bucket policies and CORS configurations

This phase taught me the importance of proper infrastructure planning and security from the start.

Phase 2: Building the ETL Pipeline

The heart of the system is the ETL process. I built:

Lambda Functions:

sales-data-generator: Generates sample CSV data with intentional quality issues for testing
sales-glue-trigger: Automatically starts ETL jobs when new files arrive

AWS Glue ETL Job: A PySpark script that:

Reads CSV files from the raw S3 bucket
Removes records with missing critical fields
Standardizes text fields (trimming, case normalization)
Fixes missing prices and amounts
Validates and filters invalid dates
Extracts date components for partitioning
Calculates derived fields (discount amounts, net amounts)
Writes cleaned data as Parquet files partitioned by year/month/day

Key Learning: The ETL job uses Glue job bookmarks to process only new data incrementally, avoiding reprocessing of existing files.

Phase 3: Enabling SQL Analytics with Athena

I configured Amazon Athena to query the cleaned Parquet files:

Created a Glue Data Catalog database (sales_analytics)
Defined a table schema with partition projection (eliminating the need for manual partition management)
Wrote 12+ analytics queries covering:
- Total revenue by month
- Top products by revenue
- Regional sales distribution
- Category performance
- Order status analysis

Performance Optimization: By partitioning data by date and using partition projection, Athena scans only the relevant partitions, reducing query time and cost.

Phase 4: Building the Frontend Dashboard

I developed a modern web dashboard using:

Nuxt 4 with Vue 3 (static site generation)
Nuxt UI (Tailwind CSS-based component library)
Chart.js for data visualizations

The dashboard features:

Real-time analytics cards showing key metrics
Interactive charts (revenue trends, top products, regional sales, category performance)
Dark theme support
Auto-refresh every 5 minutes
Responsive design for mobile and desktop

Components Built:

StatsCard.vue - Key metrics display
RevenueChart.vue - Revenue trends over time
ProductChart.vue - Top products bar chart
RegionChart.vue - Regional sales doughnut chart
CategoryChart.vue - Category performance analysis

Phase 5: CloudFront Deployment

To make the dashboard production-ready, I:

Built the Nuxt application as a static site
Uploaded it to an S3 bucket configured for static website hosting
Created a CloudFront distribution with:
- HTTPS-only access (HTTP redirects to HTTPS)
- Global CDN for low latency
- Custom error pages for SPA routing
- Compression enabled
- Optimized caching headers

Result: The dashboard is now accessible at https://d1du5nvx9lhzdt.cloudfront.net with global edge caching.

Phase 6: QuickSight Integration

I integrated Amazon QuickSight for advanced business intelligence:

Connected QuickSight to the Athena data source
Created datasets with SPICE (in-memory caching)
Configured automatic daily data refresh
Built interactive dashboards with filters and drill-downs

QuickSight provides a more powerful visualization tool for stakeholders who need deeper analytics capabilities.

Phase 7: API Gateway for Real-Time Data

To connect the frontend to live data, I created:

API Gateway REST API endpoint (/analytics)
Lambda function (analytics-api) that:
- Executes parallel Athena queries
- Aggregates revenue, product, region, and category data
- Returns JSON responses for the frontend
- Handles CORS headers

This enables the Nuxt dashboard to fetch real-time analytics data instead of using mock data.

Phase 8: Automation Scripts

I developed automation scripts for:

QuickSight setup (data sources, datasets, refresh schedules)
Complete pipeline testing
Frontend deployment
Resource cleanup

Technical Highlights

Data Transformation

The ETL process handles common data quality issues:

Missing Values: Defaults missing customer names to "Unknown", calculates missing prices
Invalid Dates: Filters out invalid dates and standardizes date formats
Inconsistent Formatting: Normalizes text fields (uppercase regions, lowercase status)
Data Validation: Removes records with missing critical fields

Storage Optimization

Parquet Format: Columnar storage reduces file size by ~75% compared to CSV while enabling efficient column-based queries.

Partitioning: Data is partitioned by year/month/day, allowing Athena to scan only relevant partitions:

clean/sales/year=2025/month=12/day=15/data.parquet

Result: Faster queries and lower costs (Athena charges per TB scanned).

Serverless Benefits

No Infrastructure Management: All services are fully managed
Automatic Scaling: Handles traffic spikes without configuration
Cost-Effective: Pay only for what you use
High Availability: Built-in redundancy and failover

Cost Analysis

One of the project's goals was cost optimization. Here's the breakdown:

Service	Monthly Cost	Notes
S3 Storage	$0	Within free tier (5GB)
Lambda	$0	Within free tier (1M requests)
Glue	~$1-2	~$0.03 per job run
Athena	$0	Within free tier (10GB scanned)
CloudFront	$0	Within free tier (1TB transfer)
API Gateway	$0	Within free tier (1M calls)
QuickSight	$0	Standard edition (free)
Total	~$1-2/month	Mostly Glue job runs

Challenges and Solutions

Challenge 1: SPA Routing with CloudFront

Problem: CloudFront returned 404 errors for client-side routes.

Solution: Configured CloudFront error pages to redirect 404s to index.html, allowing Vue Router to handle routing.

Challenge 2: Partition Management

Problem: Traditional Athena tables require running MSCK REPAIR TABLE after adding partitions.

Solution: Used partition projection, which automatically discovers partitions based on a pattern, eliminating manual maintenance.

Challenge 3: CORS Configuration

Problem: Frontend couldn't access API Gateway due to CORS errors.

Solution: Configured CORS headers in both API Gateway and Lambda responses.

Challenge 4: Glue Job Performance

Problem: Initial Glue jobs were slow and expensive.

Solution: Optimized by:

Using job bookmarks for incremental processing
Partitioning output data
Using appropriate worker types (G.1X for this workload)

Key Learnings

Serverless Architecture: Building with serverless services eliminates infrastructure management overhead and enables automatic scaling.
Data Format Matters: Parquet format provides significant storage and query performance benefits over CSV for analytics workloads.
Partitioning Strategy: Proper partitioning is crucial for cost-effective analytics at scale.
Event-Driven Design: Using S3 event notifications to trigger Lambda functions creates a truly automated pipeline.
Security First: IAM roles with least-privilege access are essential for production systems.
Cost Optimization: AWS free tiers make it possible to build and test production-like systems at minimal cost.

Results

✅ Complete ETL Pipeline: Automated data processing from ingestion to analytics
✅ 6+ AWS Services Integrated: S3, Lambda, Glue, Athena, CloudFront, QuickSight, API Gateway
✅ Production-Ready Dashboard: HTTPS-enabled, globally distributed
✅ Cost-Effective: ~$1-2/month operational cost
✅ Scalable: Handles growth automatically
✅ Well-Documented: Comprehensive documentation for all phases

Live Demo

🌐 Dashboard: https://d1du5nvx9lhzdt.cloudfront.net

The dashboard is live and accessible via CloudFront CDN with HTTPS encryption.

Future Enhancements

Potential improvements for production use:

Real-Time Processing: Integrate Amazon Kinesis for streaming data
Machine Learning: Use SageMaker for sales predictions
Alerting: SNS notifications for data quality issues
Multi-Region: Cross-region replication for disaster recovery
Data Lake: Expand to handle multiple data sources
Advanced Analytics: Add time-series analysis and forecasting

Conclusion

This project demonstrated the power of AWS serverless services to build a complete data analytics pipeline. By leveraging managed services, I was able to focus on business logic rather than infrastructure management, resulting in a scalable, cost-effective solution.

The architecture pattern used here—S3 for storage, Lambda for event processing, Glue for ETL, Athena for analytics, and CloudFront for delivery—can be adapted for various data analytics use cases, from IoT sensor data to financial transactions.

Whether you're a data engineer, developer, or business analyst, understanding how these services work together opens up powerful possibilities for building modern data solutions.

Technical Stack

Cloud: AWS (S3, Lambda, Glue, Athena, CloudFront, QuickSight, API Gateway)
Frontend: Nuxt 4, Vue 3, Nuxt UI, Chart.js
ETL: AWS Glue (PySpark)
Analytics: Amazon Athena (SQL)
Visualization: QuickSight, Chart.js
Infrastructure: IAM, CloudWatch

Project Statistics

Total Files Created: 50+
AWS Resources: 15+
Lines of Code: 2000+
Documentation Pages: 10+
Queries Created: 12+
Components Built: 6+
Phases Completed: 8/8

This project was completed as part of a cloud computing course in December 2025. All code and documentation are available in the project repository.

My First Steps into Graph Databases: Learning Neo4j Fundamentals

Discovering a new way to think about data relationships: My journey learning Neo4j graph databases, from understanding graph theory to mastering Cypher and graph data modeling. Completed Neo4j Fundamentals, Cypher Fundamentals, and Graph Data Modeling Fundamentals certifications.

From Traffic Violations to Safety Culture: My Data Analytics Framework

A detailed breakdown of my data analysis approach, from raw traffic violation data to actionable safety insights that transform fleet operations.