
Building a Serverless ETL Pipeline on AWS: From Raw Data to Interactive Dashboards
How I built a complete data analytics pipeline using AWS serverless services—processing CSV files, cleaning data, enabling SQL analytics, and serving insights through interactive dashboards—all for less than $2/month.
Peter Mangoro
In today's data-driven world, businesses need efficient ways to process, analyze, and visualize their data. As part of my cloud computing coursework, I set out to build a production-ready ETL (Extract, Transform, Load) pipeline that could handle e-commerce sales data from ingestion to visualization—all using serverless AWS services.
The result? A fully automated pipeline that processes CSV files, cleans and transforms data, stores it efficiently, and serves it through interactive dashboards—all while staying within AWS's free tier for most operations.
The Challenge
The goal was ambitious: create an end-to-end data pipeline that:
- Automatically processes incoming sales data
- Cleans and transforms raw CSV files
- Stores data in an optimized format for analytics
- Enables SQL queries on the processed data
- Provides real-time visualizations through web dashboards
- Scales automatically without managing servers
- Costs less than $2/month
Architecture Overview
The pipeline follows a modern serverless architecture pattern:
CSV Upload → S3 Raw → Lambda Trigger → Glue ETL → S3 Clean (Parquet)
→ Athena → QuickSight/Nuxt Dashboard → CloudFront (HTTPS)
Key Components
Data Ingestion: CSV files are uploaded to an S3 bucket, triggering an automated workflow.
ETL Processing: AWS Glue processes the data, cleaning inconsistencies, validating fields, and transforming it into a queryable format.
Storage: Cleaned data is stored in Parquet format (a columnar storage format) partitioned by date for efficient querying.
Analytics: Amazon Athena enables SQL queries directly on the Parquet files without loading data into a database.
Visualization: Two interfaces provide insights:
- A custom Nuxt 4 dashboard with interactive charts
- Amazon QuickSight dashboards for business intelligence
Delivery: CloudFront CDN serves the dashboard globally with HTTPS encryption.
The Journey: 8 Phases of Development
Phase 1: Infrastructure Setup
I started by creating the foundational AWS resources:
- 5 S3 buckets for different data layers (raw, clean, dashboard, scripts, results)
- IAM roles with least-privilege permissions
- Bucket policies and CORS configurations
This phase taught me the importance of proper infrastructure planning and security from the start.
Phase 2: Building the ETL Pipeline
The heart of the system is the ETL process. I built:
Lambda Functions:
sales-data-generator: Generates sample CSV data with intentional quality issues for testingsales-glue-trigger: Automatically starts ETL jobs when new files arrive
AWS Glue ETL Job: A PySpark script that:
- Reads CSV files from the raw S3 bucket
- Removes records with missing critical fields
- Standardizes text fields (trimming, case normalization)
- Fixes missing prices and amounts
- Validates and filters invalid dates
- Extracts date components for partitioning
- Calculates derived fields (discount amounts, net amounts)
- Writes cleaned data as Parquet files partitioned by year/month/day
Key Learning: The ETL job uses Glue job bookmarks to process only new data incrementally, avoiding reprocessing of existing files.
Phase 3: Enabling SQL Analytics with Athena
I configured Amazon Athena to query the cleaned Parquet files:
- Created a Glue Data Catalog database (
sales_analytics) - Defined a table schema with partition projection (eliminating the need for manual partition management)
- Wrote 12+ analytics queries covering:
- Total revenue by month
- Top products by revenue
- Regional sales distribution
- Category performance
- Order status analysis
Performance Optimization: By partitioning data by date and using partition projection, Athena scans only the relevant partitions, reducing query time and cost.
Phase 4: Building the Frontend Dashboard
I developed a modern web dashboard using:
- Nuxt 4 with Vue 3 (static site generation)
- Nuxt UI (Tailwind CSS-based component library)
- Chart.js for data visualizations
The dashboard features:
- Real-time analytics cards showing key metrics
- Interactive charts (revenue trends, top products, regional sales, category performance)
- Dark theme support
- Auto-refresh every 5 minutes
- Responsive design for mobile and desktop
Components Built:
StatsCard.vue- Key metrics displayRevenueChart.vue- Revenue trends over timeProductChart.vue- Top products bar chartRegionChart.vue- Regional sales doughnut chartCategoryChart.vue- Category performance analysis
Phase 5: CloudFront Deployment
To make the dashboard production-ready, I:
- Built the Nuxt application as a static site
- Uploaded it to an S3 bucket configured for static website hosting
- Created a CloudFront distribution with:
- HTTPS-only access (HTTP redirects to HTTPS)
- Global CDN for low latency
- Custom error pages for SPA routing
- Compression enabled
- Optimized caching headers
Result: The dashboard is now accessible at https://d1du5nvx9lhzdt.cloudfront.net with global edge caching.
Phase 6: QuickSight Integration
I integrated Amazon QuickSight for advanced business intelligence:
- Connected QuickSight to the Athena data source
- Created datasets with SPICE (in-memory caching)
- Configured automatic daily data refresh
- Built interactive dashboards with filters and drill-downs
QuickSight provides a more powerful visualization tool for stakeholders who need deeper analytics capabilities.
Phase 7: API Gateway for Real-Time Data
To connect the frontend to live data, I created:
- API Gateway REST API endpoint (
/analytics) - Lambda function (
analytics-api) that:- Executes parallel Athena queries
- Aggregates revenue, product, region, and category data
- Returns JSON responses for the frontend
- Handles CORS headers
This enables the Nuxt dashboard to fetch real-time analytics data instead of using mock data.
Phase 8: Automation Scripts
I developed automation scripts for:
- QuickSight setup (data sources, datasets, refresh schedules)
- Complete pipeline testing
- Frontend deployment
- Resource cleanup
Technical Highlights
Data Transformation
The ETL process handles common data quality issues:
- Missing Values: Defaults missing customer names to "Unknown", calculates missing prices
- Invalid Dates: Filters out invalid dates and standardizes date formats
- Inconsistent Formatting: Normalizes text fields (uppercase regions, lowercase status)
- Data Validation: Removes records with missing critical fields
Storage Optimization
Parquet Format: Columnar storage reduces file size by ~75% compared to CSV while enabling efficient column-based queries.
Partitioning: Data is partitioned by year/month/day, allowing Athena to scan only relevant partitions:
clean/sales/year=2025/month=12/day=15/data.parquet
Result: Faster queries and lower costs (Athena charges per TB scanned).
Serverless Benefits
- No Infrastructure Management: All services are fully managed
- Automatic Scaling: Handles traffic spikes without configuration
- Cost-Effective: Pay only for what you use
- High Availability: Built-in redundancy and failover
Cost Analysis
One of the project's goals was cost optimization. Here's the breakdown:
| Service | Monthly Cost | Notes |
|---|---|---|
| S3 Storage | $0 | Within free tier (5GB) |
| Lambda | $0 | Within free tier (1M requests) |
| Glue | ~$1-2 | ~$0.03 per job run |
| Athena | $0 | Within free tier (10GB scanned) |
| CloudFront | $0 | Within free tier (1TB transfer) |
| API Gateway | $0 | Within free tier (1M calls) |
| QuickSight | $0 | Standard edition (free) |
| Total | ~$1-2/month | Mostly Glue job runs |
Challenges and Solutions
Challenge 1: SPA Routing with CloudFront
Problem: CloudFront returned 404 errors for client-side routes.
Solution: Configured CloudFront error pages to redirect 404s to index.html, allowing Vue Router to handle routing.
Challenge 2: Partition Management
Problem: Traditional Athena tables require running MSCK REPAIR TABLE after adding partitions.
Solution: Used partition projection, which automatically discovers partitions based on a pattern, eliminating manual maintenance.
Challenge 3: CORS Configuration
Problem: Frontend couldn't access API Gateway due to CORS errors.
Solution: Configured CORS headers in both API Gateway and Lambda responses.
Challenge 4: Glue Job Performance
Problem: Initial Glue jobs were slow and expensive.
Solution: Optimized by:
- Using job bookmarks for incremental processing
- Partitioning output data
- Using appropriate worker types (G.1X for this workload)
Key Learnings
- Serverless Architecture: Building with serverless services eliminates infrastructure management overhead and enables automatic scaling.
- Data Format Matters: Parquet format provides significant storage and query performance benefits over CSV for analytics workloads.
- Partitioning Strategy: Proper partitioning is crucial for cost-effective analytics at scale.
- Event-Driven Design: Using S3 event notifications to trigger Lambda functions creates a truly automated pipeline.
- Security First: IAM roles with least-privilege access are essential for production systems.
- Cost Optimization: AWS free tiers make it possible to build and test production-like systems at minimal cost.
Results
✅ Complete ETL Pipeline: Automated data processing from ingestion to analytics
✅ 6+ AWS Services Integrated: S3, Lambda, Glue, Athena, CloudFront, QuickSight, API Gateway
✅ Production-Ready Dashboard: HTTPS-enabled, globally distributed
✅ Cost-Effective: ~$1-2/month operational cost
✅ Scalable: Handles growth automatically
✅ Well-Documented: Comprehensive documentation for all phases
Live Demo
🌐 Dashboard: https://d1du5nvx9lhzdt.cloudfront.net
The dashboard is live and accessible via CloudFront CDN with HTTPS encryption.
Future Enhancements
Potential improvements for production use:
- Real-Time Processing: Integrate Amazon Kinesis for streaming data
- Machine Learning: Use SageMaker for sales predictions
- Alerting: SNS notifications for data quality issues
- Multi-Region: Cross-region replication for disaster recovery
- Data Lake: Expand to handle multiple data sources
- Advanced Analytics: Add time-series analysis and forecasting
Conclusion
This project demonstrated the power of AWS serverless services to build a complete data analytics pipeline. By leveraging managed services, I was able to focus on business logic rather than infrastructure management, resulting in a scalable, cost-effective solution.
The architecture pattern used here—S3 for storage, Lambda for event processing, Glue for ETL, Athena for analytics, and CloudFront for delivery—can be adapted for various data analytics use cases, from IoT sensor data to financial transactions.
Whether you're a data engineer, developer, or business analyst, understanding how these services work together opens up powerful possibilities for building modern data solutions.
Technical Stack
- Cloud: AWS (S3, Lambda, Glue, Athena, CloudFront, QuickSight, API Gateway)
- Frontend: Nuxt 4, Vue 3, Nuxt UI, Chart.js
- ETL: AWS Glue (PySpark)
- Analytics: Amazon Athena (SQL)
- Visualization: QuickSight, Chart.js
- Infrastructure: IAM, CloudWatch
Project Statistics
- Total Files Created: 50+
- AWS Resources: 15+
- Lines of Code: 2000+
- Documentation Pages: 10+
- Queries Created: 12+
- Components Built: 6+
- Phases Completed: 8/8
This project was completed as part of a cloud computing course in December 2025. All code and documentation are available in the project repository.
Predicting Hospital Readmissions: A Machine Learning Journey
How I built three ML models to predict 30-day readmissions in diabetes patients using Logistic Regression, CART, and Random Forest
From Traffic Violations to Safety Culture: My Data Analytics Framework
A detailed breakdown of my data analysis approach, from raw traffic violation data to actionable safety insights that transform fleet operations.