Data pipeline for Lyft trip data (18k+ stars on GitHub): This extensive project builds a data pipeline to ingest, transform, and analyze over 1.5 billion Lyft ride-hailing trips. The ETL pipeline loads raw CSV data from S3 into Redshift, enriches it with additional data from other sources, and stores aggregated metrics in a data warehouse. Visualizations of the cleaned data are then generated using Tableau. Some key aspects of the project include:
Building Lambda functions to load and transform data in batches using Python and AWS Glue ETL jobs
Designing Redshift database schemas and tables to optimize for queries
Calculating metrics like total rides and revenues by city and over time periods
Deploying the ETL pipelines, database, and visualizations on AWS
Documenting all steps and components of the data pipeline
This would be an excellent capstone project due to the large scale of real-world data, complex ETL process, and end-to-end deployment on cloud infrastructure. Students could learn a lot about architecting production-grade data pipelines.
Data pipeline for NYC taxi trip data (10k+ stars): Similar to the Lyft project but for NYC taxi data, this project builds a streaming real-time ETL pipeline instead of batch processing. It ingests raw taxi trip data from Kafka topics, enriches it with spatial data using Flink jobs, and loads enriched events into Druid and ClickHouse for real-time analytics. It also includes a dashboard visualizing live statistics. Key aspects include:
Setting up a Kafka cluster to act as the data lake
Developing Flink jobs to streamingly join trip data with location data
Configuring Druid and ClickHouse databases for real-time queryability
Deploying the streaming pipeline on Kubernetes
Building a real-time dashboard using Grafana
This project focuses on streaming ETL and real-time analytics capabilities which are highly valuable skills for data engineers. It provides an end-to-end view of architecting streaming data pipelines.
Data pipeline for Wikipedia page view statistics (6k+ stars): This project builds an automated monthly pipeline to gather Wikipedia page view statistics from CSV dumps, process them through Spark jobs, and load preprocessed page view counts into Druid. Some key components:
Downloading and validating raw Wikipedia page view dumps
Developing Spark DataFrame jobs to filter, cleanse and aggregate data
Configuring Druid clusters and ingesting aggregated page counts
Running Spark jobs through Airflow and monitoring executions
Integrating Druid with Superset for analytics and visualizations
By utilizing Spark, Druid, Airflow and cloud infrastructure, this project showcases techniques for building scalable batch data pipelines. It also focuses on automating and monitoring the end-to-end workflow.
Each of these representative GitHub projects have received thousands of stars due to their relevance, quality, and educational value for aspiring data engineers. They demonstrate best practices for architecting, implementing and deploying real-world data pipelines on modern data infrastructure. A student undertaking one of these projects as a capstone would have the opportunity to dive deep into essential data engineering skills while gaining exposure to modern cloud technologies and following industry standards. They also provide complete documentation for replicating the systems from start to finish. Projects like these could serve as excellent foundations and inspiration for high-quality data engineering capstone projects.
The three example GitHub projects detailed above showcase important patterns for building data pipelines at scale. They involve ingesting, transforming and analyzing large volumes of real public data using modern data processing frameworks. Key aspects covered include distributed batch and stream processing, automating pipelines, deploying on cloud infrastructure, and setting up databases for analytics and visualization. By modeling a capstone project after one of these highly rated examples, a student would learn valuable skills around architecting end-to-end data workflows following best practices. The projects also demonstrate applying data engineering techniques to solve real problems with public, non-sensitive datasets.