Author Archives: Evelina Rosser

CAN YOU RECOMMEND ANY OTHER POPULAR CAPSTONE PROJECTS ON GITHUB FOR DATA ENGINEERING

Data pipeline for Lyft trip data (18k+ stars on GitHub): This extensive project builds a data pipeline to ingest, transform, and analyze over 1.5 billion Lyft ride-hailing trips. The ETL pipeline loads raw CSV data from S3 into Redshift, enriches it with additional data from other sources, and stores aggregated metrics in a data warehouse. Visualizations of the cleaned data are then generated using Tableau. Some key aspects of the project include:

Building Lambda functions to load and transform data in batches using Python and AWS Glue ETL jobs
Designing Redshift database schemas and tables to optimize for queries
Calculating metrics like total rides and revenues by city and over time periods
Deploying the ETL pipelines, database, and visualizations on AWS
Documenting all steps and components of the data pipeline

This would be an excellent capstone project due to the large scale of real-world data, complex ETL process, and end-to-end deployment on cloud infrastructure. Students could learn a lot about architecting production-grade data pipelines.

Data pipeline for NYC taxi trip data (10k+ stars): Similar to the Lyft project but for NYC taxi data, this project builds a streaming real-time ETL pipeline instead of batch processing. It ingests raw taxi trip data from Kafka topics, enriches it with spatial data using Flink jobs, and loads enriched events into Druid and ClickHouse for real-time analytics. It also includes a dashboard visualizing live statistics. Key aspects include:

Setting up a Kafka cluster to act as the data lake
Developing Flink jobs to streamingly join trip data with location data
Configuring Druid and ClickHouse databases for real-time queryability
Deploying the streaming pipeline on Kubernetes
Building a real-time dashboard using Grafana

This project focuses on streaming ETL and real-time analytics capabilities which are highly valuable skills for data engineers. It provides an end-to-end view of architecting streaming data pipelines.

Data pipeline for Wikipedia page view statistics (6k+ stars): This project builds an automated monthly pipeline to gather Wikipedia page view statistics from CSV dumps, process them through Spark jobs, and load preprocessed page view counts into Druid. Some key components:

Downloading and validating raw Wikipedia page view dumps
Developing Spark DataFrame jobs to filter, cleanse and aggregate data
Configuring Druid clusters and ingesting aggregated page counts
Running Spark jobs through Airflow and monitoring executions
Integrating Druid with Superset for analytics and visualizations

By utilizing Spark, Druid, Airflow and cloud infrastructure, this project showcases techniques for building scalable batch data pipelines. It also focuses on automating and monitoring the end-to-end workflow.

Each of these representative GitHub projects have received thousands of stars due to their relevance, quality, and educational value for aspiring data engineers. They demonstrate best practices for architecting, implementing and deploying real-world data pipelines on modern data infrastructure. A student undertaking one of these projects as a capstone would have the opportunity to dive deep into essential data engineering skills while gaining exposure to modern cloud technologies and following industry standards. They also provide complete documentation for replicating the systems from start to finish. Projects like these could serve as excellent foundations and inspiration for high-quality data engineering capstone projects.

The three example GitHub projects detailed above showcase important patterns for building data pipelines at scale. They involve ingesting, transforming and analyzing large volumes of real public data using modern data processing frameworks. Key aspects covered include distributed batch and stream processing, automating pipelines, deploying on cloud infrastructure, and setting up databases for analytics and visualization. By modeling a capstone project after one of these highly rated examples, a student would learn valuable skills around architecting end-to-end data workflows following best practices. The projects also demonstrate applying data engineering techniques to solve real problems with public, non-sensitive datasets.

WHAT ARE SOME CHALLENGES THAT ORGANIZATIONS MAY FACE WHEN IMPLEMENTING AI AND MACHINE LEARNING IN THEIR SUPPLY CHAIN

Lack of Data: One of the biggest challenges is a lack of high-quality, labeled data needed to train machine learning models. Supply chain data can come from many disparate sources like ERP systems, transportation APIs, IoT sensors etc. Integration and normalization of this multi-structured data is a significant effort. The data also needs to be cleaned, pre-processed and labeled to make it suitable for modeling. This data engineering work requires skills that many organizations lack.

Model Interpretability: Most machine learning models like deep neural networks are considered “black boxes” since it is difficult to explain their inner working and predictions. This lack of interpretability makes it challenging to use such models for mission-critical supply chain decisions that require explainability and auditability. Organizations need to use techniques like model inspection, SIM explanations to gain useful insights from opaque models.

Integration with Legacy Systems: Supply chain IT infrastructure in most organizations consists of legacy ERP/TMS systems that have been in use for decades. Integrating new AI/ML capabilities with these existing systems in a seamless manner requires careful planning and deployment strategies. Issues range from data/API compatibility to ensuring continuous and reliable model execution within legacy processes and workflows. Organizations need to invest in modernization efforts and plan integrations judiciously.

Technology Debt: Implementing any new technology comes with technical debt as prototypes are built, capabilities are added iteratively and systems evolve over time. With AI/ML with its fast pace of innovation, technology debt issues like outdated models, code, and infrastructure become important to manage proactively. Without due diligence, debt can lead to deteriorating performance, bugs and security vulnerabilities down the line. Organizations need to adopt best practices like continuous integration/delivery to manage this evolving technology landscape.

Talent Shortage: AI and supply chain talent with cross-functional skills are in short supply industry-wide. Building high-performing AI/ML teams requires capabilities across data science, engineering, domain expertise and more. While certain roles can be outsourced, core team members with deep technical skills and business acumen are critical for long term success but difficult to hire. Organizations need strategic talent partnerships and training programs to develop internal staff.

Regulatory Compliance: Supply chains operate in complex regulatory environments which adds extra challenges for AI. Issues range from data privacy & security to model governance, explainability for audits and non-discrimination in outputs. Frameworks like GDPR guidelines on ML require thorough due diligence. Adoption also needs to consider domain-specific regulations for industries like pharma, manufacturing etc. Regulatory knowledge gaps can delay projects or even result in non-compliance penalties.

Change Management: Implementing emerging technologies with potential for business model change and job displacements requires proactive change management. Issues range from guiding user adoption, reskilling workforce to addressing potential job displacement responsibly. Change fatigue from repeated large-scale digital transformations also needs consideration. Strong change leadership, communication and talent strategies are important for successful transformation while mitigating operational/social disruptions.

Cost of Experimentation: Building complex AI/ML supply chain applications often requires extensive experimentation with different model architectures, features, algorithms, etc. to get optimal solutions. This exploratory work has significant associated costs in terms of infrastructure spend, data processing resources, talent effort etc. Budgeting adequately for an experimental phase and establishing governance around cost controls is important. Return on investment also needs to consider tangible vs intangible benefits to justify spends.

While AI/ML offers immense opportunities to transform supply chains, their successful implementation requires diligent planning and long term commitment to address challenges across data, technology, talent, change management and regulatory compliance dimensions. Adopting best practices, piloting judiciously, establishing governance processes and fostering cross-functional collaboration are critical success factors for organizations. Continuous learning based on experiments and outcomes also helps maximize value from these emerging technologies over time.

COULD YOU EXPLAIN THE PROCESS OF CONDUCTING A FORMAL DEFENSE FOR A CAPSTONE PROJECT?

The formal defense is typically the final stage of the capstone project where the student presents their work to a committee of faculty members and others. It is a major undertaking that requires thorough preparation in order to showcase the effort, learning, and results of the capstone project in a clear and organized manner.

In the months leading up to the defense, the student works closely with their capstone advisor to refine their project results, prepare a formal written report, and plan out their oral presentation. The written report provides an in-depth record of the entire capstone project from start to finish so that readers can understand the research problem/issue that was addressed, the approach and methodology that was used, a discussion of the key findings and outcomes, as well as overall conclusions and implications. It is common for the written report to be 50-100 pages in length depending on the specific requirements.

Once the written report is finalized and approved by the capstone advisor, preparation begins for the oral presentation which will take place during the formal defense meeting. This involves creating a compelling slide presentation, usually around 20-30 slides, that covers all the critical elements of the project in a clear, logical flow. Sample slides would include an introduction to the research problem, literature review, methodology, results, conclusions, and future work. Visual elements like graphs, tables, photos are used judiciously to enhance understanding. The presentation is rehearsed numerous times to ensure its timing falls within the allotted time limit, usually around 30 minutes, including some periods for Q&A.

Weeks before the targeted defense date, the student submits their request to schedule the formal meeting along with electronic copies of their written report and presentation slides. The capstone coordinator or department sets the date, time and location for the defense meeting. Committees typically consist of 3 faculty members including the capstone advisor, but may include additional members from industry for professionally focused projects. The date is widely advertised to enable other interested parties can attend as well.

On the big day, the student arrives early to set up their laptop and ensure the AV equipment is functioning properly. As the meeting begins, the committee members are introduced and provided printed copies of the written report for reference during the presentation. The student then proceeds to deliver their oral presentation, staying within the time limit.

Following the presentation portion, the formal question and answer period begins. Committee members rigorously examine different aspects of the project, often playing “devil’s advocate” to probe the depth of the student’s knowledge and understanding. Questions can cover anything and everything related to the project from methodology to results to limitations. Students must demonstrate full command of their work and think on their feet. This Q&A period typically lasts 30-45 minutes.

Once all questions have been addressed, the committee excuses the student from the room and deliberates among themselves. They consider the quality and rigor of the project work, the student’s presentation skills and responses during Q&A. A decision is made regarding whether the student has successfully passed the defense.

The student is then invited back in, and the committee chair informs them of the final outcome. In the case of a PASS, official congratulations are given and the project is deemed completed. For a FAIL outcome, the committee explains areas requiring further work before another defense can be scheduled. A list of revisions is provided to guide the student.

Assuming a successful PASS result, the student can proudly lay claim to having completed their capstone project through this rigorous review process. It serves as a demonstration of the higher-order research, critical thinking, and presentation skills attained over their course of study.

The formal capstone defense provides both challenges and rewards for students as the culmination of their capstone experience. With diligent preparation and command of their work, they can feel a great sense of accomplishment in having their project vetted and validated through this rigorous academic rite of passage.

CAN YOU PROVIDE SOME EXAMPLES OF SUCCESSFUL CLOUD COMPUTING CAPSTONE PROJECTS

Implementing and Testing a Cloud-Based Virtual Desktop Infrastructure (VDI):

This project involved building a VDI environment using virtualization software like VMware Horizon, Citrix XenDesktop, or Microsoft Azure Virtual Desktop and testing its functionality and performance. The student would deploy virtual desktops on a cloud infrastructure like AWS, Azure, or GCP. They would test features like connectivity, login/logout speed, application launching times, graphics capabilities, scalability etc. Detailed reports would be generated on the overall process, challenges faced, optimization done and results. This helped demonstrate skills in deploying and managing virtual desktop environments leveraging cloud technologies.

Building a Serverless Web or Mobile Application on AWS Lambda:

In this project, a student developed a simple web or mobile application that utilized AWS Lambda for serverless computing. Common tasks included building APIs using Lambda, DynamoDB for data storage, connecting user interfaces built using technologies like ReactJS, building in authentication and authorization via Cognito, adding image/file processing via S3 buckets etc. Comprehensive documentation and demos were provided highlighting how the application leveraged serverless computing to improve scalability and reduce operational overhead. This showcased skills in designing, developing and deploying applications using AWS serverless services.

Implementing a Disaster Recovery Solution using AWS or Azure:

The student designed and implemented a disaster recovery (DR) solution for critical systems or applications of an organization using cloud DR offerings. This involved activities like identifying critical systems, documenting RPO/RTO requirements, designing the replication architecture (active-passive or active-active), deploying required cloud infrastructure in the designated DR region, setting up replication between on-prem and cloud using tools like AWS Database Migration Service or Azure Site Recovery, testing failovers, and generating documents for DR processes. Students gained hands-on experience in designing and implementing cloud-based DR solutions leveraging services from AWS or Azure.

Developing an IoT Application on AWS IoT Core:

In this project, the student identified a potential IoT use case and developed a prototype solution on AWS IoT Core. Common implementations included building a smart door lock that could be remotely controlled and monitored, building a smart home solutions that could control lights, temperature etc. or implementing a supply chain solution tracking shipments. Key tasks involved designing the IoT architecture, provisioning devices, uploading device fingerprints and certificates, developing rules and APIs to process data, storing data in databases like DynamoDB, visualizing data with tools like Quicksight etc. Students demonstrated skills in end to end IoT application development on AWS leveraging its IoT platform and related services.

Implementing a Hybrid Cloud Solution Spanning On-Prem and Cloud:

The student designed and deployed a hybrid solution integrating on-prem and cloud infrastructure from a major public cloud provider. Common implementations included extending on-prem Active Directory to the cloud, implementing a hybrid WAN connectivity, building hybrid databases with on-prem and cloud instances, implementing hybrid backup and disaster recovery or building hybrid applications accessible from both environments. Key tasks included activities like networking/identity integration, data replication, performance/scalability testing across environments etc. Students gained expertise in implementing interconnectivity between on-prem and cloud environments leveraging hybrid cloud technologies.

As seen in the examples above,cloud computing capstone projects allow students to implement and showcase end-to-end solutions handling real-world use cases. Successful projects have clearly defined requirements and objectives, demonstrate hands-on technical skills in deploying cloud infrastructure and developing applications, provide thorough documentation of the process and address key pain-points with optimization. This helps crystallize learnings from the cloud computing program and prepares students for cloud jobs/certifications by implementing projects of relevance to the industry. Capstone projects are an effective way for students to gain practical cloud experience through self-directed applied learning experiences.

HOW CAN STUDENTS ENSURE THAT THEIR CAPSTONE PROJECT TOPIC IS FEASIBLE AND APPROPRIATE?

HOW CAN STUDENTS ENSURE THAT THEIR CAPSTONE PROJECT TOPIC IS FEASIBLE AND APPROPRIATE?

Preliminary research is extremely important. Students should conduct an initial literature review on their topic idea to see what kind of information is already available. This will help determine if there is sufficient data, resources, and prior studies to support a full capstone project. It’s important to verify that information exists to draw from and add new insights to. If little to no previous research exists, the topic may be too broad or underdeveloped.

Discussing the topic idea with their capstone advisor or instructor early in the process is highly recommended. An experienced faculty member can provide valuable feedback on whether the scope and goals of the project seem realistic given the usual parameters and expectations of a capstone. They may also help narrow the focus to what can actually be achieved within the timeframe and given any other constraints like costs, equipment needs, or recruiting requirements. Taking instructor guidance at the start can help avoid issues later on.

Considering feasibility factors like time, costs, and access is critical. Students need to evaluate if they realistically have the necessary time, funding or ability to obtain funds, and permission or access to study participants, test groups, physical locations or other resources required to conduct the capstone research or project work. It’s not appropriate to propose something that can’t be finished properly prior to deadlines due to challenges in these practical areas.

Determining how the topic fits within the field of study is also important. Capstone projects should connect meaningfully to the student’s major or program of study in a way that allows them to demonstrate higher-level learning at the culmination of their undergraduate career. Topics merely tangentially related or well outside the scope of the curriculum may not be suitable. Obtaining guidance from instructors on how a proposed topic can showcase or integrate key lessons from the entire course of study can ensure appropriateness.

Students should explicitly consider how ethical issues may arise and how they plan to address them from the start. Some topic ideas unfortunately involve populations or methods which would pose unacceptable ethical risks to study participants’ rights, privacy or well-being. Others may stray into political or controversial areas that could compromise the objectivity and scholarly nature of capstone work. Considering from an early stage how to design research plans sensitively and appropriately is important to determine feasibility given ethics requirements and academic standards.

Potential value of the work should also be reflected on. Students need to evaluate if the capstone as proposed has novel and meaningful contributions it could potentially make within the field. Feasible topics are more likely to be those where there is room for new insights, conclusions, frameworks, applications or knowledge. Those that simply repeat what is already well-known are less suitable as they may struggle to demonstrate the deeper learning goals of a capstone experience. Clear communication of expected outcomes is important.

The topic idea refine process doesn’t necessarily stop after the proposal stage either. Students may find that as planning progresses, certain elements or goals become nonviable and alternatives need consideration. Maintaining a flexible approach and regularly re-evaluating feasibility with the instructor guiding them helps ensure any necessary adjustments can be made proactively to complete high quality work that satisfies capstone requirements and represents the culmination of their undergraduate career in the most positive way. With due diligence given to feasibility at each stage of the process, students can select a topic that allows them to shine.

Carefully evaluating preliminary research resources, discussion capstone advisor input, considering practical constraints realistically, determining fit within the field of study, anticipating ethical aspects, and communicating clear value and outcomes are strategic steps students can take to help guarantee their proposed capstone topic is feasible and appropriate before proceeding to full project planning and implementation. Maintaining ongoing dialog throughout the process also helps issues be addressed proactively to optimize success. With feasibility as a priority during topic selection and refinement, students set themselves up well to complete impactful and meaningful work.