Kaggle Datasets: Kaggle is one of the largest data science communities and open data repositories in the world. It contains thousands of public datasets that cover a wide range of domains including healthcare, business, biology, social sciences, and more. Some particularly useful or interesting Kaggle datasets for capstone projects include:
Titanic dataset: Contains passenger information from the Titanic voyage to build predictive models to identify which passengers were more likely to survive. This is a classic introductory machine learning problem.
House Prices – Advanced Regression Techniques: Contains housing data to predict property values using techniques like random forest regression. A good test of more advanced techniques.
Human Activity Recognition Using Smartphones: Includes smartphone sensor data to classify different physical activities like walking, sitting, etc. Good for exploring time series and sensor data.
Porto Seguro’s Safe Driver Prediction: Predict safe driving behavior from usage-based insurance data using techniques like XGBoost. A direct application to insurance industry.
US Government Data Sources: The US government collects and publishes troves of public data that are well-structured and authenticated. Some excellent sources include:
Census Data: Demographic and economic data from the US Census Bureau including decennial census, American Community Survey, economic census, etc. Rich datasets for exploring US population trends over time.
Centers for Disease Control and Prevention (CDC) Data: Epidemiological data on diseases, risk factors, health behaviors from the national public health agency. Relevant for healthcare and life sciences projects.
Federated Electronic Biomedical Research Database (EBRD): Biomedical and clinical trial data maintained by the National Library of Medicine. Useful for projects involving healthcare, genetics, biomedicine.
NASA Data: Earth science datasets on climate, weather, oceans, natural hazards from NASA satellites and missions. Good for projects related to environmental monitoring, climate change, natural disasters.
UK Data Service: Run by University of Essex, this is a large collection of social and economic data from the UK primarily but also some international data. Some commonly used datasets include:
British Social Attitudes Survey: Public opinion polling on key social issues in the UK since 1983. Useful for trend analysis.
Understanding Society: Longitudinal survey following 40,000 UK households since 2009 on health, financial circumstances, employment, education. Comprehensive for studying changes over time.
Economic and Social Data Service Workplace Employee Relations Surveys: Datasets on employees, work satisfaction, etc. from 1998-2004. Valuable for organizational behavior analysis.
Financial and Economic Data:
FRED Economic Data (St. Louis Fed): Vast collection of US economic indicators like GDP, prices, employment, banking from 1917-present. Ideal for macroeconomic analyses, nowcasting/forecasting.
Quandl Stock and Market Data: Large database of stock price histories, market indices, company fundamentals, and market events for equity analysis projects.
IMF Data: Macroeconomic and finance data on 190 countries available from International Monetary Fund sources. Appropriate for cross-country comparative analyses.
Other Subject-Specific Data Sources: There are many specialized data repositories depending on the project domain like gene expression and biomedical data (Gene Expression Omnibus, Proteomics Identifications database), educational data (National Center for Education Statistics), transportation data (US DOT databases), and more. With diligent searches, relevant open datasets can almost always be found for capstone projects in any domain or subject area.
These publicly available data sources cover a broad range of domains and contain high quality datasets that are well-formatted and documented – making them ideally suited for data science capstone projects. Choosing datasets from these authoritative sources lends credibility to the analysis and insights generated. With thoughtful selection of the problem topic and matching of suitable datasets, students can design rigorous and impactful final projects to demonstrate their data skills and knowledge gained through their programs.