There are several trusted sources where you can find free and paid retail datasets to download and analyze. Some of the most commonly used sources include:

Kaggle: Kaggle is a very popular platform for data science competitions and projects where users can access thousands of public datasets for free. They have a wide selection of retail datasets ranging from transaction records to customer profiles. To access these datasets, you need to create a free Kaggle account. Then you can browse their retail category or use the search bar to find specific datasets. Most datasets can be downloaded directly from their page as CSV files.

Data.gov: As a government portal, Data.gov contains a large collection of datasets from different agencies that are all public domain. They have some interesting retail datasets primarily focused on things like census data, economic indicators, and consumer behavior analytics. To download from Data.gov, browse their catalog, search for relevant keywords like “retail sales” or categories like “economic” to find options. You can then click on individual datasets for metadata and download links.

Information Resources: This company curates retail datasets from various stores and chains then licenses them for use by businesses and researchers. Their datasets provide detailed point-of-sale transaction records, loyalty card purchase histories, and inventory/pricing files. Access requires registering for a free trial account on their site. Trial access is limited but lets you evaluate samples before paying licensing fees for full datasets.

Nielsen: As a leading market research firm, Nielsen has a wealth of consumer shopping behavior data captured via their Nielsen Homescan panel and store point-of-sale monitoring systems. Their retail datasets are only available for purchase through commercial licenses but provide very robust insights into categories like household item sales, store foot traffic patterns, and competitive brand/product analyses. Costs typically range from a few thousand to tens of thousands depending on scale and frequency of updates required.

Euromonitor: Similar to Nielsen, Euromonitor collects extensive market data on industries globally including retail sectors in different countries. They have pre-built retail market size and forecast datasets covering things like the size of the clothing, grocery, electronics retail industries over time by region. These detailed retail market reports and datasets need to be purchased but provide macro analyses of retail industry compositions and growth trends. Pricing is more affordable compared to Nielsen, starting at a few hundred dollars.

Store Layouts: This shopper behavior startup has crowdsourced floor maps and layouts of hundreds of major retail stores globally. Their open datasets contain anonymized store maps with metadata on departments, aisles, fixtures which researchers and retailers use for understanding consumer journeys and spatial analyses. Maps can be freely downloaded as image files with attribution given to the source.

IRI: Formerly known as Information Resources Inc, IRI is another leading market data provider collecting point-of-sale and survey-based information. Their retail datasets focus more on consumer-packaged goods like grocery, tobacco, OTC healthcare products. Dataset access requires commercial licensing but provides competitive sales, pricing, promotion, and household panel data for CPG categories.

US Census Bureau: The Bureau collects and publishes government economic reports providing insights like total retail sales by industry, inventory levels, e-commerce trends. Much of this macro retail indicators data is publicly available for free download as CSV files on their website without needing an account. Key datasets include Monthly & Annual Retail Trade reports along with quinquennial Economic Census results detailing sales by store type.

Individual Retail Chains: Some prominent big box and specialty retailers like Target, Walmart, Lowe’s, Home Depot also publicly share limited data subsets focusing on things like sales of particular product categories nationally or by region over time. These datasets have narrower scopes than Nielsen/IRI but provide a view of sales directly from major chains. They are freely available on the chains’ open data or “About Us” pages without registration.

There are also private retailers, marketplaces, e-commerce platforms where researchers can potentially gain access to transaction and user behavioral datasets for a fee by contacting their business development/partnerships teams. Getting approved typically requires clear use cases and agreeing to restrictive non-disclosure terms due to the sensitive commercial nature of the raw data.

While some of the most complete retail datasets need payment, there are also many sources for free public datasets to leverage without commercial licenses. Understanding the pros and cons of different data providers is important based on one’s specific analytical needs and research budgets when seeking retail datasets for projects. With the variety available, researchers should be able to find suitable options to power insightful retail sector analyses and model building.


Kaggle Retail Dataset: This dataset contains over 10 years of daily sales data for 45,000 food products across 10 stores. It includes fields like store, department, date, weekly sales, markup, and more. With over 500,000+ rows, it provides a lot of rich data to analyze retail sales patterns, perform forecasting, explore department performance, and get insights into pricing and promotion effectiveness. Some potential capstone projects could be building predictive sales models, optimizing inventory levels, detecting anomalies or outliers, comparing store or department performance, etc.

Online Retail II Dataset: This dataset from the UCI Machine Learning Repository contains transactions made by a UK-based online retail between 01/12/2009 and 09/12/2011. It includes fields like InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, and Country. With over 5,000 unique products and around 8,000 customers, it allows examining customer purchasing behaviors, product categories, sales trends over time. Capstone ideas could be customer segmentation, recommendation engines, predictive churn analysis, promotion targeting, assortment optimization, etc.

European Retail Study Dataset: This dataset was collected between 2013-2015 across 24 countries in Europe to study omni-channel retail. It provides information on over 42,000 customers, their purchase transactions, demographic details, online/offline shopping behaviors, returns etc. Some dimensions covered are age, gender, income-level, product categories purchased, channels used, spend amounts. This rich dataset opens up opportunities for multi-channel analytics, personalized experiences, loyalty program design, understanding cross-border trends at a continental scale.

Instacart Market Basket Analysis Dataset: This dataset collected over 3 million grocery orders from real Instacart customers. It includes anonymized order data with product names, quantities, added or removed from basket, purchase or cancellation. This provides scope for advanced market basket or transactional analysis to determine complementary or frequently bought together products, influencing factors on abandoned cart recovery, incremental sales from personalized recommendations, effects of out-of-stock items etc.

Walmart Sales Forecasting Dataset: This dataset contains daily sales data for 45 Walmart stores located in different regions collected over 3 years. Features include Store, Dept, Date, Weekly_Sales, Markup, etc. It can be leveraged to build statistical or deep learning models for short and long term demand forecasting across departments, developing automatic outlier detection capabilities, scenarion analysis during special events etc.

Target Customer Dataset: This dataset contains purchasing profiles for over 5000 anonymous Target customers encompassing their transactions over a 6 month period. It includes features like age, gender, marital status, home ownership, number of dependents, income, spend categories within Target like grocery, personal care, electronics etc. This could enable identifying high lifetime value segments, developing micro-segmentation strategies, testing personalization and targeted promotions approaches.

Kroger Customer Analytics Dataset: This dataset contains anonymous profiles of over 30,000 Kroger customers including their demographics, surveyed household & lifestyle characteristics, shopping behaviors and purchasing basket details. Variables provided are age, ethnicity, family status, income level, ZIP code, preferences like organic, wellness focused etc along with purchases across departments. Potential projects include customer churn analysis, propensity modeling for private label brands, targeted loyalty program personalization at scale.

These datasets offer rich retail data that span various dimensions – from transactions, customers, banners to omni-channel behavior. They enable diving deep into opportunities like forecasting, recommendations, segmentation, promotions analysis, supply chain optimization at scale suitable for many capstone project ideas exploring insights for retailers. The datasets are publicly available and of a good volume and variety to power meaningful analytical modeling and drive actionable business recommendations.


One of the most common types of datasets used is health/medical data, as it allows students to analyze topics that can have real-world impact. For example, one group of students obtained de-identified medical claim records from a large insurance provider covering several years. They analyzed the data to identify predictors of high medical costs and develop risk profiles that could help the insurance company better manage patient care. Some features they examined included diagnoses, procedures, prescriptions, demographics, and lifestyle factors. They built machine learning models to predict which patients were most at risk of future high costs based on their histories.

Another popular source of data is urban/transportation planning datasets. One project looked at public transit ridership patterns in a major city using anonymized tap-in/tap-out records from the city’s subway and bus systems. Students analyzed rider origins and destinations to identify the most traveled routes and times of day. They also examined how ridership changed on different days of the week and during major events. Their findings helped the city transportation authority understand demand and make recommendations on where to focus service improvements.

Education data is another rich area for capstone work. A group worked with a large statewide standardized test scores database containing student performance dating back over 10 years. They performed longitudinal analysis to determine what factors most strongly correlated with improvements or declines in test scores over time. Features they considered included school characteristics, class sizes, teacher experience levels, as well as student demographics. Their statistical models provided insight into what policies had the biggest impacts on student outcomes.

Some students obtain datasets directly from private companies or non-profits. For example, a retail company provided anonymous customer transactions records from their loyalty program. Students analyzed purchasing patterns and developed segments of customer groups with similar behaviors. They also built predictive models to identify good prospects for targeted marketing campaigns. Another project partnered with a medical research non-profit. Students analyzed their database of published clinical trials to determine what therapies were most promising based on completed studies. They also examined factors correlated with trials receiving funding or being terminated early. Their analyses could help guide the non-profit’s future research investment strategies.

While restricted real-world datasets aren’t always possible to work with, many students supplement private data projects with publicly available benchmark datasets. For example, the Iris flowers dataset, Wine quality dataset and Breast cancer dataset from the UCI Machine Learning Repository have all been used in student capstones. Projects analyze these and apply modern techniques like deep learning or make comparisons to historical analyses. Students then discuss potential applications and limitations if the models were used on similar real problem domains.

Some larger capstone projects involve collecting original datasets. For instance, education students designed questionnaires and conducted surveys of K-12 teachers and administrators in their state. They gathered input on professional development needs and challenges in teaching certain subjects. After analyzing the survey results, students presented strategic recommendations to the state department of education. In another example, engineering students gathered sensor readings from their own Internet-of-Things devices deployed on a university campus, collecting data on factors like noise levels, foot traffic and weather over several months. They used this to develop predictive maintenance models for campus facilities.

Real-world datasets enable capstone students to gain experience analyzing significant problems and generating potentially impactful insights, while also meeting the goals of demonstrating technical and analytical skills. The ability to link those findings back to an applied context or decision making scenario adds relevancy and value for the organizations involved. While privacy and consent challenges exist, appropriate partnerships and data access have allowed many successful student projects.