Published April 28, 2025 | Version v1
Dataset Open

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis

  • 1. ROR icon TU Wien

Description

Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

  1. Train:
    This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).

    https://handle.test.datacite.org/10.82556/yb6j-jw41
    PID: b1c59499-9c6e-42c2-af8f-840181e809db
  2. Test2:
    The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.

    https://handle.test.datacite.org/10.82556/jerg-4b84
    PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
  3. Store:
    This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered. 

    https://handle.test.datacite.org/10.82556/nqeg-gy34
    PID: 9627ec46-4ee6-4969-b14a-bda555fe34db

Data Fields Description:

  • Id: A unique identifier for each (Store, Date) combination within the test set.

  • Store: A unique identifier for each store.

  • Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).

  • Customers: The number of customers visiting the store on a given day.

  • Open: An indicator of whether the store was open (1 = open, 0 = closed).

  • StateHoliday: Indicates if the day is a state holiday, with values like:

    • 'a' = public holiday,

    • 'b' = Easter holiday,

    • 'c' = Christmas,

    • '0' = no holiday.

  • SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).

  • StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.

  • Assortment: Describes the level of product assortment in the store:

    • 'a' = basic,

    • 'b' = extra,

    • 'c' = extended.

  • CompetitionDistance: Distance (in meters) to the nearest competitor store.

  • CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.

  • Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).

  • Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).

  • Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.

  • PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

  • DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.

  • Python Libraries: Key libraries for working with the dataset include:

    • pandas for data manipulation,

    • numpy for numerical operations,

    • matplotlib and seaborn for data visualization,

    • scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

  1. Presentation:
    A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.

  2. Jupyter Notebook:
    A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.

  3. Model Evaluation Results:
    The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.

  4. Trained Models (.pkl files):
    The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.

  5. sample_submission.csv:
    This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

Files

Random Forest Feature Importance.png

Files (4.0 GiB)

Name Size
md5:f52f10f33c0f9d50597076138a88b306
288.1 KiB Preview Download
md5:d61a403db3f521ead1b642ea94af474b
196.2 KiB Preview Download
md5:08ee26164bd4fe05324380f8916f98ca
78.4 MiB Download
md5:903cd7e184ce950e2869d7a1cf476f67
78.6 MiB Download
md5:3c8d95ac6b0ca779bc4a41a4b6a7726a
301.1 KiB Preview Download
md5:12aa921c84acb04afd8590d2b721aed6
216.6 KiB Preview Download
md5:7e62c757eafd74d7a402dd6e24ddca73
304.1 KiB Preview Download
md5:6c0d69c89b697dd3ff898d1007b6fb63
207.2 KiB Preview Download
md5:78d1483e004afe7e5f08c945fa292337
4.1 KiB Download
md5:28f1ddbfd8683f3af44ba919926a718b
301.8 KiB Preview Download
md5:852ef4e3c9d6c02af3cc53566f86b02b
205.7 KiB Preview Download
md5:327c45106fcfe56e6214fdc8e9d62bf3
817 Bytes Download
md5:9a1a53b6f5808c9332f7d855796301c1
1.8 KiB Preview Download
md5:00839da74fd2d07f39ba5e9f9945e552
244.6 KiB Preview Download
md5:3e74743d5f427b578b8aa5239d87de81
299.2 KiB Preview Download
md5:5ba7a46e64560d69b1962ba92748f0f5
3.9 GiB Download
md5:9ca48264979e484b4c1be41607b0c576
199.8 KiB Preview Download
md5:c050dc8e24242192793de20f5fe9472c
3.4 KiB Preview Download
md5:54c61171f9d28ee42cb76c266c131699
757.3 KiB Preview Download
md5:cd81fed6a56d1942a205339ff2cc0974
2.5 MiB Preview Download
md5:44da16ced7e649d473a2dde3e454f3f3
422.2 KiB Preview Download
md5:27d2c2529b29a07472f62fd8202804db
258.8 KiB Preview Download
md5:75a91131d3ba697c7e6a9e2eb5f51c57
263.0 KiB Preview Download
md5:ed509bc99c1993f4de721ed6bcc15770
259.8 KiB Preview Download
md5:144c7bff499315c61b72e31606806d7b
303.2 KiB Preview Download
md5:8db6ec18e668c11aa8f0a659016f3fd6
7.7 MiB Download
md5:050789b6f302cfb989e6ebf6929a6097
206.7 KiB Preview Download

Additional details

Related works

Is part of
Software: 10.5281/zenodo.15295800 (DOI)
Dataset: 10.82556/yb6j-jw41 (DOI)
Dataset: 10.82556/jerg-4b84 (DOI)
Dataset: 10.82556/nqeg-gy34 (DOI)

Dates

Submitted
2025-04

References