Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis

doi:10.70124/f5t2d-xt904

Published April 28, 2025 | Version v1

Dataset Open

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis

Çakmak, Dilara¹

1. TU Wien

Context and Methodology

Research Domain:
The dataset is part of a project focused on retail sales forecasting. Specifically, it is designed to predict daily sales for Rossmann, a chain of over 3,000 drug stores operating across seven European countries. The project falls under the broader domain of time series analysis and machine learning applications for business optimization. The goal is to apply machine learning techniques to forecast future sales based on historical data, which includes factors like promotions, competition, holidays, and seasonal trends.

Purpose:
The primary purpose of this dataset is to help Rossmann store managers predict daily sales for up to six weeks in advance. By making accurate sales predictions, Rossmann can improve inventory management, staffing decisions, and promotional strategies. This dataset serves as a training set for machine learning models aimed at reducing forecasting errors and supporting decision-making processes across the company’s large network of stores.

How the Dataset Was Created:
The dataset was compiled from several sources, including historical sales data from Rossmann stores, promotional calendars, holiday schedules, and external factors such as competition. The data is split into multiple features, such as the store's location, promotion details, whether the store was open or closed, and weather information. The dataset is publicly available on platforms like Kaggle and was initially created for the Kaggle Rossmann Store Sales competition. The data is made accessible via an API for further analysis and modeling, and it is structured to help machine learning models predict future sales based on various input variables.

Technical Details

Dataset Structure:

The dataset consists of three main files, each with its specific role:

Train:
This file contains the historical sales data, which is used to train machine learning models. It includes daily sales information for each store, as well as various features that could influence the sales (e.g., promotions, holidays, store type, etc.).
```
https://handle.test.datacite.org/10.82556/yb6j-jw41
PID: b1c59499-9c6e-42c2-af8f-840181e809db
```
Test2:
The test dataset mirrors the structure of train.csv but does not include the actual sales values (i.e., the target variable). This file is used for making predictions using the trained machine learning models. It is used to evaluate the accuracy of predictions when the true sales data is unknown.
```
https://handle.test.datacite.org/10.82556/jerg-4b84
PID: 7cbb845c-21dd-4b60-b990-afa8754a0dd9
```
Store:
This file provides metadata about each store, including information such as the store’s location, type, and assortment level. This data is essential for understanding the context in which the sales data is gathered.
```
https://handle.test.datacite.org/10.82556/nqeg-gy34
PID: 9627ec46-4ee6-4969-b14a-bda555fe34db
```

Data Fields Description:

Id: A unique identifier for each (Store, Date) combination within the test set.
Store: A unique identifier for each store.
Sales: The daily turnover (target variable) for each store on a specific day (this is what you are predicting).
Customers: The number of customers visiting the store on a given day.
Open: An indicator of whether the store was open (1 = open, 0 = closed).
StateHoliday: Indicates if the day is a state holiday, with values like:
- 'a' = public holiday,
- 'b' = Easter holiday,
- 'c' = Christmas,
- '0' = no holiday.
SchoolHoliday: Indicates whether the store is affected by school closures (1 = yes, 0 = no).
StoreType: Differentiates between four types of stores: 'a', 'b', 'c', 'd'.
Assortment: Describes the level of product assortment in the store:
- 'a' = basic,
- 'b' = extra,
- 'c' = extended.
CompetitionDistance: Distance (in meters) to the nearest competitor store.
CompetitionOpenSince[Month/Year]: The month and year when the nearest competitor store opened.
Promo: Indicates whether the store is running a promotion on a particular day (1 = yes, 0 = no).
Promo2: Indicates whether the store is participating in Promo2, a continuing promotion for some stores (1 = participating, 0 = not participating).
Promo2Since[Year/Week]: The year and calendar week when the store started participating in Promo2.
PromoInterval: Describes the months when Promo2 is active, e.g., "Feb,May,Aug,Nov" means the promotion starts in February, May, August, and November.

Software Requirements

To work with this dataset, you will need to have specific software installed, including:

DBRepo Authorization: This is required to access the datasets via the DBRepo API. You may need to authenticate with an API key or login credentials to retrieve the datasets.
Python Libraries: Key libraries for working with the dataset include:
- pandas for data manipulation,
- numpy for numerical operations,
- matplotlib and seaborn for data visualization,
- scikit-learn for machine learning algorithms.

Additional Resources

Several additional resources are available for working with the dataset:

Presentation:
A presentation summarizing the exploratory data analysis (EDA), feature engineering process, and key insights from the analysis is provided. This presentation also includes visualizations that help in understanding the dataset’s trends and relationships.
Jupyter Notebook:
A Jupyter notebook, titled Retail_Sales_Prediction_Capstone_Project.ipynb, is provided, which details the entire machine learning pipeline, from data loading and cleaning to model training and evaluation.
Model Evaluation Results:
The project includes a detailed evaluation of various machine learning models, including their performance metrics like training and testing scores, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE). This allows for a comparison of model effectiveness in forecasting sales.
Trained Models (.pkl files):
The models trained during the project are saved as .pkl files. These files contain the trained machine learning models (e.g., Random Forest, Linear Regression, etc.) that can be loaded and used to make predictions without retraining the models from scratch.
sample_submission.csv:
This file is a sample submission file that demonstrates the format of predictions expected when using the trained model. The sample_submission.csv contains predictions made on the test dataset using the trained Random Forest model. It provides an example of how the output should be structured for submission.

These resources provide a comprehensive guide to implementing and analyzing the sales forecasting model, helping you understand the data, methods, and results in greater detail.

Files

Random Forest Feature Importance.png

Files (4.0 GiB)

Name	Size
Decision Tree Actual Predicted Histogram.png md5:f52f10f33c0f9d50597076138a88b306	288.1 KiB	Preview Download
Decision Tree Real Predicted Scatter Plot.png md5:d61a403db3f521ead1b642ea94af474b	196.2 KiB	Preview Download
Decision Tree Regression_model.pkl md5:08ee26164bd4fe05324380f8916f98ca	78.4 MiB	Download
k-Nearest Neighbors Regression_model.pkl md5:903cd7e184ce950e2869d7a1cf476f67	78.6 MiB	Download
KNN Histogram.png md5:3c8d95ac6b0ca779bc4a41a4b6a7726a	301.1 KiB	Preview Download
KNN scatter plot.png md5:12aa921c84acb04afd8590d2b721aed6	216.6 KiB	Preview Download
Lasso Regression Actual Predicted.png md5:7e62c757eafd74d7a402dd6e24ddca73	304.1 KiB	Preview Download
Lasso Regression Real Predicted Scatter Plot.png md5:6c0d69c89b697dd3ff898d1007b6fb63	207.2 KiB	Preview Download
Lasso Regression_model.pkl md5:78d1483e004afe7e5f08c945fa292337	4.1 KiB	Download
Linear Regression Actual - Predicted.png md5:28f1ddbfd8683f3af44ba919926a718b	301.8 KiB	Preview Download
Linear Regression Real Predicted Scatter Plot.png md5:852ef4e3c9d6c02af3cc53566f86b02b	205.7 KiB	Preview Download
Linear Regression_model.pkl md5:327c45106fcfe56e6214fdc8e9d62bf3	817 Bytes	Download
model_evaluation_results.json md5:9a1a53b6f5808c9332f7d855796301c1	1.8 KiB	Preview Download
Random Forest Feature Importance.png md5:00839da74fd2d07f39ba5e9f9945e552	244.6 KiB	Preview Download
Random Forest Histogram.png md5:3e74743d5f427b578b8aa5239d87de81	299.2 KiB	Preview Download
Random Forest Regression_model.pkl md5:5ba7a46e64560d69b1962ba92748f0f5	3.9 GiB	Download
Random Forest Scatter Plot.png md5:9ca48264979e484b4c1be41607b0c576	199.8 KiB	Preview Download
README.md md5:c050dc8e24242192793de20f5fe9472c	3.4 KiB	Preview Download
Retail Sales Prediction.pdf md5:54c61171f9d28ee42cb76c266c131699	757.3 KiB	Preview Download
Retail_Sales_Prediction_Capstone_Project.ipynb md5:cd81fed6a56d1942a205339ff2cc0974	2.5 MiB	Preview Download
sample_submission.csv md5:44da16ced7e649d473a2dde3e454f3f3	422.2 KiB	Preview Download
Train Test MAPEs.png md5:27d2c2529b29a07472f62fd8202804db	258.8 KiB	Preview Download
Train Test RMSEs.png md5:75a91131d3ba697c7e6a9e2eb5f51c57	263.0 KiB	Preview Download
Train Test Scores.png md5:ed509bc99c1993f4de721ed6bcc15770	259.8 KiB	Preview Download
Tuned Decision Tree Actual Predicted Histograms.png md5:144c7bff499315c61b72e31606806d7b	303.2 KiB	Preview Download
Tuned Decision Tree Regression_model.pkl md5:8db6ec18e668c11aa8f0a659016f3fd6	7.7 MiB	Download
Tuned Decision Tree Scatter Plot.png md5:050789b6f302cfb989e6ebf6929a6097	206.7 KiB	Preview Download

Additional details

Is part of: Software: 10.5281/zenodo.15295800 (DOI); Dataset: 10.82556/yb6j-jw41 (DOI); Dataset: 10.82556/jerg-4b84 (DOI); Dataset: 10.82556/nqeg-gy34 (DOI)

Submitted: 2025-04

Siddiqui, A. (2021, October 2). Retail Sales Prediction [Computer software]. Retrieved from Github: https://github.com/asim5800/Retail-Sales-Prediction
Cukierski, W. (2016). Rossmann Store Sales. Retrieved from Kaggle: https://www.kaggle.com/competitions/rossmann-store- sales/data?select=sample_submission.csv

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Files

Random Forest Feature Importance.png

Files (4.0 GiB)

Additional details

Related works

Dates

References

Evaluating FAIR Models for Rossmann Store Sales Prediction: Insights and Performance Analysis

Creators

Description

Context and Methodology

Technical Details

Data Fields Description:

Software Requirements

Additional Resources

Files

Random Forest Feature Importance.png

Files (4.0 GiB)

Additional details

Related works

Dates

References