Model: Random Forest Regressor Trained on Merged NFL Stadium Attendance Dataset

Paskovski, Emil

doi:10.70124/9bnb9-n0c65

Published April 25, 2025 | Version v1

Model Open

Model: Random Forest Regressor Trained on Merged NFL Stadium Attendance Dataset

Paskovski, Emil

This Random Forest Regressor model was trained on a custom-preprocessed dataset of NFL stadium attendance, constructed by merging three publicly available datasets from Kaggle, uploaded by Sujay Kapadnis. The original datasets include game-level statistics (games.csv), season standings (standings.csv), and attendance records (attendance.csv).

The datasets were merged and cleaned to form a single modeling-ready table. Categorical variables (such as teams, location, and outcome) were transformed using one-hot encoding, while numerical variables (such as scores and rankings) were standardized using StandardScaler. The target variable is weekly stadium attendance for NFL games.

The full dataset was split into:

Training set: 70%
Validation set: 15%
Test set: 15%

A GridSearchCV was applied to a RandomForestRegressor from sklearn.ensemble, using 3-fold cross-validation on the training set. The model was optimized to minimize mean absolute error (MAE).

Best hyperparameters found:

n_estimators: 250
max_depth: None
min_samples_split: 2
min_samples_leaf: 3
max_features: None

After tuning, the final model was retrained on the full training set, and its performance was evaluated on the validation set as a proxy for generalization performance.

Technical environment:

Python version: 3.13.1
Libraries used:
- scikit-learn (sklearn)
- pandas, numpy
- matplotlib (for evaluation visuals)

The model is saved as a .pkl file and is intended for inference on similarly preprocessed data. Numerical features must be scaled using the same StandardScaler, and categorical encoding must follow the same one-hot schema used during training. The target scaler (target_scaler.pkl) is also provided to inverse-transform predictions back to the original scale.

Model Risks, Biases, and Limitations:

Risks: Misuse for speculative dynamic ticket pricing strategies without ethical oversight.
Biases: Dataset may under-represent regional differences in reporting or non-ticketed attendance.
Limitations: Model does not account for special promotional events, weather conditions, or real-time ticket resale dynamics.

fair4ml:mlTask | Regression – NFL stadium attendance forecasting
fair4ml:modelCategory | Supervised → Ensemble → Random Forest
fair4ml:intendedUse | Stadium attendance forecasting for operational, marketing, and planning decisions in the NFL.
fair4ml:modelRisksBiasLimitations | - Risk: Model predictions could be misused for discriminatory pricing. - Bias: Original datasets may underreport regional attendance discrepancies. - Limitations: Model does not include external factors like weather, ticket resale, or last-minute promotions.
fair4ml:trainedOn | - https://doi.org/10.82556/4fst-m890
fair4ml:validatedOn | - https://doi.org/10.82556/djvg-rb67
fair4ml:testedOn | - https://doi.org/10.82556/zv9b-6h09
schema:license + fair4ml:legal / ethicalSocial | License: CC-BY-4.0
schema:codeRepository | {{ https://github.com/emilp-tuwien/nfl-attendance-prediction/tree/main

Other

/* --- core identification --- */
"name": "Random Forest Regressor for NFL Stadium Attendance",
"version": "1.0",
"dateCreated": "2025-04-26",
"dateModified": "2025-04-27",

/* --- authorship --- */
"author": {
"@type": "Person",
"name": "Emil Paskovski",
"identifier": "https://orcid.org/0009-0004-4299-0636"
},

/* --- licensing & code --- */
"license": "https://creativecommons.org/licenses/by/4.0/",
"codeRepository": "https://github.com/emilp-tuwien/nfl-attendance-prediction",

/* --- FAIR4ML task info --- */
"mlTask": "regression",
"modelCategory": "Supervised > Ensemble > Random Forest",
"intendedUse": "Stadium attendance forecasting for operational, marketing and planning decisions in the NFL.",
"modelRisksBiasLimitations": "Risk: potential misuse for discriminatory pricing. Bias: regional under-reporting of attendance. Limitation: does not consider weather, resale, or promotions.",

/* --- training, validation, test sources --- */
"trainedOn": "https://doi.org/10.82556/4fst-m890",
"validatedOn": "https://doi.org/10.82556/djvg-rb67",
"testedOn": "https://doi.org/10.82556/zv9b-6h09",

/* --- environment & hyper-params (optional but useful) --- */
"softwareRequirements": [
"Python 3.13.1",
"scikit-learn",
"pandas",
"numpy",
"matplotlib"
],
"hyperparameters": {
"n_estimators": 250,
"max_depth": null,
"min_samples_split": 2,
"min_samples_leaf": 3,
"max_features": null
},

/* --- CO2 footprint (put N/A if unknown) --- */
"hasCO2eEmissions": "N/A",

}
</script>

Files

Files (19.7 MiB)

Name	Size
hyp_tuning_best_model.pkl md5:eff911bd69b98ad85b02632fee735b60	19.7 MiB	Download

Model: Random Forest Regressor Trained on Merged NFL Stadium Attendance Dataset

Creators

Description

Other

Files

Files (19.7 MiB)