Model: Random Forest Regressor Trained on Merged NFL Stadium Attendance Dataset
Creators
Description
This Random Forest Regressor model was trained on a custom-preprocessed dataset of NFL stadium attendance, constructed by merging three publicly available datasets from Kaggle, uploaded by Sujay Kapadnis. The original datasets include game-level statistics (games.csv), season standings (standings.csv), and attendance records (attendance.csv).
The datasets were merged and cleaned to form a single modeling-ready table. Categorical variables (such as teams, location, and outcome) were transformed using one-hot encoding, while numerical variables (such as scores and rankings) were standardized using StandardScaler. The target variable is weekly stadium attendance for NFL games.
The full dataset was split into:
Training set: 70%
Validation set: 15%
Test set: 15%
A GridSearchCV was applied to a RandomForestRegressor from sklearn.ensemble, using 3-fold cross-validation on the training set. The model was optimized to minimize mean absolute error (MAE).
Best hyperparameters found:
n_estimators: 250max_depth: Nonemin_samples_split: 2min_samples_leaf: 3max_features: None
After tuning, the final model was retrained on the full training set, and its performance was evaluated on the validation set as a proxy for generalization performance.
Technical environment:
Python version: 3.13.1
Libraries used:
scikit-learn(sklearn)pandas,numpymatplotlib(for evaluation visuals)
The model is saved as a .pkl file and is intended for inference on similarly preprocessed data. Numerical features must be scaled using the same StandardScaler, and categorical encoding must follow the same one-hot schema used during training. The target scaler (target_scaler.pkl) is also provided to inverse-transform predictions back to the original scale.
Model Risks, Biases, and Limitations:
Risks: Misuse for speculative dynamic ticket pricing strategies without ethical oversight.
Biases: Dataset may under-represent regional differences in reporting or non-ticketed attendance.
Limitations: Model does not account for special promotional events, weather conditions, or real-time ticket resale dynamics.
fair4ml:mlTask | Regression – NFL stadium attendance forecasting
fair4ml:modelCategory | Supervised → Ensemble → Random Forest
fair4ml:intendedUse | Stadium attendance forecasting for operational, marketing, and planning decisions in the NFL.
fair4ml:modelRisksBiasLimitations | - Risk: Model predictions could be misused for discriminatory pricing. - Bias: Original datasets may underreport regional attendance discrepancies. - Limitations: Model does not include external factors like weather, ticket resale, or last-minute promotions.
fair4ml:trainedOn | - https://doi.org/10.82556/4fst-m890
fair4ml:validatedOn | - https://doi.org/10.82556/djvg-rb67
fair4ml:testedOn | - https://doi.org/10.82556/zv9b-6h09
schema:license + fair4ml:legal / ethicalSocial | License: CC-BY-4.0
schema:codeRepository | {{ https://github.com/emilp-tuwien/nfl-attendance-prediction/tree/main
Other
<script type="application/ld+json">
{
"@context": [
"https://schema.org",
"https://w3id.org/fair4ml/context"
],
"@type": "fair4ml:MLModel",
/* --- core identification --- */
"name": "Random Forest Regressor for NFL Stadium Attendance",
"version": "1.0",
"dateCreated": "2025-04-26",
"dateModified": "2025-04-27",
/* --- authorship --- */
"author": {
"@type": "Person",
"name": "Emil Paskovski",
"identifier": "https://orcid.org/0009-0004-4299-0636"
},
/* --- licensing & code --- */
"license": "https://creativecommons.org/licenses/by/4.0/",
"codeRepository": "https://github.com/emilp-tuwien/nfl-attendance-prediction",
/* --- FAIR4ML task info --- */
"mlTask": "regression",
"modelCategory": "Supervised > Ensemble > Random Forest",
"intendedUse": "Stadium attendance forecasting for operational, marketing and planning decisions in the NFL.",
"modelRisksBiasLimitations": "Risk: potential misuse for discriminatory pricing. Bias: regional under-reporting of attendance. Limitation: does not consider weather, resale, or promotions.",
/* --- training, validation, test sources --- */
"trainedOn": "https://doi.org/10.82556/4fst-m890",
"validatedOn": "https://doi.org/10.82556/djvg-rb67",
"testedOn": "https://doi.org/10.82556/zv9b-6h09",
/* --- environment & hyper-params (optional but useful) --- */
"softwareRequirements": [
"Python 3.13.1",
"scikit-learn",
"pandas",
"numpy",
"matplotlib"
],
"hyperparameters": {
"n_estimators": 250,
"max_depth": null,
"min_samples_split": 2,
"min_samples_leaf": 3,
"max_features": null
},
/* --- CO2 footprint (put N/A if unknown) --- */
"hasCO2eEmissions": "N/A",
}
</script>
Files
Files (19.7 MiB)
| Name | Size | |
|---|---|---|
| md5:eff911bd69b98ad85b02632fee735b60 | 19.7 MiB | Download |