Published April 25, 2025 | Version v1
Model Open

Model: Random Forest Regressor Trained on Merged NFL Stadium Attendance Dataset

Description

This Random Forest Regressor model was trained on a custom-preprocessed dataset of NFL stadium attendance, constructed by merging three publicly available datasets from Kaggle, uploaded by Sujay Kapadnis. The original datasets include game-level statistics (games.csv), season standings (standings.csv), and attendance records (attendance.csv).

The datasets were merged and cleaned to form a single modeling-ready table. Categorical variables (such as teams, location, and outcome) were transformed using one-hot encoding, while numerical variables (such as scores and rankings) were standardized using StandardScaler. The target variable is weekly stadium attendance for NFL games.

The full dataset was split into:

  • Training set: 70%

  • Validation set: 15%

  • Test set: 15%

A GridSearchCV was applied to a RandomForestRegressor from sklearn.ensemble, using 3-fold cross-validation on the training set. The model was optimized to minimize mean absolute error (MAE).

Best hyperparameters found:

  • n_estimators: 250

  • max_depth: None

  • min_samples_split: 2

  • min_samples_leaf: 3

  • max_features: None

After tuning, the final model was retrained on the full training set, and its performance was evaluated on the validation set as a proxy for generalization performance.

Technical environment:

  • Python version: 3.13.1

  • Libraries used:

    • scikit-learn (sklearn)

    • pandas, numpy

    • matplotlib (for evaluation visuals)

The model is saved as a .pkl file and is intended for inference on similarly preprocessed data. Numerical features must be scaled using the same StandardScaler, and categorical encoding must follow the same one-hot schema used during training. The target scaler (target_scaler.pkl) is also provided to inverse-transform predictions back to the original scale.

 

Model Risks, Biases, and Limitations:

  • Risks: Misuse for speculative dynamic ticket pricing strategies without ethical oversight.

  • Biases: Dataset may under-represent regional differences in reporting or non-ticketed attendance.

  • Limitations: Model does not account for special promotional events, weather conditions, or real-time ticket resale dynamics.


fair4ml:mlTask | Regression – NFL stadium attendance forecasting
fair4ml:modelCategory | Supervised → Ensemble → Random Forest
fair4ml:intendedUse | Stadium attendance forecasting for operational, marketing, and planning decisions in the NFL.
fair4ml:modelRisksBiasLimitations | - Risk: Model predictions could be misused for discriminatory pricing. - Bias: Original datasets may underreport regional attendance discrepancies. - Limitations: Model does not include external factors like weather, ticket resale, or last-minute promotions.
fair4ml:trainedOn | - https://doi.org/10.82556/4fst-m890
fair4ml:validatedOn | - https://doi.org/10.82556/djvg-rb67
fair4ml:testedOn | - https://doi.org/10.82556/zv9b-6h09
schema:license + fair4ml:legal / ethicalSocial | License: CC-BY-4.0
schema:codeRepository | {{ https://github.com/emilp-tuwien/nfl-attendance-prediction/tree/main

Other

<script type="application/ld+json">
{
  "@context": [
    "https://schema.org",
    "https://w3id.org/fair4ml/context"
  ],
  "@type": "fair4ml:MLModel",

  /* --- core identification --- */
  "name": "Random Forest Regressor for NFL Stadium Attendance",
  "version": "1.0",
  "dateCreated": "2025-04-26",
  "dateModified": "2025-04-27",

  /* --- authorship --- */
  "author": {
    "@type": "Person",
    "name": "Emil Paskovski",
    "identifier": "https://orcid.org/0009-0004-4299-0636"
  },

  /* --- licensing & code --- */
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "codeRepository": "https://github.com/emilp-tuwien/nfl-attendance-prediction",

  /* --- FAIR4ML task info --- */
  "mlTask": "regression",
  "modelCategory": "Supervised > Ensemble > Random Forest",
  "intendedUse": "Stadium attendance forecasting for operational, marketing and planning decisions in the NFL.",
  "modelRisksBiasLimitations": "Risk: potential misuse for discriminatory pricing. Bias: regional under-reporting of attendance. Limitation: does not consider weather, resale, or promotions.",

  /* --- training, validation, test sources --- */
  "trainedOn": "https://doi.org/10.82556/4fst-m890",
  "validatedOn": "https://doi.org/10.82556/djvg-rb67",
  "testedOn": "https://doi.org/10.82556/zv9b-6h09",

  /* --- environment & hyper-params (optional but useful) --- */
  "softwareRequirements": [
    "Python 3.13.1",
    "scikit-learn",
    "pandas",
    "numpy",
    "matplotlib"
  ],
  "hyperparameters": {
    "n_estimators": 250,
    "max_depth": null,
    "min_samples_split": 2,
    "min_samples_leaf": 3,
    "max_features": null
  },

  /* --- CO2 footprint (put N/A if unknown) --- */
  "hasCO2eEmissions": "N/A",

}
</script>

Files

Files (19.7 MiB)

NameSize
md5:eff911bd69b98ad85b02632fee735b60
19.7 MiBDownload