DMP Poodles vs. Pugs as Income Indicators
Description
The datasets used in this machine learning experiment, Poodles vs. Pugs as Income Indicators, are publicly available through the Open Government Data portal of the City of Vienna (data.gv.at). This project utilizes two specific datasets: Hunderassen Wien and Durchschnittliches Nettoeinkommen seit 2002 - Bezirke Wien. The rows in the processed dataset represent Vienna's 23 municipal districts.
Context and Methodology
The purpose of this dataset is to train and evaluate a regression model that predicts the average annual income of a geographic area based on the prevalence of specific dog breeds.
The dataset is not newly collected. It is reused from open data sources (data.gv.at). The project workflow involves:
Acquiring and downloading the raw datasets
Cleaning and preprocessing the data
Merging the demographic and income data by municipal district
Training a machine learning model (Random Forest regressor)
Evaluating model accuracy and generating visual distributions
Additional datasets and artifacts are created during the project, including a merged/cleaned dataset, model evaluation metrics, and geographic distribution plots.
Technical Details
The dataset and project environment are structured in a clear folder hierarchy:
dataresults
Files include:
datasets (CSV): Original downloaded files from Stadt Wien.
plots (PNG): Geographic distribution of dog breeds (
avg_dogs_per_district.png), income distribution (avg_income_per_district.png), and model error rates (final_evaluation_plots.png).source code (IPYNB): Jupyter Notebook containing the data orchestration and modeling pipeline.
Software requirements: To open and work with this dataset and its accompanying code, the following open-source tools are required:
Python 3
pandas (for data orchestration)
scikit-learn (for predictive modeling)
matplotlib & seaborn (for visualization)
Jupyter Notebook environment
Further Details
The dataset contains no personal or sensitive data. All records are aggregated at the district level, ensuring complete anonymity and compliance with open-data standards.
Important points for reuse:
The Random Forest model demonstrated robust predictive power for lower-to-middle income brackets.
The model's accuracy plateaus at the highest income tiers, indicating that "luxury" wealth is influenced by factors beyond simple pet ownership trends. Future researchers should account for this limitation when utilizing the model for high-income predictions.