Telco_Customer_churn_Data
Description
Context and Methodology
The dataset originates from the research domain of Customer Churn Prediction in the Telecom Industry. It was created as part of the project "Data-Driven Churn Prediction: ML Solutions for the Telecom Industry," completed within the Data Stewardship course (Master programme Data Science, TU Wien).
The primary purpose of this dataset is to support machine learning model development for predicting customer churn based on customer demographics, service usage, and account information.
The dataset enables the training, testing, and evaluation of classification algorithms, allowing researchers and practitioners to explore techniques for customer retention optimization.
The dataset was originally obtained from the IBM Accelerator Catalog and adapted for academic use. It was uploaded to TU Wien’s DBRepo test system and accessed via SQLAlchemy connections to the MariaDB environment.
Technical Details
The dataset has a tabular structure and was initially stored in CSV format. It contains:
-
Rows: 7,043 customer records
-
Columns: 21 features including customer attributes (gender, senior citizen status, partner status), account information (tenure, contract type, payment method), service usage (internet service, streaming TV, tech support), and the target variable (Churn: Yes/No).
Naming Convention:
-
The table in the database is named
telco_customer_churn_data
.
Software Requirements:
-
To open and work with the dataset, any standard database client or programming language supporting MariaDB connections can be used (e.g., Python etc).
-
For machine learning applications, libraries such as
pandas
,scikit-learn
, andjoblib
are typically used.
Additional Resources:
-
Source code for data loading, preprocessing, model training, and evaluation is available at the associated GitHub repository: https://github.com/nazerum/fair-ml-customer-churn
Further Details
When reusing the dataset, users should be aware:
-
Licensing: The dataset is shared under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
-
Use Case Suitability: The dataset is best suited for classification tasks, particularly binary classification (churn vs. no churn).
-
Metadata Standards: Metadata describing the dataset adheres to FAIR principles and is supplemented by CodeMeta and Croissant standards for improved interoperability.
Files
confusion_matrix.png
Additional details
Dates
- Submitted
-
2025-04-28