FAIR Dataset for Disease Prediction in Healthcare Applications

Published April 14, 2025 | Version v1

Dataset Open

Research Domain/Project:
This dataset was created for a machine learning experiment aimed at developing a classification model to predict outcomes based on a set of features. The primary research domain is disease prediction in patients. The dataset was used in the context of training, validating, and testing.
Purpose of the Dataset:
The purpose of this dataset is to provide training, validation, and testing data for the development of machine learning models. It includes labeled examples that help train classifiers to recognize patterns in the data and make predictions.
Dataset Creation:
Data preprocessing steps involved cleaning, normalization, and splitting the data into training, validation, and test sets. The data was carefully curated to ensure its quality and relevance to the problem at hand. For any missing values or outliers, appropriate handling techniques were applied (e.g., imputation, removal, etc.).

Structure of the Dataset:
The dataset consists of several files organized into folders by data type:
- Training Data: Contains the training dataset used to train the machine learning model.
- Validation Data: Used for hyperparameter tuning and model selection.
- Test Data: Reserved for final model evaluation.
Each folder contains files with consistent naming conventions for easy navigation, such as train_data.csv, validation_data.csv, and test_data.csv. Each file follows a tabular format with columns representing features and rows representing individual data points.
Software Requirements:
To open and work with this dataset, you need VS Code or Jupyter, which could include tools like:
- Python (with libraries such as pandas, numpy, scikit-learn, matplotlib, etc.)

Reusability:
Users of this dataset should be aware that it is designed for machine learning experiments involving classification tasks. The dataset is already split into training, validation, and test subsets. Any model trained with this dataset should be evaluated using the test set to ensure proper validation.
Limitations:
The dataset may not cover all edge cases, and it might have biases depending on the selection of data sources. It's important to consider these limitations when generalizing model results to real-world applications.

Files

Name	Size
confusion_matrix.png md5:3b546649700ef50b99144f372d867402	13.9 KiB	Preview Download
evaluation_metrics.json md5:5a6e5020fac1ec6e778bffa530c3274d	705 Bytes	Preview Download
feature_importance_chart.png md5:1914e43c77edccff9e34a8e6c25295ab	20.9 KiB	Preview Download
patient_health_model.pkl md5:b3295a0bb3d59fd785ab8ca742517a8d	2.1 MiB	Download
recommendations.csv md5:6be2b37816b42783775526b75d42746e	13.4 KiB	Preview Download
test_patient_data.csv md5:eeafee42eb3162810ec3200a44f7630c	6.4 KiB	Preview Download
train_patient_data.csv md5:b5eba532f2d05009738bb9609578436d	19.0 KiB	Preview Download
val_patient_data.csv md5:869c92c45423b36dce3239bc481c983d	6.4 KiB	Preview Download