This repository contains the outputs generated during the experiment for spam email classification using neural networks.
The files include:
README.md: Project overview, dataset source, structure, and dependency information.confusion_matrix.png: A confusion matrix that shows the performance of the model on the test set.evaluation_metrics.txt: Text summary of evaluation metrics: accuracy, precision, recall, and F1-score.test_predictions.csv: A CSV file that contains the predictions of the model on the test set.top_spam_words.png: A bar chart showing the top 10 most frequent words in correctly predicted spam emails.spam_classifier.h5: The trained model file, which can be used to make predictions on new emails.The dataset used for this experiment is available on the Lemmatized Spam Ham Dataset on DBREPO. The dataset contains a collection of emails labeled as spam or not spam.
The dataset is split into three subsets:
Training Data: 3619 emails (70% of the total data) used to train the model. Entries in this set have experiment_id values ranging from 1 to 3618.
Validation Data: 512 emails (10% of the total data) used to tune the model's hyperparameters and evaluate its performance during training. Entries in this set have experiment_id values ranging from 3619 to 4131.
Test Data: 1038 emails (20% of the total data) used to evaluate the model's performance after training. Entries in this set have experiment_id values ranging from 4132 to 5170.
The code used to train the model and generate the outputs is available in GitHub: Spam-Mail Classification-Data-Stewardship-FAIR-Experiment
The DPM (Data Management Plan) for this project is available at the following link: