Published April 19, 2025
| Version v2
Dataset
Open
Spam Mails Dataset - FAIR experiment
Description
Context
The Spam Mail dataset is a collection of 5.171 emails that have been classified as spam or ham (non-spam). This dataset was originally created in 2006 for research purposes in the field of spam detection and filtering using machine learning techniques, specifically a Naive Bayes classifier as described in the paper "Spam Filtering with Naive Bayes - Which Naive Bayes?" by Metsis, Androutsopoulos, and Paliouras.
The data was created using mainly the inbox of 6 users of the company "Enron" for the "ham" emails, and the "spam" emails were collected from various sources, including the SpamAssassin corpus, the Honeypot project, the spam collection of Bruce Guenter, and spam collected by the authors themselves.
The emails were preprocessed to remove any html tags, and emails with non-latin characters were removed to avoid any possible bias since all "ham" emails are written with latin characters.
The original data can be found in CSV format on Kaggle at: https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data
Project description
In this project we will use the Spam Mail dataset to train a Neural Network model to classify emails as spam or ham. The dataset will be further preprocessed to remove any unnecessary characters like stopwords and punctuation.
The emails will also be tokenized and converted into a format suitable for training the model, but this last step will be performed in the code itself so it is not included in the dataset.
Files
In this repository you will find the following files:
- README.md: Project overview, dataset source, structure, and dependency information.
- confusion_matrix.png: A confusion matrix that shows the performance of the model on the test set.
- evaluation_metrics.txt: Text summary of evaluation metrics: accuracy, precision, recall, and F1-score.
- test_predictions.csv: A CSV file that contains the predictions of the model on the test set.
- top_spam_words.png: A bar chart showing the top 10 most frequent words in correctly predicted spam emails.
- spam_classifier.h5: The trained model file, which can be used to make predictions on new emails.
Files
confusion_matrix.png
Files
(2.2 MiB)
Name | Size | |
---|---|---|
md5:e56c616635de23a7597d179a4740fb6c
|
21.0 KiB | Preview Download |
md5:be89b082b87c78a7918cfa714ede3eca
|
519 Bytes | Preview Download |
md5:6632a4186a09a836bfb062e2e41415d1
|
2.2 MiB | Download |
md5:b4264d086071d73396b8a91c50ee4033
|
9.2 KiB | Preview Download |
md5:b7b41130a0ba6753e16206355f31b71f
|
18.7 KiB | Preview Download |
Additional details
References
- Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006). Spam filtering with Naive Bayes – Which Naive Bayes? Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA.
- https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data