Spam Mails Dataset - FAIR experiment

doi:10.70124/0e1sf-saz86

Published April 19, 2025 | Version v2

Dataset Open

Spam Mails Dataset - FAIR experiment

Bernal, Nicolas¹

1. TU Wien

Context

The Spam Mail dataset is a collection of 5.171 emails that have been classified as spam or ham (non-spam). This dataset was originally created in 2006 for research purposes in the field of spam detection and filtering using machine learning techniques, specifically a Naive Bayes classifier as described in the paper "Spam Filtering with Naive Bayes - Which Naive Bayes?" by Metsis, Androutsopoulos, and Paliouras.

The data was created using mainly the inbox of 6 users of the company "Enron" for the "ham" emails, and the "spam" emails were collected from various sources, including the SpamAssassin corpus, the Honeypot project, the spam collection of Bruce Guenter, and spam collected by the authors themselves.

The emails were preprocessed to remove any html tags, and emails with non-latin characters were removed to avoid any possible bias since all "ham" emails are written with latin characters.

The original data can be found in CSV format on Kaggle at: https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data

Project description

In this project we will use the Spam Mail dataset to train a Neural Network model to classify emails as spam or ham. The dataset will be further preprocessed to remove any unnecessary characters like stopwords and punctuation.

The emails will also be tokenized and converted into a format suitable for training the model, but this last step will be performed in the code itself so it is not included in the dataset.

Files

In this repository you will find the following files:

- README.md: Project overview, dataset source, structure, and dependency information.

- confusion_matrix.png: A confusion matrix that shows the performance of the model on the test set.

- evaluation_metrics.txt: Text summary of evaluation metrics: accuracy, precision, recall, and F1-score.

- test_predictions.csv: A CSV file that contains the predictions of the model on the test set.

- top_spam_words.png: A bar chart showing the top 10 most frequent words in correctly predicted spam emails.

- spam_classifier.h5: The trained model file, which can be used to make predictions on new emails.

Files

confusion_matrix.png

Files (2.2 MiB)

Name	Size
confusion_matrix.png md5:e56c616635de23a7597d179a4740fb6c	21.0 KiB	Preview Download
evaluation_metrics.txt md5:be89b082b87c78a7918cfa714ede3eca	519 Bytes	Preview Download
spam_nn_model_v1.h5 md5:6632a4186a09a836bfb062e2e41415d1	2.2 MiB	Download
test_predictions.csv md5:b4264d086071d73396b8a91c50ee4033	9.2 KiB	Preview Download
top10_spam_tokens.png md5:b7b41130a0ba6753e16206355f31b71f	18.7 KiB	Preview Download

Additional details

Metsis, V., Androutsopoulos, I., & Paliouras, G. (2006). Spam filtering with Naive Bayes – Which Naive Bayes? Proceedings of the 3rd Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA.
https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data

Spam Mails Dataset - FAIR experiment

Creators

Description

Context

Project description

Files

Files

confusion_matrix.png

Files (2.2 MiB)

Additional details

References