There is a newer version of the record available.

Published April 19, 2025 | Version v2
Dataset Open

Spam Mails Dataset - FAIR experiment

  • 1. ROR icon TU Wien

Description

Context

The Spam Mail dataset is a collection of 5.171 emails that have been classified as spam or ham (non-spam). This dataset was originally created in 2006 for research purposes in the field of spam detection and filtering using machine learning techniques, specifically a Naive Bayes classifier as described in the paper "Spam Filtering with Naive Bayes - Which Naive Bayes?" by Metsis, Androutsopoulos, and Paliouras.
 
The data was created using mainly the inbox of 6 users of the company "Enron" for the "ham" emails, and the "spam" emails were collected from various sources, including the SpamAssassin corpus, the Honeypot project, the spam collection of Bruce Guenter, and spam collected by the authors themselves.
The emails were preprocessed to remove any html tags, and emails with non-latin characters were removed to avoid any possible bias since all "ham" emails are written with latin characters.
 
The original data can be found in CSV format on Kaggle at: https://www.kaggle.com/datasets/venky73/spam-mails-dataset/data
 

Project description

In this project we will use the Spam Mail dataset to train a Neural Network model to classify emails as spam or ham. The dataset will be further preprocessed to remove any unnecessary characters like stopwords and punctuation.
 
The emails will also be tokenized and converted into a format suitable for training the model, but this last step will be performed in the code itself so it is not included in the dataset.

 

Files

In this repository you will find the following files:
README.md: Project overview, dataset source, structure, and dependency information.
confusion_matrix.png: A confusion matrix that shows the performance of the model on the test set.
evaluation_metrics.txt: Text summary of evaluation metrics: accuracy, precision, recall, and F1-score.
test_predictions.csv: A CSV file that contains the predictions of the model on the test set.
- top_spam_words.png: A bar chart showing the top 10 most frequent words in correctly predicted spam emails.
spam_classifier.h5: The trained model file, which can be used to make predictions on new emails.

Files

confusion_matrix.png

Files (2.2 MiB)

Name Size
md5:e56c616635de23a7597d179a4740fb6c
21.0 KiB Preview Download
md5:be89b082b87c78a7918cfa714ede3eca
519 Bytes Preview Download
md5:6632a4186a09a836bfb062e2e41415d1
2.2 MiB Download
md5:b4264d086071d73396b8a91c50ee4033
9.2 KiB Preview Download
md5:b7b41130a0ba6753e16206355f31b71f
18.7 KiB Preview Download

Additional details

References