Ex-06: Classifying Spam Emails

Objective:

Apply and evaluate different classification models to predict high-risk credit individuals using financial traits, focusing on understanding model performance through various evaluation metrics.

Prerequisites: Ensure you have Python, Jupyter Notebook, and the required libraries (pandas, numpy, scikit-learn, matplotlib, seaborn, mord) installed. The dataset spam.csv should be available in the data directory.

Find the template GitHub repo HERE.

Dataset:

The data this week comes from Vincent Arel-Bundock’s Rdatasets package(https://vincentarelbundock.github.io/Rdatasets/index.html).

Rdatasets is a collection of 2246 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.

We’re working with the spam email dataset. This is a subset of the spam e-mail database.

This is a dataset collected at Hewlett-Packard Labs by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt and shared with the UCI Machine Learning Repository. The dataset classifies 4601 e-mails as spam or non-spam, with additional variables indicating the frequency of certain words and characters in the e-mail.

Metadata

Variable Class Description
crl.tot double Total length of uninterrupted sequences of capitals
dollar double Occurrences of the dollar sign, as percent of total number of characters
bang double Occurrences of !, as percent of total number of characters
money double Occurrences of money, as percent of total number of characters
n000 double Occurrences of the string 000, as percent of total number of words
make double Occurrences of make, as a percent of total number of words
yesno character Outcome variable, a factor with levels n not spam, y spam

(Source: TidyTuesday)

Question:

Can we predict whether an email is Spam or not using decision tree classification?

Step 1: Setup and Data Preprocessing

  • Start by importing the necessary libraries and load the spam.csv dataset.

  • Preprocess the data by encoding categorical variables, defining features and target, and splitting the data into training and testing sets. Finally, apply PCA to reduce dimensionality.

    # Import libraries
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import LabelEncoder
    from sklearn.decomposition import PCA
    from sklearn.model_selection import train_test_split
    
    # Load the dataset
    spam = pd.read_csv("data/spam.csv")
    
    # Encode categorical variables
    categorical_columns = spam.select_dtypes(include = ['object', 'category']).columns.tolist()
    label_encoders = {col: LabelEncoder() for col in categorical_columns}
    for col in categorical_columns:
        spam[col] = label_encoders[col].fit_transform(spam[col])
    
    # Define features and target
    X = spam.drop('yesno', axis = 1)
    y = spam['yesno']
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
    
    # Reduce dimensionality
    pca = PCA(n_components = 2)
    X_train_pca = pca.fit_transform(X_train)
    X_test_pca = pca.transform(X_test)

Step 2: Model Training and Decision Boundary Visualization

  • Train a Decision Tree classifier on the PCA-transformed training data.

  • Implement and use the decisionplot function to visualize the decision boundary of your trained model.

from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt

# Train Decision Tree
dtree = DecisionTreeClassifier()
dtree.fit(X_train_pca, y_train)

# Implement the decisionplot function (as provided in the lecture content)
# Add the decisionplot function here

# Visualize decision boundary
decisionplot(dtree, pd.DataFrame(X_train_pca, columns = ['PC1', 'PC2']), y_train)

Step 3: Model Evaluation

  • Evaluate your model using accuracy, precision, recall, F1 score, and AUC-ROC metrics.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc
from sklearn.preprocessing import label_binarize

# Predictions
predictions = dtree.predict(X_test_pca)

# Evaluate metrics
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average = 'weighted')
recall = recall_score(y_test, predictions, average = 'weighted')
f1 = f1_score(y_test, predictions, average = 'weighted')

# Display results
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# For AUC-ROC, binarize the output and calculate AUC-ROC for each class
# Add the necessary code for AUC-ROC calculation here (refer to lecture content)

Assignment:

  • Implement the missing parts of the code: the decisionplot function and AUC-ROC calculation.

  • Discuss the results among your peers. Consider the following:

    • Which metric is most informative for this problem and why?

    • How does the decision boundary visualization help in understanding the model’s performance?

    • Reflect on the impact of PCA on model performance and decision boundary.

Submission:

  • Submit your Jupyter Notebook via GitHub with implemented code and a brief summary of your discussion findings regarding model evaluation and the impact of PCA.