# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
# Load the dataset
= pd.read_csv("data/spam.csv")
spam
# Encode categorical variables
= spam.select_dtypes(include = ['object', 'category']).columns.tolist()
categorical_columns = {col: LabelEncoder() for col in categorical_columns}
label_encoders for col in categorical_columns:
= label_encoders[col].fit_transform(spam[col])
spam[col]
# Define features and target
= spam.drop('yesno', axis = 1)
X = spam['yesno']
y
# Split the data
= train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train, X_test, y_train, y_test
# Reduce dimensionality
= PCA(n_components = 2)
pca = pca.fit_transform(X_train)
X_train_pca = pca.transform(X_test) X_test_pca
Ex-06: Classifying Spam Emails
Objective:
Apply and evaluate different classification models to predict high-risk credit individuals using financial traits, focusing on understanding model performance through various evaluation metrics.
Prerequisites: Ensure you have Python, Jupyter Notebook, and the required libraries (pandas
, numpy
, scikit-learn
, matplotlib
, seaborn
, mord
) installed. The dataset spam.csv
should be available in the data
directory.
Find the template GitHub repo HERE.
Dataset:
The data this week comes from Vincent Arel-Bundock’s Rdatasets package(https://vincentarelbundock.github.io/Rdatasets/index.html).
Rdatasets is a collection of 2246 datasets which were originally distributed alongside the statistical software environment R and some of its add-on packages. The goal is to make these data more broadly accessible for teaching and statistical software development.
We’re working with the spam email dataset. This is a subset of the spam e-mail database.
This is a dataset collected at Hewlett-Packard Labs by Mark Hopkins, Erik Reeber, George Forman, and Jaap Suermondt and shared with the UCI Machine Learning Repository. The dataset classifies 4601 e-mails as spam or non-spam, with additional variables indicating the frequency of certain words and characters in the e-mail.
Metadata
Variable | Class | Description |
---|---|---|
crl.tot |
double | Total length of uninterrupted sequences of capitals |
dollar |
double | Occurrences of the dollar sign, as percent of total number of characters |
bang |
double | Occurrences of ! , as percent of total number of characters |
money |
double | Occurrences of money , as percent of total number of characters |
n000 |
double | Occurrences of the string 000 , as percent of total number of words |
make |
double | Occurrences of make , as a percent of total number of words |
yesno |
character | Outcome variable, a factor with levels n not spam, y spam |
(Source: TidyTuesday)
Question:
Can we predict whether an email is Spam or not using decision tree classification?
Step 1: Setup and Data Preprocessing
Start by importing the necessary libraries and load the
spam.csv
dataset.Preprocess the data by encoding categorical variables, defining features and target, and splitting the data into training and testing sets. Finally, apply PCA to reduce dimensionality.
Step 2: Model Training and Decision Boundary Visualization
Train a Decision Tree classifier on the PCA-transformed training data.
Implement and use the
decisionplot
function to visualize the decision boundary of your trained model.
from sklearn.tree import DecisionTreeClassifier
import matplotlib.pyplot as plt
# Train Decision Tree
= DecisionTreeClassifier()
dtree
dtree.fit(X_train_pca, y_train)
# Implement the decisionplot function (as provided in the lecture content)
# Add the decisionplot function here
# Visualize decision boundary
= ['PC1', 'PC2']), y_train) decisionplot(dtree, pd.DataFrame(X_train_pca, columns
Step 3: Model Evaluation
- Evaluate your model using accuracy, precision, recall, F1 score, and AUC-ROC metrics.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, auc
from sklearn.preprocessing import label_binarize
# Predictions
= dtree.predict(X_test_pca)
predictions
# Evaluate metrics
= accuracy_score(y_test, predictions)
accuracy = precision_score(y_test, predictions, average = 'weighted')
precision = recall_score(y_test, predictions, average = 'weighted')
recall = f1_score(y_test, predictions, average = 'weighted')
f1
# Display results
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
# For AUC-ROC, binarize the output and calculate AUC-ROC for each class
# Add the necessary code for AUC-ROC calculation here (refer to lecture content)
Assignment:
Implement the missing parts of the code: the
decisionplot
function and AUC-ROC calculation.Discuss the results among your peers. Consider the following:
Which metric is most informative for this problem and why?
How does the decision boundary visualization help in understanding the model’s performance?
Reflect on the impact of PCA on model performance and decision boundary.
Submission:
- Submit your Jupyter Notebook via GitHub with implemented code and a brief summary of your discussion findings regarding model evaluation and the impact of PCA.