Homework 3
For any exercise where you’re writing code, insert a code chunk and make sure to label the chunk. Use a short and informative label. For any exercise where you’re creating a plot, make sure to label all axes, legends, etc. and give it an informative title. For any exercise where you’re including a description and/or interpretation, use full sentences. Make a commit at least after finishing each exercise, or better yet, more frequently. Push your work regularly to GitHub, and make sure all checks pass.
For this Homework, find the types of classification, their pros and cons, and implementation here.
Classification Methods
Welcome to this week’s data mining challenge! Your mission is to dive into the world of machine learning with a focus on classification algorithms. Using a dataset from #TidyTuesday, a project that presents a new dataset to the R for Data Science community every week, you will explore how to categorize data into distinct classes.
Objective: You will apply a variety of classification methods to the dataset, fine-tune models, and interpret the results to gain insights. This exercise aims to enhance your understanding of classification techniques and prepare you to tackle real-world data science problems.
Datasets & question
Based on a county’s demographic, employment, and household data, can we predict whether the median price for center-based childcare for preschoolers (mc_preschool) will be above or below the state’s average?
Part 1: Data Preparation and Exploration
Task 1: Data Retrieval
- Download the dataset provided for the week from the TidyTuesday repository.
- Load the dataset into a Python environment using pandas.
Task 2: Exploratory Data Analysis (EDA)
- Perform EDA to understand the features of the dataset.
- Visualize the distribution of the classes (target variable).
- Identify and visualize relationships between features and the target variable.
Hint
- Use
matplotlib
andseaborn
for visualizations to understand data distributions. pandas.DataFrame.describe()
can give you descriptive statistics.- Use
seaborn.pairplot()
orpandas.scatter_matrix()
to visualize possible correlations.
Task 3: Data Cleaning
- Handle missing values and remove or impute where appropriate.
- Convert categorical data to numerical data using one-hot encoding or label encoding.
- Merge if appropriate.
Hint
- Check for missing values using
pandas.DataFrame.isnull().sum()
. - Consider using
pandas.DataFrame.dropna()
to remove rows with missing values orpandas.DataFrame.fillna()
to impute them. - For encoding categorical variables, look into
pandas.get_dummies()
for one-hot encoding orsklearn.preprocessing.LabelEncoder
for label encoding.
Part 2: Feature Engineering and Selection
In this section, you will perform date preparation and exploration on your chosen dataset. You will retrieve and load the data, clean the data, and perform a concise exploratory data analysis.
Task 1: Feature Engineering
- Create new features if necessary from existing data.
- Normalize or standardize features if needed.
Hint
- Think about what additional information might be useful that’s not directly provided in the features.
- Standardize features with
sklearn.preprocessing.StandardScaler
or normalize them withsklearn.preprocessing.MinMaxScaler
.
Task 2: Feature Selection
- Use statistical tests, selection methods, or model-based methods to select a subset of features for the classification.
Hint
- Use methods like
SelectKBest
fromsklearn.feature_selection
to select features based on univariate statistical tests. - Consider using
feature_importances_
from tree-based models like RandomForest to evaluate feature importance. - Consider removing correlated pairs like we did in class
Part 3: Model Implementation
This section will focus on cleaning and preparing the dataset for modeling. You will correct any issues you found during the EDA phase.
Task 1: Data Splitting
- Split the dataset into a training set and a test set.
Hint
- Use
sklearn.model_selection.train_test_split
to separate your data into training and testing sets. - Consider using PCA for dimensional reduction, picking the optimal components like we did in class.
Task 2: Model Training
- Train at least three of the following classification models:
- Logistic Regression (3 points)
- Decision Tree Classifier (3 points)
- Random Forest Classifier (3 points)
- K-Nearest Neighbors (3 points)
- For each model, briefly explain the working principle.
Hint
- For each algorithm, ensure you understand the basic principle; logistic regression is about probabilities, decision trees make hierarchical decisions, etc.
- Consult the scikit-learn documentation for details on parameters and usage of different classifiers.
Task 3: Model Validation
- Use cross-validation to estimate the effectiveness of each model.
- Report the accuracy, precision, recall, F1-score, and ROC-AUC for each model.
Hint
- Implement cross-validation using
sklearn.model_selection.cross_val_score
orsklearn.model_selection.cross_validate
. - For metrics, explore
sklearn.metrics
for functions likeaccuracy_score
,precision_score
,recall_score
,f1_score
, androc_auc_score
.
Part 4: Model Evaluation and Interpretation
Task 1: Model Comparison + Result Interpretation
- Interpret the results of the best performing model.
- Discuss the importance of feature contributions to the model’s predictions.
- Reflect on the potential real-world implications of the model’s performance (e.g., overfitting, misclassification costs).
Hint
- When interpreting the ROC curve, recall that a curve closer to the top-left corner indicates a better performance.
Deliverables: - A Quarto Notebook containing all code and visualizations. - A written report summarizing your findings from the EDA, the decisions you made during preprocessing, and the rationale behind your choices.
Submission Guidelines: - Push your Quarto Notebook to your GitHub repository. - Ensure your commit messages are descriptive.
Grading Rubric: Your work will be evaluated based on the following criteria: - Correctness and completeness of the code. - Quality and clarity of the visualizations and summary report. - Proper use of comments and documentation in the code. - Adherence to the submission guidelines.
Points Distribution: Each task is allocated a specific number of points. Points will be awarded based on the completeness and correctness of the work submitted. Be sure to follow best practices in data analysis and provide interpretations for your findings and decisions during preprocessing.
Good luck, and may your insights be profound!