# Encode categorical variablescategorical_columns = loans_class.select_dtypes(include = ['object', 'category']).columns.tolist()# Encode categorical variableslabel_encoders = {col: LabelEncoder() for col in categorical_columns}for col in categorical_columns: loans_class[col] = label_encoders[col].fit_transform(loans_class[col])# Define features and targetX = loans_class.drop('risk', axis =1)y = loans_class['risk']# Split the dataX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Reduce dimensionality to prevent overfittingpca = PCA(n_components =2)X_train_pca = pca.fit_transform(X_train)X_test_pca = pca.transform(X_test)
Reminder: decision boundary
def decisionplot(model, X, y, resolution=216):# Split the data into features (X) and the class variable (y) x_min, x_max = X.iloc[:, 0].min() -1, X.iloc[:, 0].max() +1 y_min, y_max = X.iloc[:, 1].min() -1, X.iloc[:, 1].max() +1 xx, yy = np.meshgrid(np.linspace(x_min, x_max, resolution), np.linspace(y_min, y_max, resolution))# Predict outcomes for each point on the gridifisinstance(model, LinearDiscriminantAnalysis):# For LDA, we need to use the decision_function method Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])else: Z = model.predict(np.c_[xx.ravel(), yy.ravel()])ifisinstance(model, LinearDiscriminantAnalysis):# Reshape LDA decision function output appropriately Z = Z.reshape(-1, 1)else: Z = Z.reshape(xx.shape)# Plot the actual data points plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=y, edgecolors='k', s=20)# Overlay the decision boundary plt.contourf(xx, yy, Z, alpha =0.5)# Calculate the accuracy predictions = model.predict(X) acc = accuracy_score(y, predictions)# Set labels for axes plt.xlabel(X.columns[0]) plt.ylabel(X.columns[1]) plt.show()
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of a statistical model from a set of candidate models, given data. Typically, Occam’s Razor is the best approach
Broadly we will focus on two categories
Evaluating performance
The process of assessing the performance of a machine learning model using various metrics, such as accuracy, precision, recall, and F1 score, to determine how effectively it makes predictions on new, unseen data.
Validation methods
The technique of verifying a machine learning model’s performance and reliability on a separate dataset (validation set) that was not used during the model’s training, to ensure that it generalizes well to new data.
Binarization: label_binarize is used to binarize the labels in a one-vs-all fashion which is necessary for multiclass ROC calculation.
Macro-average: computes the metric independently for each class and then takes the average (treating all classes equally).
Micro-average:aggregates the contributions of all classes to compute the average metric.
Conclusions
All values are relatively the same (~50%)
Our risk column is unbalanced, so precision and recall are useful
The F1 Score is best to analyze our data (balance precision and recall)
ROC-AUC is effective in distinguishing classes
Cross validation
A resampling method that evaluates machine learning models on a limited data sample. It involves partitioning a dataset into complementary subsets, performing the analysis on one subset (training set), and validating the analysis on the other subset (validation set).
Use Case: Widely used for assessing the effectiveness of predictive models, helping to safeguard against overfitting.
Pros:
Provides a more accurate measure of a model’s predictive performance compared to a simple train/test split.
Utilizes the data efficiently as every observation is used for both training and validation.
Cons:
Computationally intensive, especially for large datasets.
Results can vary depending on how the data is divided.
Definition: The number of folds equals the number of instances in the dataset. Each model is trained on all data points except one, which is used as the test set.
Use Case: Useful for small datasets where maximizing the training data is important.
Pros:
Utilizes the data to its maximum extent.
Reduces bias as each data point gets to be in the test set exactly once.
Cons:
Highly computationally expensive with large datasets.
High variance in the estimate of model performance as the evaluation can be highly dependent on the data points chosen as the test set.
Definition: The dataset is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set.
Use Case: Ideal for both small and medium-sized datasets and when the balance between bias and variance is crucial.
Pros:
Reduces the variance of a single trial of train/test split.
More reliable estimate of out-of-sample performance than LOOCV.
Cons:
Still computationally intensive, especially with large k.
Results can be dependent on the random division of the data into folds.
Definition: Refers to managing the trade-off between the bias of the model (error due to overly simplistic assumptions) and its variance (error due to sensitivity to small fluctuations in the training set).
Bootstrap F1-Score: 0.48
Bootstrap Method took 1.09 seconds
ratio = kfold_time / loocv_timeprint(f"Ratio of k-Fold time to LOOCV time: {ratio:.4f}")ratio = bootstrap_time / loocv_timeprint(f"Ratio of Bootstrap time to LOOCV time: {ratio:.4f}")ratio = kfold_time / bootstrap_timeprint(f"Ratio of k-Fold time to Bootstrap time: {ratio:.4f}")
Ratio of k-Fold time to LOOCV time: 0.0005
Ratio of Bootstrap time to LOOCV time: 0.0052
Ratio of k-Fold time to Bootstrap time: 0.1050