Regressions II

Lecture 8

Dr. Greg Chism

University of Arizona
INFO 523 - Spring 2024

Warm up

Announcements

  • HW 04 is due Apr 10, 11:59pm

  • Final project proposals are due Wed Apr 03, 1pm for peer-review

Setup

# Import all required libraries
# Data handling and manipulation
import pandas as pd
import numpy as np

# Implementing and selecting models
import statsmodels.api as sm
from itertools import combinations
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import Ridge, RidgeCV, Lasso, LassoCV, ElasticNet, ElasticNetCV, LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.decomposition import PCA
from sklearn.cross_decomposition import PLSRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# For advanced visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Show computation time
import time

# Increase font size of all Seaborn plot elements
sns.set(font_scale = 1.25)

# Set Seaborn theme
sns.set_theme(style = "white", palette = "colorblind")

Regressions II

Our data: Indoor air pollution

  • Read + head
  • Metadata
  • Info
  • Describe
  • Missing values
from skimpy import clean_columns

pollution = pd.read_csv("data/merged_pollution.csv", encoding = 'iso-8859-1')
pollution = clean_columns(pollution)

pollution.head()
entity year access_clean_perc gdp popn death_rate_asp
0 0 2000.0 -1.371886 0.000000 -0.104197 371.951345
1 0 2001.0 -1.353313 0.000000 -0.102565 368.490253
2 0 2002.0 -1.330292 -0.877632 -0.100603 355.870851
3 0 2003.0 -1.302300 -0.875238 -0.098471 350.188748
4 0 2004.0 -1.276925 -0.877087 -0.096407 341.858106
variable class description
entity character Country, unique number identifier
year double Year
access_clean_perc double % of population with access to clean cooking fuels
gdp double GDP per capita, PPP (constant 2017 international $)
popn character Country population
death_rate_asp double Cause of death related to air pollution from solid fuels, standardized
pollution.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3264 entries, 0 to 3263
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   entity             3264 non-null   int64  
 1   year               3264 non-null   float64
 2   access_clean_perc  3264 non-null   float64
 3   gdp                3264 non-null   float64
 4   popn               3264 non-null   float64
 5   death_rate_asp     3264 non-null   float64
dtypes: float64(5), int64(1)
memory usage: 153.1 KB
pollution.describe()
entity year access_clean_perc gdp popn death_rate_asp
count 3264.000000 3264.00000 3.264000e+03 3264.000000 3.264000e+03 3264.000000
mean 95.500000 2008.00000 -1.349683e-16 0.000000 4.353816e-17 70.587846
std 55.433366 4.89973 1.000153e+00 1.000153 1.000153e+00 87.057969
min 0.000000 2000.00000 -1.598171e+00 -0.906714 -1.451953e-01 0.005738
25% 47.750000 2004.00000 -1.064244e+00 -0.742517 -1.417803e-01 1.090309
50% 95.500000 2008.00000 4.424448e-01 -0.337723 -1.295730e-01 23.828597
75% 143.250000 2012.00000 9.444564e-01 0.311943 -9.508658e-02 135.902705
max 191.000000 2016.00000 1.013911e+00 5.036683 1.458832e+01 474.973060
pollution.isnull().sum()
entity               0
year                 0
access_clean_perc    0
gdp                  0
popn                 0
death_rate_asp       0
dtype: int64

Multiple regression

Code
# Assuming pollution DataFrame is predefined
X = pollution[['year', 'access_clean_perc', 'gdp', 'popn']]
y = pollution['death_rate_asp']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Adding a constant for OLS
X_train_with_const = sm.add_constant(X_train)
X_test_with_const = sm.add_constant(X_test)

# Fitting the OLS model
model = sm.OLS(y_train, X_train_with_const).fit()

# Making predictions
y_pred = model.predict(X_test_with_const)

# Calculating MSE
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')

# Extracting Adjusted R-squared from the model's summary
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2:.4f}')
Mean Squared Error: 1526.64
R-squared: 0.8002

Regularization

A technique used in regression to avoid overfitting by shrinking the coefficient estimates to 0.

Two main methods:

  1. Ridge Regression

  2. Lasso Regression

  3. …but also note cross-validation for hyperparameter tuning

Ridge regression

  • Visual
  • Variance inflation factor
  • Formula
  • Key points


  • VIF quantifies multicollinearity in OLS regressions

  • Assesses how much variation is increased by multicollinearity

  • High VIF indicated that predictor variables can be linearly predicted by each other

Formula

\(VIF_j = \frac{1}{1-R^{2}_j}\)

  • \(VIF_j\) is the Variance Inflation Factor for the \(j^{th}\) predictor variable.

  • \(R^{2}_j\) is the coefficient of determination obtained by regressing the \(j^{th}\) predictor variable against all other predictor variables.

  • Ranges from 1-5 (or 10)

\(RSS + \lambda \sum_{j=1}^{p} \beta_j^2\)

  • Where \(j\) ranges from 1 to \(p\) and \(\lambda \geq 0\)

  • \(\sum_{j=1}^{p} \beta_j^2\) is the L2 normalization term

  • Penalized Regression: Adds a penalty to OLS to regularize coefficients, aiding in handling multicollinearity and reducing complexity.

  • Coefficient Shrinkage: Coefficients shrink towards zero, enhancing stability and accuracy.

  • L2 Regularization: Employs squared coefficient sum as a penalty, regulated by \(\lambda\).

  • Bias-Variance Trade-off: Slightly increases bias to reduce variance, preventing overfitting.

  • Efficient Computation: Features a closed-form solution, ensuring computational efficiency.

  • No Feature Elimination: Maintains all features due to non-zero coefficients, unlike Lasso.

  • Effective in \(p > n\): Remains effective when predictors outnumber observations.

  • Interpretability: Less interpretable because all predictors are included.

Investigate VIF

VIFs = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
for idx, vif in enumerate(VIFs):
    print(f"VIF for column {X.columns[idx]}: {vif.round(3)}")
VIF for column year: 1.0
VIF for column access_clean_perc: 1.678
VIF for column gdp: 1.679
VIF for column popn: 1.001
  • Entity and Year have relatively high VIF

  • Remaining columns relatively low VIF

Ridge regression: applied

  • Model summary
  • Residual plot
Code
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Add a constant to the model (for statsmodels)
X_train_const = sm.add_constant(X_train)
X_test_const = sm.add_constant(X_test)

# Initialize the Ridge Regression model
ridge_reg = Ridge(alpha = 1)  # Alpha is the regularization strength; adjust accordingly

# Fit the model
ridge_reg.fit(X_train, y_train)

# Predict on the testing set
y_pred = ridge_reg.predict(X_test)

# Calculate and print the MSE
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse.round(2)}\n')

# Since Ridge doesn't provide AIC, BIC directly, we focus on what's available
print(f'R-squared: {r2.round(2)}')
Mean Squared Error: 1526.71

R-squared: 0.8
Code
residuals = y_test - y_pred
sns.residplot(x = y_pred, y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

Model tuning

Optimizing the hyperparameters of a machine learning model to enhance its performance. The process aims to find the best combination of hyperparameters that results in the most accurate predictions for a given dataset.

Key points:

  • Hyperparameters: Pre-set parameters influencing model behavior, not derived from data.

  • Search Methods: Techniques like Grid Search and Random Search to explore hyperparameter spaces.

  • Cross-Validation: Essential for assessing model generalizability during tuning.

  • Performance Metrics: Criteria like accuracy or MSE to evaluate hyperparameter efficacy.

  • Computational Cost: Potentially high, depending on hyperparameter space complexity.

Model tuning: Ridge regression

  • Applied
  • Re-evaluate model
  • Compare with previous model
  • Residuals
# Define a set of alpha values
alphas = np.logspace(-6, 6, 13)

# Initialize RidgeCV
ridge_cv = RidgeCV(alphas = alphas, store_cv_values = True)

# Fit the model
ridge_cv.fit(X_train, y_train)

# Best alpha value
print(f'Best alpha: {ridge_cv.alpha_}')

# Re-initialize and fit the model with the best alpha
best_ridge = Ridge(alpha = ridge_cv.alpha_)
best_ridge.fit(X_train, y_train)

# Make new predictions
y_pred_best = best_ridge.predict(X_test)
Best alpha: 0.1
# Calculate R-squared
r2_best = r2_score(y_test, y_pred_best)
print(f'R-squared with best alpha: {r2_best.round(4)}')

# Calculate Mean Squared Error (MSE)
mse_best = mean_squared_error(y_test, y_pred_best)
print(f'Mean Squared Error with best alpha: {mse_best.round(3)}')
R-squared with best alpha: 0.8002
Mean Squared Error with best alpha: 1526.646
# Assuming `y_pred` are the predictions from the initial Ridge model
mse_initial = mean_squared_error(y_test, y_pred)
r2_initial = r2_score(y_test, y_pred)

# Print comparison
print(f'Initial MSE: {mse_initial.round(3)}, Best Alpha MSE: {mse_best.round(3)}')
print(f'Initial R-squared: {r2_initial.round(4)}, Best Alpha R-squared: {r2_best.round(5)}')
Initial MSE: 1526.708, Best Alpha MSE: 1526.646
Initial R-squared: 0.8002, Best Alpha R-squared: 0.80022
Code
residuals_best = y_test - y_pred_best
plt.scatter(y_pred_best, residuals_best, alpha = 0.5)
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals Plot with Best Alpha')
plt.show()

Lasso regression

  • Visual
  • Formula
  • Key points


\(RSS + \lambda \sum_{j=1}^{p} |\beta_j|\)

  • Where \(j\) ranges from 1 to \(p\) and \(\lambda \geq 0\)

  • \(\sum_{j=1}^{p} |\beta_j|\) is the L1 normalization term

  • Penalized Regression: Implements OLS with an added L1 penalty on coefficients’ absolute values to reduce complexity and tackle multicollinearity.

  • Feature Selection: Effectively zeroes out less significant coefficients, offering built-in feature selection for model simplicity.

  • L1 Regularization: Uses \(\sum_{j=1}^{p} |\beta_j|\) as the penalty, with \(λ\) tuning the penalty’s strength.

  • Bias-Variance: Increases bias to lower variance, aiding in overfitting prevention.

  • Computation: May require iterative optimization, lacking a closed-form solution, especially in high-dimensional datasets.

  • Sparse Solutions: Ideal for models expecting many non-influential features, providing sparsity.

  • Interpretability: Enhances model interpretability by retaining only relevant features.

Lasso regression: applied

  • Model summary
  • Residual plot
Code
# Prepare the data (assuming X and y are already defined)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Initialize the Lasso Regression model
lasso_reg = Lasso(alpha = 1.0)  # Alpha is the regularization strength; adjust accordingly

# Fit the model
lasso_reg.fit(X_train, y_train)

# Predict on the testing set
y_pred = lasso_reg.predict(X_test)

# Calculate and print the MSE
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse.round(2)}')

# Since Lasso doesn't provide AIC, BIC directly, we focus on what's available
r2 = r2_score(y_test, y_pred)
print(f'R-squared: {r2.round(2)}')
Mean Squared Error: 1535.4
R-squared: 0.8
Code
residuals = y_test - y_pred
sns.residplot(x = y_pred, y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Lasso Regression')
plt.show()

Model tuning: Lasso regression

  • Applied
  • Re-evaluate model
  • Compare with previous model
  • Residuals
# Define a range of alpha values for Lasso
alphas = np.logspace(-6, 6, 13)

# Initialize LassoCV
lasso_cv = LassoCV(alphas = alphas, cv = 5, random_state = 42)

# Fit the model
lasso_cv.fit(X_train, y_train)

# Optimal alpha value
print(f'Optimal alpha: {lasso_cv.alpha_}')

# Re-initialize and fit the model with the optimal alpha
best_lasso = Lasso(alpha = lasso_cv.alpha_)
best_lasso.fit(X_train, y_train)

# Make new predictions
y_pred_best = best_lasso.predict(X_test)
Optimal alpha: 1e-06
# Calculate R-squared with the best alpha
r2_best = r2_score(y_test, y_pred_best)
print(f'R-squared with optimal alpha: {r2_best.round(4)}')

# Calculate Mean Squared Error (MSE) with the best alpha
mse_best = mean_squared_error(y_test, y_pred_best)
print(f'Mean Squared Error with optimal alpha: {mse_best.round(3)}')
R-squared with optimal alpha: 0.8002
Mean Squared Error with optimal alpha: 1526.64
# Assuming `y_pred` are the predictions from the initial Lasso model
mse_initial = mean_squared_error(y_test, y_pred)
r2_initial = r2_score(y_test, y_pred)

# Print comparison
print(f'Initial MSE: {mse_initial.round(3)}, Optimal Alpha MSE: {mse_best.round(3)}')
print(f'Initial R-squared: {r2_initial.round(4)}, Optimal Alpha R-squared: {r2_best.round(4)}')
Initial MSE: 1535.4, Optimal Alpha MSE: 1526.64
Initial R-squared: 0.7991, Optimal Alpha R-squared: 0.8002
Code
residuals_best = y_test - y_pred_best
plt.scatter(y_pred_best, residuals_best, alpha = 0.5)
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals Plot with Optimal Alpha')
plt.show()

Elastic net regression

  • Visual
  • Formula
  • Key points


\(RSS + \lambda_1 \sum_{j=1}^{p} |\beta_j| + \lambda_2 \sum_{j=1}^{p} \beta_j^2\)

  • Where \(j\) ranges from 1 to \(p\) and \(\lambda \geq 0\)

  • \(\sum_{j=1}^{p} |\beta_j|\) is the L1 normalization term, which encourages sparsity in the coefficients

  • \(\sum_{j=1}^{p} \beta_j^2\) is the L2 normalization term, which encourages smoothness in the coefficients by penalizing large values.

  • Combines L1 and L2 Penalties: Merges Ridge and Lasso advantages for multicollinearity and feature selection.

  • Optimizes Feature Selection: L1 part zeroes out insignificant coefficients; L2 part shrinks coefficients to manage multicollinearity.

  • Requires Parameter Tuning: Optimal \(\lambda_1\)​ and \(\lambda_2\) balance feature elimination and coefficient reduction.

  • Mitigates Overfitting: Adjusts bias-variance trade-off, reducing overfitting risk.

  • Iterative Optimization: No closed-form solution due to L1 penalty; relies on optimization methods.

  • Effective in High Dimensions: Suitable for datasets with more features than observations.

  • Balances Sparsity and Stability: Ensures model relevance and stability through L1 and L2 penalties.

  • Enhances Interpretability: Simplifies the model by keeping only relevant predictors, improving model interpretability.

Elastic net regression: applied

  • Model summary
  • Residual plot
Code
# Assuming X and y are already defined and preprocessed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Initialize the Elastic Net Regression model with a mix of L1 and L2 regularization
elastic_net = ElasticNet(alpha = 1.0, l1_ratio = 0.5)  # Adjust alpha and l1_ratio accordingly

# Fit the model
elastic_net.fit(X_train, y_train)

# Predict on the testing set
y_pred = elastic_net.predict(X_test)

# Calculate and print the MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse.round(2)}')
print(f'R-squared: {r2.round(2)}')
Mean Squared Error: 2258.21
R-squared: 0.7
Code
residuals = y_test - y_pred
sns.residplot(x = y_pred, y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Elastic Net Regression')
plt.show()

Model tuning: Elastic net regression

  • Applied
  • Re-evaluate model
  • Compare with previous model
  • Residuals
# Define a range of alpha values and l1_ratios for Elastic Net
alphas = np.logspace(-6, 6, 13)
l1_ratios = np.linspace(0.1, 0.9, 9)

# Initialize ElasticNetCV
elastic_net_cv = ElasticNetCV(alphas = alphas, l1_ratio = l1_ratios, cv = 5, random_state = 42)

# Fit the model to find the optimal alpha and l1_ratio
elastic_net_cv.fit(X_train, y_train)

# Optimal alpha and l1_ratio
print(f'Optimal alpha: {elastic_net_cv.alpha_}')
print(f'Optimal l1_ratio: {elastic_net_cv.l1_ratio_}')

# Re-initialize and fit the model with the optimal parameters
best_elastic_net = ElasticNet(alpha=elastic_net_cv.alpha_, l1_ratio=elastic_net_cv.l1_ratio_)
best_elastic_net.fit(X_train, y_train)

# Make new predictions
y_pred_best = best_elastic_net.predict(X_test)
Optimal alpha: 0.0001
Optimal l1_ratio: 0.1
# Calculate R-squared with the best parameters
r2_best = r2_score(y_test, y_pred_best)
print(f'R-squared with optimal parameters: {r2_best.round(4)}')

# Calculate Mean Squared Error (MSE) with the best parameters
mse_best = mean_squared_error(y_test, y_pred_best)
print(f'Mean Squared Error with optimal parameters: {mse_best.round(3)}')
R-squared with optimal parameters: 0.8002
Mean Squared Error with optimal parameters: 1526.655
# Print comparison of MSE and R-squared before and after tuning
mse_initial = mean_squared_error(y_test, y_pred)
r2_initial = r2_score(y_test, y_pred)
print(f'Initial MSE: {mse_initial.round(3)}, Optimal Parameters MSE: {mse_best.round(3)}')
print(f'Initial R-squared: {r2_initial.round(4)}, Optimal Parameters R-squared: {r2_best.round(4)}')
Initial MSE: 2258.214, Optimal Parameters MSE: 1526.655
Initial R-squared: 0.7045, Optimal Parameters R-squared: 0.8002
Code
residuals_best = y_test - y_pred_best
plt.scatter(y_pred_best, residuals_best, alpha = 0.5)
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residuals Plot with Optimal Parameters')
plt.show()

Summary: regularization

  1. Reduces Overfitting: Regularization (Ridge, Lasso, Elastic Net) constrains model complexity, preventing overfitting by penalizing large coefficients.

  2. Parameter Optimization via CV: Cross-validation determines the optimal regularization strength (λ), ensuring model generalizability.

  3. Feature Selection with Lasso: Lasso (L1 penalty) zeroes out less significant coefficients, enabling automatic feature selection.

  4. Ridge Tackles Multicollinearity: Ridge regression (L2 penalty) minimizes collinearity effects among predictors by evenly shrinking coefficients.

  5. Elastic Net Merges Penalties: Elastic Net combines L1 and L2 penalties, ideal for correlated features and high-dimensional data, with CV tuning for optimal balance.

  6. CV Enhances Model Accuracy: Cross-validation across multiple folds improves accuracy and reliability of the chosen regularization parameter.

  7. Supports Sparse Solutions: Lasso and Elastic Net support sparse solutions, effectively reducing model complexity by selecting only relevant features.

  8. Bias-Variance Trade-off: Regularization adjusts the bias-variance trade-off, slightly increasing bias but significantly reducing variance.

High dimensionality

Dimensional reduction: revisited

Dimension reduction techniques reduce the number of input variables in a dataset. In regression, these methods can help mitigate issues related to multicollinearity, overfitting, and the curse of dimensionality, particularly in high-dimensional data.

Two main methods:

  1. Principal components regression (PCR)
  2. Partial least squares (PLS)

Principal components regression (PCR)

  • Visual
  • Formula
  • Key points

PCR involves two major steps—Principal Component Analysis (PCA) for dimensionality reduction, followed by regression on these principal components.

  1. PCA Transformation

    \(Z=XW\)

    Where \(X\) is the original dataset, \(W\) represents the weight matrix for PCA, and \(Z\) contains the principal components (PCs).

  2. Regression on PCs

    \(\hat{Y} = ZB + \epsilon\)

    Where \(\hat{Y}\) is the predicted outcome, \(Z\) are the PC predictors, \(B\) is the coefficient matrix, \(\epsilon\) is the error term

  1. Dimensionality Reduction: Transforms predictors into fewer uncorrelated principal components.

  2. Overcomes Multicollinearity: Uses orthogonal components, mitigating multicollinearity in the original predictors.

  3. Extracts Informative Features: Captures most data variance through principal components, enhancing model efficiency.

  4. Challenging Interpretation: Coefficients relate to principal components, complicating interpretation relative to original predictors.

  5. Critical Component Selection: Involves choosing the optimal number of components based on variance explained or cross-validation.

  6. Requires Standardization: PCA step necessitates variable scaling to ensure consistent variance across features.

  7. No Direct Feature Elimination: While reducing dimensions, PCR retains linear combinations of all features, unlike methods that select individual variables.

Principal components regression: applied

  • Model summary
  • Residual plot
Code
# Assuming X and y are already defined
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Standardizing the features
scaler = StandardScaler()

# PCA for dimensionality reduction
pca = PCA()

# Linear Regression
lin_reg = LinearRegression()

# Creating a pipeline for PCR
pipeline = Pipeline([('scaler', scaler), ('pca', pca), ('lin_reg', lin_reg)])

# Fit the PCR model
pipeline.fit(X_train, y_train)

# Predict on the testing set
y_pred = pipeline.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse.round(2)}')
print(f'R-squared: {r2.round(4)}')
Mean Squared Error: 1526.64
R-squared: 0.8002
Code
residuals = y_test - y_pred
sns.residplot(x = y_pred, y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for PCR')
plt.show()

Model tuning: PCR

  • Applied
  • Compare with previous model
  • Residuals
Code
# Function to calculate MSE with cross-validation
def cv_mse(n_components):
    pipeline = Pipeline([
        ('scaler', StandardScaler()), 
        ('pca', PCA(n_components = n_components)), 
        ('lin_reg', LinearRegression())
    ])
    mse = -cross_val_score(pipeline, X_train, y_train, scoring = 'neg_mean_squared_error', cv = 5).mean()
    return mse

# Testing different numbers of components
components_range = range(1, min(len(X_train.columns), len(X_train)) + 1)
mse_scores = [cv_mse(n) for n in components_range]

# Find the optimal number of components
optimal_components = components_range[np.argmin(mse_scores)]
print(f'Optimal number of components: {optimal_components}')

# Re-fit the PCR model with the optimal number of components
pipeline.set_params(pca__n_components = optimal_components)
pipeline.fit(X_train, y_train)

# New predictions and evaluation
y_pred_opt = pipeline.predict(X_test)
mse_opt = mean_squared_error(y_test, y_pred_opt)
r2_opt = r2_score(y_test, y_pred_opt)

print(f'Optimized Mean Squared Error: {mse_opt.round(2)}')
print(f'Optimized R-squared: {r2_opt.round(4)}')
Optimal number of components: 4
Optimized Mean Squared Error: 1526.64
Optimized R-squared: 0.8002
# Comparing initial and optimized model performances
print(f'Initial vs. Optimized MSE: {mse.round(2)} vs. {mse_opt.round(2)}')
print(f'Initial vs. Optimized R-squared: {r2.round(4)} vs. {r2_opt.round(4)}')
Initial vs. Optimized MSE: 1526.64 vs. 1526.64
Initial vs. Optimized R-squared: 0.8002 vs. 0.8002
Code
residuals_opt = y_test - y_pred_opt
sns.residplot(x = y_pred_opt, y = residuals_opt, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Optimized Residual Plot for PCR')
plt.show()

Partial least squares regression (PLS)

  • Visual
  • Formula
  • Key points

PLS focuses on predicting response variables by finding the linear regression model in a transformed space. It differs from PCR by considering the response variable \(Y\) when determining the new feature space.

  1. PLS Transformation

    \(T=XW^{*}\)

    Where \(X\) is the original dataset, \(W^{*}\) represents the weight matrix for PLS, and \(T\) represents the scores (latent variables).

  2. Regression on PCs

    \(\hat{Y} = TB + \epsilon\)

    Where \(T\) are the PLS latent variables (predictors), \(B\) is the coefficient matrix, \(\epsilon\) is the error term

  1. Dimensionality Reduction and Prediction: Reduces predictor dimensions while considering response variable \(Y\), optimizing for predictive power.

  2. Handles Multicollinearity: By generating latent variables, PLS mitigates multicollinearity, similar to PCR but with a focus on prediction.

  3. Incorporates Response in Feature Extraction: Unlike PCR, PLS extracts features based on their covariance with \(Y\), enhancing relevant feature capture.

  4. Component Selection is Crucial: The number of latent variables (components) is key to model performance, typically determined via cross-validation.

  5. Standardization May Be Necessary: Just like PCR, variable scaling can be important depending on the data characteristics.

  6. Suitable for High-Dimensional Data: Effective in scenarios where predictors far exceed observations.

  7. Direct Interpretation Challenging: Coefficients apply to latent variables, complicating direct interpretation relative to original predictors.

Parial least squares regression: applied

  • Model summary
  • Residual plot
Code
# Assuming X and y are already defined
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Initialize PLS Regression model
pls = PLSRegression(n_components = 2)  # Adjust n_components as needed

# Fit the model
pls.fit(X_train, y_train)

# Predict on the testing set
y_pred = pls.predict(X_test)

# Calculate and print the MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred.flatten())  # Flatten if y_pred is 2D

print(f'Mean Squared Error: {mse.round(2)}')
print(f'R-squared: {r2.round(4)}')
Mean Squared Error: 1534.9
R-squared: 0.7991
Code
residuals = y_test - y_pred.flatten()
sns.residplot(x = y_pred.flatten(), y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for PLS Regression')
plt.show()

Model tuning: PLS

  • Applied
  • Compare with previous model
  • Residuals
Code
# Function to perform cross-validation and calculate MSE
def cv_mse_pls(n_components):
    pls = PLSRegression(n_components = n_components)
    mse = -cross_val_score(pls, X_train, y_train, scoring='neg_mean_squared_error', cv=5).mean()
    return mse

# Determining the optimal number of components
components_range = range(1, min(len(X_train.columns), len(X_train)) + 1)
mse_scores = [cv_mse_pls(n) for n in components_range]

# Optimal number of components
optimal_components = components_range[np.argmin(mse_scores)]
print(f'Optimal number of components: {optimal_components}')

# Re-fit PLS model with the optimal number of components
pls.set_params(n_components = optimal_components)
pls.fit(X_train, y_train)

# New predictions and evaluation
y_pred_opt = pls.predict(X_test)
mse_opt = mean_squared_error(y_test, y_pred_opt)
r2_opt = r2_score(y_test, y_pred_opt.flatten())

print(f'Optimized Mean Squared Error: {mse_opt.round(2)}')
print(f'Optimized R-squared: {r2_opt.round(4)}')
Optimal number of components: 3
Optimized Mean Squared Error: 1526.57
Optimized R-squared: 0.8002
# Comparing initial and optimized model performances
print(f'Initial vs. Optimized MSE: {mse.round(2)} vs. {mse_opt.round(2)}')
print(f'Initial vs. Optimized R-squared: {r2.round(4)} vs. {r2_opt.round(4)}')
Initial vs. Optimized MSE: 1534.9 vs. 1526.57
Initial vs. Optimized R-squared: 0.7991 vs. 0.8002
Code
residuals_opt = y_test - y_pred_opt.flatten()
sns.residplot(x = y_pred_opt.flatten(), y = residuals_opt, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Optimized Residual Plot for PLS Regression')
plt.show()

Considerations in high dimensions

  • Overfitting in High-Dimensional Data: Models may capture noise as signal, leading to overfitting.

  • Curse of Dimensionality: Increased dimensionality exponentially expands data space, causing data sparsity and learning difficulties.

  • Multicollinearity Risks: More features increase the likelihood of correlated predictors, destabilizing models.

  • Use of Regularization: Techniques like Lasso and Ridge effectively address high-dimensional challenges by shrinking coefficients and simplifying models.

  • Model Simplification: PCR and PLS reduce variable count, aiding interpretability.

  • Feature Importance Analysis: Identifying influential features becomes harder in high dimensions; Lasso’s feature selection is beneficial.

  • Need for Dimensionality Reduction: Essential for crafting robust, interpretable models in high-dimensional scenarios.

Conclusions

  • Table
  • Note, why might best MSE and \(R^{2}\) not match?
Model MSE \(R^{2}\)
Multiple regression (OLS, best fit) 1526.64 0.8002
Ridge regression (best \(\lambda\)) 1526.646 0.8002
Lasso regression (best \(\alpha\)) 1526.64 0.8002
Elastic net (best \(\lambda, \alpha\)) 1526.655 0.8002
Principle components (best # components) 1526.64 0.8002
Partial least squares (best # components) 1526.57 0.8002

Occam’s Razor loses!

  • Simpler Models: A model with fewer predictors may achieve lower MSE without significantly improving explanation, resulting in lower Adjusted R-squared.

  • Scale & Variance: Models on low-variance data can show lower MSE but not necessarily higher Adjusted R-squared, reflecting prediction efficiency over variance explanation.

  • Balance: Overfit models might have lower MSE but lower Adjusted R-squared, indicating poor generalization versus models that balance complexity and predictive power.

  • Split Variability: Train-test split differences can affect MSE more than Adjusted R-squared, especially with test set anomalies.

  • Error Patterns: Models with evenly distributed errors tend to have higher Adjusted R-squared even if MSE is slightly higher, compared to those with uneven error distribution.

In-class Exercise

Go to ex-08 and perform the tasks

🔗 datamineaz.org

1 / 31
Regressions II Lecture 8 Dr. Greg Chism University of Arizona INFO 523 - Spring 2024

  1. Slides

  2. Tools

  3. Close
  • Regressions II
  • Warm up
  • Announcements
  • Setup
  • Regressions II
  • Our data: Indoor air pollution
  • Multiple regression
  • Regularization
  • Ridge regression
  • Investigate VIF
  • Ridge regression: applied
  • Model tuning
  • Model tuning: Ridge regression
  • Lasso regression
  • Lasso regression: applied
  • Model tuning: Lasso regression
  • Elastic net regression
  • Elastic net regression: applied
  • Model tuning: Elastic net regression
  • Summary: regularization
  • High dimensionality
  • Dimensional reduction: revisited
  • Principal components regression (PCR)
  • Principal components regression: applied
  • Model tuning: PCR
  • Partial least squares regression (PLS)
  • Parial least squares regression: applied
  • Model tuning: PLS
  • Considerations in high dimensions
  • Conclusions
  • In-class Exercise
  • f Fullscreen
  • s Speaker View
  • o Slide Overview
  • e PDF Export Mode
  • b Toggle Chalkboard
  • c Toggle Notes Canvas
  • d Download Drawings
  • ? Keyboard Help