INFO 523 - Regressions III

Our data: Tucson daily average temps

from skimpy import clean_columns

tucsonTemp = pd.read_csv("data/tucsonWeather.csv", encoding = 'iso-8859-1')[["STATION", "NAME", "DATE", "TAVG"]]
tucsonTemp = clean_columns(tucsonTemp)
tucsonTemp['date'] = pd.to_datetime(tucsonTemp['date'])
tucsonTemp['date_numeric'] = (tucsonTemp['date'] - tucsonTemp['date'].min()).dt.days

tucsonTemp.head()

	station	name	date	tavg	date_numeric
0	US1AZPM0322	CATALINA 1.6 S, AZ US	2023-01-01	NaN	0
1	US1AZPM0322	CATALINA 1.6 S, AZ US	2023-01-02	NaN	1
2	US1AZPM0322	CATALINA 1.6 S, AZ US	2023-01-03	NaN	2
3	US1AZPM0322	CATALINA 1.6 S, AZ US	2023-01-04	NaN	3
4	US1AZPM0322	CATALINA 1.6 S, AZ US	2023-01-05	NaN	4

variable	class	description
`station`	character	Station ID
`name`	character	Station name
`date`	date	Date the reading was collected
`tavg`	float	Daily average temperature
`date_numeric`	int	Numerical representation of the date

tucsonTemp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47731 entries, 0 to 47730
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   station       47731 non-null  object        
 1   name          47731 non-null  object        
 2   date          47731 non-null  datetime64[ns]
 3   tavg          718 non-null    float64       
 4   date_numeric  47731 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 1.8+ MB

tucsonTemp.describe()
tucsonTemp = tucsonTemp[tucsonTemp['tavg'] > 0]

tusconTemp = tucsonTemp.dropna()
tusconTemp.info()

<class 'pandas.core.frame.DataFrame'>
Index: 718 entries, 6831 to 45728
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   station       718 non-null    object        
 1   name          718 non-null    object        
 2   date          718 non-null    datetime64[ns]
 3   tavg          718 non-null    float64       
 4   date_numeric  718 non-null    int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(2)
memory usage: 33.7+ KB

Review: linear models

\(Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\)

\(Y_i\): Dependent/response variable
\(X_i\): Independent/predictor variable
\(\beta_0\): y-intercept
\(\beta_1\): Slope
\(\epsilon_i\): Random error term, deviation of the real value from predicted
Advantage: Simplicity, interpretability, ease of fitting.

Limitations of linear models

Linearity Assumption: Assumes a linear relationship between predictors and response.
Lacks Flexibility: Struggles with non-linear relationships and interactions without explicit feature engineering.
Potential for Underfitting: May not capture complex data patterns, leading to poor model performance.

Motivation for moving beyond linearity

Improved Accuracy: Captures non-linear relationships for better predictions.
Complex Data Handling: Models complex patterns in real-world data.
Interpretability with Non-linearity: Offers structured ways to understand non-linear effects of predictors.
Balanced Flexibility: Provides a middle ground between simplicity and the complexity of machine learning models.

Polynomial regression

Polynomial regression: applied

Code

X = tucsonTemp[['date_numeric']].values
y = tucsonTemp['tavg'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Transforming data for polynomial regression
poly_features = PolynomialFeatures(degree = 2)  # Adjust degree as necessary
X_train_poly = poly_features.fit_transform(X_train)
X_test_poly = poly_features.transform(X_test)

# Initialize the Linear Regression model
linear_reg = LinearRegression()

# Fit the model with polynomial features
linear_reg.fit(X_train_poly, y_train)

# Predict on the testing set
y_pred = linear_reg.predict(X_test_poly)

# Calculate and print the MSE and R-squared
mse_initial = mean_squared_error(y_test, y_pred)
r2_initial = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse_initial.round(2)}')
print(f'R-squared: {r2_initial.round(2)}')

Mean Squared Error: 55.6
R-squared: 0.76

Code

# Scatter plot of the actual data points
plt.scatter(X_test, y_test, color = 'lightgray', label = 'Actual data')

# To plot the polynomial curve, we need to sort the values because the line plot needs to follow the order of X
# Create a sequence of values from the minimum to the maximum X values for plotting the curve
X_plot = np.linspace(np.min(X), np.max(X), 100).reshape(-1, 1)

# Transform the plot data for the polynomial model
X_plot_poly = poly_features.transform(X_plot)

# Predict y values for the plot data
y_plot = linear_reg.predict(X_plot_poly)

# Plot the polynomial curve
plt.plot(X_plot, y_plot, color = 'red', label = 'Polynomial fit')

# Labeling the plot
plt.xlabel('Date Numeric')
plt.ylabel('Daily Average Temperature')
plt.title('2nd Degree Polynomial Fit to Tucson Temperature Data')
plt.legend()

# Show the plot
plt.show()

Code

residuals = y_test - y_pred
sns.residplot(x = y_pred.ravel(), y = residuals.ravel(), lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Polynomial Regression')
plt.show()

Model tuning: Polynomial regression

Code

# Exploring different polynomial degrees to find the best fit
degrees = [1, 2, 3, 4, 5]
mse_scores = []
r2_scores = []

for degree in degrees:
    poly_features = PolynomialFeatures(degree = degree)
    X_train_poly = poly_features.fit_transform(X_train)
    X_test_poly = poly_features.transform(X_test)
    
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    y_pred = model.predict(X_test_poly)
    mse_scores.append(mean_squared_error(y_test, y_pred))
    r2_scores.append(r2_score(y_test, y_pred))

# Display the MSE and R-squared for each degree
for degree, mse, r2 in zip(degrees, mse_scores, r2_scores):
    print(f'Degree: {degree}, MSE: {mse.round(3)}, R-squared: {r2.round(4)}')

Degree: 1, MSE: 222.653, R-squared: 0.0444
Degree: 2, MSE: 55.597, R-squared: 0.7614
Degree: 3, MSE: 38.106, R-squared: 0.8365
Degree: 4, MSE: 26.386, R-squared: 0.8868
Degree: 5, MSE: 26.067, R-squared: 0.8881

Code

# Selecting the best degree based on previous step
best_degree = degrees[np.argmin(mse_scores)]

# Transforming data with the best degree
poly_features_best = PolynomialFeatures(degree = best_degree)
X_train_poly_best = poly_features_best.fit_transform(X_train)
X_test_poly_best = poly_features_best.transform(X_test)

# Fitting the model again
best_model = LinearRegression()
best_model.fit(X_train_poly_best, y_train)

# New predictions
y_pred_best = best_model.predict(X_test_poly_best)

# Calculate new MSE and R-squared
mse_best = mean_squared_error(y_test, y_pred_best)
r2_best = r2_score(y_test, y_pred_best)
print(f'Best Degree: {best_degree}, MSE: {mse_best.round(3)}, R-squared: {r2_best.round(4)}')

Best Degree: 5, MSE: 26.067, R-squared: 0.8881

Code

# Generate a sequence of X values for plotting
X_range = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)

# Plot the actual data
plt.scatter(X_test, y_test, color = 'gray', alpha = 0.5, label = 'Actual data')

colors = ['blue', 'green', 'red', 'purple', 'orange']
labels = ['1st degree', '2nd degree', '3rd degree', '4th degree', '5th degree']

for i, degree in enumerate(degrees):
    # Create polynomial features for the current degree
    poly_features = PolynomialFeatures(degree=degree)
    X_train_poly = poly_features.fit_transform(X_train)
    X_test_poly = poly_features.transform(X_test)
    X_range_poly = poly_features.transform(X_range)
    
    # Fit the Linear Regression model
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    # Predict over the generated range of X values
    y_range_pred = model.predict(X_range_poly)
    
    # Plot
    plt.plot(X_range, y_range_pred, color = colors[i], label = f'Polynomial fit degree {degree}')

# Enhancing the plot
plt.xlabel('Date Numeric')
plt.ylabel('Daily Average Temperature')
plt.title('Comparing Polynomial Fits of Different Degrees')
plt.legend()
plt.show()

Code

residuals_best = y_test - y_pred_best
plt.scatter(y_pred_best, residuals_best, alpha = 0.5)
plt.axhline(y = 0, color = 'r', linestyle = '--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot with Best Degree')
plt.show()

Step functions

Code

# Define the step function intervals
step_intervals = [0, 100, 200, 300, 400]
tucsonTemp['step_bins'] = pd.cut(tucsonTemp['date_numeric'], bins = step_intervals, labels=False)

# Prepare the data for step function fitting
X = pd.get_dummies(tucsonTemp['step_bins'], drop_first = True).values
y = tucsonTemp['tavg'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Fit the step function model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Calculate and print the MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Mean Squared Error: 53.23
R-squared: 0.77

Code

# Plotting the step function fit
sns.lineplot(x = tucsonTemp['date_numeric'], y = tucsonTemp['tavg'], label = 'Actual Data', alpha = 0.6)

# Generate the step function values for the plot
for i, (lower, upper) in enumerate(zip(step_intervals[:-1], step_intervals[1:])):
    mask = (tucsonTemp['date_numeric'] >= lower) & (tucsonTemp['date_numeric'] < upper)
    plt.hlines(y[mask].mean(), lower, upper, colors = 'red', label = f'Step {i+1}')

plt.xlabel('Date Numeric')
plt.ylabel('Daily Average Temperature')
plt.title('Step Function Fit to Tucson Temperature Data')
plt.legend()
plt.show()

Basis functions

Basis function regression: Applied

Code

# Splitting the dataset
X = tucsonTemp[['date_numeric']].values
y = tucsonTemp['tavg'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


# Create the basis function model
model = make_pipeline(RBFSampler(gamma = 1.0, n_components=50, random_state=42), LinearRegression())

# Fit the model
model.fit(X_train, y_train)

# Predict on the testing set
y_pred = model.predict(X_test)

# Calculate and print the MSE and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Mean Squared Error: 93.76
R-squared: 0.60

Code

# Predict on a grid for a smooth line
np.random.seed(42)
num_samples = 365
x_grid = np.linspace(X.min(), X.max(), num_samples).reshape(-1, 1)
y_grid_pred = model.predict(x_grid)

# Plotting the actual data points and the basis function fit
sns.lineplot(x = X_train.squeeze(), y = y_train, color = 'gray', label = 'Training data', alpha = 0.5)
sns.lineplot(x = X_test.squeeze(), y = y_test, color = 'blue', label = 'Testing data', alpha = 0.5)
sns.lineplot(x = x_grid.squeeze(), y = y_grid_pred, color = 'red', label = 'Basis Function Fit')

# Labeling the plot
plt.xlabel('Normalized Date Numeric')
plt.ylabel('Daily Average Temperature')
plt.title('Basis Function Fit to Tucson Temperature Data')
plt.legend()
plt.show()

Code

# Calculating residuals
residuals = y_test - y_pred

# Plotting residuals
sns.residplot(x = y_pred, y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Basis Function Regression')
plt.show()

Aside: Grid search cross-validation

Aside: random search

Best of both worlds?

Visual
Pros + cons

Model tuning: Basis function regression

Code

# First search: random Search to narrow down the range for hyperparameters
random_param_grid = {
    'rbfsampler__gamma': np.logspace(-3, 0, 4),  # Wider range for gamma
    'rbfsampler__n_components': np.linspace(50, 500, 10).astype(int)  # Wider range for n_components
}

# Create a custom scorer for cross-validation
mse_scorer = make_scorer(mean_squared_error, greater_is_better = False)


# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    model,
    param_distributions = random_param_grid,
    n_iter = 10,  # Number of parameter settings that are sampled
    scoring = mse_scorer,
    cv = 5,
    random_state = 42
)

# Fit the model
random_search.fit(X_train, y_train)

# Second search: grid Search to fine-tune the hyperparameters
# Use best parameters from random search as the center point for the grid search
best_gamma = random_search.best_params_['rbfsampler__gamma']
best_n_components = random_search.best_params_['rbfsampler__n_components']
refined_param_grid = {
    'rbfsampler__gamma': np.linspace(best_gamma / 2, best_gamma * 2, 5),
    'rbfsampler__n_components': [best_n_components - 50, best_n_components, best_n_components + 50]
}

# Initialize GridSearchCV with the refined grid
grid_search = GridSearchCV(
    model,
    param_grid = refined_param_grid,
    scoring = mse_scorer,
    cv = 5
)

# Fit the model using GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters from grid search
print(f'Best parameters after Grid Search: {grid_search.best_params_}')

# Re-initialize and fit the model with the best parameters from grid search
best_basis_model = make_pipeline(
    RBFSampler(
        gamma = grid_search.best_params_['rbfsampler__gamma'],
        n_components = grid_search.best_params_['rbfsampler__n_components'],
        random_state = 42
    ),
    LinearRegression()
)
best_basis_model.fit(X_train, y_train)

# Make new predictions with the best model
y_pred_best = best_basis_model.predict(X_test)

# Calculate R-squared and Mean Squared Error (MSE) with the best model
r2_best = r2_score(y_test, y_pred_best)
mse_best = mean_squared_error(y_test, y_pred_best)
print(f'Mean Squared Error: {mse_best.round(3)}')
print(f'R-squared: {r2_best.round(4)}')

Best parameters after Grid Search: {'rbfsampler__gamma': 0.002, 'rbfsampler__n_components': 250}
Mean Squared Error: 20.303
R-squared: 0.9129

# Calculate the original MSE and r-squared scored
mse_initial = mean_squared_error(y_test, y_pred)
r2_initial = r2_score(y_test, y_pred)

# Print comparison
print(f'Initial MSE: {mse_initial.round(3)}, Best Parameters MSE: {mse_best.round(3)}')
print(f'Initial R-squared: {r2_initial.round(4)}, Best Parameters R-squared: {r2_best.round(5)}')

Initial MSE: 93.762, Best Parameters MSE: 20.303
Initial R-squared: 0.5976, Best Parameters R-squared: 0.91287

Code

residuals_best = y_test - y_pred_best
sns.residplot(x = y_pred_best, y = residuals_best, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot with Best Parameters')
plt.show()

Piecewise polynomials

Piecewise polynomial regression: Applied

Code

# Assign variables
X = tucsonTemp[['date_numeric']].values
y = tucsonTemp['tavg'].values

# Splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

# Ensure X_train and y_train are sorted by X_train
sorted_indices = np.argsort(X_train.squeeze())
X_train_sorted = X_train[sorted_indices].squeeze()  # Convert to 1D
y_train_sorted = y_train[sorted_indices]

# Define initial knots based on domain knowledge or quantiles
initial_knots = np.quantile(X_train_sorted, [0.25, 0.5, 0.75])

# Adjust knots to include the full range of X_train_sorted
knots = np.concatenate(([X_train_sorted.min()], initial_knots, [X_train_sorted.max()]))

# Function to handle duplicates by averaging y values for duplicate x values
def unique_with_average(X, Y):
    unique_X, indices = np.unique(X, return_inverse = True)
    avg_Y = np.array([Y[indices == i].mean() for i in range(len(unique_X))])
    return unique_X, avg_Y

splines = []

for i in range(len(knots) - 1):
    mask = (X_train_sorted >= knots[i]) & (X_train_sorted < knots[i+1])
    X_segment = X_train_sorted[mask]
    y_segment = y_train_sorted[mask]
    
    # Ensure X_segment is strictly increasing by handling duplicates
    X_segment_unique, y_segment_avg = unique_with_average(X_segment, y_segment)
    
    if len(X_segment_unique) >= 4:  # Ensuring there are enough points
        spline = CubicSpline(X_segment_unique, y_segment_avg, bc_type='natural')
        splines.append((knots[i], knots[i+1], spline))

# Predict function needs updating to loop over splines correctly
def predict_with_splines(X, splines, knots):
    y_pred = np.zeros_like(X)
    for i, x_val in enumerate(X):
        for start, end, spline in splines:
            if start <= x_val < end:
                y_pred[i] = spline(x_val)
                break
    return y_pred

# Predictions and evaluations
y_pred = predict_with_splines(X_test.squeeze(), splines, knots)  # Ensure X_test is 1D for the function

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'R-squared: {r2:.2f}')

Mean Squared Error: 16.80
R-squared: 0.93

Code

# Visualization of the training and testing data
sns.scatterplot(x = X_train.squeeze(), y = y_train, color = 'gray', label = 'Training data', alpha = 0.5)
sns.scatterplot(x = X_test.squeeze(), y = y_test, color = 'blue', label = 'Testing data', alpha = 0.5)

# Assuming splines and knots are correctly calculated as per previous steps
x_range = np.linspace(X.min(), X.max(), 400).squeeze()
y_range_pred = predict_with_splines(x_range, splines, knots)

sns.lineplot(x = x_range, y = y_range_pred, color = 'red', label = 'Piecewise Polynomial Fit')
plt.xlabel('Date Numeric')
plt.ylabel('Daily Average Temperature')
plt.title('Piecewise Polynomial Fit to Temperature Data')
plt.legend()
plt.show()

Code

# Calculating residuals
residuals = y_test - predict_with_splines(X_test.squeeze(), splines, knots)

# Plotting residuals
sns.residplot(x = y_pred, y = residuals, lowess = True, line_kws = {'color': 'red', 'lw': 1})
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot for Piecewise Polynomial Regression')
plt.show()

Model tuning

Pretty advanced, so we will skip this one.

Methods include:

Understanding data + domain knowledge
Adaptive algorithms
1. Recursive Partitioning and Amalgamation (RECPAM)
2. Cross-validation
Regularization
1. Ridge / Lasso regression
Gradient descent for fine-tuning

Other regressions

Local regression (LOESS/LOWESS)
- Adaptive and Robust: Captures variable trends with minimal assumptions, adapting to data changes and resistant to outliers.
Generalized Additive Models (GAMs)
- Flexible and Interpretable: Extends GLMs to include non-linear trends through additive, interpretable components, supporting various distributions.

Tree-based regressions

See ISL Chapter 8.2

Bagging
- Robust and Reducing Overfitting: Effectively lowers variance and enhances stability in models prone to overfitting.
Random Forests
- Feature Insight and Stability: Enhances prediction stability and offers valuable insights into feature importance.
Boosting
- Precision and Performance: Amplifies the accuracy of weak learners, significantly boosting overall model performance.
Bayesian Additive Regression Trees (BART)
- Probabilistic and Nuanced: Delivers detailed probabilistic interpretations, suitable for intricate modeling challenges.

Conclusions

Model	MSE	\(R^{2}\)
Polynomial regression (best degree)	26.067	0.8881
Step functions	53.23	0.77
Basis functions (best \(\gamma\), # components)	20.303	0.9129
Piecewise polynomial (untuned)	16.80	0.93

Occam’s Razor loses!..

In-class exercise

Go to ex-09 and perform the tasks

Regressions III

Warm up

Announcements

Setup

Regressions III

Our data: Tucson daily average temps

Review: linear models

Limitations of linear models

Motivation for moving beyond linearity

Polynomial regression

Polynomial regression: applied

Model tuning: Polynomial regression

Step functions

Step functions: applied

Basis functions

Basis function regression: Applied

Aside: Grid search cross-validation

Aside: random search

Best of both worlds?

Model tuning: Basis function regression

Piecewise polynomials

Piecewise polynomial regression: Applied

Model tuning

Other regressions

Tree-based regressions

Conclusions

In-class exercise