\(X^n\): the \(n\)th degree polynomial term of \(X\).
\(\beta_0, \beta_1, \beta_2, ... \beta_n\): the coefficients that the model will estimate
\(\epsilon_i\): Random error term, deviation of the real value from predicted
Non-linear Relationship Modeling: Incorporates polynomial terms of predictors, enabling the modeling of complex, non-linear relationships between variables.
Flexibility: The degree of the polynomial (\(n\)) determines the model’s flexibility, with higher degrees allowing for more complex curves.
Overfitting Risk: Higher-degree polynomials can lead to overfitting, capturing noise rather than the underlying data pattern.
Interpretation: While polynomial regression can model non-linear relationships, interpreting the coefficients becomes more challenging as the degree of the polynomial increases.
Use Cases: Suitable for modeling phenomena where the relationship between predictors and response variable is known to be non-linear, such as in biological, agricultural, or environmental data.
X = tucsonTemp[['date_numeric']].valuesy = tucsonTemp['tavg'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Transforming data for polynomial regressionpoly_features = PolynomialFeatures(degree =2) # Adjust degree as necessaryX_train_poly = poly_features.fit_transform(X_train)X_test_poly = poly_features.transform(X_test)# Initialize the Linear Regression modellinear_reg = LinearRegression()# Fit the model with polynomial featureslinear_reg.fit(X_train_poly, y_train)# Predict on the testing sety_pred = linear_reg.predict(X_test_poly)# Calculate and print the MSE and R-squaredmse_initial = mean_squared_error(y_test, y_pred)r2_initial = r2_score(y_test, y_pred)print(f'Mean Squared Error: {mse_initial.round(2)}')print(f'R-squared: {r2_initial.round(2)}')
Mean Squared Error: 55.6
R-squared: 0.76
Code
# Scatter plot of the actual data pointsplt.scatter(X_test, y_test, color ='lightgray', label ='Actual data')# To plot the polynomial curve, we need to sort the values because the line plot needs to follow the order of X# Create a sequence of values from the minimum to the maximum X values for plotting the curveX_plot = np.linspace(np.min(X), np.max(X), 100).reshape(-1, 1)# Transform the plot data for the polynomial modelX_plot_poly = poly_features.transform(X_plot)# Predict y values for the plot datay_plot = linear_reg.predict(X_plot_poly)# Plot the polynomial curveplt.plot(X_plot, y_plot, color ='red', label ='Polynomial fit')# Labeling the plotplt.xlabel('Date Numeric')plt.ylabel('Daily Average Temperature')plt.title('2nd Degree Polynomial Fit to Tucson Temperature Data')plt.legend()# Show the plotplt.show()
# Selecting the best degree based on previous stepbest_degree = degrees[np.argmin(mse_scores)]# Transforming data with the best degreepoly_features_best = PolynomialFeatures(degree = best_degree)X_train_poly_best = poly_features_best.fit_transform(X_train)X_test_poly_best = poly_features_best.transform(X_test)# Fitting the model againbest_model = LinearRegression()best_model.fit(X_train_poly_best, y_train)# New predictionsy_pred_best = best_model.predict(X_test_poly_best)# Calculate new MSE and R-squaredmse_best = mean_squared_error(y_test, y_pred_best)r2_best = r2_score(y_test, y_pred_best)print(f'Best Degree: {best_degree}, MSE: {mse_best.round(3)}, R-squared: {r2_best.round(4)}')
Best Degree: 5, MSE: 26.067, R-squared: 0.8881
Code
# Generate a sequence of X values for plottingX_range = np.linspace(X.min(), X.max(), 500).reshape(-1, 1)# Plot the actual dataplt.scatter(X_test, y_test, color ='gray', alpha =0.5, label ='Actual data')colors = ['blue', 'green', 'red', 'purple', 'orange']labels = ['1st degree', '2nd degree', '3rd degree', '4th degree', '5th degree']for i, degree inenumerate(degrees):# Create polynomial features for the current degree poly_features = PolynomialFeatures(degree=degree) X_train_poly = poly_features.fit_transform(X_train) X_test_poly = poly_features.transform(X_test) X_range_poly = poly_features.transform(X_range)# Fit the Linear Regression model model = LinearRegression() model.fit(X_train_poly, y_train)# Predict over the generated range of X values y_range_pred = model.predict(X_range_poly)# Plot plt.plot(X_range, y_range_pred, color = colors[i], label =f'Polynomial fit degree {degree}')# Enhancing the plotplt.xlabel('Date Numeric')plt.ylabel('Daily Average Temperature')plt.title('Comparing Polynomial Fits of Different Degrees')plt.legend()plt.show()
Code
residuals_best = y_test - y_pred_bestplt.scatter(y_pred_best, residuals_best, alpha =0.5)plt.axhline(y =0, color ='r', linestyle ='--')plt.xlabel('Predicted')plt.ylabel('Residuals')plt.title('Residual Plot with Best Degree')plt.show()
\(I\): indicator function that returns 1 if \(x\) is within the interval \(C_i\) and 0
\(\beta_0, \beta_1, \beta_2, … \beta_k\): value of the response variable \(Y\) within interval \(C_i\)
Model Non-linearity: Efficiently captures non-linear relationships by assigning constant values within specific intervals of the predictor variable.
Simple Interpretation: Each step’s effect is straightforward, with a constant response value within each interval.
Adjustable Steps: Flexibility in setting the number and boundaries of steps, although optimal placement may require exploratory data analysis or domain knowledge.
Discontinuity Handling: Can model sudden jumps in the response variable, a feature not readily handled by smooth models like polynomials or splines.
# Define the step function intervalsstep_intervals = [0, 100, 200, 300, 400]tucsonTemp['step_bins'] = pd.cut(tucsonTemp['date_numeric'], bins = step_intervals, labels=False)# Prepare the data for step function fittingX = pd.get_dummies(tucsonTemp['step_bins'], drop_first =True).valuesy = tucsonTemp['tavg'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Fit the step function modelmodel = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)# Calculate and print the MSE and R-squaredmse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f'Mean Squared Error: {mse:.2f}')print(f'R-squared: {r2:.2f}')
Mean Squared Error: 53.23
R-squared: 0.77
Code
# Plotting the step function fitsns.lineplot(x = tucsonTemp['date_numeric'], y = tucsonTemp['tavg'], label ='Actual Data', alpha =0.6)# Generate the step function values for the plotfor i, (lower, upper) inenumerate(zip(step_intervals[:-1], step_intervals[1:])): mask = (tucsonTemp['date_numeric'] >= lower) & (tucsonTemp['date_numeric'] < upper) plt.hlines(y[mask].mean(), lower, upper, colors ='red', label =f'Step {i+1}')plt.xlabel('Date Numeric')plt.ylabel('Daily Average Temperature')plt.title('Step Function Fit to Tucson Temperature Data')plt.legend()plt.show()
\(w_j\): weight (or coefficient) for the \(j\)th basis function
\(M\): number of basis functions used in the model
For Gaussian basis functions, each \(\phi_j(x)\) could be defined as: \(\phi_j(x) = \exp\left(-\frac{(x - \mu_j)^2}{2s^2}\right)\)
Where:
\(\mu_j\) is the center of the \(j\)th basis function
\(s\) is the width
Flexibility: Basis functions, such as polynomials, splines, or Gaussians, allow modeling of complex non-linear relationships.
Transformations: They transform the input data into a new space where linear regression techniques can be applied.
Customizability: The choice and parameters of the basis functions can be adapted to the problem, often requiring domain knowledge or model selection techniques.
Overfitting Potential: More basis functions can lead to greater model complexity, which increases the risk of overfitting, hence the need for regularization.
Computational Efficiency: While basis functions can provide powerful models, they may introduce computational complexity, especially when the number of basis functions is large.
# Splitting the datasetX = tucsonTemp[['date_numeric']].valuesy = tucsonTemp['tavg'].valuesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Create the basis function modelmodel = make_pipeline(RBFSampler(gamma =1.0, n_components=50, random_state=42), LinearRegression())# Fit the modelmodel.fit(X_train, y_train)# Predict on the testing sety_pred = model.predict(X_test)# Calculate and print the MSE and R-squaredmse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f'Mean Squared Error: {mse:.2f}')print(f'R-squared: {r2:.2f}')
Mean Squared Error: 93.76
R-squared: 0.60
Code
# Predict on a grid for a smooth linenp.random.seed(42)num_samples =365x_grid = np.linspace(X.min(), X.max(), num_samples).reshape(-1, 1)y_grid_pred = model.predict(x_grid)# Plotting the actual data points and the basis function fitsns.lineplot(x = X_train.squeeze(), y = y_train, color ='gray', label ='Training data', alpha =0.5)sns.lineplot(x = X_test.squeeze(), y = y_test, color ='blue', label ='Testing data', alpha =0.5)sns.lineplot(x = x_grid.squeeze(), y = y_grid_pred, color ='red', label ='Basis Function Fit')# Labeling the plotplt.xlabel('Normalized Date Numeric')plt.ylabel('Daily Average Temperature')plt.title('Basis Function Fit to Tucson Temperature Data')plt.legend()plt.show()
Code
# Calculating residualsresiduals = y_test - y_pred# Plotting residualssns.residplot(x = y_pred, y = residuals, lowess =True, line_kws = {'color': 'red', 'lw': 1})plt.xlabel('Predicted Values')plt.ylabel('Residuals')plt.title('Residual Plot for Basis Function Regression')plt.show()
\(k\) is the number of folds in the cross-validation.
\(score_i\) is the score obtained from the \(i\)-th fold.
\(CV_{score}\) is the cross-validation score, which is the average of all the individual fold scores.
Pros
Systematic Exploration: It ensures that every combination in the specified parameter range is evaluated.
Reproducibility: The search is deterministic, meaning it will produce the same result each time for the same dataset and parameters.
Cons
Computational Cost: It can be very computationally expensive, especially with a large number of hyperparameters or when the range of values for each hyperparameter is large.
Dimensionality: The time required increases exponentially with the addition of more parameters (known as the curse of dimensionality).
Here, we identify an area where the model performs well, then launch a second grid:
# First search: random Search to narrow down the range for hyperparametersrandom_param_grid = {'rbfsampler__gamma': np.logspace(-3, 0, 4), # Wider range for gamma'rbfsampler__n_components': np.linspace(50, 500, 10).astype(int) # Wider range for n_components}# Create a custom scorer for cross-validationmse_scorer = make_scorer(mean_squared_error, greater_is_better =False)# Initialize RandomizedSearchCVrandom_search = RandomizedSearchCV( model, param_distributions = random_param_grid, n_iter =10, # Number of parameter settings that are sampled scoring = mse_scorer, cv =5, random_state =42)# Fit the modelrandom_search.fit(X_train, y_train)# Second search: grid Search to fine-tune the hyperparameters# Use best parameters from random search as the center point for the grid searchbest_gamma = random_search.best_params_['rbfsampler__gamma']best_n_components = random_search.best_params_['rbfsampler__n_components']refined_param_grid = {'rbfsampler__gamma': np.linspace(best_gamma /2, best_gamma *2, 5),'rbfsampler__n_components': [best_n_components -50, best_n_components, best_n_components +50]}# Initialize GridSearchCV with the refined gridgrid_search = GridSearchCV( model, param_grid = refined_param_grid, scoring = mse_scorer, cv =5)# Fit the model using GridSearchCVgrid_search.fit(X_train, y_train)# Best parameters from grid searchprint(f'Best parameters after Grid Search: {grid_search.best_params_}')# Re-initialize and fit the model with the best parameters from grid searchbest_basis_model = make_pipeline( RBFSampler( gamma = grid_search.best_params_['rbfsampler__gamma'], n_components = grid_search.best_params_['rbfsampler__n_components'], random_state =42 ), LinearRegression())best_basis_model.fit(X_train, y_train)# Make new predictions with the best modely_pred_best = best_basis_model.predict(X_test)# Calculate R-squared and Mean Squared Error (MSE) with the best modelr2_best = r2_score(y_test, y_pred_best)mse_best = mean_squared_error(y_test, y_pred_best)print(f'Mean Squared Error: {mse_best.round(3)}')print(f'R-squared: {r2_best.round(4)}')
Best parameters after Grid Search: {'rbfsampler__gamma': 0.002, 'rbfsampler__n_components': 250}
Mean Squared Error: 20.303
R-squared: 0.9129
# Calculate the original MSE and r-squared scoredmse_initial = mean_squared_error(y_test, y_pred)r2_initial = r2_score(y_test, y_pred)# Print comparisonprint(f'Initial MSE: {mse_initial.round(3)}, Best Parameters MSE: {mse_best.round(3)}')print(f'Initial R-squared: {r2_initial.round(4)}, Best Parameters R-squared: {r2_best.round(5)}')
Initial MSE: 93.762, Best Parameters MSE: 20.303
Initial R-squared: 0.5976, Best Parameters R-squared: 0.91287
Code
residuals_best = y_test - y_pred_bestsns.residplot(x = y_pred_best, y = residuals_best, lowess =True, line_kws = {'color': 'red', 'lw': 1})plt.xlabel('Predicted')plt.ylabel('Residuals')plt.title('Residual Plot with Best Parameters')plt.show()
\(y(x)\): the output of the piecewise polynomial function at input \(x\)
\(a_0, a_1, \dots, a_n, b_0, b_1, \dots, b_n, ..., z_0, z_1, \dots, z_n\): coefficients specific to each polynomial piece within its respective interval
\([x_0, x_1), [x_1, x_2),...,[x_{k-1}, x_k]\): intervals that divide the range \(x\) into segments, each with a specific polynomial.
Defined by knots \(x_0, x_1,...,x_k\), or points where the polynomial changes
Flexibility: Allows for modeling different behaviors of the response variable across the range of the predictor variable.
Customizable: The number and location of knots (points where the polynomial changes) can be adjusted based on the data or domain knowledge.
Continuity: While the model can change form at each knot, continuity can be enforced to ensure the function is smooth at the transitions.
Complexity Control: The degree of the polynomial for each segment can be chosen to balance the model’s flexibility with the risk of overfitting.
Interpretation: While more complex than a single polynomial model, piecewise polynomials can offer intuitive insights into changes in trends or behaviors within different regions of the data.
# Assign variablesX = tucsonTemp[['date_numeric']].valuesy = tucsonTemp['tavg'].values# Splitting the datasetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Ensure X_train and y_train are sorted by X_trainsorted_indices = np.argsort(X_train.squeeze())X_train_sorted = X_train[sorted_indices].squeeze() # Convert to 1Dy_train_sorted = y_train[sorted_indices]# Define initial knots based on domain knowledge or quantilesinitial_knots = np.quantile(X_train_sorted, [0.25, 0.5, 0.75])# Adjust knots to include the full range of X_train_sortedknots = np.concatenate(([X_train_sorted.min()], initial_knots, [X_train_sorted.max()]))# Function to handle duplicates by averaging y values for duplicate x valuesdef unique_with_average(X, Y): unique_X, indices = np.unique(X, return_inverse =True) avg_Y = np.array([Y[indices == i].mean() for i inrange(len(unique_X))])return unique_X, avg_Ysplines = []for i inrange(len(knots) -1): mask = (X_train_sorted >= knots[i]) & (X_train_sorted < knots[i+1]) X_segment = X_train_sorted[mask] y_segment = y_train_sorted[mask]# Ensure X_segment is strictly increasing by handling duplicates X_segment_unique, y_segment_avg = unique_with_average(X_segment, y_segment)iflen(X_segment_unique) >=4: # Ensuring there are enough points spline = CubicSpline(X_segment_unique, y_segment_avg, bc_type='natural') splines.append((knots[i], knots[i+1], spline))# Predict function needs updating to loop over splines correctlydef predict_with_splines(X, splines, knots): y_pred = np.zeros_like(X)for i, x_val inenumerate(X):for start, end, spline in splines:if start <= x_val < end: y_pred[i] = spline(x_val)breakreturn y_pred# Predictions and evaluationsy_pred = predict_with_splines(X_test.squeeze(), splines, knots) # Ensure X_test is 1D for the functionmse = mean_squared_error(y_test, y_pred)r2 = r2_score(y_test, y_pred)print(f'Mean Squared Error: {mse:.2f}')print(f'R-squared: {r2:.2f}')
Mean Squared Error: 16.80
R-squared: 0.93
Code
# Visualization of the training and testing datasns.scatterplot(x = X_train.squeeze(), y = y_train, color ='gray', label ='Training data', alpha =0.5)sns.scatterplot(x = X_test.squeeze(), y = y_test, color ='blue', label ='Testing data', alpha =0.5)# Assuming splines and knots are correctly calculated as per previous stepsx_range = np.linspace(X.min(), X.max(), 400).squeeze()y_range_pred = predict_with_splines(x_range, splines, knots)sns.lineplot(x = x_range, y = y_range_pred, color ='red', label ='Piecewise Polynomial Fit')plt.xlabel('Date Numeric')plt.ylabel('Daily Average Temperature')plt.title('Piecewise Polynomial Fit to Temperature Data')plt.legend()plt.show()