# Import all required libraries# Data handling and manipulationimport pandas as pdimport numpy as np# Implementing and selecting modelsimport statsmodels.api as smfrom sklearn.model_selection import train_test_split, cross_val_score, KFoldfrom itertools import combinationsfrom mlxtend.feature_selection import SequentialFeatureSelector as SFSfrom sklearn.linear_model import LinearRegression# For advanced visualizationsimport matplotlib.pyplot as pltimport seaborn as sns# Show computation timeimport time# Increase font size of all Seaborn plot elementssns.set(font_scale =1.25)# Set Seaborn themesns.set_theme(style ="white")
Mean square error (MSE): MSE of a predictor is calculated as the average of the squares of the errors, where the error is the difference between the actual value and the predicted value.
R-squared (\(R^2\)): Proportion of variance in the dependent variable explained by the model; closer to 1 indicates a better fit.
Adjusted R-squared (\(R^2_{adj}\)): Modified R² that accounts for the number of predictors; useful for comparing models with different numbers of independent variables.
Residual plots: Visual check for randomness in residuals; patterns may indicate model issues like non-linearity or heteroscedasticity.
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator measures the average of the squares of the errors—that is, the average squared difference between the estimated values and the actual value. The MSE either assesses the quality of a predictoror of an estimator.
plt.scatter(X_test, y_test, label ='Data')line_x = np.linspace(X_test.min(), X_test.max(), 100)line_y = model.predict(sm.add_constant(line_x))plt.plot(line_x, line_y, color ='red', label ='OLS Regression Line')plt.xlabel("Family income ($)")plt.ylabel("Gift aid from university ($)")plt.title('OLS Regression Fit')plt.legend()plt.show()
Multiple Predictors: more than one predictor variable to predict a response variable.
Model Form: \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_n X_n + \epsilon\) , where \(Y\) is the response variable, \(X_1, X_2,...,X_n\) are predictor variables, \(\beta\)s are coefficients, and \(\epsilon\) is the error term.
Coefficient Interpretation: Each coefficient represents the change in the response variable for one unit change in the predictor, holding other predictors constant.
Assumptions: Includes linearity, no perfect multicollinearity, homoscedasticity, independence of errors, and normality of residuals.
Adjusted R-squared: Used to determine the model’s explanatory power, adjusting for the number of predictors.
Multicollinearity Concerns: High correlation between predictor variables can distort the model and make coefficient estimates unreliable.
Interaction Effects: Can be included to see if the effect of one predictor on the response variable depends on another predictor.
How do factors like debt-to-income ratio, bankruptcy history, and loan term affect the interest rate of a loan?
loans = pd.read_csv("data/loans.csv")loans.head()
emp_title
emp_length
state
homeownership
annual_income
verified_income
debt_to_income
annual_income_joint
verification_income_joint
debt_to_income_joint
...
sub_grade
issue_month
loan_status
initial_listing_status
disbursement_method
balance
paid_total
paid_principal
paid_interest
paid_late_fees
0
global config engineer
3.0
NJ
MORTGAGE
90000.0
Verified
18.01
NaN
NaN
NaN
...
C3
Mar-2018
Current
whole
Cash
27015.86
1999.33
984.14
1015.19
0.0
1
warehouse office clerk
10.0
HI
RENT
40000.0
Not Verified
5.04
NaN
NaN
NaN
...
C1
Feb-2018
Current
whole
Cash
4651.37
499.12
348.63
150.49
0.0
2
assembly
3.0
WI
RENT
40000.0
Source Verified
21.15
NaN
NaN
NaN
...
D1
Feb-2018
Current
fractional
Cash
1824.63
281.80
175.37
106.43
0.0
3
customer service
1.0
PA
RENT
30000.0
Not Verified
10.16
NaN
NaN
NaN
...
A3
Jan-2018
Current
whole
Cash
18853.26
3312.89
2746.74
566.15
0.0
4
security supervisor
10.0
CA
RENT
35000.0
Verified
57.96
57000.0
Verified
37.66
...
C3
Mar-2018
Current
whole
Cash
21430.15
2324.65
1569.85
754.80
0.0
5 rows × 55 columns
Variable
Description
interest_rate
Interest rate on the loan, in an annual percentage.
verified_income
Borrower’s income verification: Verified, Source Verified, and Not Verified.
debt_to_income
Debt-to-income ratio, which is the percentage of total debt of the borrower divided by their total income.
credit_util
The fraction of available credit utilized.
bankruptcy
An indicator variable for whether the borrower has a past bankruptcy in their record. This variable takes a value of 1 if the answer is yes and 0 if the answer is no.
AIC (Akaike Information Criterion): a measure of the relative quality of a statistical model for a given set of data. Lower and fewer parameters = better.
\(AIC=2k−2\ln(L)\), where \(k\) is the number of parameters, and \(L\) is the likelihood of the model.
BIC (Bayesian Information Criterion): a criterion for model selection based on the likelihood of the model. Lower and fewer parameters= better.
\(BIC = k \ln(n) - 2 \ln(L)\), where \(k\) is the number of parameters, \(n\) is the sample size, and \(L\) is the likelihood of the model
Cross-Validation: a statistical method used for assessing how the results of a predictive model will generalize to an independent data set.
Most common method is k-Fold Cross Validation
Pros:
Balanced Approach: AIC balances model fit with complexity, reducing overfitting.
Comparative Utility: Suitable for comparing models on the same dataset.
Widespread Acceptance: Commonly used and recognized in statistical analysis.
Versatile: Applicable across various types of statistical models.
Cons:
Relative Measurement: Only provides a comparative metric and doesn’t indicate absolute model quality.
Risk of Underfitting: Penalizing complex models may lead to underfitting.
Sample Size Sensitivity: Its effectiveness can vary with the size of the dataset.
Assumption of Large Sample: Infinite sample size assumption is impractical.
No Probabilistic Interpretation: Lacks a direct probabilistic meaning, unlike some other statistical measures.
Best Subset Selection is a statistical method used in regression analysis to select a subset of predictors that provides the best fit for the response variable.
It involves:
Considering All Possible Predictor Subsets: For \(p\) predictors, all possible combinations (totaling \(2_p\)) of these predictors are considered.
Fitting a Model for Each Subset: A regression model is fitted for each subset of predictors.
Selecting the Best Model: The best model is selected based on a criterion that balances fit and complexity, such as the lowest Residual Sum of Squares (\(RSS\)) or the highest \(R^{2}_{adj}\) (or \(AIC\), \(BIC\)).
def best_subset_selection(X, y): results = []for k inrange(1, len(X.columns) +1):for combo in combinations(X.columns, k): X_subset = sm.add_constant(X[list(combo)].astype(int)) model = sm.OLS(y, X_subset).fit() results.append({'model': model, 'predictors': combo})return resultssubset_results = best_subset_selection(X_train, y_train)
best_aic = np.infbest_model_1 =Nonefor result in subset_results:if result['model'].aic < best_aic: best_aic = result['model'].aic best_model_1 = resultprint("Best Model Predictors:", best_model_1['predictors'])print("Best Model AIC:", best_aic.round(2))
Best Model Predictors: ('debt_to_income', 'credit_util', 'bankruptcy', 'term', 'credit_checks', 'verified_income_Source Verified', 'verified_income_Verified')
Best Model AIC: 46650.23
best_bic = np.infbest_model_2 =Nonefor result in subset_results:if result['model'].bic < best_bic: best_bic = result['model'].bic best_model_2 = resultprint("Best Model Predictors:", best_model_2['predictors'])print("Best Model BIC:", best_bic.round(2))
Best Model Predictors: ('debt_to_income', 'credit_util', 'bankruptcy', 'term', 'credit_checks', 'verified_income_Source Verified', 'verified_income_Verified')
Best Model BIC: 46706.13
best_r_sq_adj =-np.inf best_model_3 =Nonefor result in subset_results:if result['model'].rsquared_adj > best_r_sq_adj: best_r_sq_adj = result['model'].rsquared_adj best_model_3 = resultprint("Best Model Predictors:", best_model_3['predictors'])print("Best Model Adjusted R-squared:", best_r_sq_adj.round(2))
Best Model Predictors: ('debt_to_income', 'credit_util', 'bankruptcy', 'term', 'credit_checks', 'verified_income_Source Verified', 'verified_income_Verified')
Best Model Adjusted R-squared: 0.19
lr = LinearRegression()# Initialize SequentialFeatureSelector (SFS) for forward selectionsfs = SFS(lr, k_features ='best', # Select the best number of features based on criterion forward =True, # Forward selection floating =False, scoring ='neg_mean_squared_error', # Using negative MSE as scoring criterion cv =0) # No cross-validation# Fit SFS on the training datasfs.fit(X_train, y_train)# Get the names of the selected featuresselected_features =list(sfs.k_feature_names_)# Convert the DataFrame to numeric to ensure all features are numericX_train = X_train.apply(pd.to_numeric, errors='coerce').fillna(0)X_test = X_test.apply(pd.to_numeric, errors='coerce').fillna(0)# Select the features identified by forward selectionX_train_selected = X_train[selected_features]# Fit the final Linear Regression model using selected featuresfinal_model = lr.fit(X_train_selected, y_train)# For AIC, BIC, and adjusted R-squared, use statsmodelsX_train_selected_with_const = sm.add_constant(X_train_selected)# Ensure data type consistencyX_train_selected_with_const = X_train_selected_with_const.astype(float)# Fit the OLS modelmodel_sm = sm.OLS(y_train, X_train_selected_with_const).fit()# Output the resultsprint("Forward Selection Results")print("Selected predictors:", selected_features)print("AIC:", model_sm.aic.round(3))print("BIC:", model_sm.bic.round(3))print("Adjusted R-squared:", model_sm.rsquared_adj.round(3))
# Initialize the linear regression modellr = LinearRegression()# Initialize SequentialFeatureSelector (SFS) for forward selectionsfs = SFS(lr, k_features ='best', # Select the best number of features based on criterion forward =False, # Forward selection floating =False, scoring ='neg_mean_squared_error', # Using negative MSE as scoring criterion cv =0) # No cross-validation# Fit SFS on the training datasfs.fit(X_train, y_train)# Get the names of the selected featuresselected_features =list(sfs.k_feature_names_)# Convert the DataFrame to numeric to ensure all features are numericX_train = X_train.apply(pd.to_numeric, errors='coerce').fillna(0)X_test = X_test.apply(pd.to_numeric, errors='coerce').fillna(0)# Select the features identified by forward selectionX_train_selected = X_train[selected_features]# Fit the final Linear Regression model using selected featuresfinal_model = lr.fit(X_train_selected, y_train)# For AIC, BIC, and adjusted R-squared, use statsmodelsX_train_selected_with_const = sm.add_constant(X_train_selected)# Ensure data type consistencyX_train_selected_with_const = X_train_selected_with_const.astype(float)# Fit the OLS modelmodel_sm = sm.OLS(y_train, X_train_selected_with_const).fit()# Output the resultsprint("Backward Selection Results")print("Selected predictors:", selected_features)print("AIC:", model_sm.aic.round(3))print("BIC:", model_sm.bic.round(3))print("Adjusted R-squared:", model_sm.rsquared_adj.round(3))
def best_subset_cv(X, y, max_features=5): best_score = np.inf best_subset =None# Limiting the number of features for computational feasibilityfor k inrange(1, max_features +1):for subset in combinations(X.columns, k):# Define the model model = LinearRegression()# Perform k-fold cross-validation kf = KFold(n_splits =5, shuffle =True, random_state =42) cv_scores = cross_val_score(model, X[list(subset)], y, cv = kf, scoring ='neg_mean_squared_error')# Compute the average score score =-np.mean(cv_scores) # Convert back to positive MSE# Update the best score and subsetif score < best_score: best_score = score best_subset = subsetreturn best_subset, best_score# Start timingstart_time = time.time()# Assuming X and y are already defined and preprocessedbest_subset, best_score = best_subset_cv(X, y)# End timingend_time = time.time()# Calculate durationduration = end_time - start_timeprint("Best Subset:", best_subset)print("Best CV Score (MSE):", best_score)print("Computation Time: {:.2f} seconds".format(duration))
Best Subset: ('credit_util', 'term', 'credit_checks', 'verified_income_Source Verified', 'verified_income_Verified')
Best CV Score (MSE): 18.664618422927752
Computation Time: 3.48 seconds
Code
# Initialize the linear regression modellr = LinearRegression()# Initialize SequentialFeatureSelector for forward selection with cross-validationsfs = SFS(lr, k_features ='best', # 'best' or specify a number with ('1, 10') for range forward=True, floating=False, scoring ='neg_mean_squared_error', cv = KFold(n_splits =5, shuffle =True, random_state =42))# Start timingstart_time = time.time()# Fit SFS on the training datasfs.fit(X, y)# End timingend_time = time.time()# Calculate durationduration = end_time - start_time# Get the names of the selected featuresselected_features =list(sfs.k_feature_names_)print("Selected predictors (forward selection):", selected_features)print("Computation Time: {:.2f} seconds".format(duration))
# Convert the DataFrame to numeric to ensure all features are numericX_train = X_train.apply(pd.to_numeric, errors ='coerce').fillna(0)X_test = X_test.apply(pd.to_numeric, errors ='coerce').fillna(0)# Select the features identified by forward selectionX_train_selected = X_train[selected_features]# Fit the final Linear Regression model using selected featuresfinal_model = lr.fit(X_train_selected, y_train)# For AIC, BIC, and adjusted R-squared, use statsmodelsX_train_selected_with_const = sm.add_constant(X_train_selected)# Ensure data type consistencyX_train_selected_with_const = X_train_selected_with_const.astype(float)# Fit the OLS modelmodel_sm = sm.OLS(y_train, X_train_selected_with_const).fit()# Model resultsmodel_summary2 = model_sm.summary2()print("Forward selection best fit:\n")print(model_summary2)
Access to clean fuels and technologies for cooking (% of population)
variable
class
description
Entity
character
The country
Code
character
Country code
Year
double
Year
access_clean_perc
double
% of population with access to clean cooking fuels
GDP
double
GDP per capita, PPP (constant 2017 international $)
popn
double
Country population
Continent
character
Continent the country resides on
variable
class
description
Entity
character
The country
Code
character
Country code
Year
double
Year
Death_Rate_ASP
double
Cause of death related to air pollution from solid fuels, standardized
Live coding
Code
from sklearn.impute import SimpleImputerfrom feature_engine.imputation import CategoricalImputer# Read in datafuel_access = pd.read_csv('data/fuel_access.csv')fuel_gdp = pd.read_csv('data/fuel_gdp.csv')death_source = pd.read_csv('data/death_source.csv')# Select relevant columns and rename for consistency if neededfuel_access = fuel_access[['Entity', 'Year', 'access_clean_perc']]fuel_gdp = fuel_gdp[['Entity', 'Year', 'GDP', 'popn']]death_source = death_source[['Entity', 'Year', 'Death_Rate_ASP']]# Ensure 'Year' is an integerfuel_access['Year'] = fuel_access['Year'].astype(int)fuel_gdp['Year'] = fuel_gdp['Year'].astype(int)death_source['Year'] = death_source['Year'].astype(int)# Merge datasets on 'Entity' and 'Year'merged_data = pd.merge(fuel_access, fuel_gdp, on = ['Entity', 'Year'], how ='inner')merged_data = pd.merge(merged_data, death_source, on = ['Entity', 'Year'], how ='inner')# Check for missing valuesprint(merged_data.isnull().sum())# Handle missing values if necessary (e.g., fill with mean or median, or drop)num_imputer = SimpleImputer(strategy ='mean') # or 'median'numerical_cols = merged_data.select_dtypes(include = ['int64', 'float64']).columns.tolist()merged_data[numerical_cols] = num_imputer.fit_transform(merged_data[numerical_cols])merged_data.dropna()merged_data['Entity'] = pd.factorize(merged_data['Entity'])[0]# Final check for data types and missing valuesprint(merged_data.info())print(merged_data.isnull().sum())# Optional: Standardize/normalize numerical columns if necessaryfrom sklearn.preprocessing import StandardScalerscaler = StandardScaler()numerical_columns = ['access_clean_perc', 'GDP', 'popn']merged_data[numerical_columns] = scaler.fit_transform(merged_data[numerical_columns])merged_data.to_csv()# Define the predictor variables and the response variableX = merged_data.drop(['Death_Rate_ASP'], axis =1) # Assuming 'Death_Rate_ASP' is the response variabley = merged_data['Death_Rate_ASP']# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size =0.2, random_state =42)# Initialize LinearRegression modellr = LinearRegression()# Initialize SequentialFeatureSelector for forward selection with cross-validationsfs = SFS(lr, k_features ='best', # 'best' or specify a number with ('1, 10') for range forward =True, # Forward selection floating =False, # Set to True for stepwise selection scoring ='neg_mean_squared_error', # Using negative MSE as scoring criterion cv = KFold(n_splits =5, shuffle =True, random_state =42)) # 5-fold cross-validation# Fit SFS on the training datasfs.fit(X_train, y_train)# Get the names of the selected featuresselected_features =list(sfs.k_feature_names_)# Fit the final Linear Regression model using selected featuresfinal_model = lr.fit(X_train[selected_features], y_train)# Evaluate the model performance on the test setfrom sklearn.metrics import mean_squared_errory_pred = final_model.predict(X_test[selected_features])mse = mean_squared_error(y_test, y_pred)print("Selected predictors (forward selection):", selected_features)print("Test MSE:", mse)# Extracting predictor variables (X) and the target variable (y)X = merged_data[selected_features]Y = merged_data['Death_Rate_ASP']# Adding a constant for the intercept termX = sm.add_constant(X)# Fit the regression modelmodel = sm.OLS(y, X).fit()model_summary = model.summary2()print(model_summary)