Regressions I

Lecture 7

Dr. Greg Chism

University of Arizona
INFO 523 - Spring 2024

Warm up

Regressions

Assessing accuracy of model

Four major methods:

Mean square error (MSE): MSE of a predictor is calculated as the average of the squares of the errors, where the error is the difference between the actual value and the predicted value.
R-squared (): Proportion of variance in the dependent variable explained by the model; closer to 1 indicates a better fit.
Adjusted R-squared (): Modified R² that accounts for the number of predictors; useful for comparing models with different numbers of independent variables.
Residual plots: Visual check for randomness in residuals; patterns may indicate model issues like non-linearity or heteroscedasticity.

	family_income	gift_aid	price_paid
count	50.000000	50.000000	50.000000
mean	101.778520	19.935560	19.544440
std	63.206451	5.460581	5.979759
min	0.000000	7.000000	8.530000
25%	64.079000	16.250000	15.000000
50%	88.061500	20.470000	19.500000
75%	137.174000	23.515000	23.630000
max	271.974000	32.720000	35.000000

OLS regression: applied

Code

X = elmhurst['family_income']  # Independent variable
y = elmhurst['gift_aid']  # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_with_const = sm.add_constant(X_train)

model = sm.OLS(y_train, X_train_with_const).fit()

model_summary2 = model.summary2()
print(model_summary2)X = elmhurst['family_income']  # Independent variable
y = elmhurst['gift_aid']  # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_with_const = sm.add_constant(X_train)

model = sm.OLS(y_train, X_train_with_const).fit()

model_summary2 = model.summary2()
print(model_summary2)X = elmhurst['family_income']  # Independent variable
y = elmhurst['gift_aid']  # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_with_const = sm.add_constant(X_train)

model = sm.OLS(y_train, X_train_with_const).fit()

model_summary2 = model.summary2()
print(model_summary2)X = elmhurst['family_income']  # Independent variable
y = elmhurst['gift_aid']  # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_with_const = sm.add_constant(X_train)

model = sm.OLS(y_train, X_train_with_const).fit()

model_summary2 = model.summary2()
print(model_summary2)X = elmhurst['family_income']  # Independent variable
y = elmhurst['gift_aid']  # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_with_const = sm.add_constant(X_train)

model = sm.OLS(y_train, X_train_with_const).fit()

model_summary2 = model.summary2()
print(model_summary2)X = elmhurst['family_income']  # Independent variable
y = elmhurst['gift_aid']  # Dependent variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
X_train_with_const = sm.add_constant(X_train)

model = sm.OLS(y_train, X_train_with_const).fit()

model_summary2 = model.summary2()
print(model_summary2)

                 Results: Ordinary least squares
=================================================================
Model:              OLS              Adj. R-squared:     0.205   
Dependent Variable: gift_aid         AIC:                241.3545
Date:               2024-04-25 15:09 BIC:                244.7323
No. Observations:   40               Log-Likelihood:     -118.68 
Df Model:           1                F-statistic:        11.05   
Df Residuals:       38               Prob (F-statistic): 0.00197 
R-squared:          0.225            Scale:              23.273  
-----------------------------------------------------------------
                   Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
-----------------------------------------------------------------
const             24.5170   1.4489 16.9216 0.0000 21.5839 27.4500
family_income     -0.0398   0.0120 -3.3238 0.0020 -0.0640 -0.0156
-----------------------------------------------------------------
Omnibus:              0.057        Durbin-Watson:           2.277
Prob(Omnibus):        0.972        Jarque-Bera (JB):        0.249
Skew:                 0.047        Prob(JB):                0.883
Kurtosis:             2.625        Condition No.:           230  
=================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.

Code

X_test_with_const = sm.add_constant(X_test)
predictions = model.predict(X_test_with_const)
residuals = y_test - predictions

sns.residplot(x = predictions, y = residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()X_test_with_const = sm.add_constant(X_test)
predictions = model.predict(X_test_with_const)
residuals = y_test - predictions

sns.residplot(x = predictions, y = residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()X_test_with_const = sm.add_constant(X_test)
predictions = model.predict(X_test_with_const)
residuals = y_test - predictions

sns.residplot(x = predictions, y = residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()X_test_with_const = sm.add_constant(X_test)
predictions = model.predict(X_test_with_const)
residuals = y_test - predictions

sns.residplot(x = predictions, y = residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()X_test_with_const = sm.add_constant(X_test)
predictions = model.predict(X_test_with_const)
residuals = y_test - predictions

sns.residplot(x = predictions, y = residuals)
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

Code

plt.scatter(X_test, y_test, label = 'Data')

line_x = np.linspace(X_test.min(), X_test.max(), 100)
line_y = model.predict(sm.add_constant(line_x))
plt.plot(line_x, line_y, color = 'red', label = 'OLS Regression Line')

plt.xlabel("Family income ($)")
plt.ylabel("Gift aid from university ($)")
plt.title('OLS Regression Fit')
plt.legend()
plt.show()plt.scatter(X_test, y_test, label = 'Data')

line_x = np.linspace(X_test.min(), X_test.max(), 100)
line_y = model.predict(sm.add_constant(line_x))
plt.plot(line_x, line_y, color = 'red', label = 'OLS Regression Line')

plt.xlabel("Family income ($)")
plt.ylabel("Gift aid from university ($)")
plt.title('OLS Regression Fit')
plt.legend()
plt.show()plt.scatter(X_test, y_test, label = 'Data')

line_x = np.linspace(X_test.min(), X_test.max(), 100)
line_y = model.predict(sm.add_constant(line_x))
plt.plot(line_x, line_y, color = 'red', label = 'OLS Regression Line')

plt.xlabel("Family income ($)")
plt.ylabel("Gift aid from university ($)")
plt.title('OLS Regression Fit')
plt.legend()
plt.show()plt.scatter(X_test, y_test, label = 'Data')

line_x = np.linspace(X_test.min(), X_test.max(), 100)
line_y = model.predict(sm.add_constant(line_x))
plt.plot(line_x, line_y, color = 'red', label = 'OLS Regression Line')

plt.xlabel("Family income ($)")
plt.ylabel("Gift aid from university ($)")
plt.title('OLS Regression Fit')
plt.legend()
plt.show()plt.scatter(X_test, y_test, label = 'Data')

line_x = np.linspace(X_test.min(), X_test.max(), 100)
line_y = model.predict(sm.add_constant(line_x))
plt.plot(line_x, line_y, color = 'red', label = 'OLS Regression Line')

plt.xlabel("Family income ($)")
plt.ylabel("Gift aid from university ($)")
plt.title('OLS Regression Fit')
plt.legend()
plt.show()

Variable	Description
`interest_rate`	Interest rate on the loan, in an annual percentage.
`verified_income`	Borrower’s income verification: `Verified`, `Source Verified`, and `Not Verified`.
`debt_to_income`	Debt-to-income ratio, which is the percentage of total debt of the borrower divided by their total income.
`credit_util`	The fraction of available credit utilized.
`bankruptcy`	An indicator variable for whether the borrower has a past bankruptcy in their record. This variable takes a value of `1` if the answer is yes and `0` if the answer is no.
`term`	The length of the loan, in months.
`issue_month`	The month and year the loan was issued.
`credit_checks`	Number of credit checks in the last 12 months.

variable	class	description
Entity	character	Country name
Code	character	Country code
Year	double	Year
access_clean_perc	double	Access to clean fuels and technologies for cooking (% of population)

variable	class	description
Entity	character	The country
Code	character	Country code
Year	double	Year
access_clean_perc	double	% of population with access to clean cooking fuels
GDP	double	GDP per capita, PPP (constant 2017 international $)
popn	double	Country population
Continent	character	Continent the country resides on

variable	class	description
Entity	character	The country
Code	character	Country code
Year	double	Year
Death_Rate_ASP	double	Cause of death related to air pollution from solid fuels, standardized

1 / 28

Regressions I Lecture 7 Dr. Greg Chism University of Arizona INFO 523 - Spring 2024

Regressions I
Warm up
Announcements
Setup
Regressions
Linear regression
Assumptions
Ordinary Least Squares (OLS)
Assessing accuracy of coefficients
Assessing accuracy of model
Our data
OLS regression: applied
Multiple regression
Multiple regression
Our data
Our data: preprocessed
Multiple regression: applied
Model optimization
Model selection
Model selection
Best subset selection
Stepwise selection (both types)
Best Subset Selection: applied
Step-wise selection: applied
Cross validation
Conclusions
Live coding: Indoor air pollution
Live coding

	emp_title	emp_length	state	homeownership	annual_income	verified_income	debt_to_income	annual_income_joint	verification_income_joint	debt_to_income_joint	...	sub_grade	issue_month	loan_status	initial_listing_status	disbursement_method	balance	paid_total	paid_principal	paid_interest
0	global config engineer	3.0	NJ	MORTGAGE	90000.0	Verified	18.01	NaN	NaN	NaN	...	C3	Mar-2018	Current	whole	Cash	27015.86	1999.33	984.14	1015.19
1	warehouse office clerk	10.0	HI	RENT	40000.0	Not Verified	5.04	NaN	NaN	NaN	...	C1	Feb-2018	Current	whole	Cash	4651.37	499.12	348.63	150.49
2	assembly	3.0	WI	RENT	40000.0	Source Verified	21.15	NaN	NaN	NaN	...	D1	Feb-2018	Current	fractional	Cash	1824.63	281.80	175.37	106.43
3	customer service	1.0	PA	RENT	30000.0	Not Verified	10.16	NaN	NaN	NaN	...	A3	Jan-2018	Current	whole	Cash	18853.26	3312.89	2746.74	566.15
4	security supervisor	10.0	CA	RENT	35000.0	Verified	57.96	57000.0	Verified	37.66	...	C3	Mar-2018	Current	whole	Cash	21430.15	2324.65	1569.85	754.80

	emp_length	annual_income	debt_to_income	annual_income_joint	debt_to_income_joint	delinq_2y	months_since_last_delinq	earliest_credit_line	inquiries_last_12m	total_credit_lines	...	public_record_bankrupt	loan_amount	term	interest_rate	installment	balance	paid_total	paid_principal	paid_interest	paid_late_fees
count	9183.000000	1.000000e+04	9976.000000	1.495000e+03	1495.000000	10000.00000	4342.000000	10000.00000	10000.00000	10000.000000	...	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000	10000.000000
mean	5.930306	7.922215e+04	19.308192	1.279146e+05	19.979304	0.21600	36.760709	2001.29000	1.95820	22.679600	...	0.123800	16361.922500	43.272000	12.427524	476.205323	14458.916610	2494.234773	1894.448466	599.666781	0.119516
std	3.703734	6.473429e+04	15.004851	7.016838e+04	8.054781	0.68366	21.634939	7.79551	2.38013	11.885439	...	0.337172	10301.956759	11.029877	5.001105	294.851627	9964.561865	3958.230365	3884.407175	517.328062	1.813468
min	0.000000	0.000000e+00	0.000000	1.920000e+04	0.320000	0.00000	1.000000	1963.00000	0.00000	2.000000	...	0.000000	1000.000000	36.000000	5.310000	30.750000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	2.000000	4.500000e+04	11.057500	8.683350e+04	14.160000	0.00000	19.000000	1997.00000	0.00000	14.000000	...	0.000000	8000.000000	36.000000	9.430000	256.040000	6679.065000	928.700000	587.100000	221.757500	0.000000
50%	6.000000	6.500000e+04	17.570000	1.130000e+05	19.720000	0.00000	34.000000	2003.00000	1.00000	21.000000	...	0.000000	14500.000000	36.000000	11.980000	398.420000	12379.495000	1563.300000	984.990000	446.140000	0.000000
75%	10.000000	9.500000e+04	25.002500	1.515455e+05	25.500000	0.00000	53.000000	2006.00000	3.00000	29.000000	...	0.000000	24000.000000	60.000000	15.050000	644.690000	20690.182500	2616.005000	1694.555000	825.420000	0.000000
max	10.000000	2.300000e+06	469.090000	1.100000e+06	39.980000	13.00000	118.000000	2015.00000	29.00000	87.000000	...	3.000000	40000.000000	60.000000	30.940000	1566.590000	40000.000000	41630.443684	40000.000000	4216.440000	52.980000