Project 1 proposal reviews will be returned to you by Friday
HW 1 lessons learned
Review HW 1 issues, and show us you reviewed them by closing the issue.
DO NOT hard code local paths!
Instead, you can use {pyprojroot}:
python -m pip install pyprojroot
from pyprojroot.here import here
e.g., pd.read_csv(here('data/data.csv'))
Related: data should go into a data folder
Narrative/code description text should be plain text under a ### or #### header under the appropriate code chunk.
HW 1 lessons learned cont…
Start early. No late work exceptions will be made in the future
Ask your peers. I have been monopolized by a few individuals, which is not fair to the rest.
Peers will likely have the answer
Peers will likely get to the question before I will.
Ask descriptive questions. See this page on asking effective questions.
Please respect my work hours. I will no longer reply to messages after 5pm on work days and at all on weekends.
Setup
# Import all required librariesimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerimport statsmodels.api as smimport scipy.stats as statsfrom scipy.stats import gaussian_kdefrom scipy.signal import find_peaks, argrelextremafrom scipy.stats import pearsonr# Increase font size of all Seaborn plot elementssns.set(font_scale =1.25)# Set Seaborn themesns.set_theme(style ="whitegrid")
Data Preprocessing
Data preprocessing can refer to manipulation, filtration or augmentation of data before it is analyzed, and is often an important step in the data mining process.
Datasets
Human Freedom Index
The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through variables for many countries around the globe.
Environmental Sustainability
Countries are given an overall sustainability score as well as scores in each of several different environmental areas.
Question
How does environmental stabilitycorrelate with human freedom indices in different countries, and what trends can be observed over recent years?
Can be better for data with outliers or heavy tails
KDE: our data
Code
values = hfi_clean['pf_score_square']kde = gaussian_kde(values, bw_method ='scott')x_eval = np.linspace(values.min(), values.max(), num =500) kde_values = kde(x_eval)minima_indices = argrelextrema(kde_values, np.less)[0]valleys = x_eval[minima_indices]plt.figure(figsize = (7, 5))plt.title('KDE and Valleys')sns.lineplot(x = x_eval, y = kde_values, label ='KDE')plt.scatter(x = valleys, y = kde(valleys), color ='r', zorder =5, label ='Valleys')plt.legend()plt.show()print("Valley x-values:", valleys)
Valley x-values: [68.39968248]
Split the data
valley =68.39968248hfi_clean['group'] = np.where(hfi_clean['pf_score_square'] < valley, 'group1', 'group2')data = hfi_clean[['group', 'pf_score_square']].sort_values(by ='pf_score_square')data.head()
group
pf_score_square
159
group1
4.693962
321
group1
5.461029
141
group1
6.308405
483
group1
6.345709
303
group1
8.189057
data.tail()
group
pf_score_square
435
group2
90.755418
1407
group2
91.284351
597
group2
91.396839
1083
group2
91.428990
759
group2
91.549575
Plot the grouped data
Code
sns.histplot(data = hfi_clean, x ="pf_score_square", hue ="group", kde =True, stat ="density", common_norm =False)plt.show()
Dimensional reduction
Dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Principal component analysis (PCA) - Unsupervised
Maximizes variance in the dataset.
Finds orthogonal principal components.
Useful for feature extraction and data visualization.
sns.lmplot(data = esi_hfi_red, x ="pf_score_square", y ="esi_log", height =5, aspect =7/5)plt.xlabel("Personal Freedom Log-Normal")plt.ylabel("Environmental Stability Squared-Normal")plt.title("Human Freedom Index vs. Environmental Stability")plt.show()
Conclusions: question
How does environmental stabilitycorrelate with human freedom indices in different countries, and what trends can be observed over recent years?
We can’t make inferences about recent years…
Moderate positive correlation between human freedom index and environmental stability
We cannot find a relationship between countries either
We need a linear regression next (later)
Conclusions: data preprocessing
There are multiple steps:
Check the distribution for normality
Likely will need a transformation based on the severity and direction of skew
Normalize the data with different units
Correlations are a good start, but regressions are more definitive
It’s “as needed”, ergo we didn’t cover everything…