University of Arizona INFO 523 - Data Mining & Discovery
Setup
# Import all required librariesimport pandas as pdimport numpy as npimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.decomposition import PCAfrom sklearn.manifold import TSNEfrom sklearn.preprocessing import StandardScalerfrom sklearn.impute import SimpleImputerimport statsmodels.api as smimport scipy.stats as statsfrom scipy.stats import gaussian_kdefrom scipy.signal import find_peaks, argrelextremafrom scipy.stats import pearsonr# Increase font size of all Seaborn plot elementssns.set(font_scale =1.25)# Set Seaborn themesns.set_theme(style ="whitegrid")
Data Preprocessing
Data preprocessing can refer to manipulation, filtration or augmentation of data before it is analyzed, and is often an important step in the data mining process.
Datasets
Human Freedom Index
The Human Freedom Index is a report that attempts to summarize the idea of “freedom” through variables for many countries around the globe.
Environmental Sustainability
Countries are given an overall sustainability score as well as scores in each of several different environmental areas.
Question
How does environmental stabilitycorrelate with human freedom indices in different countries, and what trends can be observed over recent years?
Can be better for data with outliers or heavy tails
KDE: our data
Code
values = hfi_clean['pf_score_square']kde = gaussian_kde(values, bw_method ='scott')x_eval = np.linspace(values.min(), values.max(), num =500) kde_values = kde(x_eval)minima_indices = argrelextrema(kde_values, np.less)[0]valleys = x_eval[minima_indices]plt.figure(figsize = (7, 5))plt.title('KDE and Valleys')sns.lineplot(x = x_eval, y = kde_values, label ='KDE')plt.scatter(x = valleys, y = kde(valleys), color ='r', zorder =5, label ='Valleys')plt.legend()plt.show()print("Valley x-values:", valleys)
Valley x-values: [68.39968248]
Split the data
valley =68.39968248hfi_clean['group'] = np.where(hfi_clean['pf_score_square'] < valley, 'group1', 'group2')data = hfi_clean[['group', 'pf_score_square']].sort_values(by ='pf_score_square')data.head()
group
pf_score_square
159
group1
4.693962
321
group1
5.461029
141
group1
6.308405
483
group1
6.345709
303
group1
8.189057
data.tail()
group
pf_score_square
435
group2
90.755418
1407
group2
91.284351
597
group2
91.396839
1083
group2
91.428990
759
group2
91.549575
Plot the grouped data
Code
sns.histplot(data = hfi_clean, x ="pf_score_square", hue ="group", kde =True, stat ="density", common_norm =False)plt.show()
Dimensional reduction
Dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension.
Principal component analysis (PCA) - Unsupervised
Maximizes variance in the dataset.
Finds orthogonal principal components.
Useful for feature extraction and data visualization.
sns.lmplot(data = esi_hfi_red, x ="pf_score_square", y ="esi_log", height =5, aspect =7/5)plt.xlabel("Personal Freedom Log-Normal")plt.ylabel("Environmental Stability Squared-Normal")plt.title("Human Freedom Index vs. Environmental Stability")plt.show()
Conclusions: question
How does environmental stabilitycorrelate with human freedom indices in different countries, and what trends can be observed over recent years?
We can’t make inferences about recent years…
Moderate positive correlation between human freedom index and environmental stability
We cannot find a relationship between countries either
We need a linear regression next (later)
Conclusions: data preprocessing
There are multiple steps:
Check the distribution for normality
Likely will need a transformation based on the severity and direction of skew
Normalize the data with different units
Correlations are a good start, but regressions are more definitive
It’s “as needed”, ergo we didn’t cover everything…