Data Preprocessing

Lecture 4

Dr. Greg Chism

University of Arizona
INFO 523 - Spring 2024

Warm up

Understand the data

.info()
.describe()

hfi.info(verbose = True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458 entries, 0 to 1457
Data columns (total 123 columns):
 #    Column                              Dtype  
---   ------                              -----  
 0    year                                int64  
 1    ISO_code                            object 
 2    countries                           object 
 3    region                              object 
 4    pf_rol_procedural                   float64
 5    pf_rol_civil                        float64
 6    pf_rol_criminal                     float64
 7    pf_rol                              float64
 8    pf_ss_homicide                      float64
 9    pf_ss_disappearances_disap          float64
 10   pf_ss_disappearances_violent        float64
 11   pf_ss_disappearances_organized      float64
 12   pf_ss_disappearances_fatalities     float64
 13   pf_ss_disappearances_injuries       float64
 14   pf_ss_disappearances                float64
 15   pf_ss_women_fgm                     float64
 16   pf_ss_women_missing                 float64
 17   pf_ss_women_inheritance_widows      float64
 18   pf_ss_women_inheritance_daughters   float64
 19   pf_ss_women_inheritance             float64
 20   pf_ss_women                         float64
 21   pf_ss                               float64
 22   pf_movement_domestic                float64
 23   pf_movement_foreign                 float64
 24   pf_movement_women                   float64
 25   pf_movement                         float64
 26   pf_religion_estop_establish         float64
 27   pf_religion_estop_operate           float64
 28   pf_religion_estop                   float64
 29   pf_religion_harassment              float64
 30   pf_religion_restrictions            float64
 31   pf_religion                         float64
 32   pf_association_association          float64
 33   pf_association_assembly             float64
 34   pf_association_political_establish  float64
 35   pf_association_political_operate    float64
 36   pf_association_political            float64
 37   pf_association_prof_establish       float64
 38   pf_association_prof_operate         float64
 39   pf_association_prof                 float64
 40   pf_association_sport_establish      float64
 41   pf_association_sport_operate        float64
 42   pf_association_sport                float64
 43   pf_association                      float64
 44   pf_expression_killed                float64
 45   pf_expression_jailed                float64
 46   pf_expression_influence             float64
 47   pf_expression_control               float64
 48   pf_expression_cable                 float64
 49   pf_expression_newspapers            float64
 50   pf_expression_internet              float64
 51   pf_expression                       float64
 52   pf_identity_legal                   float64
 53   pf_identity_parental_marriage       float64
 54   pf_identity_parental_divorce        float64
 55   pf_identity_parental                float64
 56   pf_identity_sex_male                float64
 57   pf_identity_sex_female              float64
 58   pf_identity_sex                     float64
 59   pf_identity_divorce                 float64
 60   pf_identity                         float64
 61   pf_score                            float64
 62   pf_rank                             float64
 63   ef_government_consumption           float64
 64   ef_government_transfers             float64
 65   ef_government_enterprises           float64
 66   ef_government_tax_income            float64
 67   ef_government_tax_payroll           float64
 68   ef_government_tax                   float64
 69   ef_government                       float64
 70   ef_legal_judicial                   float64
 71   ef_legal_courts                     float64
 72   ef_legal_protection                 float64
 73   ef_legal_military                   float64
 74   ef_legal_integrity                  float64
 75   ef_legal_enforcement                float64
 76   ef_legal_restrictions               float64
 77   ef_legal_police                     float64
 78   ef_legal_crime                      float64
 79   ef_legal_gender                     float64
 80   ef_legal                            float64
 81   ef_money_growth                     float64
 82   ef_money_sd                         float64
 83   ef_money_inflation                  float64
 84   ef_money_currency                   float64
 85   ef_money                            float64
 86   ef_trade_tariffs_revenue            float64
 87   ef_trade_tariffs_mean               float64
 88   ef_trade_tariffs_sd                 float64
 89   ef_trade_tariffs                    float64
 90   ef_trade_regulatory_nontariff       float64
 91   ef_trade_regulatory_compliance      float64
 92   ef_trade_regulatory                 float64
 93   ef_trade_black                      float64
 94   ef_trade_movement_foreign           float64
 95   ef_trade_movement_capital           float64
 96   ef_trade_movement_visit             float64
 97   ef_trade_movement                   float64
 98   ef_trade                            float64
 99   ef_regulation_credit_ownership      float64
 100  ef_regulation_credit_private        float64
 101  ef_regulation_credit_interest       float64
 102  ef_regulation_credit                float64
 103  ef_regulation_labor_minwage         float64
 104  ef_regulation_labor_firing          float64
 105  ef_regulation_labor_bargain         float64
 106  ef_regulation_labor_hours           float64
 107  ef_regulation_labor_dismissal       float64
 108  ef_regulation_labor_conscription    float64
 109  ef_regulation_labor                 float64
 110  ef_regulation_business_adm          float64
 111  ef_regulation_business_bureaucracy  float64
 112  ef_regulation_business_start        float64
 113  ef_regulation_business_bribes       float64
 114  ef_regulation_business_licensing    float64
 115  ef_regulation_business_compliance   float64
 116  ef_regulation_business              float64
 117  ef_regulation                       float64
 118  ef_score                            float64
 119  ef_rank                             float64
 120  hf_score                            float64
 121  hf_rank                             float64
 122  hf_quartile                         float64
dtypes: float64(119), int64(1), object(3)
memory usage: 1.4+ MB

hfi.describe()

	year	pf_rol_procedural	pf_rol_civil	pf_rol_criminal	pf_rol	pf_ss_homicide	pf_ss_disappearances_disap	pf_ss_disappearances_violent	pf_ss_disappearances_organized	pf_ss_disappearances_fatalities	...	ef_regulation_business_bribes	ef_regulation_business_licensing	ef_regulation_business_compliance	ef_regulation_business	ef_regulation	ef_score	ef_rank	hf_score	hf_rank	hf_quartile
count	1458.000000	880.000000	880.000000	880.000000	1378.000000	1378.000000	1369.000000	1378.000000	1279.000000	1378.000000	...	1283.000000	1357.000000	1368.000000	1374.000000	1378.000000	1378.000000	1378.000000	1378.000000	1378.000000	1378.000000
mean	2012.000000	5.589355	5.474770	5.044070	5.309641	7.412980	8.341855	9.519458	6.772869	9.584972	...	4.886192	7.698494	6.981858	6.317668	7.019782	6.785610	76.973149	6.993444	77.007983	2.490566
std	2.582875	2.080957	1.428494	1.724886	1.529310	2.832947	3.225902	1.744673	2.768983	1.559826	...	1.889168	1.728507	1.979200	1.230988	1.027625	0.883601	44.540142	1.025811	44.506549	1.119698
min	2008.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.000000	2.009841	2.483540	2.880000	1.000000	3.765827	1.000000	1.000000
25%	2010.000000	4.133333	4.549550	3.789724	4.131746	6.386978	10.000000	10.000000	5.000000	9.942607	...	3.433786	6.874687	6.368178	5.591851	6.429498	6.250000	38.000000	6.336685	39.000000	1.000000
50%	2012.000000	5.300000	5.300000	4.575189	4.910797	8.638278	10.000000	10.000000	7.500000	10.000000	...	4.418371	8.074161	7.466692	6.265234	7.082075	6.900000	77.000000	6.923840	76.000000	2.000000
75%	2014.000000	7.389499	6.410975	6.400000	6.513178	9.454402	10.000000	10.000000	10.000000	10.000000	...	6.227978	8.991882	8.209310	7.139718	7.720955	7.410000	115.000000	7.894660	115.000000	3.000000
max	2016.000000	9.700000	8.773533	8.719848	8.723094	9.926568	10.000000	10.000000	10.000000	10.000000	...	9.623811	9.999638	9.865488	9.272600	9.439828	9.190000	162.000000	9.126313	162.000000	4.000000

8 rows × 120 columns

Data Cleaning

	ef_score_scale	pf_score_scale
count	1.378000e+03	1.378000e+03
mean	4.524683e-16	2.062533e-17
std	1.000363e+00	1.000363e+00
min	-4.421711e+00	-3.663087e+00
25%	-6.063870e-01	-7.303950e-01
50%	1.295064e-01	-8.926277e-03
75%	7.068997e-01	9.081441e-01
max	2.722116e+00	1.722056e+00

	group	pf_score_square
159	group1	4.693962
321	group1	5.461029
141	group1	6.308405
483	group1	6.345709
303	group1	8.189057

	group	pf_score_square
435	group2	90.755418
1407	group2	91.284351
597	group2	91.396839
1083	group2	91.428990
759	group2	91.549575

	tsne-2d-one	tsne-2d-two
0	-8.372954	30.640753
1	27.053766	-35.499779
2	33.169220	-34.696602
3	6.443269	-15.997910
4	-6.526068	19.279558
...	...	...
1453	53.171978	8.308328
1454	12.148342	-12.818124
1455	-10.965682	-5.448216
1456	18.900705	8.233622
1457	52.294846	2.491910

	countries	region	pf_score
0	Albania	Eastern Europe	7.696934
1	Algeria	Middle East & North Africa	5.249383
2	Angola	Sub-Saharan Africa	5.856932
3	Argentina	Latin America & the Caribbean	8.120779
4	Armenia	Caucasus & Central Asia	7.192095

	esi	pf_score	region	country
0	58.8	7.696934	Eastern Europe	Albania
1	46.0	5.249383	Middle East & North Africa	Algeria
2	42.9	5.856932	Sub-Saharan Africa	Angola
3	62.7	8.120779	Latin America & the Caribbean	Argentina
4	53.2	7.192095	Caucasus & Central Asia	Armenia

	esi	pf_score	pf_score_square	esi_log
count	129.000	129.000	129.000	129.000
mean	50.599	7.210	0.000	0.000
std	8.304	1.291	1.004	1.004
min	32.700	4.203	-1.916	-2.604
25%	44.800	6.207	-0.805	-0.671
50%	50.000	7.074	-0.191	0.006
75%	56.100	8.415	0.915	0.718
max	75.100	9.476	1.927	2.527

	esi	pf_score	pf_score_square	esi_log
esi	1.000000	0.574756	0.583615	0.993689
pf_score	0.574756	1.000000	0.995631	0.560831
pf_score_square	0.583615	0.995631	1.000000	0.566744
esi_log	0.993689	0.560831	0.566744	1.000000

1 / 52

Data Preprocessing Lecture 4 Dr. Greg Chism University of Arizona INFO 523 - Spring 2024

Data Preprocessing
Warm up
Announcements
HW 1 lessons learned
HW 1 lessons learned cont…
Setup
Data Preprocessing
Datasets
Question
Dataset #1: Human Freedom Index
Understand the data
Identifying missing values
Data Cleaning
Handling missing data
Imputation
Mean imputation
Median imputation
Mode imputation
Capping (Winsorizing) imputation
Other Imputation Methods
Data type conversion
Removing duplicates
Filtering data
Transformations
Normalizing
Normality test: Q-Q plot
Correcting skew
Comparing transformations
What did we learn?
Dealing with multimodality
Kernel Density Estimation (KDE)
KDE: bandwidth method
KDE: our data
Split the data
Plot the grouped data
Dimensional reduction
Dimensional reduction: applied
Dimensional reduction: what now?
Back to our question
Question
Dataset #2: Environmental Stability
Grouping and aggregating
Joining the data
Back to missing values
Transformations
Normality test: Q-Q plot
Correcting skew
Normalizing
Correlations
Correlations: p-value
Conclusions: question
Conclusions: data preprocessing

	year	ISO_code	countries	region	pf_rol_procedural	pf_rol_civil	pf_rol_criminal	pf_rol	pf_ss_homicide	pf_ss_disappearances_disap	...	ef_regulation_business_bribes	ef_regulation_business_licensing	ef_regulation_business_compliance	ef_regulation_business	ef_regulation	ef_score	ef_rank	hf_score	hf_rank	hf_quartile
0	2016	ALB	Albania	Eastern Europe	6.661503	4.547244	4.666508	5.291752	8.920429	10.0	...	4.050196	7.324582	7.074366	6.705863	6.906901	7.54	34.0	7.568140	48.0	2.0
1	2016	DZA	Algeria	Middle East & North Africa	NaN	NaN	NaN	3.819566	9.456254	10.0	...	3.765515	8.523503	7.029528	5.676956	5.268992	4.99	159.0	5.135886	155.0	4.0
2	2016	AGO	Angola	Sub-Saharan Africa	NaN	NaN	NaN	3.451814	8.060260	5.0	...	1.945540	8.096776	6.782923	4.930271	5.518500	5.17	155.0	5.640662	142.0	4.0
3	2016	ARG	Argentina	Latin America & the Caribbean	7.098483	5.791960	4.343930	5.744791	7.622974	10.0	...	3.260044	5.253411	6.508295	5.535831	5.369019	4.84	160.0	6.469848	107.0	3.0
4	2016	ARM	Armenia	Caucasus & Central Asia	NaN	NaN	NaN	5.003205	8.808750	10.0	...	4.575152	9.319612	6.491481	6.797530	7.378069	7.57	29.0	7.241402	57.0	2.0

	code	country	esi	system	stress	vulner	cap	global	sys_air	sys_bio	...	vul_hea	vul_sus	vul_dis	cap_gov	cap_eff	cap_pri	cap_st	glo_col	glo_ghg	glo_tbp
0	ALB	Albania	58.8	52.4	65.4	72.3	46.2	57.9	0.45	0.17	...	0.32	0.79	0.66	-0.32	0.79	-0.65	-0.20	-0.45	0.21	0.84
1	DZA	Algeria	46.0	43.1	66.3	57.5	31.8	21.1	-0.02	-0.08	...	-0.33	0.45	0.45	-0.69	-0.28	-0.66	-0.27	-0.51	-0.56	-1.33
2	AGO	Angola	42.9	67.9	59.1	11.8	22.1	39.1	-0.77	0.77	...	-1.75	-1.91	0.11	-0.96	0.12	-1.08	-1.16	-0.88	0.31	-0.26
3	ARG	Argentina	62.7	67.6	54.9	69.9	65.4	58.5	0.40	0.10	...	0.85	0.69	0.03	-0.34	0.18	1.23	0.51	0.45	0.09	0.11
4	ARM	Armenia	53.2	54.4	62.2	50.8	34.9	60.3	1.21	-0.02	...	0.29	-0.79	0.56	-0.38	-0.66	-0.55	0.03	-0.29	-0.29	1.37