Unsupervised
Learning I

Lecture 11

Dr. Greg Chism

University of Arizona
INFO 523 - Spring 2024

Warm up

Announcements

Remainder of the semester:

  • Lectures (“lab” time):

    • This and next week: Work on projects
    • In two-weeks: Project peer code review + fill out peer evals
  • Project: Presentations Mon, May 06, 1:00pm - 3:00pm (all team members must be there!)

    • You can opt into presenting May 01
  • HW 05 is due Fri Apr 29, 11:59pm

Setup

# Data Handling and Manipulation
import pandas as pd
import numpy as np

# Data Preprocessing
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Model Selection and Evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score
from sklearn.mixture import GaussianMixture

# Machine Learning Models
from sklearn.cluster import KMeans
from sklearn_extra.cluster import KMedoids

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set the default style for visualization
sns.set_theme(style = "white", palette = "colorblind")

# Increase font size of all Seaborn plot elements
sns.set(font_scale = 1.25)

Unsupervised Learning


Credit: Recro

Clustering

Clustering

Some use cases for clustering include:

  • Recommender systems:

    • Grouping together users with similar viewing patterns on Netflix, in order to recommend similar content
  • Anomaly detection:

    • Fraud detection, detecting defective mechanical parts
  • Genetics:

    • Clustering DNA patterns to analyze evolutionary biology
  • Customer segmentation:

    • Understanding different customer segments to devise marketing strategies

Question:

Can we identify distinct baseball player groupings based on their player stats in 2018?

Our data: MLB player stats

mlb_players_18 = pd.read_csv("data/mlb_players_18.csv", encoding = 'iso-8859-1')

mlb_players_18.head()
name team position games AB R H doubles triples HR RBI walks strike_outs stolen_bases caught_stealing_base AVG OBP SLG OPS
0 Allard, K ATL P 3 1 1 1 0 0 0 0 0 0 0 0 1.0 1.0 1.0 2.0
1 Gibson, K MIN P 1 2 2 2 0 0 0 0 0 0 0 0 1.0 1.0 1.0 2.0
2 Law, D SF P 7 1 1 1 0 0 0 0 0 0 0 0 1.0 1.0 1.0 2.0
3 Nuno, V TB P 1 2 0 2 0 0 0 1 0 0 0 0 1.0 1.0 1.0 2.0
4 Romero, E KC P 4 1 1 1 1 0 0 0 0 0 0 0 1.0 1.0 2.0 3.0
mlb_players_18.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1270 entries, 0 to 1269
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  1270 non-null   object 
 1   team                  1270 non-null   object 
 2   position              1270 non-null   object 
 3   games                 1270 non-null   int64  
 4   AB                    1270 non-null   int64  
 5   R                     1270 non-null   int64  
 6   H                     1270 non-null   int64  
 7   doubles               1270 non-null   int64  
 8   triples               1270 non-null   int64  
 9   HR                    1270 non-null   int64  
 10  RBI                   1270 non-null   int64  
 11  walks                 1270 non-null   int64  
 12  strike_outs           1270 non-null   int64  
 13  stolen_bases          1270 non-null   int64  
 14  caught_stealing_base  1270 non-null   int64  
 15  AVG                   1270 non-null   float64
 16  OBP                   1270 non-null   float64
 17  SLG                   1270 non-null   float64
 18  OPS                   1270 non-null   float64
dtypes: float64(4), int64(12), object(3)
memory usage: 188.6+ KB
Code
# Assign data
df = mlb_players_18

# Select categorical columns
categorical_cols = df.columns

# Initialize a dictionary to store results
category_analysis = {}

# Loop through each categorical column
for col in categorical_cols:
    counts = df[col].value_counts()
    proportions = df[col].value_counts(normalize=True)
    unique_levels = df[col].unique()
    
    # Store results in dictionary
    category_analysis[col] = {
        'Unique Levels': unique_levels,
        'Counts': counts,
        'Proportions': proportions
    }

# Print results
for col, data in category_analysis.items():
    print(f"Analysis for {col}:\n")
    print("Unique Levels:", data['Unique Levels'])
    print("\nCounts:\n", data['Counts'])
    print("\nProportions:\n", data['Proportions'])
    print("\n" + "-"*50 + "\n")
Analysis for name:

Unique Levels: [' Allard, K' ' Gibson, K' ' Law, D' ... ' Zamora, D' ' Zastryzny, R'
 ' Ziegler, B']

Counts:
 name
 Anderson, T    3
 Garcia, J      3
 Guerra, J      3
 Sanchez, A     3
 Santana, D     3
               ..
 Taylor, M      1
 Bour, J        1
 Flowers, T     1
 Davidson, M    1
 Ziegler, B     1
Name: count, Length: 1224, dtype: int64

Proportions:
 name
 Anderson, T    0.002362
 Garcia, J      0.002362
 Guerra, J      0.002362
 Sanchez, A     0.002362
 Santana, D     0.002362
                  ...   
 Taylor, M      0.000787
 Bour, J        0.000787
 Flowers, T     0.000787
 Davidson, M    0.000787
 Ziegler, B     0.000787
Name: proportion, Length: 1224, dtype: float64

--------------------------------------------------

Analysis for team:

Unique Levels: ['ATL' 'MIN' 'SF' 'TB' 'KC' 'CHC' 'MIL' 'PIT' 'SEA' 'NYM' 'CWS' 'COL'
 'LAD' 'BOS' 'MIA' 'NYY' 'TOR' 'WSH' 'LAA' 'OAK' 'TEX' 'HOU' 'CLE' 'CIN'
 'PHI' 'STL' 'DET' 'ARI' 'BAL' 'SD']

Counts:
 team
ATL    53
NYM    51
LAD    50
MIL    49
CHC    48
ARI    47
CIN    46
SF     45
STL    45
WSH    45
MIA    44
DET    44
PHI    43
TOR    43
SD     43
LAA    42
COL    41
OAK    41
BAL    41
BOS    40
SEA    40
PIT    40
TB     39
TEX    38
CLE    38
KC     38
NYY    38
HOU    34
MIN    32
CWS    32
Name: count, dtype: int64

Proportions:
 team
ATL    0.041732
NYM    0.040157
LAD    0.039370
MIL    0.038583
CHC    0.037795
ARI    0.037008
CIN    0.036220
SF     0.035433
STL    0.035433
WSH    0.035433
MIA    0.034646
DET    0.034646
PHI    0.033858
TOR    0.033858
SD     0.033858
LAA    0.033071
COL    0.032283
OAK    0.032283
BAL    0.032283
BOS    0.031496
SEA    0.031496
PIT    0.031496
TB     0.030709
TEX    0.029921
CLE    0.029921
KC     0.029921
NYY    0.029921
HOU    0.026772
MIN    0.025197
CWS    0.025197
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for position:

Unique Levels: ['P' 'C' 'SS' 'RF' '3B' 'CF' 'LF' '2B' '1B' 'DH']

Counts:
 position
P     642
C     115
2B     83
CF     79
RF     76
3B     73
LF     70
SS     67
1B     59
DH      6
Name: count, dtype: int64

Proportions:
 position
P     0.505512
C     0.090551
2B    0.065354
CF    0.062205
RF    0.059843
3B    0.057480
LF    0.055118
SS    0.052756
1B    0.046457
DH    0.004724
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for games:

Unique Levels: [  3   1   7   4  43  14  65   6  56  60  52  10  45   2   5  29 136  30
  54   9 150  63 147  47 137  23 103 140  39 154 162 143 141 111 139 152
 158 144  80  40  95  31  13 135  89  91  38 157 149 156 113  55 132  18
 146  51 116 160 148  48  21  77 138  59  82  19 104 145 142 105  24  76
  25 153 127  78  41  87 112  85 109 125 108  75  66  28 128  97 129  22
 119 102 101 151 134 123 131  68  16  26  36 126  17  15  81  61  20 130
  44  35  90  32  37  50 117  49  73 110 107 155 159  84  42  57  83 161
  98  74 114 133  11 122  92  86  94  58 124  79  33  96 115 121  46  67
  93  69 120  34  88   8 100 118  72  62  27 106  12  71  70  53  64]

Counts:
 games
1      96
2      76
3      70
4      54
5      31
       ..
150     1
120     1
93      1
135     1
98      1
Name: count, Length: 161, dtype: int64

Proportions:
 games
1      0.075591
2      0.059843
3      0.055118
4      0.042520
5      0.024409
         ...   
150    0.000787
120    0.000787
93     0.000787
135    0.000787
98     0.000787
Name: proportion, Length: 161, dtype: float64

--------------------------------------------------

Analysis for AB:

Unique Levels: [  1   2   3   6   4   7   5   8  93 520  59   9 569 225 574 143  84 534
  70 365 471 109 584 618 570 529 539 382 455 632 586 310 152 319  10  60
 487 504 281 328 134 620 623 573 590 280 480 503 433  58  99 140 560 123
 554 171 414 598 626 537 606  31 579 593 462 513 156 302 559 178 594 444
 252  63 326 596 401 398  74 215  53 527 566 486 249 580 582 250 100 261
 165 413 547 288 321 415 661 318 477 463 293 275 210 533 494  40 284 492
  11 396  66 389 257 464 151 431 258 664 347 440 296 578  26 432 556 546
 101 116 176 386  15 405 169 617 403 211 283  34 136 544 597 223 499 174
 106 379 395  19 190 422 255 141 282 351 149 344  65 501 459 639 356 437
 330 557  50  77 466 131  27 332 170 236 298 489 564 265 519 402 407 380
 550 110 531 177 358 130 524 461 536  75 312 474 316 399 605 613 218 119
 488 512 294 306 394 199 335 467 516  12 420 353 438 342 460 230 206 202
 602  81 576 126 500 187  61 558 424 465 506 404 213 242 197 543  33 343
 600 505 498 436  25 450 567 417 476 117 184  46 159 478 428 235 349 139
  76 203 292 555 530 348 429  68 493 510 387 128  64 278  30 133  73 146
 129  86 473  43 357 181 125 371 303 485  13 308 408 369 244 362 192 166
 434 251 423 468 221 115 452  71 373 120 196 299 532 360  18 207  36 212
 583 113 427 241 191 224  96  32 247  23 161  69 189 194  37 334 553  14
 164  94  57 238  67  91 182  48 491  29  87 214 200  44 103 325 208 323
 195  20 266 272  56 384  41 289 147 163 228  16  54  92  55 160 243 111
  78  28 118  45 135  79  17 102 470 179  72  90  24 150  49  38 122  42
  21  22  97 112  39  47  52  35  51  62   0]

Counts:
 AB
0      286
1       92
2       64
3       35
4       28
      ... 
335      1
467      1
420      1
438      1
605      1
Name: count, Length: 389, dtype: int64

Proportions:
 AB
0      0.225197
1      0.072441
2      0.050394
3      0.027559
4      0.022047
         ...   
335    0.000787
467    0.000787
420    0.000787
438    0.000787
605    0.000787
Name: proportion, Length: 389, dtype: float64

--------------------------------------------------

Analysis for R:

Unique Levels: [  1   2   0   9 129   4 111  35 118  30  11  84  10  62 101  15  86  94
  89  88  90  39  67  64  91  44  17  55   3  65  38  40  78  83 104  26
  59  75  68  21  77 119  70   5 103  95  72  27   8  87  85  19 105  33
  52  47  79  74  54  69  28  36  76 100  43  81  41  49  71 102  80  31
  50  37  63  23  45  48 110  13  60   7  56  22  61  25  20  46  53  16
  29  82  14  66  42  32  12  34  98  57  51  73  24   6  18]

Counts:
 R
0      562
1       83
2       52
3       28
4       22
      ... 
98       1
57       1
118      1
100      1
104      1
Name: count, Length: 105, dtype: int64

Proportions:
 R
0      0.442520
1      0.065354
2      0.040945
3      0.022047
4      0.017323
         ...   
98     0.000787
57     0.000787
118    0.000787
100    0.000787
104    0.000787
Name: proportion, Length: 105, dtype: float64

--------------------------------------------------

Analysis for H:

Unique Levels: [  1   2   4   3  33 180  20 188  74 187  46  27 169  22 114 147  34 181
 191 176 163 166 117 139 192 178  94  96  18 146 151  84  98  40 185 170
 175  83  45 142 148 127  17  29  41 164  36 162  50 121 174 182 156   9
 168 172 134  87 161  51  72  93 143 113  21  61  15 149 160 137  70  28
  73 115 165 152  80 155  89 183  88 132 128  81  76  58  16 136  11  77
  78 135  68 108 118 106 154 126 119  42   7 116 141  31  47 103 159 107
  56  75 144 158  59  14 100 104   5 111  67  37  92  39  90 131 120 167
  86 145  13  44 133 101  97 140  91  19  79 153  55  30 123 129  99 105
  10 109  85  57 124  52  48   8 122   6  38 102  69 125  82  65 110 112
  71  23  24  26  53  43  32  49  54  25  35  12  66  60   0]

Counts:
 H
0      501
1       78
3       41
2       39
4       30
      ... 
151      1
191      1
140      1
153      1
187      1
Name: count, Length: 177, dtype: int64

Proportions:
 H
0      0.394488
1      0.061417
3      0.032283
2      0.030709
4      0.023622
         ...   
151    0.000787
191    0.000787
140    0.000787
153    0.000787
187    0.000787
Name: proportion, Length: 177, dtype: float64

--------------------------------------------------

Analysis for doubles:

Unique Levels: [ 0  1  2  4 47 37 11 34  5  6 29  9 31 24 30 44 36 25 22 28 43 14 18 33
 35 16 15 46 38 12 27 26 10 42 40  3 41 45 51 21 17  8  7 32 19 23 13 48
 20]

Counts:
 doubles
0     653
1      94
2      46
3      30
4      26
5      24
6      22
14     21
18     20
9      18
13     17
8      16
23     16
10     16
16     15
11     15
12     15
22     15
25     14
7      14
15     13
28     12
17     12
26     10
21      9
19      9
27      8
20      8
31      8
35      7
30      7
32      7
33      6
34      6
29      6
24      6
38      4
42      4
36      4
40      3
41      2
44      2
43      2
37      2
47      2
45      1
46      1
48      1
51      1
Name: count, dtype: int64

Proportions:
 doubles
0     0.514173
1     0.074016
2     0.036220
3     0.023622
4     0.020472
5     0.018898
6     0.017323
14    0.016535
18    0.015748
9     0.014173
13    0.013386
8     0.012598
23    0.012598
10    0.012598
16    0.011811
11    0.011811
12    0.011811
22    0.011811
25    0.011024
7     0.011024
15    0.010236
28    0.009449
17    0.009449
26    0.007874
21    0.007087
19    0.007087
27    0.006299
20    0.006299
31    0.006299
35    0.005512
30    0.005512
32    0.005512
33    0.004724
34    0.004724
29    0.004724
24    0.004724
38    0.003150
42    0.003150
36    0.003150
40    0.002362
41    0.001575
44    0.001575
43    0.001575
37    0.001575
47    0.001575
45    0.000787
46    0.000787
48    0.000787
51    0.000787
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for triples:

Unique Levels: [ 0  1  5  2  6  7  4  3 10  9  8 12]

Counts:
 triples
0     941
1     128
2      81
3      48
4      24
5      15
6      12
7       9
8       6
9       3
10      2
12      1
Name: count, dtype: int64

Proportions:
 triples
0     0.740945
1     0.100787
2     0.063780
3     0.037795
4     0.018898
5     0.011811
6     0.009449
7     0.007087
8     0.004724
9     0.002362
10    0.001575
12    0.000787
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for HR:

Unique Levels: [ 0  2  3 32  1 43 36 15 13  4 14 39 23 17 24 10  9 12  5  7 37 27 38 11
 26 30 22 29 34 16 33 21  6 31 25 20  8 35 19 18 28 48 40]

Counts:
 HR
0     734
1      92
2      42
3      35
4      31
6      28
5      22
9      22
11     21
8      18
10     16
7      16
13     15
15     14
12     13
16     13
17     12
20     12
14     12
21     11
23     11
24      9
22      9
19      7
18      7
27      7
25      6
34      4
26      4
37      3
32      3
30      3
38      3
29      2
39      2
36      2
35      2
28      2
31      1
43      1
33      1
48      1
40      1
Name: count, dtype: int64

Proportions:
 HR
0     0.577953
1     0.072441
2     0.033071
3     0.027559
4     0.024409
6     0.022047
5     0.017323
9     0.017323
11    0.016535
8     0.014173
10    0.012598
7     0.012598
13    0.011811
15    0.011024
12    0.010236
16    0.010236
17    0.009449
20    0.009449
14    0.009449
21    0.008661
23    0.008661
24    0.007087
22    0.007087
19    0.005512
18    0.005512
27    0.005512
25    0.004724
34    0.003150
26    0.003150
37    0.002362
32    0.002362
30    0.002362
38    0.002362
29    0.001575
39    0.001575
36    0.001575
35    0.001575
28    0.001575
31    0.000787
43    0.000787
33    0.000787
48    0.000787
40    0.000787
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for RBI:

Unique Levels: [  0   1   2   3  21  80   6 130  19 110  36   9  61  14  52  79  15  92
  98  76  38  70  58  83  60  63  50  12  33   5  55  43  42  22  89 107
  93  51  40  44  64   7  16  87  75 108  85 111  10   4 103  77  17  41
  34  67  53  11 104 101  48  35  88  39  31  18  54  68  72  46  74  37
  62  30  25  65  84  32  73  57  45 105  86  13  99 100  20  71  28  78
  23  47  29   8  59  27  81  97  49  24  69  95  56 123  96  82  66  26]

Counts:
 RBI
0      596
1       73
2       38
4       29
3       24
      ... 
101      1
104      1
111      1
89       1
26       1
Name: count, Length: 108, dtype: int64

Proportions:
 RBI
0      0.469291
1      0.057480
2      0.029921
4      0.022835
3      0.018898
         ...   
101    0.000787
104    0.000787
111    0.000787
89     0.000787
26     0.000787
Name: proportion, Length: 108, dtype: float64

--------------------------------------------------

Analysis for walks:

Unique Levels: [  0   1   2  81   3  69  14  68  17   8  55  47 122  11  42  76  48  71
  32  49  61   5  38  37  21  20  22  70  72  25  73  24  45   7  12  35
  10  79  59  23  29  90  36  16   4  30  96 108  64  51   9  58  31  39
  67  15  19  62  60  34  52  26 106  33   6  78  18  28  92  80  13  43
  44  41  77  54  27  40 102  50  95 130  53  63  87  84  83  46  65 110
  66  74  56]

Counts:
 walks
0      591
1       88
2       51
3       29
4       25
      ... 
108      1
122      1
68       1
81       1
56       1
Name: count, Length: 93, dtype: int64

Proportions:
 walks
0      0.465354
1      0.069291
2      0.040157
3      0.022835
4      0.019685
         ...   
108    0.000787
122    0.000787
68     0.000787
81     0.000787
56     0.000787
Name: proportion, Length: 93, dtype: float64

--------------------------------------------------

Analysis for strike_outs:

Unique Levels: [  0   1   3   5  91   9   2 146  24 135  43  12  79  21  54 124  19 125
 132  60  82  94  80 104 114  69  47  29  64  11  96  46  40  27 151  97
 122  72  36  98  93 123  32  44  18  99 168 134  63 167 106 173 115 102
  50  62  75  85  83  31 148 101  53  41 142  86  59 110  38  16 152  95
 109  73 107 113  71  77  49   8 140  65  34 143  68 129  22  61 126 108
  37  13  17  15  45  42 128   4  87 211 119 156  20  55 131 138  66 116
  84  14  35  88 158 176 100  58  78 111 145  25  30  57 117 178  23 127
 155  33 103   6 169 121  89  52  67 147 163 175 150 133 159 153 149 217
 137  10  28  90  74 112 165  76  39 160  48   7  26 207 120  51 192  70]

Counts:
 strike_outs
0      351
1      122
2       52
3       31
4       23
      ... 
155      1
169      1
89       1
147      1
70       1
Name: count, Length: 162, dtype: int64

Proportions:
 strike_outs
0      0.276378
1      0.096063
2      0.040945
3      0.024409
4      0.018110
         ...   
155    0.000787
169    0.000787
89     0.000787
147    0.000787
70     0.000787
Name: proportion, Length: 162, dtype: float64

--------------------------------------------------

Analysis for stolen_bases:

Unique Levels: [ 0 30  3  2  6  7 22  1 17 24  4 10 12 45 20  9 16  8 14 40  5 27 21 23
 11 33 25 32 15 43 34 28 35 13 19 26]

Counts:
 stolen_bases
0     862
1     104
2      70
3      45
4      26
5      24
6      23
7      17
12     12
10     12
8       9
9       7
14      7
11      6
16      6
13      5
15      4
24      4
21      4
30      3
20      3
17      2
34      2
19      1
35      1
28      1
43      1
45      1
32      1
25      1
33      1
23      1
27      1
22      1
40      1
26      1
Name: count, dtype: int64

Proportions:
 stolen_bases
0     0.678740
1     0.081890
2     0.055118
3     0.035433
4     0.020472
5     0.018898
6     0.018110
7     0.013386
12    0.009449
10    0.009449
8     0.007087
9     0.005512
14    0.005512
11    0.004724
16    0.004724
13    0.003937
15    0.003150
24    0.003150
21    0.003150
30    0.002362
20    0.002362
17    0.001575
34    0.001575
19    0.000787
35    0.000787
28    0.000787
43    0.000787
45    0.000787
32    0.000787
25    0.000787
33    0.000787
23    0.000787
27    0.000787
22    0.000787
40    0.000787
26    0.000787
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for caught_stealing_base:

Unique Levels: [ 0  6  1  4  2  3  7 10 11 12  5  9 14  8]

Counts:
 caught_stealing_base
0     942
1     117
2      70
3      46
4      35
6      18
5      18
7       8
10      4
12      3
9       3
11      2
14      2
8       2
Name: count, dtype: int64

Proportions:
 caught_stealing_base
0     0.741732
1     0.092126
2     0.055118
3     0.036220
4     0.027559
6     0.014173
5     0.014173
7     0.006299
10    0.003150
12    0.002362
9     0.002362
11    0.001575
14    0.001575
8     0.001575
Name: proportion, dtype: float64

--------------------------------------------------

Analysis for AVG:

Unique Levels: [1.    0.667 0.5   0.429 0.4   0.375 0.355 0.346 0.339 0.333 0.33  0.329
 0.326 0.322 0.321 0.316 0.314 0.312 0.31  0.309 0.308 0.306 0.305 0.304
 0.303 0.301 0.3   0.299 0.298 0.297 0.296 0.294 0.293 0.292 0.291 0.29
 0.288 0.287 0.286 0.285 0.284 0.283 0.282 0.281 0.28  0.279 0.278 0.277
 0.276 0.275 0.274 0.273 0.272 0.271 0.27  0.269 0.268 0.267 0.266 0.265
 0.264 0.263 0.262 0.261 0.26  0.259 0.258 0.257 0.256 0.255 0.254 0.253
 0.252 0.251 0.25  0.249 0.248 0.247 0.246 0.245 0.244 0.243 0.242 0.241
 0.24  0.239 0.238 0.237 0.236 0.235 0.234 0.233 0.232 0.231 0.23  0.229
 0.228 0.227 0.226 0.225 0.224 0.223 0.222 0.221 0.22  0.219 0.217 0.216
 0.215 0.214 0.213 0.212 0.211 0.21  0.209 0.208 0.207 0.206 0.205 0.204
 0.203 0.202 0.201 0.2   0.199 0.198 0.197 0.196 0.195 0.194 0.192 0.19
 0.189 0.188 0.187 0.186 0.185 0.184 0.183 0.182 0.181 0.18  0.179 0.178
 0.177 0.176 0.175 0.174 0.173 0.17  0.169 0.168 0.167 0.165 0.164 0.163
 0.162 0.161 0.16  0.159 0.158 0.156 0.154 0.152 0.15  0.148 0.147 0.143
 0.141 0.14  0.138 0.136 0.135 0.134 0.133 0.13  0.128 0.125 0.122 0.12
 0.119 0.118 0.117 0.116 0.115 0.114 0.111 0.109 0.108 0.105 0.103 0.102
 0.1   0.095 0.094 0.093 0.092 0.091 0.089 0.088 0.087 0.083 0.08  0.078
 0.077 0.075 0.07  0.067 0.065 0.064 0.063 0.059 0.057 0.056 0.052 0.048
 0.045 0.044 0.042 0.038 0.034 0.024 0.019 0.   ]

Counts:
 AVG
0.000    501
0.167     20
0.200     18
0.250     17
0.333     14
        ... 
0.375      1
0.169      1
0.170      1
0.173      1
0.197      1
Name: count, Length: 224, dtype: int64

Proportions:
 AVG
0.000    0.394488
0.167    0.015748
0.200    0.014173
0.250    0.013386
0.333    0.011024
           ...   
0.375    0.000787
0.169    0.000787
0.170    0.000787
0.173    0.000787
0.197    0.000787
Name: proportion, Length: 224, dtype: float64

--------------------------------------------------

Analysis for OBP:

Unique Levels: [1.    0.667 0.5   0.429 0.333 0.375 0.371 0.438 0.4   0.402 0.381 0.398
 0.394 0.386 0.329 0.406 0.46  0.357 0.388 0.364 0.374 0.395 0.358 0.378
 0.367 0.341 0.331 0.3   0.417 0.354 0.33  0.405 0.336 0.366 0.328 0.359
 0.397 0.361 0.349 0.34  0.355 0.352 0.337 0.348 0.323 0.326 0.389 0.338
 0.36  0.345 0.325 0.339 0.308 0.286 0.342 0.304 0.415 0.376 0.351 0.309
 0.313 0.335 0.294 0.322 0.392 0.356 0.327 0.362 0.35  0.306 0.288 0.321
 0.396 0.316 0.273 0.385 0.334 0.303 0.344 0.332 0.314 0.387 0.31  0.29
 0.319 0.353 0.267 0.343 0.39  0.301 0.346 0.289 0.295 0.377 0.391 0.404
 0.382 0.324 0.347 0.305 0.312 0.317 0.368 0.318 0.287 0.307 0.297 0.282
 0.274 0.315 0.25  0.293 0.222 0.393 0.302 0.311 0.299 0.279 0.292 0.281
 0.235 0.298 0.291 0.28  0.268 0.277 0.269 0.285 0.253 0.275 0.266 0.283
 0.284 0.278 0.252 0.258 0.276 0.262 0.271 0.225 0.256 0.255 0.263 0.254
 0.239 0.296 0.242 0.236 0.214 0.211 0.26  0.247 0.264 0.257 0.265 0.272
 0.261 0.226 0.244 0.248 0.259 0.2   0.24  0.193 0.209 0.221 0.217 0.188
 0.241 0.207 0.218 0.245 0.232 0.227 0.19  0.191 0.224 0.243 0.143 0.167
 0.194 0.184 0.161 0.203 0.192 0.174 0.179 0.183 0.154 0.15  0.237 0.246
 0.178 0.208 0.136 0.175 0.21  0.204 0.171 0.125 0.176 0.163 0.119 0.118
 0.162 0.156 0.148 0.233 0.115 0.114 0.205 0.111 0.172 0.103 0.131 0.1
 0.17  0.094 0.093 0.091 0.231 0.089 0.139 0.16  0.083 0.096 0.127 0.077
 0.075 0.109 0.086 0.067 0.097 0.065 0.102 0.063 0.158 0.123 0.074 0.105
 0.056 0.052 0.045 0.061 0.038 0.048 0.019 0.    0.095]

Counts:
 OBP
0.000    474
0.333     28
0.250     20
0.200     19
0.500     18
        ... 
0.227      1
0.232      1
0.245      1
0.218      1
0.095      1
Name: count, Length: 249, dtype: int64

Proportions:
 OBP
0.000    0.373228
0.333    0.022047
0.250    0.015748
0.200    0.014961
0.500    0.014173
           ...   
0.227    0.000787
0.232    0.000787
0.245    0.000787
0.218    0.000787
0.095    0.000787
Name: proportion, Length: 249, dtype: float64

--------------------------------------------------

Analysis for SLG:

Unique Levels: [1.    2.    0.833 0.5   0.714 0.4   0.6   1.125 0.516 0.64  0.39  0.333
 1.333 0.667 0.629 0.471 0.598 0.671 0.452 0.451 0.614 0.518 0.628 0.422
 0.49  0.505 0.468 0.535 0.417 0.487 0.44  0.457 0.438 0.415 0.474 0.411
 0.3   0.35  0.435 0.431 0.454 0.448 0.538 0.527 0.561 0.414 0.406 0.366
 0.552 0.483 0.364 0.45  0.398 0.38  0.517 0.567 0.502 0.428 0.554 0.71
 0.581 0.465 0.533 0.481 0.522 0.525 0.427 0.479 0.416 0.461 0.532 0.378
 0.492 0.286 0.564 0.493 0.419 0.372 0.382 0.392 0.512 0.434 0.526 0.47
 0.379 0.446 0.433 0.42  0.37  0.498 0.528 0.508 0.46  0.407 0.519 0.456
 0.484 0.467 0.413 0.345 0.464 0.429 0.363 0.539 0.534 0.273 0.384 0.53
 0.545 0.489 0.344 0.48  0.376 0.444 0.395 0.466 0.346 0.308 0.389 0.349
 0.494 0.347 0.491 0.396 0.388 0.386 0.267 0.331 0.509 0.449 0.412 0.473
 0.353 0.375 0.421 0.356 0.292 0.486 0.582 0.453 0.408 0.496 0.458 0.597
 0.477 0.436 0.357 0.403 0.52  0.377 0.437 0.523 0.463 0.343 0.381 0.367
 0.425 0.439 0.424 0.469 0.394 0.332 0.499 0.32  0.362 0.404 0.462 0.426
 0.459 0.307 0.25  0.34  0.55  0.625 0.336 0.504 0.476 0.409 0.432 0.549
 0.397 0.279 0.326 0.305 0.271 0.325 0.418 0.385 0.283 0.405 0.374 0.373
 0.355 0.342 0.327 0.368 0.324 0.359 0.289 0.297 0.266 0.391 0.317 0.315
 0.304 0.291 0.313 0.262 0.328 0.33  0.387 0.338 0.423 0.351 0.36  0.393
 0.281 0.354 0.278 0.348 0.288 0.314 0.369 0.335 0.321 0.478 0.217 0.371
 0.216 0.312 0.214 0.309 0.383 0.242 0.339 0.399 0.263 0.244 0.239 0.231
 0.253 0.264 0.241 0.302 0.29  0.258 0.205 0.303 0.41  0.2   0.24  0.272
 0.8   0.316 0.213 0.232 0.28  0.365 0.268 0.301 0.211 0.259 0.257 0.188
 0.306 0.185 0.294 0.269 0.298 0.284 0.176 0.223 0.261 0.26  0.247 0.296
 0.167 0.319 0.238 0.274 0.204 0.179 0.163 0.27  0.194 0.16  0.158 0.246
 0.154 0.152 0.175 0.148 0.235 0.143 0.224 0.254 0.203 0.138 0.136 0.135
 0.165 0.222 0.17  0.125 0.122 0.169 0.116 0.115 0.114 0.111 0.105 0.172
 0.128 0.119 0.1   0.19  0.094 0.093 0.123 0.159 0.091 0.089 0.088 0.13
 0.083 0.08  0.098 0.077 0.103 0.075 0.067 0.097 0.065 0.064 0.063 0.059
 0.057 0.056 0.069 0.048 0.045 0.038 0.034 0.024 0.019 0.   ]

Counts:
 SLG
0.000    501
0.333     19
0.250     17
0.200     15
0.500     14
        ... 
0.549      1
0.397      1
0.418      1
0.405      1
0.625      1
Name: count, Length: 346, dtype: int64

Proportions:
 SLG
0.000    0.394488
0.333    0.014961
0.250    0.013386
0.200    0.011811
0.500    0.011024
           ...   
0.549    0.000787
0.397    0.000787
0.418    0.000787
0.405    0.000787
0.625    0.000787
Name: proportion, Length: 346, dtype: float64

--------------------------------------------------

Analysis for OPS:

Unique Levels: [2.    3.    1.667 1.5   1.    1.167 1.143 0.733 1.1   0.887 1.078 0.765
 0.667 0.762 1.067 1.031 0.852 1.069 0.846 0.837 0.943 0.924 1.088 0.797
 0.847 0.892 0.832 0.909 0.813 0.845 0.817 0.821 0.806 0.755 0.805 0.6
 0.65  0.789 0.804 0.836 0.79  0.843 0.854 0.905 0.855 0.935 0.83  0.811
 0.773 0.727 0.917 0.703 0.868 0.729 0.754 0.728 0.923 0.914 0.86  0.751
 0.881 1.043 0.922 0.818 0.883 0.874 0.803 0.742 0.785 0.926 0.701 0.792
 0.8   0.571 0.925 0.859 0.714 0.741 0.696 0.89  0.849 0.732 0.758 0.664
 0.838 0.704 0.919 0.864 0.787 0.774 0.747 0.871 0.798 0.825 0.796 0.763
 0.888 0.633 0.749 0.794 0.678 0.757 0.545 0.93  0.834 0.824 0.722 0.82
 0.679 0.76  0.776 0.731 0.78  0.939 0.668 0.629 0.699 0.637 0.829 0.744
 0.697 0.676 0.736 0.801 0.533 0.672 0.74  0.786 0.743 0.705 0.677 0.651
 0.81  0.735 0.623 0.756 0.973 0.886 0.863 0.723 0.95  0.768 0.75  0.682
 0.764 0.715 0.814 0.882 0.706 0.709 0.619 0.809 0.71  0.769 0.753 0.897
 0.808 0.654 0.85  0.675 0.693 0.777 0.73  0.669 0.782 0.719 0.708 0.639
 0.793 0.839 0.775 0.816 0.626 0.718 0.72  0.643 0.702 0.64  0.68  0.738
 0.5   0.9   0.583 0.656 0.472 0.657 0.958 0.778 0.889 0.645 0.662 0.833
 0.724 0.7   0.779 0.624 0.788 0.592 0.694 0.713 0.617 0.815 0.746 0.681
 0.865 0.687 0.771 0.684 0.622 0.823 0.634 0.661 0.69  0.576 0.614 0.635
 0.688 0.671 0.613 0.566 0.567 0.717 0.602 0.581 0.557 0.761 0.608 0.597
 0.683 0.642 0.604 0.711 0.593 0.653 0.649 0.546 0.767 0.673 0.766 0.644
 0.655 0.648 0.611 0.565 0.605 0.59  0.627 0.555 0.475 0.559 0.61  0.541
 0.652 0.526 0.591 0.575 0.658 0.609 0.783 0.467 0.584 0.58  0.485 0.618
 0.589 0.429 0.712 0.558 0.674 0.621 0.474 0.685 0.504 0.523 0.478 0.63
 0.54  0.529 0.499 0.691 0.574 0.552 0.553 0.519 0.603 0.46  0.599 0.692
 0.51  0.551 0.4   0.544 0.473 1.229 0.435 0.739 0.588 0.577 0.547 0.425
 0.663 0.486 0.433 0.564 0.539 0.528 0.423 0.438 0.532 0.501 0.399 0.631
 0.582 0.455 0.516 0.628 0.698 0.421 0.578 0.511 0.562 0.549 0.477 0.44
 0.471 0.452 0.464 0.615 0.31  0.333 0.56  0.417 0.394 0.492 0.42  0.487
 0.39  0.363 0.454 0.355 0.382 0.352 0.378 0.337 0.383 0.462 0.368 0.34
 0.325 0.327 0.495 0.357 0.36  0.286 0.47  0.392 0.393 0.432 0.439 0.299
 0.403 0.273 0.406 0.451 0.388 0.426 0.52  0.376 0.313 0.25  0.301 0.285
 0.288 0.294 0.373 0.272 0.264 0.269 0.314 0.319 0.311 0.222 0.395 0.444
 0.424 0.341 0.296 0.276 0.328 0.267 0.268 0.2   0.289 0.346 0.188 0.219
 0.186 0.242 0.53  0.258 0.209 0.322 0.182 0.178 0.227 0.29  0.35  0.208
 0.195 0.194 0.243 0.154 0.185 0.174 0.133 0.18  0.191 0.323 0.158 0.166
 0.156 0.259 0.213 0.179 0.15  0.161 0.139 0.121 0.128 0.248 0.091 0.132
 0.124 0.077 0.131 0.072 0.038 0.    0.095 0.1   0.167 0.125 0.083]

Counts:
 OPS
0.000    474
0.500     17
0.333     11
0.667     11
1.000     10
        ... 
0.683      1
0.604      1
0.711      1
0.593      1
0.083      1
Name: count, Length: 443, dtype: int64

Proportions:
 OPS
0.000    0.373228
0.500    0.013386
0.333    0.008661
0.667    0.008661
1.000    0.007874
           ...   
0.683    0.000787
0.604    0.000787
0.711    0.000787
0.593    0.000787
0.083    0.000787
Name: proportion, Length: 443, dtype: float64

--------------------------------------------------
mlb_players_18.describe()
games AB R H doubles triples HR RBI walks strike_outs stolen_bases caught_stealing_base AVG OBP SLG OPS
count 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000
mean 48.171654 130.261417 17.031496 32.297638 6.507087 0.666929 4.397638 16.225197 12.351181 32.446457 1.948031 0.754331 0.140191 0.181824 0.217412 0.399239
std 49.957749 185.855484 26.896304 49.396815 10.487391 1.517461 8.036863 26.085535 20.680606 44.687302 5.018058 1.769933 0.140268 0.165976 0.218611 0.374984
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 5.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 29.000000 23.500000 1.000000 3.000000 0.000000 0.000000 0.000000 1.000000 1.000000 8.000000 0.000000 0.000000 0.166000 0.217500 0.214000 0.436500
75% 79.750000 213.750000 27.000000 50.000000 10.000000 1.000000 5.000000 24.000000 18.000000 54.000000 1.000000 1.000000 0.247000 0.316000 0.395000 0.703000
max 162.000000 664.000000 129.000000 192.000000 51.000000 12.000000 48.000000 130.000000 130.000000 217.000000 45.000000 14.000000 1.000000 1.000000 2.000000 3.000000

Preprocessing

# Define the columns based on their type for preprocessing
categorical_features = ['team', 'position']
numerical_features = ['games', 'AB', 'R', 'H', 'doubles', 'triples', 'HR', 'RBI', 'walks', 'strike_outs', 'stolen_bases', 'caught_stealing_base', 'AVG', 'OBP', 'SLG', 'OPS']
# Handling missing values: Impute missing values if any
# For numerical features, replace missing values with the median of the column
# For categorical features, replace missing values with the most frequent value of the column
numerical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps = [
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

preprocessor = ColumnTransformer(transformers = [
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)])
# Apply the transformations to the dataset
mlb_preprocessed = preprocessor.fit_transform(mlb_players_18)

# The result is a NumPy array. To convert it back to a DataFrame:
# Update the method to get_feature_names_out for compatibility with newer versions of scikit-learn
feature_names = list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features))
new_columns = numerical_features + feature_names

mlb_preprocessed_df = pd.DataFrame(mlb_preprocessed, columns = new_columns)
mlb_preprocessed_df.head()
games AB R H doubles triples HR RBI walks strike_outs ... position_1B position_2B position_3B position_C position_CF position_DH position_LF position_P position_RF position_SS
0 -0.904553 -0.695768 -0.596283 -0.633846 -0.620712 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 -0.944603 -0.690386 -0.559089 -0.613594 -0.620712 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 -0.824454 -0.695768 -0.596283 -0.633846 -0.620712 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 -0.944603 -0.690386 -0.633478 -0.613594 -0.620712 -0.439676 -0.547399 -0.583894 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 -0.884529 -0.695768 -0.596283 -0.633846 -0.525322 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 56 columns

Before moving on:
Similarity / Dissimilarity

Similarity + Dissimilarity

Similarity

\(\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\)

  • Best for text data or any high-dimensional data.

  • Useful when the magnitude of the data vector is not important.

  • Python

\(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\)

  • Suitable for sets or binary data.

  • Ideal for comparing the similarity between two sample sets.

  • Python

\(r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}\)

  • Use when measuring the linear relationship between two continuous variables.

  • Appropriate for data with a normal distribution.

  • Python

\(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\)

  • Ordinal data or when data do not meet the assumptions of Pearson’s correlation.

  • Monotonic relationships between two continuous or ordinal variables.

  • Python

Dissimilarity

\(d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}\)

  • Use for continuous data to measure the “straight line” distance between points in Euclidean space.
  • Most common in clustering and classification where simple distance measurement is required.
  • Python

\(d(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} |p_i - q_i|\)

  • Suitable for continuous or ordinal data where you want to measure the distance as if navigating a grid-like path (like city blocks).
  • Useful when the difference across dimensions is important regardless of the path taken.
  • Python

\(d(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} \delta(p_i, q_i) \quad \text{where} \quad \delta(a, b) = \begin{cases} 1 & \text{if } a \neq b \\ 0 & \text{otherwise} \end{cases}\)

  • Use for categorical or binary data.
  • Ideal for comparing two strings of equal length or binary feature vectors.
  • Python

\(d(\mathbf{p}, \mathbf{q}) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}}\)

  • A generalization of Euclidean and Manhattan distances. Use when you need to fine-tune the distance calculation by emphasizing different dimensions.
  • Parameterizable for different applications; adjust the parameter to control the impact of different dimensions.
  • Python

\[d(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T S^{-1} (\mathbf{x} - \mathbf{y})}\]

  • Best for multivariate data where variables are correlated or scales differ.

  • Useful in identifying outliers or in clustering when data is not isotropic.

  • Python

Clustering

Clustering methods

K-Means Clustering

The goal of K-Means is to minimize the variance within each cluster. The variance is measured as the sum of squared distances between each point and its corresponding cluster centroid. The objective function, which K-Means aims to minimize, can be defined as:

\(J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2\)

Where:

  • \(J\) is the objective function

  • \(k\) is the number of clusters

  • \(C_i\) is the set of points belonging to a cluster \(i\).

  • \(x\) is a point in the cluster \(C_i\)

  • \(||x - \mu_i||^2\) is the squared Euclidean distance between a point \(x\) and the centroid \(\mu_i\)​, which measures the dissimilarity between them.

  • Initialization: Randomly selects \(k\) initial centroids.

  • Assignment Step: Assigns each data point to the closest centroid based on Euclidean distance.

  • Update Step: Recalculates centroids as the mean of assigned points in each cluster.

  • Convergence: Iterates until the centroids stabilize (minimal change from one iteration to the next).

  • Objective: Minimizes the within-cluster sum of squares (WCSS), the sum of squared distances between points and their corresponding centroid.

  • Optimal \(k\): Determined experimentally, often using methods like the Elbow Method.

  • Sensitivity: Results can vary based on initial centroid selection; techniques like “k-means++” improve initial centroid choices.

  • Efficiency: Generally good, but worsens with increasing \(k\) and data dimensionality; sensitive to outliers.

K-Medians Clustering

\(\min \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||_1\)

  • \(k\) is the number of clusters.

  • \(C_i\)​ represents the data points in cluster \(i\).

  • \(x\) is a point within cluster \(C_i\)​.

  • \(m_i\)​ is the median of the data points in cluster \(i\), replacing the mean from K-Means.

  • \(||x - \mu_i||_1\) denotes the Manhattan distance (L1 norm) between point \(x\) and median \(m_i\)​.

  • Initialization: Randomly selects \(k\) initial medians.

  • Assignment Step: Assigns each data point to the closest median based on some distance metric, typically Manhattan distance.

  • Update Step: Recalculates medians as the median of assigned points in each cluster.

  • Convergence: Iterates until the medians stabilize (minimal change from one iteration to the next).

  • Objective: Minimizes the within-cluster sum of absolute deviations (WSAD), the sum of absolute differences between points and their corresponding median.

  • Optimal \(k\): Determined experimentally, often using methods like the Elbow Method.

  • Sensitivity: Results can vary based on initial median selection; techniques like “k-medians++” may improve initial choices.

  • Efficiency: Generally good, but can worsen with increasing \(k\) and data dimensionality; often more robust to outliers compared to k-means.

K-Means vs. K-Medians clustering

K-Means Clustering:

  • Groups data by minimizing the variance within clusters.

  • Adopts the mean as the cluster center.

  • Prone to the impact of outliers.

  • Effective for locations in high-dimensional spaces and for “spherical” cluster shapes.

K-Median Clustering:

  • Prioritizes the minimization of the sum of absolute deviations.

  • Adopts the median as the cluster center.

  • More robust to outliers than K-Means.

  • Bets for non-spherical data, and effectively manages the distortion in distributions.

Choosing the right number of clusters

Four main methods:

  • Elbow Method

    • Identifies the \(k\) at which the within-cluster sum of squares (WCSS) starts to diminish more slowly.
  • Silhouette Score

    • Measures how similar an object is to its own cluster compared to other clusters.
  • Davies-Bouldin Index

    • Evaluates intra-cluster similarity and inter-cluster differences.
  • Calinski-Harabasz Index (Variance Ratio Criterion)

    • Measures the ratio of the sum of between-clusters dispersion and of intra-cluster dispersion for all clusters.
  • BIC

    • Identifies the optimal number of clusters by penalizing models for excessive parameters, striking a balance between simplicity and accuracy.

Elbow Method

Pros:

  • Simple and easy to understand: Requires minimal statistical knowledge.

  • Clear graphical representation: Helps intuitively identify the optimal number of clusters.

  • Versatile: Applicable to various clustering algorithms.

Cons:

  • Subjective: The “elbow” point can be ambiguous, leading to different interpretations.

  • Not ideal for all datasets: Difficulty in identifying a clear elbow in datasets with gradual variance reduction.

  • Computationally expensive: For large datasets, calculating WCSS for many values of \(k\) can be resource-intensive.

  • Sensitive to initialization: The initial placement of centroids can influence the identification of the elbow point.

Silhouette Score

For 3 clusters Silhouetter Score: 0.553

\(s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\)

  • \(a\) is the mean distance between a sample and all other points in the same class.

  • \(b\) is the mean distance between a sample and all other points in the next nearest cluster.

Pros:

  • The score provides insight into the distance between the resulting clusters.

  • Values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

Cons:

  • Computationally expensive for large datasets.

  • Does not perform well with clusters of varying densities.

Davies-Bouldin Index

\(DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right)\)

Where:

  • \(k\) is the number of clusters

  • \(\sigma_i\) is the average distance of all points in cluster \(i\) to the centroid of cluster \(i\) (intra-cluster distance)

  • \(d(c_i, c_j)\) is the distance between centroids \(i\) and \(j\)

  • The ratio \(\frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\) reflects the similarity between clusters \(i\) and \(j\), with lower values indicating clusters are well-separated and compact.

Pros:

  • Intuitive: Easy to understand and interpret. A lower DBI value means better clustering.

  • Versatile: Applicable to any distance metric used within the clustering algorithm.

  • Useful for Comparing Models: Effective for comparing the performance of different clustering models on the same dataset.

Cons:

  • Sensitivity to Cluster Density: May not perform well with clusters of varying densities, as it relies on the mean distances within clusters.

  • Does Not Scale Well: Computationally expensive for large datasets due to the calculation of distances between all pairs of clusters.

  • Ambiguity in Interpretation: While lower values are better, there’s no clear threshold below which clusters are considered ‘good’ or ‘optimal’.

Calinski-Harabasz Index

\(CH = \frac{SS_B / (k - 1)}{SS_W / (n - k)}\)

where:

  • \(CH\) is the Calinski-Harabasz score.

  • \(SS_B\)​ is the between-cluster variance.

  • \(SS_W\)​ is the within-cluster variance.

  • \(k\) is the number of clusters.

  • \(n\) is the number of data points.

Pros:

  • Clear Interpretation: High values indicate better-defined clusters.

  • Computationally Efficient: Less resource-intensive than many alternatives.

  • Scale-Invariant: Effective across datasets of varying sizes.

  • No Labeled Data Required: Useful for unsupervised learning scenarios.

Cons:

  • Cluster Structure Bias: Prefers convex clusters of similar sizes.

  • Sample Size Sensitivity: Can favor more clusters in larger datasets.

  • Not Ideal for Overlapping Clusters: Assumes distinct, non-overlapping clusters.

BIC

\(\text{BIC} = -2 \ln(\hat{L}) + k \ln(n)\)

where:

  • \(\hat{L}\) is the maximized value of the likelihood function of the model,

  • \(k\) is the number of parameters in the model,

  • \(n\) is the number of observations.

Pros:

  • Penalizes Complexity: Helps avoid overfitting by penalizing models with more parameters.

  • Objective Selection: Facilitates choosing the model with the best balance between fit and simplicity.

  • Applicability: Useful across various model types, including clustering and regression.

Cons:

  • Computationally Intensive: Requires fitting multiple models to calculate, which can be resource-heavy.

  • Sensitivity to Model Assumptions: Performance depends on the underlying assumptions of the model being correct.

  • Not Always Intuitive: Determining the absolute best model may still require domain knowledge and additional diagnostics.

Systematic comparison: Equal clusters

Systematic comparison: Unequal clusters

Systematic comparison - accuracy

K-Means Clustering: applied

Code
# K-Means Clustering
kmeans = KMeans(n_clusters = 5, random_state = 0)  # Adjust n_clusters as needed
kmeans.fit(mlb_preprocessed_df)
clusters = kmeans.predict(mlb_preprocessed_df)

# Adding cluster labels to the DataFrame
mlb_preprocessed_df['Cluster'] = clusters

# Evaluate clustering performance
silhouette_avg = silhouette_score(mlb_preprocessed, clusters)
print("For n_clusters =", 5, f"The average silhouette_score is : {silhouette_avg:.3f}")
print("")

# Model Summary
print("Cluster Centers:\n", kmeans.cluster_centers_)
For n_clusters = 5 The average silhouette_score is : 0.361

Cluster Centers:
 [[ 9.80447852e-01  8.68398274e-01  6.85686523e-01  7.80094891e-01
   7.12100496e-01  3.88054976e-01  5.27932579e-01  6.95521803e-01
   6.66153895e-01  9.08848347e-01  3.11613648e-01  4.56004838e-01
   7.26773039e-01  7.81705387e-01  7.87473189e-01  8.05195721e-01
   3.33333333e-02  3.33333333e-02  5.00000000e-02  3.33333333e-02
   2.77777778e-02  1.66666667e-02  3.88888889e-02  2.77777778e-02
   2.77777778e-02  3.33333333e-02  2.77777778e-02  3.88888889e-02
   2.77777778e-02  1.11111111e-02  1.66666667e-02  3.88888889e-02
   5.55555556e-02  3.88888889e-02  3.33333333e-02  3.33333333e-02
   4.44444444e-02  3.33333333e-02  3.88888889e-02  2.77777778e-02
   5.55555556e-02  3.88888889e-02  3.33333333e-02  2.77777778e-02
   3.33333333e-02  2.22222222e-02  1.33333333e-01  1.11111111e-01
   1.16666667e-01  2.00000000e-01  1.16666667e-01  1.11111111e-02
   9.44444444e-02 -6.66133815e-16  1.33333333e-01  8.33333333e-02]
 [-6.43923908e-01 -6.68405672e-01 -6.24289850e-01 -6.44989752e-01
  -6.14049361e-01 -4.39676439e-01 -5.45490433e-01 -6.14666152e-01
  -5.88322957e-01 -6.56038799e-01 -3.86998693e-01 -4.25396567e-01
  -8.91273096e-01 -9.35527561e-01 -9.09541581e-01 -9.44343098e-01
   4.08858603e-02  4.94037479e-02  2.55536627e-02  3.23679727e-02
   4.25894378e-02  4.42930153e-02  2.55536627e-02  3.40715503e-02
   2.38500852e-02  4.25894378e-02  2.55536627e-02  2.55536627e-02
   2.89608177e-02  4.42930153e-02  3.06643952e-02  3.57751278e-02
   2.04429302e-02  4.25894378e-02  2.55536627e-02  3.23679727e-02
   3.06643952e-02  3.23679727e-02  3.57751278e-02  2.72572402e-02
   3.57751278e-02  3.91822828e-02  2.72572402e-02  2.72572402e-02
   3.23679727e-02  3.91822828e-02  1.70357751e-03  8.51788756e-03
   5.11073254e-03  1.19250426e-02  1.02214651e-02  8.67361738e-19
   1.70357751e-03  9.48892675e-01  6.81431005e-03  5.11073254e-03]
 [-3.91402474e-01 -3.79744257e-01 -3.95022196e-01 -3.90508911e-01
  -3.87210975e-01 -2.73851534e-01 -3.87795143e-01 -3.94376804e-01
  -3.70441099e-01 -3.43732560e-01 -2.64216595e-01 -2.85922460e-01
   7.27521486e-01  7.47585532e-01  6.23908840e-01  6.94635443e-01
   2.76073620e-02  3.37423313e-02  4.29447853e-02  2.14723926e-02
   3.06748466e-02  3.37423313e-02  3.06748466e-02  3.06748466e-02
   1.84049080e-02  2.45398773e-02  1.84049080e-02  3.68098160e-02
   4.60122699e-02  3.37423313e-02  5.82822086e-02  4.29447853e-02
   2.14723926e-02  4.60122699e-02  2.76073620e-02  3.06748466e-02
   3.06748466e-02  3.37423313e-02  3.06748466e-02  3.68098160e-02
   3.68098160e-02  3.06748466e-02  3.37423313e-02  3.37423313e-02
   3.98773006e-02  3.68098160e-02  3.68098160e-02  1.04294479e-01
   7.66871166e-02  1.99386503e-01  8.89570552e-02 -9.54097912e-18
   7.66871166e-02  2.57668712e-01  7.66871166e-02  8.28220859e-02]
 [ 1.84336601e+00  2.00093650e+00  2.01422924e+00  2.01298538e+00
   2.03251916e+00  9.00100073e-01  2.18403901e+00  2.20827125e+00
   2.01821145e+00  1.90441535e+00  5.34481046e-01  6.63047958e-01
   8.63620542e-01  9.34621852e-01  1.09476387e+00  1.05182158e+00
   5.64516129e-02  3.22580645e-02  1.61290323e-02  4.03225806e-02
   5.64516129e-02  3.22580645e-02  3.22580645e-02  2.41935484e-02
   2.41935484e-02  2.41935484e-02  5.64516129e-02  1.61290323e-02
   3.22580645e-02  8.06451613e-02  3.22580645e-02  4.03225806e-02
   2.41935484e-02  1.61290323e-02  4.03225806e-02  4.03225806e-02
   4.83870968e-02  1.61290323e-02  1.61290323e-02  3.22580645e-02
   1.61290323e-02  4.03225806e-02  1.61290323e-02  4.03225806e-02
   3.22580645e-02  2.41935484e-02  1.69354839e-01  8.87096774e-02
   1.77419355e-01  5.64516129e-02  8.06451613e-02  3.22580645e-02
   1.53225806e-01  8.06451613e-03  1.61290323e-01  7.25806452e-02]
 [ 1.89665173e+00  2.10798003e+00  2.30278073e+00  2.18655890e+00
   2.00881676e+00  3.13025217e+00  1.52406415e+00  1.70483602e+00
   1.81025384e+00  1.83795064e+00  3.60257994e+00  3.37018281e+00
   9.07494502e-01  9.21541805e-01  1.00020250e+00  9.90889402e-01
   1.88679245e-02  5.66037736e-02  1.88679245e-02  5.66037736e-02
   1.88679245e-02  3.77358491e-02  3.77358491e-02  5.66037736e-02
   7.54716981e-02  3.77358491e-02  1.88679245e-02  3.77358491e-02
   1.88679245e-02  1.88679245e-02 -2.08166817e-17  3.77358491e-02
   0.00000000e+00  3.77358491e-02  5.66037736e-02  1.88679245e-02
   1.88679245e-02  3.77358491e-02  5.66037736e-02  5.66037736e-02
   0.00000000e+00  0.00000000e+00  7.54716981e-02  1.88679245e-02
   1.88679245e-02  5.66037736e-02  1.88679245e-02  2.45283019e-01
   3.77358491e-02  1.38777878e-17  2.45283019e-01 -8.67361738e-19
   1.50943396e-01  1.11022302e-16  5.66037736e-02  2.45283019e-01]]
Code
pca = PCA(n_components = 2)
mlb_pca = pca.fit_transform(mlb_preprocessed)
sns.scatterplot(x = mlb_pca[:, 0], y = mlb_pca[:, 1], hue = clusters, alpha = 0.75, palette = "colorblind")
plt.title('MLB Players Clustered (PCA-reduced Features)')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend(title = 'Cluster')
plt.show()

Apply K-Medians Clustering

Code
from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# Using 'mlb_preprocessed_df' is your DataFrame after preprocessing
data = mlb_preprocessed_df.to_numpy()

# Create KMedoids instance with 5 clusters and Manhattan distance (L1)
kmedoids_instance = KMedoids(n_clusters = 5, metric = 'manhattan', random_state = 42)

# Fit the model and predict cluster labels
cluster_labels = kmedoids_instance.fit_predict(data)

# Assign cluster labels to each record in DataFrame
mlb_preprocessed_df['Cluster'] = cluster_labels

# Evaluate clustering performance using silhouette score
silhouette_avg = silhouette_score(data, cluster_labels)
print(f"For n_clusters = 5, The average silhouette_score is : {silhouette_avg:.3f}")

# Displaying the medians (centroids) of the clusters
# Note: KMedoids uses actual data points as centers, not the mean or median of the cluster.
print("Cluster Medians (Centers):\n", kmedoids_instance.cluster_centers_)
For n_clusters = 5, The average silhouette_score is : 0.231
Cluster Medians (Centers):
 [[-0.88452853 -0.70115086 -0.63347758 -0.65409806 -0.62071205 -0.43967644
  -0.54739892 -0.62224479 -0.59747025 -0.7263638  -0.38835719 -0.42635946
  -0.99984475 -1.09591463 -0.99490558 -1.0650998   0.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          1.        ]
 [ 0.95775308  0.81675479  0.63113462  0.72305121  0.61936006  0.87883378
   0.6973578   0.56662143  0.46674745  1.17649183  0.01036038  0.13885611
   0.71896746  0.7002357   0.87215709  0.81838665  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          1.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.
   0.          0.          0.        ]
 [-0.58415653 -0.56120211 -0.48469968 -0.55283709 -0.52532188 -0.43967644
  -0.42292325 -0.50719322 -0.40397612 -0.59204458 -0.38835719 -0.42635946
   0.36949942  0.85091946  0.58843678  0.71967702  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          2.        ]
 [ 1.93896828  1.93096213  1.63538549  1.79641755  1.76404201  1.53808889
   1.44421183  1.9855908   1.96632693  2.09433984  0.01036038  1.26928723
   0.76175947  0.85694681  0.87673323  0.890418    0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          1.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          1.          0.
   0.          0.          3.        ]
 [-0.58415653 -0.56120211 -0.63347758 -0.59334148 -0.62071205 -0.43967644
  -0.54739892 -0.62224479 -0.54909672 -0.43533882 -0.38835719 -0.42635946
  -0.17966465 -0.20386681 -0.46865017 -0.36079325  0.          0.
   0.          0.          0.          1.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          1.        ]]
Code
# Visualize using PCA for dimensionality reduction
pca = PCA(n_components = 2)
mlb_pca = pca.fit_transform(data)

# Plotting the clusters
sns.scatterplot(x = mlb_pca[:, 0], y = mlb_pca[:, 1], hue = cluster_labels, alpha = 0.75, palette = "colorblind")
plt.title('MLB Players Clustered (PCA-reduced Features)')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend(title = 'Cluster')
plt.show()

Conclusions

  • Unsupervised Learning: Explores data to find structure in the form of clusters without predefined labels or outcomes.

  • Clustering Use Cases: Includes recommender systems, anomaly detection, genetics, and customer segmentation.

  • Clustering Algorithms: K-Means and K-Medians are highlighted for their utility in grouping data based on similarity measures.

  • Choosing the Right Number of Clusters: Techniques like the Elbow Method, Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and BIC are critical for determining the optimal cluster count.

  • Similarity and Dissimilarity Measures: Essential in clustering, with methods including Euclidean, Manhattan, Cosine, Jaccard, and Mahalanobis distances.

  • Evaluation Metrics: Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index help assess clustering quality, focusing on intra-cluster cohesion and inter-cluster separation.