Unsupervised Learning

Credit: Recro



Some use cases for clustering include:

  • Recommender systems:

    • Grouping together users with similar viewing patterns on Netflix, in order to recommend similar content
  • Anomaly detection:

    • Fraud detection, detecting defective mechanical parts
  • Genetics:

    • Clustering DNA patterns to analyze evolutionary biology
  • Customer segmentation:

    • Understanding different customer segments to devise marketing strategies


Can we identify distinct baseball player groupings based on their player stats in 2018?

Our data: MLB player stats

mlb_players_18 = pd.read_csv("data/mlb_players_18.csv", encoding = 'iso-8859-1')

name team position games AB R H doubles triples HR RBI walks strike_outs stolen_bases caught_stealing_base AVG OBP SLG OPS
0 Allard, K ATL P 3 1 1 1 0 0 0 0 0 0 0 0 1.0 1.0 1.0 2.0
1 Gibson, K MIN P 1 2 2 2 0 0 0 0 0 0 0 0 1.0 1.0 1.0 2.0
2 Law, D SF P 7 1 1 1 0 0 0 0 0 0 0 0 1.0 1.0 1.0 2.0
3 Nuno, V TB P 1 2 0 2 0 0 0 1 0 0 0 0 1.0 1.0 1.0 2.0
4 Romero, E KC P 4 1 1 1 1 0 0 0 0 0 0 0 1.0 1.0 2.0 3.0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1270 entries, 0 to 1269
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  1270 non-null   object 
 1   team                  1270 non-null   object 
 2   position              1270 non-null   object 
 3   games                 1270 non-null   int64  
 4   AB                    1270 non-null   int64  
 5   R                     1270 non-null   int64  
 6   H                     1270 non-null   int64  
 7   doubles               1270 non-null   int64  
 8   triples               1270 non-null   int64  
 9   HR                    1270 non-null   int64  
 10  RBI                   1270 non-null   int64  
 11  walks                 1270 non-null   int64  
 12  strike_outs           1270 non-null   int64  
 13  stolen_bases          1270 non-null   int64  
 14  caught_stealing_base  1270 non-null   int64  
 15  AVG                   1270 non-null   float64
 16  OBP                   1270 non-null   float64
 17  SLG                   1270 non-null   float64
 18  OPS                   1270 non-null   float64
dtypes: float64(4), int64(12), object(3)
memory usage: 188.6+ KB
games AB R H doubles triples HR RBI walks strike_outs stolen_bases caught_stealing_base AVG OBP SLG OPS
count 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000 1270.000000
mean 48.171654 130.261417 17.031496 32.297638 6.507087 0.666929 4.397638 16.225197 12.351181 32.446457 1.948031 0.754331 0.140191 0.181824 0.217412 0.399239
std 49.957749 185.855484 26.896304 49.396815 10.487391 1.517461 8.036863 26.085535 20.680606 44.687302 5.018058 1.769933 0.140268 0.165976 0.218611 0.374984
min 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 5.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 29.000000 23.500000 1.000000 3.000000 0.000000 0.000000 0.000000 1.000000 1.000000 8.000000 0.000000 0.000000 0.166000 0.217500 0.214000 0.436500
75% 79.750000 213.750000 27.000000 50.000000 10.000000 1.000000 5.000000 24.000000 18.000000 54.000000 1.000000 1.000000 0.247000 0.316000 0.395000 0.703000
max 162.000000 664.000000 129.000000 192.000000 51.000000 12.000000 48.000000 130.000000 130.000000 217.000000 45.000000 14.000000 1.000000 1.000000 2.000000 3.000000


# Define the columns based on their type for preprocessing
categorical_features = ['team', 'position']
numerical_features = ['games', 'AB', 'R', 'H', 'doubles', 'triples', 'HR', 'RBI', 'walks', 'strike_outs', 'stolen_bases', 'caught_stealing_base', 'AVG', 'OBP', 'SLG', 'OPS']
# Handling missing values: Impute missing values if any
# For numerical features, replace missing values with the median of the column
# For categorical features, replace missing values with the most frequent value of the column
numerical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'median')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline(steps = [
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))])

preprocessor = ColumnTransformer(transformers = [
    ('num', numerical_transformer, numerical_features),
    ('cat', categorical_transformer, categorical_features)])
# Apply the transformations to the dataset
mlb_preprocessed = preprocessor.fit_transform(mlb_players_18)

# The result is a NumPy array. To convert it back to a DataFrame:
# Update the method to get_feature_names_out for compatibility with newer versions of scikit-learn
feature_names = list(preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features))
new_columns = numerical_features + feature_names

mlb_preprocessed_df = pd.DataFrame(mlb_preprocessed, columns = new_columns)
games AB R H doubles triples HR RBI walks strike_outs ... position_1B position_2B position_3B position_C position_CF position_DH position_LF position_P position_RF position_SS
0 -0.904553 -0.695768 -0.596283 -0.633846 -0.620712 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 -0.944603 -0.690386 -0.559089 -0.613594 -0.620712 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 -0.824454 -0.695768 -0.596283 -0.633846 -0.620712 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
3 -0.944603 -0.690386 -0.633478 -0.613594 -0.620712 -0.439676 -0.547399 -0.583894 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 -0.884529 -0.695768 -0.596283 -0.633846 -0.525322 -0.439676 -0.547399 -0.622245 -0.59747 -0.726364 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 56 columns

Before moving on:
Similarity / Dissimilarity

Similarity + Dissimilarity


\(\cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{\|\mathbf{A}\| \|\mathbf{B}\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}}\)

  • Best for text data or any high-dimensional data.

  • Useful when the magnitude of the data vector is not important.

  • Python

\(J(A, B) = \frac{|A \cap B|}{|A \cup B|}\)

  • Suitable for sets or binary data.

  • Ideal for comparing the similarity between two sample sets.

  • Python

\(r = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sqrt{\sum (X_i - \bar{X})^2 \sum (Y_i - \bar{Y})^2}}\)

  • Use when measuring the linear relationship between two continuous variables.

  • Appropriate for data with a normal distribution.

  • Python

\(\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}\)

  • Ordinal data or when data do not meet the assumptions of Pearson’s correlation.

  • Monotonic relationships between two continuous or ordinal variables.

  • Python


\(d(\mathbf{p}, \mathbf{q}) = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}\)

  • Use for continuous data to measure the “straight line” distance between points in Euclidean space.
  • Most common in clustering and classification where simple distance measurement is required.
  • Python

\(d(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} |p_i - q_i|\)

  • Suitable for continuous or ordinal data where you want to measure the distance as if navigating a grid-like path (like city blocks).
  • Useful when the difference across dimensions is important regardless of the path taken.
  • Python

\(d(\mathbf{p}, \mathbf{q}) = \sum_{i=1}^{n} \delta(p_i, q_i) \quad \text{where} \quad \delta(a, b) = \begin{cases} 1 & \text{if } a \neq b \\ 0 & \text{otherwise} \end{cases}\)

  • Use for categorical or binary data.
  • Ideal for comparing two strings of equal length or binary feature vectors.
  • Python

\(d(\mathbf{p}, \mathbf{q}) = \left( \sum_{i=1}^{n} |p_i - q_i|^p \right)^{\frac{1}{p}}\)

  • A generalization of Euclidean and Manhattan distances. Use when you need to fine-tune the distance calculation by emphasizing different dimensions.
  • Parameterizable for different applications; adjust the parameter to control the impact of different dimensions.
  • Python

\[d(\mathbf{x}, \mathbf{y}) = \sqrt{(\mathbf{x} - \mathbf{y})^T S^{-1} (\mathbf{x} - \mathbf{y})}\]

  • Best for multivariate data where variables are correlated or scales differ.

  • Useful in identifying outliers or in clustering when data is not isotropic.

  • Python


Clustering methods

K-Means Clustering

The goal of K-Means is to minimize the variance within each cluster. The variance is measured as the sum of squared distances between each point and its corresponding cluster centroid. The objective function, which K-Means aims to minimize, can be defined as:

\(J = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2\)


  • \(J\) is the objective function

  • \(k\) is the number of clusters

  • \(C_i\) is the set of points belonging to a cluster \(i\).

  • \(x\) is a point in the cluster \(C_i\)

  • \(||x - \mu_i||^2\) is the squared Euclidean distance between a point \(x\) and the centroid \(\mu_i\)​, which measures the dissimilarity between them.

  • Initialization: Randomly selects \(k\) initial centroids.

  • Assignment Step: Assigns each data point to the closest centroid based on Euclidean distance.

  • Update Step: Recalculates centroids as the mean of assigned points in each cluster.

  • Convergence: Iterates until the centroids stabilize (minimal change from one iteration to the next).

  • Objective: Minimizes the within-cluster sum of squares (WCSS), the sum of squared distances between points and their corresponding centroid.

  • Optimal \(k\): Determined experimentally, often using methods like the Elbow Method.

  • Sensitivity: Results can vary based on initial centroid selection; techniques like “k-means++” improve initial centroid choices.

  • Efficiency: Generally good, but worsens with increasing \(k\) and data dimensionality; sensitive to outliers.

K-Medians Clustering

\(\min \sum_{i=1}^{k} \sum_{x \in C_i} ||x - m_i||_1\)

  • \(k\) is the number of clusters.

  • \(C_i\)​ represents the data points in cluster \(i\).

  • \(x\) is a point within cluster \(C_i\)​.

  • \(m_i\)​ is the median of the data points in cluster \(i\), replacing the mean from K-Means.

  • \(||x - \mu_i||_1\) denotes the Manhattan distance (L1 norm) between point \(x\) and median \(m_i\)​.

  • Initialization: Randomly selects \(k\) initial medians.

  • Assignment Step: Assigns each data point to the closest median based on some distance metric, typically Manhattan distance.

  • Update Step: Recalculates medians as the median of assigned points in each cluster.

  • Convergence: Iterates until the medians stabilize (minimal change from one iteration to the next).

  • Objective: Minimizes the within-cluster sum of absolute deviations (WSAD), the sum of absolute differences between points and their corresponding median.

  • Optimal \(k\): Determined experimentally, often using methods like the Elbow Method.

  • Sensitivity: Results can vary based on initial median selection; techniques like “k-medians++” may improve initial choices.

  • Efficiency: Generally good, but can worsen with increasing \(k\) and data dimensionality; often more robust to outliers compared to k-means.

K-Means vs. K-Medians clustering

K-Means Clustering:

  • Groups data by minimizing the variance within clusters.

  • Adopts the mean as the cluster center.

  • Prone to the impact of outliers.

  • Effective for locations in high-dimensional spaces and for “spherical” cluster shapes.

K-Median Clustering:

  • Prioritizes the minimization of the sum of absolute deviations.

  • Adopts the median as the cluster center.

  • More robust to outliers than K-Means.

  • Bets for non-spherical data, and effectively manages the distortion in distributions.

Choosing the right number of clusters

Four main methods:

  • Elbow Method

    • Identifies the \(k\) at which the within-cluster sum of squares (WCSS) starts to diminish more slowly.
  • Silhouette Score

    • Measures how similar an object is to its own cluster compared to other clusters.
  • Davies-Bouldin Index

    • Evaluates intra-cluster similarity and inter-cluster differences.
  • Calinski-Harabasz Index (Variance Ratio Criterion)

    • Measures the ratio of the sum of between-clusters dispersion and of intra-cluster dispersion for all clusters.
  • BIC

    • Identifies the optimal number of clusters by penalizing models for excessive parameters, striking a balance between simplicity and accuracy.

Elbow Method


  • Simple and easy to understand: Requires minimal statistical knowledge.

  • Clear graphical representation: Helps intuitively identify the optimal number of clusters.

  • Versatile: Applicable to various clustering algorithms.


  • Subjective: The “elbow” point can be ambiguous, leading to different interpretations.

  • Not ideal for all datasets: Difficulty in identifying a clear elbow in datasets with gradual variance reduction.

  • Computationally expensive: For large datasets, calculating WCSS for many values of \(k\) can be resource-intensive.

  • Sensitive to initialization: The initial placement of centroids can influence the identification of the elbow point.

Silhouette Score

For 3 clusters Silhouetter Score: 0.553

\(s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}}\)

  • \(a\) is the mean distance between a sample and all other points in the same class.

  • \(b\) is the mean distance between a sample and all other points in the next nearest cluster.


  • The score provides insight into the distance between the resulting clusters.

  • Values range from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.


  • Computationally expensive for large datasets.

  • Does not perform well with clusters of varying densities.

Davies-Bouldin Index

\(DBI = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \left( \frac{\sigma_i + \sigma_j}{d(c_i, c_j)} \right)\)


  • \(k\) is the number of clusters

  • \(\sigma_i\) is the average distance of all points in cluster \(i\) to the centroid of cluster \(i\) (intra-cluster distance)

  • \(d(c_i, c_j)\) is the distance between centroids \(i\) and \(j\)

  • The ratio \(\frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\) reflects the similarity between clusters \(i\) and \(j\), with lower values indicating clusters are well-separated and compact.


  • Intuitive: Easy to understand and interpret. A lower DBI value means better clustering.

  • Versatile: Applicable to any distance metric used within the clustering algorithm.

  • Useful for Comparing Models: Effective for comparing the performance of different clustering models on the same dataset.


  • Sensitivity to Cluster Density: May not perform well with clusters of varying densities, as it relies on the mean distances within clusters.

  • Does Not Scale Well: Computationally expensive for large datasets due to the calculation of distances between all pairs of clusters.

  • Ambiguity in Interpretation: While lower values are better, there’s no clear threshold below which clusters are considered ‘good’ or ‘optimal’.

Calinski-Harabasz Index

\(CH = \frac{SS_B / (k - 1)}{SS_W / (n - k)}\)


  • \(CH\) is the Calinski-Harabasz score.

  • \(SS_B\)​ is the between-cluster variance.

  • \(SS_W\)​ is the within-cluster variance.

  • \(k\) is the number of clusters.

  • \(n\) is the number of data points.


  • Clear Interpretation: High values indicate better-defined clusters.

  • Computationally Efficient: Less resource-intensive than many alternatives.

  • Scale-Invariant: Effective across datasets of varying sizes.

  • No Labeled Data Required: Useful for unsupervised learning scenarios.


  • Cluster Structure Bias: Prefers convex clusters of similar sizes.

  • Sample Size Sensitivity: Can favor more clusters in larger datasets.

  • Not Ideal for Overlapping Clusters: Assumes distinct, non-overlapping clusters.


\(\text{BIC} = -2 \ln(\hat{L}) + k \ln(n)\)


  • \(\hat{L}\) is the maximized value of the likelihood function of the model,

  • \(k\) is the number of parameters in the model,

  • \(n\) is the number of observations.


  • Penalizes Complexity: Helps avoid overfitting by penalizing models with more parameters.

  • Objective Selection: Facilitates choosing the model with the best balance between fit and simplicity.

  • Applicability: Useful across various model types, including clustering and regression.


  • Computationally Intensive: Requires fitting multiple models to calculate, which can be resource-heavy.

  • Sensitivity to Model Assumptions: Performance depends on the underlying assumptions of the model being correct.

  • Not Always Intuitive: Determining the absolute best model may still require domain knowledge and additional diagnostics.

Systematic comparison: Equal clusters

Systematic comparison: Unequal clusters

Systematic comparison - accuracy

K-Means Clustering: applied

# K-Means Clustering
kmeans = KMeans(n_clusters = 5, random_state = 0)  # Adjust n_clusters as needed
clusters = kmeans.predict(mlb_preprocessed_df)

# Adding cluster labels to the DataFrame
mlb_preprocessed_df['Cluster'] = clusters

# Evaluate clustering performance
silhouette_avg = silhouette_score(mlb_preprocessed, clusters)
print("For n_clusters =", 5, f"The average silhouette_score is : {silhouette_avg:.3f}")

# Model Summary
print("Cluster Centers:\n", kmeans.cluster_centers_)
For n_clusters = 5 The average silhouette_score is : 0.361

Cluster Centers:
 [[ 9.80447852e-01  8.68398274e-01  6.85686523e-01  7.80094891e-01
   7.12100496e-01  3.88054976e-01  5.27932579e-01  6.95521803e-01
   6.66153895e-01  9.08848347e-01  3.11613648e-01  4.56004838e-01
   7.26773039e-01  7.81705387e-01  7.87473189e-01  8.05195721e-01
   3.33333333e-02  3.33333333e-02  5.00000000e-02  3.33333333e-02
   2.77777778e-02  1.66666667e-02  3.88888889e-02  2.77777778e-02
   2.77777778e-02  3.33333333e-02  2.77777778e-02  3.88888889e-02
   2.77777778e-02  1.11111111e-02  1.66666667e-02  3.88888889e-02
   5.55555556e-02  3.88888889e-02  3.33333333e-02  3.33333333e-02
   4.44444444e-02  3.33333333e-02  3.88888889e-02  2.77777778e-02
   5.55555556e-02  3.88888889e-02  3.33333333e-02  2.77777778e-02
   3.33333333e-02  2.22222222e-02  1.33333333e-01  1.11111111e-01
   1.16666667e-01  2.00000000e-01  1.16666667e-01  1.11111111e-02
   9.44444444e-02 -6.66133815e-16  1.33333333e-01  8.33333333e-02]
 [-6.43923908e-01 -6.68405672e-01 -6.24289850e-01 -6.44989752e-01
  -6.14049361e-01 -4.39676439e-01 -5.45490433e-01 -6.14666152e-01
  -5.88322957e-01 -6.56038799e-01 -3.86998693e-01 -4.25396567e-01
  -8.91273096e-01 -9.35527561e-01 -9.09541581e-01 -9.44343098e-01
   4.08858603e-02  4.94037479e-02  2.55536627e-02  3.23679727e-02
   4.25894378e-02  4.42930153e-02  2.55536627e-02  3.40715503e-02
   2.38500852e-02  4.25894378e-02  2.55536627e-02  2.55536627e-02
   2.89608177e-02  4.42930153e-02  3.06643952e-02  3.57751278e-02
   2.04429302e-02  4.25894378e-02  2.55536627e-02  3.23679727e-02
   3.06643952e-02  3.23679727e-02  3.57751278e-02  2.72572402e-02
   3.57751278e-02  3.91822828e-02  2.72572402e-02  2.72572402e-02
   3.23679727e-02  3.91822828e-02  1.70357751e-03  8.51788756e-03
   5.11073254e-03  1.19250426e-02  1.02214651e-02  8.67361738e-19
   1.70357751e-03  9.48892675e-01  6.81431005e-03  5.11073254e-03]
 [-3.91402474e-01 -3.79744257e-01 -3.95022196e-01 -3.90508911e-01
  -3.87210975e-01 -2.73851534e-01 -3.87795143e-01 -3.94376804e-01
  -3.70441099e-01 -3.43732560e-01 -2.64216595e-01 -2.85922460e-01
   7.27521486e-01  7.47585532e-01  6.23908840e-01  6.94635443e-01
   2.76073620e-02  3.37423313e-02  4.29447853e-02  2.14723926e-02
   3.06748466e-02  3.37423313e-02  3.06748466e-02  3.06748466e-02
   1.84049080e-02  2.45398773e-02  1.84049080e-02  3.68098160e-02
   4.60122699e-02  3.37423313e-02  5.82822086e-02  4.29447853e-02
   2.14723926e-02  4.60122699e-02  2.76073620e-02  3.06748466e-02
   3.06748466e-02  3.37423313e-02  3.06748466e-02  3.68098160e-02
   3.68098160e-02  3.06748466e-02  3.37423313e-02  3.37423313e-02
   3.98773006e-02  3.68098160e-02  3.68098160e-02  1.04294479e-01
   7.66871166e-02  1.99386503e-01  8.89570552e-02 -1.04083409e-17
   7.66871166e-02  2.57668712e-01  7.66871166e-02  8.28220859e-02]
 [ 1.84336601e+00  2.00093650e+00  2.01422924e+00  2.01298538e+00
   2.03251916e+00  9.00100073e-01  2.18403901e+00  2.20827125e+00
   2.01821145e+00  1.90441535e+00  5.34481046e-01  6.63047958e-01
   8.63620542e-01  9.34621852e-01  1.09476387e+00  1.05182158e+00
   5.64516129e-02  3.22580645e-02  1.61290323e-02  4.03225806e-02
   5.64516129e-02  3.22580645e-02  3.22580645e-02  2.41935484e-02
   2.41935484e-02  2.41935484e-02  5.64516129e-02  1.61290323e-02
   3.22580645e-02  8.06451613e-02  3.22580645e-02  4.03225806e-02
   2.41935484e-02  1.61290323e-02  4.03225806e-02  4.03225806e-02
   4.83870968e-02  1.61290323e-02  1.61290323e-02  3.22580645e-02
   1.61290323e-02  4.03225806e-02  1.61290323e-02  4.03225806e-02
   3.22580645e-02  2.41935484e-02  1.69354839e-01  8.87096774e-02
   1.77419355e-01  5.64516129e-02  8.06451613e-02  3.22580645e-02
   1.53225806e-01  8.06451613e-03  1.61290323e-01  7.25806452e-02]
 [ 1.89665173e+00  2.10798003e+00  2.30278073e+00  2.18655890e+00
   2.00881676e+00  3.13025217e+00  1.52406415e+00  1.70483602e+00
   1.81025384e+00  1.83795064e+00  3.60257994e+00  3.37018281e+00
   9.07494502e-01  9.21541805e-01  1.00020250e+00  9.90889402e-01
   1.88679245e-02  5.66037736e-02  1.88679245e-02  5.66037736e-02
   1.88679245e-02  3.77358491e-02  3.77358491e-02  5.66037736e-02
   7.54716981e-02  3.77358491e-02  1.88679245e-02  3.77358491e-02
   1.88679245e-02  1.88679245e-02 -2.08166817e-17  3.77358491e-02
   0.00000000e+00  3.77358491e-02  5.66037736e-02  1.88679245e-02
   1.88679245e-02  3.77358491e-02  5.66037736e-02  5.66037736e-02
   0.00000000e+00  0.00000000e+00  7.54716981e-02  1.88679245e-02
   1.88679245e-02  5.66037736e-02  1.88679245e-02  2.45283019e-01
   3.77358491e-02  1.38777878e-17  2.45283019e-01 -8.67361738e-19
   1.50943396e-01  1.11022302e-16  5.66037736e-02  2.45283019e-01]]
pca = PCA(n_components = 2)
mlb_pca = pca.fit_transform(mlb_preprocessed)
sns.scatterplot(x = mlb_pca[:, 0], y = mlb_pca[:, 1], hue = clusters, alpha = 0.75, palette = "colorblind")
plt.title('MLB Players Clustered (PCA-reduced Features)')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend(title = 'Cluster')

Apply K-Medians Clustering

from sklearn_extra.cluster import KMedoids
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

# Using 'mlb_preprocessed_df' is your DataFrame after preprocessing
data = mlb_preprocessed_df.to_numpy()

# Create KMedoids instance with 5 clusters and Manhattan distance (L1)
kmedoids_instance = KMedoids(n_clusters = 5, metric = 'manhattan', random_state = 42)

# Fit the model and predict cluster labels
cluster_labels = kmedoids_instance.fit_predict(data)

# Assign cluster labels to each record in DataFrame
mlb_preprocessed_df['Cluster'] = cluster_labels

# Evaluate clustering performance using silhouette score
silhouette_avg = silhouette_score(data, cluster_labels)
print(f"For n_clusters = 5, The average silhouette_score is : {silhouette_avg:.3f}")

# Displaying the medians (centroids) of the clusters
# Note: KMedoids uses actual data points as centers, not the mean or median of the cluster.
print("Cluster Medians (Centers):\n", kmedoids_instance.cluster_centers_)
For n_clusters = 5, The average silhouette_score is : 0.231
Cluster Medians (Centers):
 [[-0.88452853 -0.70115086 -0.63347758 -0.65409806 -0.62071205 -0.43967644
  -0.54739892 -0.62224479 -0.59747025 -0.7263638  -0.38835719 -0.42635946
  -0.99984475 -1.09591463 -0.99490558 -1.0650998   0.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          1.        ]
 [ 0.95775308  0.81675479  0.63113462  0.72305121  0.61936006  0.87883378
   0.6973578   0.56662143  0.46674745  1.17649183  0.01036038  0.13885611
   0.71896746  0.7002357   0.87215709  0.81838665  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          1.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.
   0.          0.          0.        ]
 [-0.58415653 -0.56120211 -0.48469968 -0.55283709 -0.52532188 -0.43967644
  -0.42292325 -0.50719322 -0.40397612 -0.59204458 -0.38835719 -0.42635946
   0.36949942  0.85091946  0.58843678  0.71967702  0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          1.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          2.        ]
 [ 1.93896828  1.93096213  1.63538549  1.79641755  1.76404201  1.53808889
   1.44421183  1.9855908   1.96632693  2.09433984  0.01036038  1.26928723
   0.76175947  0.85694681  0.87673323  0.890418    0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          1.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          1.          0.
   0.          0.          3.        ]
 [-0.58415653 -0.56120211 -0.63347758 -0.59334148 -0.62071205 -0.43967644
  -0.54739892 -0.62224479 -0.54909672 -0.43533882 -0.38835719 -0.42635946
  -0.17966465 -0.20386681 -0.46865017 -0.36079325  0.          0.
   0.          0.          0.          1.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.
   0.          0.          1.        ]]
# Visualize using PCA for dimensionality reduction
pca = PCA(n_components = 2)
mlb_pca = pca.fit_transform(data)

# Plotting the clusters
sns.scatterplot(x = mlb_pca[:, 0], y = mlb_pca[:, 1], hue = cluster_labels, alpha = 0.75, palette = "colorblind")
plt.title('MLB Players Clustered (PCA-reduced Features)')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
plt.legend(title = 'Cluster')


  • Unsupervised Learning: Explores data to find structure in the form of clusters without predefined labels or outcomes.

  • Clustering Use Cases: Includes recommender systems, anomaly detection, genetics, and customer segmentation.

  • Clustering Algorithms: K-Means and K-Medians are highlighted for their utility in grouping data based on similarity measures.

  • Choosing the Right Number of Clusters: Techniques like the Elbow Method, Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Index, and BIC are critical for determining the optimal cluster count.

  • Similarity and Dissimilarity Measures: Essential in clustering, with methods including Euclidean, Manhattan, Cosine, Jaccard, and Mahalanobis distances.

  • Evaluation Metrics: Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Index help assess clustering quality, focusing on intra-cluster cohesion and inter-cluster separation.