Explaining unsupervised models

1. Explaining unsupervised models

Time to explore explainability in unsupervised models, focusing on clustering.

2. Clustering

Clustering algorithms like K-means group similar data points into clusters without predefined labels, each defined by a centroid. When explaining clustering, two key questions arise: which features have the greatest impact on clustering quality, and which ones influence cluster assignments the most?

3. Silhouette score

The silhouette score evaluates clustering quality by measuring the separation between clusters, ranging from -1 to 1. A score close to 1 signifies well-separated clusters,

4. Silhouette score

while a score close to -1 suggests potential misassignments.

5. Feature impact on cluster quality

To assess feature impact on clustering quality, we initially train a clustering model on a dataset,

6. Feature impact on cluster quality

then remove features one at a time and retrain the model. In the case of two features, when one is removed, the samples will lie on a single axis, and we cluster them.

7. Feature impact on cluster quality

Each feature's impact may be determined by the change in silhouette scores: If removing a feature lowers the silhouette score, it has a positive impact, indicating improved cluster quality. Conversely, if the score increases, the feature has a negative impact, suggesting it was adding noise and might be worth removing.

8. Student Performance dataset

Let’s apply this to a dataset where we want to cluster students based on features like age, health status, absences, and three grades. These features are in array X.

9. Computing feature impact on cluster quality

We import KMeans from sklearn.cluster and silhouette_score from sklearn.metrics. We fit a k-means model with two clusters, as specified by the n_clusters parameter, and compute the silhouette score, using the silhouette_score function which takes the feature matrix X and the predicted cluster labels as inputs. Next, we iterate over each feature in the dataset. For each iteration, we remove one feature with np.delete, which removes from X the feature at index i along the specified axis. We refit the model, recalculate the silhouette score, and compute the impact as the difference between the original_score and the new_score after feature removal.

10. Computing feature impact on cluster quality

We find that age, health status, and absences improve cluster quality, while grades degrade it, suggesting they could be candidates for removal.

11. Adjusted rand index (ARI)

While the silhouette score focuses on overall clustering quality, the adjusted Rand index, or ARI, compares how similar cluster assignments are between two clustering scenarios, assessing how well they match. The maximum value of ARI 1, indicating perfect alignment between clusterings where all data points are assigned to the same clusters.

12. Adjusted rand index (ARI)

The lower the ARI, the greater the difference between the clusterings, suggesting that cluster assignment for each point is changing.

13. Feature importance for cluster assignments

To evaluate feature importance for cluster assignments, we may remove features one at a time and measure the change in clustering using 1 minus the ARI. A lower ARI, and thus higher 1-ARI, indicates the feature is important for cluster assignments.

14. Feature importance for cluster assignment

We import adjusted_rand_score from sklearn.metrics. We fit the k-means model and compute the original_clusters using kmeans.predict. Next, we iterate over the features, removing each one with np.delete, and generate reduced_clusters using .fit_predict, which both fits the model and predicts the clusters. Finally, we calculate the feature importance as 1 - adjusted_rand_score of the original and reduced clusterings. We find that features like health status and grades G2 and G3 are key for cluster assignments, while age and absences have little impact.

15. Let's practice!

Time to practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.