Clustering and cluster models

1. Clustering and cluster models

In this video, we will learn how to cluster discrete-event model results to better understand model behavior.

2. Histograms of model results

Exploring model results helps identify tipping points and bottlenecks that can support process optimization. Histograms are one way to cluster model results. A histogram is a graph that shows the frequency distribution of the model results, giving the number of observations within given intervals. Histograms are available through the Pyplot package from Matplotlib. The example shows how to use the plt-dot-hist method to plot a variable named "data" within 50 data intervals or bins. The plot shows two Gaussian distributions with averages around 40 and 60 hours, which suggests two clusters of data. Let's explore these clusters further.

3. Cluster analysis and application to models

Cluster analysis consists of grouping objects so that those in the same group (or cluster) are more similar. Cluster analysis is used across many fields, including pattern recognition, image analysis, data compression, computer graphics, and machine learning. In discrete-event models, it can help identify model output patterns, which can be used to help provide more actionable information.

4. k-means clustering

There are several types of clustering models. We will focus on k-means clustering, which is a centroid model. k-means clustering performs partitioning of observations into a pre-defined number of clusters. Each data point belongs to the cluster with the nearest mean, called cluster centroids.

5. k-means clustering with SciPy

k-means clustering is available in the SciPy package through the scipy-dot-cluster-dot-vq-dot-kmeans method. The code below shows how to use this method, where the variable "obs" is a NumPy array. It returns the cluster centroids and the distortion, which is the mean distance between the data points and the centroids generated.

6. Data whitening: Decorrelation and rescalling

Before running the k-means method, it is important to decorrelate and rescale data features for the model to converge faster. The plots show this process, with the left panel showing the correlated raw data, the center panel showing the decorrelated data, and the right panel showing the whitened data. As can be seen, the cluster becomes more apparent after the data has been whitened. Data whitening is available in the SciPy package through the scipy-dot-cluster-dot-vq-whiten method. The code shows how to use this method, where the variable "obs" is a NumPy array.

7. Example of whitening and k-means

Let's look at an example. Consider a manufacturing activity involving several processes and we want to study the impact of one of the processes (Process 1) on the overall performance. The left panels show the raw data, with the duration of Process 1 on the x-axis and the total duration on the y-axis. The middle panel shows the whitened data, which was generated by first importing the scipy-dot-cluster-dot-vq package and then running the SciPy's "whiten" method. The effect of the whitening process can be clearly seen, with all points being scaled between zero and ten. Finally, the right panel shows the whitened data now zoomed-in and the cluster centroids calculated with SciPy's "kmeans" method.

8. Optimum number of clusters

There are several methods to identify the optimal number of clusters, including the simple, elbow, silhouette-score, and gap-statistic methods. For example, the simple method prescribes an estimation of the maximum number of clusters based on a simple calculation. For our manufacturing problem, the method estimates a maximum of 22 clusters.

9. Optimum number of clusters: Silhouette-score method

The silhouette-score method is used to estimate the optimal number of clusters and can be performed using the silhouette_score method from the sklearn package. The code below shows how to calculate the silhouette score for k number of clusters using a for-loop. The silhouette scores calculated for our manufacturing problem are shown in the console outputs on the right. The best scores are those closer to 1, which in this case was obtained for k equal to 2, meaning 2 clusters.

10. Let's practice!

Let's now practice the use of these clustering techniques in discrete-event models.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.