Limitations of hierarchical clustering

1. Limitations of hierarchical clustering

Hi everyone, welcome to the final video in the chapter! Now that you are familiar with hierarchical clustering let us look at the challenges when performing this type of clustering.

2. Measuring speed in hierarchical clustering

Let us design a small task to measure the speed of various iterations of hierarchical clustering in order to check how long it takes for iterations. We will use the timeit module to check runtime of functions. As the most time-consuming step in the process of hierarchical clustering is constructing the distance matrix through the linkage method, we will time the amount of time it takes to form the matrix. For the purpose of this exercise, we will use randomly generated data points on the XY plane. To test the limits of the algorithm, we will use an increasing number of data points.

3. Use of timeit module

To demonstrate the use of timeit to further check how much time it may take for a large number of points, let us look at how long it takes to run the linkage method for 100 points with randomly generated coordinates. We first import the random and timeit modules to generate the points and time the runtime of a function respectively. Next, we create a DataFrame with 100 points, and randomly generate 100 points within coordinates in the range of zero to a hundred. To check the time of a function in the interpreter, we use the percent symbol before the timeit keyword followed by the statement that we were about to run. The timeit module runs the function multiple times and reports the mean and standard deviations of the runtimes. When I run this code on a 2017 Macbook Air running Jupyter notebooks, the mean time to execute the statement is about 1-point-02 milliseconds. Let us now perform iterations for an increasing number of points, check how long it takes to run the linkage method and then plot a graph to compare the performace.

4. Comparison of runtime of linkage method

If you plot the runtime of the linkage method with the number of points, you can see that the runtime increases with the increase in number of data points. In addition to it, you would notice that the increase in run time is not linear with respect to the increase in data points, but quadratic. This makes the technique of hierarchical clustering infeasible for huge number of data points, for instance, the shopping habits of all customers in Walmart in a year.

5. Next up - exercises

Let us now try some exercises to measure the time of functions through timeit.