Get startedGet started for free

Distance between two observations

1. Distance between two observations

Let's begin by focusing on the question that is fundamental to all clustering analyses: How similar are two observations?

2. Distance vs. Similarity

Or from another perspective, how dissimilar are they?

3. Distance vs. Similarity

You see, most clustering methods measure similarity between observations using a dissmilarity metric, often referred to as the distance. These two concepts are just two sides of the same coin. If two observations have a large distance then they are less similar to one another. Likewise, if their distance value is small, then they are more similar. Naturally, we should first develop a keen intuition by what is meant by distance.

4. Distance between two players

So, let's work with the scenario of players on a soccer field.

5. Distance between two players

In this image you see the positions of two players. How far apart are they? To answer this question we first need their coordinates.

6. Distance between two players

Here the blue player is positioned in the center of the field, which we will refer to as 0, 0. While the red player has a position of 12 and 9 - or twelve feet to the right of center and 9 feet up.

7. Distance between two players

The players in this case are our observations and their X and Y coordinates are the features of these observations. We can use these features to calculate the distance between these two players. In this case we will use a distance measurement you're likely familiar with.

8. Distance between two players

Euclidean distance.

9. Distance between two players

Which is simply the hypotenuse of the triangle that is formed by the differences in the x and y coordinates of these players.

10. Distance between two players

The familiar formula to calculate this is shown here.

11. Distance between two players

Which if we plug in our values of x and y for both players we arrive at the euclidean distance between them.

12. Distance between two players

Which in this case is 15. This is the fundamental idea for calculating a measure of dissimilarity between the blue and red players.

13. dist() function

To do this in R, we use the dist function to calculate the euclidean distance between our observations. The function simply requires a data frame or matrix containing your observations and features. In this case, we are working with the data frame two players. The method by which the distance is calculated is provided by the method parameter. In this case we are using euclidean distance and specify it accordingly. As in our manual calculation we see that the distance between the red and blue players is 15.

14. More than 2 observations

This function becomes indispensable if we have more than 2 observations. In this case if we wanted to know the distance between 3 players we would measure the distance between the players two at a time. Running this through the dist function we see that the distance between players red and blue is 15 as before, but we also have measurements between green and blue as well as green and red. In this case, green and red have the smallest distance and hence are closest to one another. The dist function would work just as well if we have more features to use for calculating the distance.

15. Let's practice!

Now, Let's put what you've just learned into practice in the upcoming exercises.