1. Measuring distance for categorical data
So far you have exclusively worked with one type distance metric, the euclidean distance.
This is a commonly used metric and is a great starting point when working with data that is continuous. But what happens if the data you have isn't continuous but is categorical?
2. Binary data
Let's start with the most basic case of categorical features, those that are binary, meaning that the values can only be one of two possiblities.
Here you are presented with survey data, let's call it survey a.
The participants of this survey were asked whether they enjoy drinking various types of alcoholic beverages.
Since they can only answer yes or no we can code this binary response as TRUE or FALSE.
We would be interested to learn which participants are similar to one another based on their responses.
To calculate this we will use the similarity score called the Jaccard Index.
3. Jaccard index
This measure of similarity captures the ratio between the intersection of A and B to the union of A and B.
Or more intuitively the ratio between the number of times the features of both observations are TRUE to the number of times they are ever TRUE.
So going back to the previous example.
4. Calculating Jaccard distance
Let us calculate the Jaccard similarity for two observations one and two.
They only agree in one category, beer, so for the intersection we get the value of one.
While the number of categories these observations are ever true, or the union, is four.
Dividing the intersection by the union we get the Jaccard similarity value of 0-point-25.
But what about the distance. Well remember that distance is 1 - similarity, so in this case the distance is just 0-point-75.
5. Calculating Jaccard distance in R
To learn how to do this in R lets start with a subset of our data containing three observations, called survey a.
In order to calculate the Jaccard distance between all three observations you just need to specify that the distance method to use in the dist() function is binary.
You can see that just like our manual calculation earlier, observations 1 and 2 have a distance of 0-point-75.
Now let's expand this idea to a broader case of categorical data where we have features represented by more than two categories.
6. More than two categories
For survey b, we have gathered the favorite color and sport for our participants. For color their choices were red blue and green and for sport the decision was between soccer and hockey.
To calculate the distance between these observations we need to represent the presence or absence of each category in a process known as dummification. Essentially we consider each feature-value pair and encode its presence or absence as a 1 or 0, which is equivalent to a TRUE or FALSE.
Take a look at observation one whose favorite color was red and favorite sport is soccer. After we dummify our data, shown in the table on the right, this observation now has a value of zero for every dummified feature except for the color red and the sport soccer where the value is one.
Once our data is dummified, its just a matter or calculating the Jaccard distance between the observations.
7. Dummification in R
To perform this preliminary step in R, we would use the dummy-dot-data-dot-frame function from the dummy library.
So long as your categorical values are encoded as factors this function will convert them into binary feature value representations.
8. Generalizing categorical distance in R
We can leverage this to calculate the distance for our data. In this case we can see that observations 2 and 3, 1 and 4 and 3 and 4 all have a comparable distance, to one another. Which makes sense if you look back at the original data.
9. Let's practice!
Now you have the tools to handle both continuous and categorical data types. Let's practice what you've learned.