Working with categorical features

1. Working with categorical features

In this lesson, we'll consider how to identify categorical features in the data, and how these should be handled when using isolation forests and LOF.

2. Checking column classes

The first step is to identify any features that are non-numeric. This becomes harder to check by simply looking, when the data have many columns. Categorical features are usually stored in dataframes as columns that have character or factor class. The column class can be found by passing a column to the class function. The example here shows that column V1 has numeric class. To find the classes of all the columns in a dataframe, we can use the sapply function. sapply picks up a function from the FUN argument, and applies it to every column in a dataframe provided to the X argument. In the case shown here, the class function has been passed to FUN, and has returned the class of every column in the sat data. The result is a character vector with length equal to the number of columns in the dataframe. Notice that sat contains a new column called high underscore low that has a character class, indicating a categorical variable.

3. Isolation forest

The iForest function can accept data containing both categorical and numeric features. However categorical features must first be transformed into factor variables. In the last slide, we noted that the feature high underscore low had character class. The as dot factor function can be used to transform a character column into a factor simply by passing the column as an argument. Here, we've used as dot factor to transform high underscore low, replacing the original character column in the sat data with the new factor version. As a check, we can use the class function again to ensure the class of the new column is now a factor. The isolation forest can be trained using the iForest function in the same way as before. Note that iForest is limited to categorical features with at most 32 unique values.

4. LOF with factors

LOF requires the distances between each point and their nearest neighbors to generate anomaly scores. The distance calculation becomes more complex when the data contain a mixture of numeric and categorical features. Gower distance is a more general way to measure the distance between pairs of points when the points have categorical and numeric features, which can be used with LOF. The Gower distance matrix must be first be calculated using the daisy function from the cluster package. daisy's first argument is a dataframe of points, in this case, the sat data excluding the first column containing labels. The second argument specifies the distance metric to use. Since we'd like to use the Gower distance matrix, we must specify metric equals gower here. The LOF score can then be calculated using the lof function with the sat underscore dist distance matrix as the first argument.

5. Exploring Gower distance matrix

The Gower distance matrix sat underscore dist contains distances between all pairs of points. To explore the object a little bit more, we must first convert it to a matrix using as dot matrix. Here the range function has been used to find the maximum and minimum interpoint distances. Gower distance has the convenient property that all distances are standardized to lie between 0 and 1. Notice here that the largest distance is 0 point 868.

6. Let's practice!

Let's practice anomaly detection with categorical features!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.