Get startedGet started for free

Categorical data

1. Categorical data

Hi there! So far, we have reviewed concepts that apply to numerical data. However, the data that companies collect and analyze can be of any type and take any shape. In this video, we will go over categorical data.

2. Categorical data

A categorical variable can take on one of a fixed number of possible values. Commonly, each of the possible values is referred to as a level.

3. Categorical data

Categorical variables can be divided into two categories: nominal

4. Categorical data

and ordinal. Nominal data do not have any particular order, whereas ordinal data do have some order.

5. Categorical data

An example of nominal data is a blood type.

6. Categorical data

An example of ordinal data is a clothing size.

7. Factors in R

To define a categorical variable in R, you need a vector of values and a vector of levels. The factor function encodes a vector as a factor.

8. Factors in R

If your dataset is ordinal, set the ordered argument to TRUE. Note the less than or equal signs between the levels in the printout.

9. Analysis - table

Let's review a few functions that can help you analyze categorical data. We are often interested in the number of occurrences in each category.

10. Analysis - table

The table function builds up a contingency table.

11. Analysis - table

A contingency table helps us to understand which category has the most observations and which category has only a few observations.

12. Analysis - barplot

If an interviewer asks you to visualize the number of occurrences in each of the categories, you can apply the barplot function on a contingency table.

13. Analysis - tapply

An analysis of data within categories

14. Analysis - tapply

can be carried out with the tapply function.

15. Analysis - tapply

The tapply function applies a function to each group of values, given by the levels of factors.

16. Analysis - tapply

The first argument of the tapply function is the vector of numerical values, the second is the vector of categories, and the third argument is the name of the function to be applied to the numerical values.

17. Categorical data encoding

Most machine learning algorithms cannot handle categorical data unless they are encoded as numbers. R maintains dummy variables internally, but you can prepare the data yourself for better control. For that, we'll review label encoding and one hot encoding, because they are relatively easy. Bear in mind that there are other approaches too.

18. Label encoding

In label encoding,

19. Label encoding

a value from 1 through n

20. Label encoding

, where n is the number of categories,

21. Label encoding

is assigned to each of the categories.

22. Label encoding

This method is easy and quick, but it assumes that there is an order between classes.

23. One hot encoding

In one hot encoding,

24. One hot encoding

the column has zeros in all rows

25. One hot encoding

except for where the value corresponds to the new column;

26. One hot encoding

then, it would be 1.

27. One hot encoding

There's a new column made for each level. If there are a lot of categories, the one hot encoding produces a lot of columns, which is a downside of the method. A good thing about this method is that it doesn't assume any order between the categories.

28. Summary

To summarize, we've covered types of categorical data, how to define factors in R, and three functions to analyze categorical data: table, barplot, and tapply. We've also reviewed two methods for data encoding: label encoding and one hot encoding.

29. Let's practice!

Let's practice analyzing and encoding categorical data!