1. Categorical data
Hi there! So far, we have reviewed concepts that apply to numerical data. However, the data that companies collect and analyze can be of any type and take any shape.
In this video, we will go over categorical data.
2. Categorical data
A categorical variable can take on one of a fixed number of possible values.
Commonly, each of the possible values is referred to as a level.
3. Categorical data
Categorical variables can be divided into two categories: nominal
4. Categorical data
and ordinal.
Nominal data do not have any particular order, whereas ordinal data do have some order.
5. Categorical data
An example of nominal data is a blood type.
6. Categorical data
An example of ordinal data is a clothing size.
7. Factors in R
To define a categorical variable in R, you need a vector of values and a vector of levels.
The factor function encodes a vector as a factor.
8. Factors in R
If your dataset is ordinal, set the ordered argument to TRUE.
Note the less than or equal signs between the levels in the printout.
9. Analysis - table
Let's review a few functions that can help you analyze categorical data. We are often interested in the number of occurrences in each category.
10. Analysis - table
The table function builds up a contingency table.
11. Analysis - table
A contingency table helps us to understand which category has the most observations and which category has only a few observations.
12. Analysis - barplot
If an interviewer asks you to visualize the number of occurrences in each of the categories, you can apply the barplot function on a contingency table.
13. Analysis - tapply
An analysis of data within categories
14. Analysis - tapply
can be carried out with the tapply function.
15. Analysis - tapply
The tapply function applies a function to each group of values, given by the levels of factors.
16. Analysis - tapply
The first argument of the tapply function is the vector of numerical values, the second is the vector of categories, and the third argument is the name of the function to be applied to the numerical values.
17. Categorical data encoding
Most machine learning algorithms cannot handle categorical data unless they are encoded as numbers. R maintains dummy variables internally, but you can prepare the data yourself for better control.
For that, we'll review label encoding and one hot encoding, because they are relatively easy. Bear in mind that there are other approaches too.
18. Label encoding
In label encoding,
19. Label encoding
a value from 1 through n
20. Label encoding
, where n is the number of categories,
21. Label encoding
is assigned to each of the categories.
22. Label encoding
This method is easy and quick, but it assumes that there is an order between classes.
23. One hot encoding
In one hot encoding,
24. One hot encoding
the column has zeros in all rows
25. One hot encoding
except for where the value corresponds to the new column;
26. One hot encoding
then, it would be 1.
27. One hot encoding
There's a new column made for each level. If there are a lot of categories, the one hot encoding produces a lot of columns, which is a downside of the method. A good thing about this method is that it doesn't assume any order between the categories.
28. Summary
To summarize, we've covered types of categorical data, how to define factors in R, and three functions to analyze categorical data: table, barplot, and tapply. We've also reviewed two methods for data encoding: label encoding and one hot encoding.
29. Let's practice!
Let's practice analyzing and encoding categorical data!