Get startedGet started for free

EDA with categorical variables

1. EDA with categorical variables

Hi again! My name is Maarten, and I'll be the second instructor for this course. We went through the first three steps of the EDA process and focused primarily on understanding continuous variables. Now, we will introduce more categorical variables into the analysis which will also move us into the fourth step - examining and quantifying relationships between variables.

2. Categorical variables and frequency

The first place to start is with frequencies, or counts, of observations per value of the categorical variables. If you remember, this is similar to a histogram as described in the previous video about distributions. Here we have a bar chart showing the frequency, or number of participants in the study, for three age groups. Viewing in this way helps us understand the characteristics of the dataset (i.e. there are more older participants).

3. Categorical variables and percentages

Maybe more insightful is the percentage, or proportion, each value accounts for within the dataset. This can showcase a more relative understanding of the groups significance within the data. For example, 250 18-29 year old participants sounds like a lot unless the total sample size is 10,000.

4. Proportions across multiple categorical variables

Looking at how proportions vary across another categorical variable can illuminate significant differences within the dataset. Here you see TikTok is used more by participants ages 18 to 29 while used by few ages 40-49. Analyzing in this way can start building insights about the effect of one variable on another.

5. Categorical variables with descriptive statistics

You can also identify trends about a continuous variable by exploring the differences of descriptive statistics across values of the categorical variable.

6. What are boxplots?

Box plots are a great method for visualizing the distribution of a continuous variable. They are similar to histograms, from the previous chapter.

7. What are boxplots?

The median is the line within the box. Remember, this indicates 50% of values fall below the line and 50% above.

8. What are boxplots?

The “box” is the rectangle, constructed using the interquartile range.

9. What are boxplots?

The lines, or whiskers, extend from the edge of the box, to 1.5 times the interquartile range.

10. What are boxplots?

Therefore, outliers, typically visualized as points, extend further than the whiskers.

11. Comparing distributions with categorical variables

Box plots are fantastic in comparing distributions across a categorical variable. Placing each side-by-side, you can derive quick conclusions about differences. Typically, you will see the categorical variable on the x-axis and the continuous on the y-axis. Here we see the distribution for heights between males and females. The boxplot for males indicates, generally, larger values for heights compared to females.

12. Creating new variables

An important part, or result, of EDA is creating new variables, both continuous and categorical. Doing so can help further refine an analysis or visualization. This process is often referred to as data transformation or mutation.

13. Creating new variables

For example, converting the age of a person into four "Age Group" categories - Teen, Early Adult, Adult, and Middle Age - or mining a text string variable to create a new categorical variable (ex. DataCamp Course Type).

14. Let's practice!

Now it's your turn to explore categorical variables and build box plots to compare distributions.