Get startedGet started for free

Categorical data

1. Categorical data

So far in this course, we've spent a lot of our time reviewing concepts and then testing our understanding of those concepts with coding exercises. In this lesson, we'll take a more practical approach as we work through some exploratory data analysis with categorical data, similar to something you might see in a coding or take-home assessment.

2. Types of variables

Categorical features can only take on a limited, and usually fixed, number of possible values. As we see here, there are different types of categorical data. The first type is ordinal. Ordinal data, like it sounds, takes some sort of order. An example is the number of stars given in a movie review, where 1 star represents a poor review and 5 stars represents an excellent review. For nominal variables, things are a bit different since order doesn't matter. Examples of this include gender or eye color.

3. Encoding categorical data

It won't always come up during exploratory data analysis, but if you're performing any type of machine learning you may have to encode your categorical variables as something else. Let's start simple with label encoding. Label encoding involves mapping each value to a number as you can see here; note that these numbers have no relationship with each other. Another popular technique is one-hot encoding. One-hot encoding maps each category to its own column containing a 0 or 1 to indicate if the observation has that feature or not. The preprocessing package in scikit-learn and the fit-underscore-transform function are helpful here, along with the pandas get-underscore-dummies function as well.

4. Example: laptop models

Let's introduce a real dataset and work through some initial exploratory data analysis. You can see from the head of this DataFrame, the dataset includes the company, model, and the accompanying price for popular laptops.

5. Example: laptop models

Once we see what we're working with, let's hone in on the company column and see how Apple, Lenovo, and Dell stack up against each other in terms of the number of observations for each value. From the bar plot, it's clear that there are far more Lenovo and Dell models in our dataset than Apple models.

6. Box plots

Quick side note here: box plots are a great tool when you want to pack a bunch of information into one visualization. The vertical marks on the lines represent the minimum, 25th percentile, median, 75th percentile, and the maximum, while any circles outside of this represent outliers.

7. Example: laptop models

Let's see how the laptops prices for each brand look using box plots. You can see each company on the y-axis and the price in euros on the x-axis. The plot tells us that, on average, Apple laptops are more expensive than Dell or Lenovo laptops. However, we can see from the outliers that Dell and Lenovo have some high-end models that are far more expensive.

8. Summary

Let's summarize what we learned here. We covered variable types, different encoding techniques, and walked through a surface-level exploration of a dataset.

9. Let's prepare for the interview!

Let's get to it and practice these concepts!