Course introduction

1. Course introduction

Hello and welcome to another course from DataCamp. Our primary objective for this course is to learn how to work with categorical data in Python. My name is Kasey Jones and I am thrilled that you have decided to join me. Let's jump right in.

2. What does it mean to be "categorical"?

A variable is usually considered categorical if it contains a finite number of distinct groups - or categories. Generally, the number of categories and the corresponding names are already known. In research fields, this type of data is also known as qualitative data. On the other hand, numerical data, also know as quantitative data, is expressed using a value and is usually in the form of a measurement. This course is completely focused on working with categorical data.

3. Ordinal vs. nominal variables

Categorical data can be further broken down into two different types, ordinal and nominal. You can think of ordinal variables as having an order. When categories within a variable have a natural rank order, the variable is considered ordinal. We have all filled out surveys that have ordinal options, such as choices ranging from strongly disagree up through strongly agree. On the flip side, nominal variables are those that cannot be placed into a natural order. A good example here is when a survey asks what your favorite color is from a list of options.

4. Our first dataset

To begin working with categorical variables, let's take a look at our first dataset. The adult census income dataset contains information on US adults and whether or not an adult makes over $50,000 annually. We have used the pandas method info to take a look at the dataset's variables and their data types, or dtypes for short. We quickly see that there are over 32,000 entries and 15 total columns. Take a look at the marital status column, which has the dtype - object. A dtype of object is how pandas stores strings and is a good indicator that a variable might be categorical.

5. Using describe

We can explore the marital status column in more detail by using the describe method on the pandas Series Marital Status. We see that there are seven unique values, with the married-civ-spouse option being the most common entry with almost 15,000 occurrences.

6. Using value counts

Another way to explore the marital status column is to use the value counts method. This method prints out a frequency table of the values found in a pandas Series. We still see the 15,000 entries for married-civ-spouse, but we also see the remaining counts for each unique value in the marital status column.

7. Using value counts with normalize

Although the value counts method has a few parameters, the most commonly used parameter is definitely normalize. By setting normalize to equal True, the output will contain relative frequency values instead of the counts of the unique values. The values shown are the proportions of all responses in a pandas Series that equal to a specific value. In this example, married-civ-spouse makes up 46% of all responses.

8. Knowledge check

Lets recap the different types of data we have discussed by working through a couple of exercises.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

Working with Categorical Data in Python

IntermediateSkill Level

4.8+

1896 reviews