1. Modifying imports: true/false data
In this course, you've mostly handled string and numeric data. This lesson focuses on another data type, Booleans, and special considerations for working with them.
2. Boolean Data
A Boolean variable has only two possible values: True or False, which makes them convenient for tasks like filtering. Despite this simplicity, Booleans can be tricky.
We'll use a subset of the New Developer Survey data to focus on them. This version only has an ID column,
3. Boolean Data
and columns for whether the respondent attended a programming bootcamp
4. Boolean Data
and if they took out a loan for it.
5. Boolean Data
True and false are represented in a few ways for demonstration purposes:
6. Boolean Data
zeros and ones, which are common among people with coding experience,
7. Boolean Data
Trues and Falses,
8. Boolean Data
and yeses and nos, which tend to appear in surveys and forms.
9. pandas and Booleans
Let's load this data with no additional arguments and check dtypes.
pandas interpreted no columns as Boolean! Even True/False columns were loaded as floats. Let's investigate.
10. pandas and Booleans
First, let's sum the dataframe's columns to see how many Trues each float column has. Recall that these columns code True as 1 and False as 0.
In our data subset, 38 attended a bootcamp and 14 took out a loan for it.
Let's also check how many values in each column are missing by summing the results of is NA.
Every record has a value for bootcamp attendance, but most of the loan values are blank, even for some students who attended a bootcamp.
11. pandas and Booleans
Now let's cast these columns as Booleans with the dtype argument. read Excel's dtype works exactly like read CSV's, so we pass a dictionary specifying which columns should be Boolean. Checking dtypes, it looks like it worked.
12. pandas and Booleans
Checking counts of True values reveals issues.
The loan columns have too many Trues, and the yes/no ones are all True.
Checking NA values by column,
we see there aren't any.
13. pandas and Booleans
What happened? pandas automatically loads True/False columns as floats,
but that can be changed with dtype.
Boolean values must be either True or False,
so NAs were re-coded as True.
While pandas recognized that zeros and ones are False and True, respectively,
it did not know what to do with Yes and No, so they were all coded as True.
14. Setting Custom True/False Values
We can solve the issue of the yes/no columns being misinterpreted with read Excel's true values
and false values arguments.
Each takes a list of values that pandas should treat as True or False when they appear in Boolean columns.
Values in non-Boolean columns are not converted.
15. Setting Custom True/False Values
Let's pass "Yes" and "No" as single-item lists to true values and false values.
16. Setting Custom True/False Values
Then, we check True counts with the sum method.
Now the yes/no columns match their counterparts. But there is still the issue of NA's being coded as True.
17. Boolean Considerations
What to do depends on the data. In our case, we don't want fake Trues, so we might decide to keep the loan columns as floats.
Things to consider when casting a column as Boolean include the presence of NA values,
how the column will be used in the analysis,
the consequences of fake True values,
and whether alternative representations like floats would do.
18. Let's practice!
You just learned how to cast columns as Boolean and set custom True and False values. Importantly, you also learned what to consider before doing so. Now, it's time to practice. Good luck!