Get startedGet started for free

Check-in 1

1. Check-in 1

Let's review what you learned in the last several exercises.

2. Review

The box plots show the association between whether or not an email is spam and the length of the email, as measured by the log number of characters. In this dataset, the typical spam message is considerably shorter than the non-spam message, but there is still a reasonable amount of overlap in the two distributions of length.

3. Review

When you looked at the distribution of spam and the number of exclamation marks used, you found that both distributions are heavily right skewed: there only a few instances of many exclamation marks being using and many many more of 0 or 1 being used. It also bucks the expectation that spam messages will be filled with emphatic exclamation marks to entice you to click on their links. If anything, here it's actually not-spam that typically has more exclamation marks. The dominant feature of the exclaim mess variable, though, is the large proportion of cases that report zero or on this log scale, -4 (point) 6 exclamation marks. This is a common occurrence in data analysis that is often termed "zero-inflation", and there are several common ways to think about those zeros.

4. Zero inflation strategies

One approach says that there are two mechanisms going on: one generating the zeros and the other generating the non-zeros, so we will analyze these two groups separately. A simpler approach is the one that thinks of the variable as actually only taking two values, zero or not-zero, and treating it like a categorical variable. If you want to take the latter approach, the first step will be to mutate this new variable. Here, our condition is simply that the exclaim mess variable is zero. Then we can pipe that new variable into a bar chart and facet it based on spam. In the resulting plot, yes, we've lost a lot of information. But it's become very clear

5. Zero inflation strategies

that spam is more likely to contain no exclamation marks, while in spam, the opposite is true. Speaking of bar charts, let's review their layout.

6. Zero inflation strategies

One way to view associations between multiple categorical variables is like this, with faceting.

7. Bar chart options

Another way that we've seen is using a stacked bar chart. For that plot, you move the second variable from the facet layer to the fill argument inside the aesthetics function. The other consideration you have to make is if you're more interested in counts or proportions. If the latter, you'll want to normalize the plot,

8. Bar chart options

which you can do by adding the position equals fill argument to the bar geom. The result is a series of conditional proportions, where you're conditioning on whichever variable you're in.

9. Let's practice!

OK, let's get back to exploring the email dataset.