1. Check-in 2
Let's revisit the exercise where you explored the association between
2. Spam and images
whether an email has an image and whether or not its spam. The plot you created was this bar chart of proportions. I want to emphasize an important, but subtle distinction when discussing proportions like this. This plot shows the proportions of spam or not-spam within the subsets of emails that either have an image or do not. Said another way, they are conditioned on the has image variable. We get a slightly different story if we exchange the variables so that we condition on spam. Among emails that are spam, almost none of them have an image, while the proportion within non-spam is larger, but still less than 5%. If you're building a spam filter, a situation where you don't actually get to see the value of spam, it'd make more sense to think about conditioning on the has image variable.
3. Spam and images
In this case, we can tell that this variable would be an awful spam filter by itself.
4. Ordering bars
When we're working with bar charts, you can often make them more readily interpretable if you give them a sensible ordering. Recall how in the last video, we collapsed all emails with at least one exclamation mark into a single level of a new two-level categorical variable.
5. Ordering bars
That led to this bar chart, which was informative, but you might caused you to do a double-take when you first saw it. The plot on the left gets us used to seeing the bar for the zeroes on the left, while in the plot on the right, that bar is on the right side.
6. Ordering bars
Let's go through how we would flip the ordering of those bars so that they agree with the plot on the left.
7. Ordering bars
The first step is to save the mutated categorical variable back into the dataset. The ordering of the bars isn't determined within the code for the plot, but in the way that R represents that variable. If we call levels on the new variable, it returns NULL. This is because this variable is actually a logical variable, not a factor. To set the ordering of the levels, let's convert it to a factor with the factor function, provide We can then save this variable back into the dataset.
8. Ordering bars
Now, when we go to make the plot with the same code,
9. Ordering bars..
it exchanges the order of the bars for us.
Here, we decided to order the bars so that it cohered with the structure of another plot. In other circumstances, you might use other criteria to choose the order including a natural ordering of the levels, arranging the bars in increasing or decreasing order of the height of the bars or alphabetical order, which is the default. In making this decision, you're thinking about emphasizing a particular interpretation of the plot and transitioning from purely exploratory graphics to expository graphics, where you seek to communicate a particular idea. This is a natural development as you continue along the process of your data analysis.
10. Let's practice!
OK, let's return back to the case study.