Session Ready
Exercise

Spam and !!!

Let's look at a more obvious indicator of spam: exclamation marks. exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.

Experiment with different types of plots until you find one that is the most informative. Recall that you've seen:

  • Side-by-side box plots
  • Faceted histograms
  • Overlaid density plots
Instructions
100 XP

The email dataset is still available in your workspace.

  • Calculate appropriate measures of the center and spread of exclaim_mess for both spam and not-spam using group_by() and summarize().
  • Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
  • If you decide to use a log transformation, remember that log(0) is -Inf in R, which isn't a very useful value! You can get around this by adding a small number (like 0.01) to the quantity inside the log() function. This way, your value is never zero. This small shift to the right won't affect your results.