Spam and !!!
Let's look at a more obvious indicator of spam: exclamation marks. exclaim_mess
contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.
Experiment with different types of plots until you find one that is the most informative. Recall that you've seen:
- Side-by-side box plots
- Faceted histograms
- Overlaid density plots
This exercise is part of the course
Exploratory Data Analysis in R
Exercise instructions
The email
dataset is still available in your workspace.
- Calculate appropriate measures of the center and spread of
exclaim_mess
for both spam and not-spam usinggroup_by()
andsummarize()
. - Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
- If you decide to use a log transformation, remember that
log(0)
is-Inf
in R, which isn't a very useful value! You can get around this by adding a small number (like0.01
) to the quantity inside thelog()
function. This way, your value is never zero. This small shift to the right won't affect your results.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Compute center and spread for exclaim_mess by spam
# Create plot for spam and exclaim_mess