Spam and num_char
Is there an association between spam and the length of an email? You could imagine a story either way:
- Spam is more likely to be a short message tempting me to click on a link, or
- My normal email is likely shorter since I exchange brief emails with my friends all the time.
Here, you'll use the email dataset to settle that question. Begin by bringing up the help file and learning about all the variables with ?email.
As you explore the association between spam and the length of an email, use this opportunity to try out linking a dplyr chain with the layers in a ggplot2 object.
This exercise is part of the course
Exploratory Data Analysis in R
Exercise instructions
Using the email dataset
- Load the packages ggplot2,dplyr, andopenintro.
- Compute appropriate measures of the center and spread of num_charfor both spam and not-spam usinggroup_by()andsummarize(). No need to name the new columns created bysummarize().
- Construct side-by-side box plots to visualize the association between the same two variables. It will be useful to mutate()a new column containing a log-transformed version ofnum_char.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load packages
# Compute summary statistics
email %>%
  ___ %>%
  ___
# Create plot
email %>%
  mutate(log_num_char = ___) %>%
  ggplot(aes(x = ___, y = log_num_char)) +
  ___