Spam and num_char
Is there an association between spam and the length of an email? You could imagine a story either way:
- Spam is more likely to be a short message tempting me to click on a link, or
- My normal email is likely shorter since I exchange brief emails with my friends all the time.
Here, you'll use the email
dataset to settle that question. Begin by bringing up the help file and learning about all the variables with ?email
.
As you explore the association between spam and the length of an email, use this opportunity to try out linking a dplyr
chain with the layers in a ggplot2
object.
This exercise is part of the course
Exploratory Data Analysis in R
Exercise instructions
Using the email
dataset
- Load the packages
ggplot2
,dplyr
, andopenintro
. - Compute appropriate measures of the center and spread of
num_char
for both spam and not-spam usinggroup_by()
andsummarize()
. No need to name the new columns created bysummarize()
. - Construct side-by-side box plots to visualize the association between the same two variables. It will be useful to
mutate()
a new column containing a log-transformed version ofnum_char
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load packages
# Compute summary statistics
email %>%
___ %>%
___
# Create plot
email %>%
mutate(log_num_char = ___) %>%
ggplot(aes(x = ___, y = log_num_char)) +
___