Get startedGet started for free

Spam and !!!

Let's look at a more obvious indicator of spam: exclamation marks. exclaim_mess contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.

Experiment with different types of plots until you find one that is the most informative. Recall that you've seen:

  • Side-by-side box plots
  • Faceted histograms
  • Overlaid density plots

This exercise is part of the course

Exploratory Data Analysis in R

View Course

Exercise instructions

The email dataset is still available in your workspace.

  • Calculate appropriate measures of the center and spread of exclaim_mess for both spam and not-spam using group_by() and summarize().
  • Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
  • If you decide to use a log transformation, remember that log(0) is -Inf in R, which isn't a very useful value! You can get around this by adding a small number (like 0.01) to the quantity inside the log() function. This way, your value is never zero. This small shift to the right won't affect your results.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Compute center and spread for exclaim_mess by spam




# Create plot for spam and exclaim_mess

Edit and Run Code