Spam and !!!
Let's look at a more obvious indicator of spam: exclamation marks. exclaim_mess
contains the number of exclamation marks in each message. Using summary statistics and visualization, see if there is a relationship between this variable and whether or not a message is spam.
Experiment with different types of plots until you find one that is the most informative. Recall that you've seen:
- Side-by-side box plots
- Faceted histograms
- Overlaid density plots
Este ejercicio forma parte del curso
Análisis exploratorio de datos en R
Instrucciones de ejercicio
The email
dataset is still available in your workspace.
- Calculate appropriate measures of the center and spread of
exclaim_mess
for both spam and not-spam usinggroup_by()
andsummarize()
. - Construct an appropriate plot to visualize the association between the same two variables, adding in a log-transformation step if necessary.
- If you decide to use a log transformation, remember that
log(0)
is-Inf
in R, which isn't a very useful value! You can get around this by adding a small number (like0.01
) to the quantity inside thelog()
function. This way, your value is never zero. This small shift to the right won't affect your results.
Ejercicio interactivo práctico
Pruebe este ejercicio completando este código de muestra.
# Compute center and spread for exclaim_mess by spam
# Create plot for spam and exclaim_mess