Get startedGet started for free

Introducing the data

1. Introducing the data

In this chapter, you'll get a chance to put to use what you know about EDA in exploring a new dataset.

2. Email dataset

The email dataset contains information on all of the emails received by a single email account in a single month. Each email is a case, so we see that this email account received 3,921 emails. Twenty-one different variables were recorded on each email, some of which are numerical, others are categorical. Of primary interest is the first column, a categorical variable indicating whether or not the email is spam. This was created by actually reading through each email individually and deciding if it looked like spam or not. The subsequent columns were easier to create. to multiple is TRUE if the email was addressed to more than one recipient and FALSE otherwise. image is the number of images attached in the email. It's important that you have a full sense of what each of the variables mean, so you'll want to start your exercises by reading about them in the help file. One of your guiding questions throughout this chapter is: what characteristics of an email are associated with it being spam? Numerical and graphical summaries are a great way of starting to answer that question. Let's review the main graphical tools that we have for numerical data.

3. Histograms

Histograms take continuous data and aggregate it into bins, then draw a bar to a height that corresponds to the number of cases in that bin. They have a tuning parameter that you should play with, the binwidth, to explore the shape of the distribution at different scales.

4. Histograms

If you're interested in building histograms broken down based on a categorical variable, they're a good candidate for faceting, which can be done with the facet wrap layer.

5. Boxplots

Box plots excel at comparing multiple distributions and this is reflected in the ggplot syntax that requires you to put something on the x axis. If variable has two levels, you'll get two side-by-side box plots. The box plot uses robust measures, the median and the IQR, to draw the box, and also flags potential outliers for you. A downside, however, is that it can hide more complicated shapes, such as a bimodal distribution.

6. Boxplots

To get a single box plot, just map x to the number 1.

7. Density plots

Density plots summarize the data by drawing a smooth line to represent its shape. Similar to histograms, you can change the smoothness of a density plot by changing the bandwidth parameter, which you can add to the geom density function. These can be faceted just like histograms, or they can be overlaid on one another,

8. Density plots

by mapping the color of the fill of the distribution to a second variable. If you want the colors to be somewhat transparent, specify an alpha parameter between 0 and 1.

9. Let's practice!

With that brief introduction, let's dive into this new dataset.