Stats outside geoms

1. Stats outside geoms

Let's take a look at some statistics that we call directly.

2. Basic plot

In this plot of the iris dataset, sepal length is described by species. What can we do with this data?

3. Calculating statistics

A typical way to summarize this data would be to take the mean and standard deviation or the 95% confidence interval. We can calculate these values manually, or we can do it directly in ggplot2. Let's see how it works.

4. Calculating statistics

The function smean-dot-sdl from the Hmisc package returns the mean plus or minus one standard deviation as a named vector. By setting the mult argument to 1, we specify 1 standard deviation. In ggplot2, the function mean_sdl converts this vector to a data frame, renaming the variables to match ggplot2 aesthetics.

5. stat_summary()

We call mean_sdl using the fun.data argument of the stat_summary function. By default we get geom_pointrange, which requires y, ymin and ymax, exactly what is returned by mean_sdl.

6. stat_summary()

For errorbars, we can just calculate the mean and use "point" as the geom, then we can call mean_sdl using the "errorbar" geom, where we can also set the width of the error bars.

7. stat_summary()

But notice that we could have also made a typical bar plot with error bars, by simply calling the bar geom - but this is NOT RECOMMENDED! We'll learn why when we get to data viz best practices later on!

8. 95% confidence interval

The 95% CI is also straight forward. mean_cl_normal returns the mean and the upper and lower bounds of the 95% confidence interval, calculated using the t-distribution.

9. Other stat_ functions

Two other useful stat_layer functions are stat_function and stat_qq. These are particularly useful if we want to look at distributions. Statisticians typically use visual cues to get an idea of the distribution of their data instead of relying only on numbers.

10. MASS::mammals

To see this in action let's return to the first example we used in the first course - the mammalian body and brain weights stored in the mammals data frame. We mentioned that our linear model fitted the log10 transformed data reasonably well. What we mean is that the log transformed data appears to be normally distributed, so let's take a look at that in detail.

11. Normal distribution

For stat_function, we can specify any function and produce the theoretical probability distribution as a line. Here, we call a normal distribution, that's dnorm, and add arguments as a list to centered it on our distribution, that's the mean and the sd). This allows us to compare how well our data is normally distributed. The log10 mammalian body weight is described by a log normal curve very well. Notice that we have another geom here, geom_rug, which adds those little tick marks on the bottom of the plot. This is a handy way of seeing the actual values in combination with a summary distribution. An empirical density plot, would be an nice alternative to the histogram, but we'll get to that in the next course.

12. QQ plot

QQ plots also allows us to compare our data to a distribution. In this case, we plot our sample against the theoretical distribution, like the normal, and draw a line intersecting the scatter plot at the first and third quartiles. The closer that our data aligns to this line, the more closely it matches the theoretical distribution in question.

13. Your turn!

There are more stat_ functions which are available for you to explore during the exercises, so let's take a look.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.