Get startedGet started for free

Exploratory data analysis

1. Exploratory data analysis

In this video you will learn many different ways to compute multiple descriptive statistics for one or many variables at once for your dataset overall and by group.

2. Summary statistics

You have already seen that the summary function provides general summary statistics like PROC UNIVARIATE in SAS. Two useful R packages Hmisc and psych both have a describe function that provide detailed descriptive statistics.

3. Summary statistics

Calling the summary R function is similar to running PROC UNIVARIATE in SAS.

4. Summary statistics

For these examples you will continue to work with the daviskeep dataset you cleaned in Chapter 2.

5. Summary statistics

The select function in R is similar to the VAR statement in SAS procedures.

6. Summary statistics

When you run this code, the summary function gives you the minimum, maximum, mean, median and first and third quartiles for the weight, height and bmi.

7. Descriptive statistics with Hmisc

But what if you want more statistics than the summary function gives you? The describe function from the Hmisc package provides many more statistics including frequency tables for the sex variable as well as the amount of missing data, several other percentiles, and the five highest and lowest values.

8. Descriptive statistics with psych

The psych package also has a useful describe function, but the result is completely different from Hmisc. In fact, the psych describe function only works for numeric data, so the sex variable was left out. The psych describe output includes statistics like the mean absolute deviation or mad, plus the range, skewness, kurtosis and standard error. It is worth noting that since both Hmisc and psych packages have functions called describe, the package double colon function name notation was used to avoid confusion.

9. Specific statistic summaries

The summarise function from the dplyr package can be used to get specific custom statistics on one or more variables similar to PROC MEANS in SAS.

10. Specific statistic summaries - one variable

For one variable such as the height variable, the summarise function can be used. Inside the parentheses, names have to be supplied for each custom statistic in the output. Besides functions like median, minimum and maximum, the quantile function can be used to find specific percentiles, like the 5th and 95th using probs equals point-05 and point-95. The dplyr package also provides the n function to get sample sizes.

11. Specific statistic summaries - one variable

As you can see when you run this code using the summarise function for height, the output includes statistics for sample size, median, fifth and ninety-fifth quartiles, min and max.

12. Specific statistic summaries - multiple variables

To get custom statistics for more than one variable or column, the summarise function can be used with a list of statistical functions after the select step, similar to adding more statistics for PROC MEANS in SAS. The across with everything function is used inside summarise to apply a list of statistical functions to all the variables selected. This code will give you the mean and standard deviation for weight, height and bmi all at the same time.

13. Specific statistic summaries - multiple variables

When you run this code, the output created is based on the variable names combined with each custom statistic. weight_mean is the mean weight and bmi_sd is the standard deviation for bmi.

14. Summary statistics - by group

Finally, it can be really useful to also get descriptive statistics for one or more variables by a third grouping variable. Here you are grouping the output by sex. The dplyr group_by function performs a similar function to the CLASS statement in SAS procedures.

15. Summary statistics - by group

By simply adding the group_by function to the code, the output now has two rows, one for each sex. The sex variable was also added to the select statement to be sure sex is included in the output, which is a tibble data frame created by the dplyr functions.

16. Let's summarise abalones!

Now put these skills to use summarizing the abalones.