Exercise 1. Proportions
Histograms and density plots provide excellent summaries of a distribution. But can we summarize even further? We often see the average and standard deviation used as summary statistics: a two number summary! To understand what these summaries are and why they are so widely used, we need to understand the normal distribution.
The normal distribution, also known as the bell curve and as the Gaussian distribution, is one of the most famous mathematical concepts in history. A reason for this is that approximately normal distributions occur in many situations. Examples include gambling winnings, heights, weights, blood pressure, standardized test scores, and experimental measurement errors. Often data visualization is needed to confirm that our data follows a normal distribution.
Here we focus on how the normal distribution helps us summarize data and can be useful in practice.
One way the normal distribution is useful is that it can be used to approximate the distribution of a list of numbers without having access to the entire list. We will demonstrate this with the heights dataset.
Load the height data set and create a vector x
with just the male heights:
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]
This exercise is part of the course
Data Science Visualization - Module 2
Exercise instructions
- What proportion of the data is between 69 and 72 inches (taller than 69 but shorter or equal to 72)? A proportion is between 0 and 1.
- Use the
mean
function in your code. Remember that you can usemean
to compute the proportion of entries of a logical vector that areTRUE
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
library(dslabs)
data(heights)
x <- heights$height[heights$sex == "Male"]