1. Measures of center
Now let's discuss summary statistics, starting with measures of center.
2. Why are measures of center useful?
So, why are measures of center useful?
We might be asked what the average number of sales orders are per month at work, hear about the typical cost of a house, or wonder what the most common hair color is.
Terminology like average, most common, and typical value are all examples of how measures of center are expressed in day-to-day life!
3. Crime data
We will use the crime dataset introduced in the previous video to review measures of center. Each row contains a London Borough and the count values for each type of crime over the past two years.
Here is a preview of the first five rows.
4. Crime data
We can see there were 5067 burglaries in the London Borough of Barnet.
5. Histograms
Let's explore another way to visualize numeric data. A histogram takes data points and separates them into bins, or ranges of values.
Here is a histogram of vehicle offenses with eight bins, each with a separate height corresponding to the number of London boroughs with a vehicle crime count that fits inside the respective bin.
The peak in the middle shows nine London Boroughs had between 6000 and 7300 vehicle crimes in the past two years.
Histograms are a great way to summarize numeric data, but we can also use descriptive statistics.
6. What is the typical amount of vehicle crime in London?
We could summarize the data by finding the typical amount of vehicle crime in a London Borough? To answer this, we need to determine what the typical, or center value of the data is.
Unfortunately, this can be hard to determine through data visualization such as our histogram.
We'll discuss three ways to calculate the center: mean, median, and mode.
7. Measures of center: mean
The mean, often called the average, is one of the most common ways of describing the center of the data.
We calculate the mean by adding up all values and divide by the number of values.
For example, to calculate the mean number of burglaries per London Borough we add up all values and divide by the number of boroughs, which is 32. This gives us approximately 3463 burglaries.
8. Measures of center: mean
Here we can see the mean value for each type of crime and overall, with theft having the largest mean.
9. Measures of center: median
Another measure of center is the median. This is the middle value for our data.
Therefore, if we sort our data from smallest to largest, as shown here for burglaries per London Borough, 50% of values should be lower than the median, and 50% should be higher.
10. Measures of center: median
We have an even number of values as there are 32 boroughs, so we instead take the two values closest to the middle.
11. Measures of center: median
We add the two values then divide by two.
The result is 3416.5, slightly lower than the mean.
12. Measures of center: mode
The third measure of center is the mode, or most frequent value.
If we count occurrences of each crime across all London Boroughs, we can see the most frequent value is theft.
If we are looking for the expected value of categorical data then the mode is generally the most suitable measure, since categories may not have an inherent numerical representation.
13. Which measure to use?
Which measure to use depends on the situation.
Take this histogram of vehicle offenses for example. The shape of the plot is fairly symmetrical, with a count peaking in the middle and getting lower towards each side.
When data is symmetrical the mean and median both work well. Notice they overlap on the plot?
14. Which measure to use?
Comparing this to robberies, the data is not symmetrical - piling up on the left and tailing off with one borough having a high number of robberies. When one value is substantially different to others we call this an outlier. This outlier pulls the mean towards it, while the median is less affected.
This is because the mean calculation involves adding up all values, so larger values affect the result, where as the median just looks at the middle value.
Therefore, when data is not symmetrical it is best to use the median to describe the data's typical value.
15. Let's practice!
Now let's center ourselves with some exercises!