1. Distributions and outliers
Now, we will dive into these concepts a bit more - common visualizations and using them to identify and address outliers within the dataset.
2. What are distributions?
Distributions of a variable refers to the set of all possible values of the variable and the associated frequencies.
Said another way, it is the number of times each value of the variable occurs within the observations.
3. What are distributions?
For example, among 100 people, how many are 18 years, 19 years old, 20 years old, and so on.
This would be a distribution for a continuous variable.
4. What are distributions?
But they can also be used with categorical variables.
5. What are histograms?
Histograms are used for visualizing a distribution. They look like bar charts and display the variable values on the x-axis and frequencies, or counts, of observations with those values on the y-axis.
6. What are histogram? - bins
The "bins" of a histogram, represented as the bars, are intervals of possible values. They can determine the detail shown by a histogram.
Here you see one with 100 bins and a second with 20 bins which shows less detail about the same distribution.
There are formulas for calculating the optimal number of bins; it really depends on the amount of observations in your dataset. Starting with 25 bins for datasets with more than 100 observations is a good place to start.
7. Reading histograms - centrality and skewness
Histograms have minimal information embedded within the chart, but visually hold lots of information.
First, they give a sense of central tendency - median and mean - as well as it's skewness.
The first histogram is symmetrical, referred to as "normal", while the second is right-skewed, covered in the previous video.
In most real-world datasets, including those used in the exercises, distributions are likely to be skewed.
8. Reading histograms - spread
How wide the distribution is indicates dispersion of the data around these central points.
If the distribution has a wide spread, then there is a larger variability from the average, i.e. large standard deviation.
If it is narrow, the variability, and standard deviation, is smaller.
9. Reading histograms - percentiles
Percentiles are the value at which a percentage of observations fall at or below. The median is the 50th percentile, i.e. 50% of observations are below median value, 50% are above.
10. Reading histograms - 25th & 75th percentiles
It is common to look at distributions in terms of quartiles, specifically the 1st, 2nd, and 3rd quartiles or 25th, 50th, and 75th percentiles.
11. Reading histograms - interquartile range
The difference between the 75th and 25th percentile values make up the interquartile range, or the shaded part here in the chart.
12. What is an outlier?
Outliers are data points lying outside the overall pattern in a distribution. Here, the shaded regions are highlighting a couple of observations which may be outliers.
Not all values deemed outliers will be. Some will be actual values, some will be simple copying or measurement errors. It is best practice to investigate them further.
13. Finding outliers
There are two common methods for quantitatively finding outliers in a dataset.
The first uses standard deviation and multiplies by -3 and 3 to get the lower and upper thresholds. This method is best suited for symmetrical, or normal, distributions.
The more robust and common way is to use the interquartile range and multiply it by 1.5. Then this result is subtracted from the 25th percentile and added to the 75th.
14. Addressing outliers
There are multiple routes to addressing outliers. The first two are similar to handling missing values - removal or imputation.
Winsorizing, similar to imputation, updates outliers with another number. A standard is 90% winsorization. If the variable value is below the 5th percentile, change it to the 5th percentile value. If the variable value is above the 95th percentile, set to the 95th percentile value.
Again, not all values deemed outliers will be. It's important to investigate to determine if it indeed should be addressed.
15. Let's practice!
Now it's your turn to build distributions and find outliers within the AirBnB dataset.