Get startedGet started for free

Box plots and IQR

1. Box plots and IQR

In the last video, we got acquainted with the 5-number summary. In this video, we will build on the concepts using boxplots.

2. Boxplots recap

Boxplots are enhanced, visual versions of the 5-number summary that display additional information about a distribution. A boxplot can tell us about distribution's locality, spread, skewness and most importantly for us, indicate the presence of outliers.

3. Boxplot components

The box itself shows where the bulk of the data lies, with the lower and upper edges denoting the 25th and 75th percentiles, also known as Q1 and Q3. The line inside the box is the 50th percentile or commonly known as the median.

4. Whiskers

Boxplots also have a pair of whiskers below and above the box that indicate lower and upper outlier limits of the distribution. Datapoints that lie beyond the whisker lengths will be marked as outliers.

5. Inter Quartile Range (IQR)

The lengths of these whiskers depend on the IQR, which stands for interquartile range. IQR is calculated by subtracting the 25th percentile from the 75th percentile. Whisker lengths can then be determined using the IQR, along with a multiplying factor.

6. Calculating whisker lengths

The most popular value for the multiplying factor is 1-point-5. So, the lower limit (or the lower whisker length) will be equal to Q1 minus 1-point-5 times the IQR, while the upper limit (or the upper whisker length) will be Q3 plus 1-point-5 times the IQR. Values beyond these limits will be considered outliers, marked individually here as circles.

7. Drawing boxplots

To draw boxplots, we can use matplotlib's boxplot function. This boxplot shows us many more outliers than we saw using a histogram and a scatterplot. Let's see what happens if we change the multiplying factor, which defaults to 1-point-5 in matplotlib.

8. Controlling whisker lengths

We can change the factor by using the whis parameter of the boxplot function. Setting whis to 2-point-5 lengthens the whiskers, reducing the number of displayed outliers.

9. IQR in code

Boxplots display outliers only visually. To isolate them in code, let's implement what's going on under the hood. First, we calculate the first and third quartiles using the quantile function of pandas Series and store them as q1 and q3. Then, we calculate IQR and use the same multiplying factor of 2-point-5.

10. Finding outliers with IQR

Next, we calculate lower and upper limits as q1 minus IQR multiplied by factor and q3 plus IQR multiplied by factor, respectively. Finally, we create two boolean masks: is_lower and is_upper, which check if values in sales are lower or higher than our limits. Then, we combine the masks using the pipe operator to filter out the found outliers. The pipe operator is the equivalent of the Python OR statement. Using the IQR method, we find 29 outliers.

11. The flexibility of the method

This method of detecting outliers with boxplots provides the end user with flexibility. They can choose custom rules and requirements for what makes an outlier by tweaking the multiplying factor. For example, a person analyzing survey data can set a custom age range that suits the purposes of their survey, and mark any respondent's answer outside this range as an outlier.

12. Let's practice!

Now, let's practice with what we have learned.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.