Data quality and cleaning

1. Data quality and cleaning

Now that you've added new variables into the abalone dataset, you need to investigate data quality by looking for outliers, data entry errors and illogical values to further clean up this dataset before analysis. This video will show you statistical functions and graphical visualizations that will help find errors and illogical data values.

2. Check distributions

For this video, you will continue to work with the modified dataset davismod.

3. Check distributions

A great place to begin with any data analysis is to get summary statistics on the variables in your dataset. Looking at the minimum and maximum will help spot values outside the expected numeric range. This code pulls BMI out of the davismod dataset and then runs summary statistics. The resulting output shows that at least one person has a BMI over 500. This is an outlier. You can tell that the BMI distribution is skewed right since the mean is higher than the median.

4. Visualize distributions

Another way to graphically visualize these outliers is by making a histogram or a dotplot. Here is an example of a dotplot created using the geom_dotplot function from the ggplot2 package. There is obviously at least one outlier with a BMI greater than 500.

5. Find the outliers

You can also see a list of the smallest or largest values in a dataset by sorting the variable of interest and then looking at the top or bottom rows using the head or tail function. If you arrange davismod by BMI and look at the bottom 6 rows using the tail function, you can easily see the individual with BMI greater than 500. Notice that the measured weight of 166 and height of 57 appear to have been reversed, possibly due to a data entry error.

6. Visualize assumption that weight <= height

It is reasonable to expect that weight in kilograms should be smaller than height in centimeters. You can visualize this assumption directly by making a scatterplot using geom_point from ggplot2. Next add a reference line for Y=X by using the geom_abline function from ggplot2 and setting the intercept option to 0 and the slope option to 1. In the resulting plot, you can easily spot the outlier below the line in the lower right corner.

7. Filter out cases with errors

Now that you've identified several ways to spot the individual whose weight was greater than their height, let's remove that case from the dataset. You can create a new dataset called daviskeep that keeps only the cases where bmi is less than 100 by using the filter function from the dplyr package. Inside the filter function a logical expression should be defined to be TRUE for the cases you want to keep. The one case where bmi was greater than 500 was removed. You now have 199 rows.

8. Visualize corrected bmi

Now that the bmi outlier has been removed, this dotplot for BMI in the daviskeep dataset looks much better.

9. Final cleanup of abalone dataset

In the last set of exercises for chapter 2, you will employ statistical summaries and graphical visualizations to check the assumptions of the abalone dataset and then remove cases that violate these assumptions. This process will finalize the abalone dataset used in chapters 3 and 4 for your statistical analyses and models. For the abalone dataset, all measurements should be positive. For the shell measurements, length should be the longest dimension and be larger than either the height or diameter. Similarly, for the weight measurements, the wholeWeight should be the largest weight with all other weights smaller.

10. Let's explore and clean up the abalone dataset

Let's put your skills to work cleaning up the abalone dataset for your final analyses and models.

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.

R For SAS Users

BeginnerSkill Level

4.8+

19 reviews