LoslegenKostenlos loslegen

Statistical outlier removal

While removing the top N% of your data is useful for ensuring that very spurious points are removed, it does have the disadvantage of always removing the same proportion of points, even if the data is correct. A commonly used alternative approach is to remove data that sits further than three standard deviations from the mean. You can implement this by first calculating the mean and standard deviation of the relevant column to find upper and lower bounds, and applying these bounds as a mask to the DataFrame. This method ensures that only data that is genuinely different from the rest is removed, and will remove fewer points if the data is close together.

Diese Übung ist Teil des Kurses

Feature Engineering for Machine Learning in Python

Kurs anzeigen

Anleitung zur Übung

  • Calculate the standard deviation and mean of the ConvertedSalary column of so_numeric_df.
  • Calculate the upper and lower bounds as three standard deviations away from the mean in both the directions.
  • Trim the so_numeric_df DataFrame to retain all rows where ConvertedSalary is within the lower and upper bounds.

Interaktive Übung

Vervollständige den Beispielcode, um diese Übung erfolgreich abzuschließen.

# Find the mean and standard dev
std = so_numeric_df['ConvertedSalary'].____
mean = so_numeric_df['ConvertedSalary'].____

# Calculate the cutoff
cut_off = std * 3
lower, upper = mean - cut_off, ____

# Trim the outliers
trimmed_df = so_numeric_df[(so_numeric_df['ConvertedSalary'] < ____) \ 
                           & (so_numeric_df['ConvertedSalary'] > ____)]

# The trimmed box plot
trimmed_df[['ConvertedSalary']].boxplot()
plt.show()
Code bearbeiten und ausführen