Remoção estatística de outliers

Embora remover os N% superiores do seu conjunto de dados seja útil para garantir que pontos muito espúrios sejam excluídos, isso tem a desvantagem de sempre eliminar a mesma proporção de pontos, mesmo quando os dados estão corretos. Uma alternativa comum é remover dados que estejam a mais de três desvios padrão da média. Você pode implementar isso calculando primeiro a média e o desvio padrão da coluna relevante para encontrar os limites superior e inferior e aplicando esses limites como uma máscara no DataFrame. Esse método garante que apenas os dados que são realmente diferentes do restante sejam removidos e eliminará menos pontos se os dados estiverem mais concentrados.

Este exercicio faz parte do curso

Feature Engineering for Machine Learning in Python

Instruções do exercicio

Calcule o desvio padrão e a média da coluna ConvertedSalary de so_numeric_df.
Calcule os limites superior e inferior como três desvios padrão de distância da média em ambas as direções.
Reduza o DataFrame so_numeric_df para manter todas as linhas em que ConvertedSalary esteja dentro dos limites lower e upper.

exercicio interativo prático

Tente este exercicio completando este código de exemplo.

# Find the mean and standard dev
std = so_numeric_df['ConvertedSalary'].____
mean = so_numeric_df['ConvertedSalary'].____

# Calculate the cutoff
cut_off = std * 3
lower, upper = mean - cut_off, ____

# Trim the outliers
trimmed_df = so_numeric_df[(so_numeric_df['ConvertedSalary'] < ____) \ 
                           & (so_numeric_df['ConvertedSalary'] > ____)]

# The trimmed box plot
trimmed_df[['ConvertedSalary']].boxplot()
plt.show()

Editar e Executar Código

Este exercicio faz parte do curso

Feature Engineering for Machine Learning in Python

IntermediárioNível de habilidade

4.8+

Comece o curso gratuitamente

In this chapter, you will explore what feature engineering is and how to get started with applying it to real-world data. You will load, explore and visualize a survey response dataset, and in doing so you will learn about its underlying data types and why they have an influence on how you should engineer your features. Using the pandas package you will create new features from both categorical and continuous columns.

Exercise 1: Why generate features?Exercise 2: Getting to know your data Exercise 3: Selecting specific data types Exercise 4: Dealing with categorical features Exercise 5: One-hot encoding and dummy variables Exercise 6: Dealing with uncommon categories Exercise 7: Numeric variables Exercise 8: Binarizing columns Exercise 9: Binning values

This chapter introduces you to the reality of messy and incomplete data. You will learn how to find where your data has missing values and explore multiple approaches on how to deal with them. You will also use string manipulation techniques to deal with unwanted characters in your dataset.

Exercise 1: Why do missing values exist?Exercise 2: How sparse is my data?Exercise 3: Finding the missing values Exercise 4: Dealing with missing values (I)Exercise 5: Listwise deletion Exercise 6: Replacing missing values with constants Exercise 7: Dealing with missing values (II)Exercise 8: Filling continuous missing values Exercise 9: Imputing values in predictive models Exercise 10: Dealing with other data issues Exercise 11: Dealing with stray characters (I)Exercise 12: Dealing with stray characters (II)Exercise 13: Method chaining

In this chapter, you will focus on analyzing the underlying distribution of your data and whether it will impact your machine learning pipeline. You will learn how to deal with skewed data and situations where outliers may be negatively impacting your analysis.

Exercise 1: Distribuições de dados Exercise 2: Como são os seus dados? (I)Exercise 3: How does your data look? (II)Exercise 4: Quando você não precisa transformar seus dados?Exercise 5: Escalonamento e transformações Exercise 6: Normalização Exercise 7: Padronização Exercise 8: Transformação logarítmica Exercise 9: Quando você pode usar normalização?Exercise 10: Removendo outliers Exercise 11: Remoção de outliers baseada em porcentagem Exercise 12: Remoção estatística de outliers

Exercicio Atual

Exercise 13: Dimensionando e transformando novos dados Exercise 14: Transformações de treino e teste (I)Exercise 15: Transformações de treino e teste (II)

Finally, in this chapter, you will work with unstructured text data, understanding ways in which you can engineer columnar features out of a text corpus. You will compare how different approaches may impact how much context is being extracted from a text, and how to balance the need for context, without too many features being created.

Exercise 1: Encoding text Exercise 2: Cleaning up your text Exercise 3: High level text features Exercise 4: Word counts Exercise 5: Counting words (I)Exercise 6: Counting words (II)Exercise 7: Limiting your features Exercise 8: Text to DataFrame Exercise 9: Term frequency-inverse document frequency Exercise 10: Tf-idf Exercise 11: Inspecting Tf-idf values Exercise 12: Transforming unseen data Exercise 13: N-grams Exercise 14: Using longer n-grams Exercise 15: Finding the most common words Exercise 16: Wrap-up