Filtering numeric fields conditionally
Again, understanding the context of your data is extremely important. We want to understand what a normal range of houses sell for. Let's make sure we exclude any outlier homes that have sold for significantly more or less than the average. Here we will calculate the mean and standard deviation and use them to filer the near normal field log_SalesClosePrice
.
Este exercício faz parte do curso
Feature Engineering with PySpark
Instruções do exercício
- Import
mean()
andstddev()
frompyspark.sql.functions
. - Use
agg()
to calculate the mean and standard deviation for'log_SalesClosePrice'
with the imported functions. - Create the upper and lower bounds by taking
mean_val
+/- 3 timesstddev_val
. - Create a
where()
filter for'log_SalesClosePrice'
using bothlow_bound
andhi_bound
.
Exercício interativo prático
Experimente este exercício completando este código de exemplo.
from ____ import ____, ____
# Calculate values used for outlier filtering
mean_val = df.____({____: ____}).collect()[0][0]
stddev_val = df.____({____: ____}).collect()[0][0]
# Create three standard deviation (μ ± 3σ) lower and upper bounds for data
low_bound = ____ - (3 * ____)
hi_bound = ____ + (3 * ____)
# Filter the data to fit between the lower and upper bounds
df = df.____((df[____] < ____) ____ (df[____] > ____))