Using Visualizations: distplot
Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE'
variable, and you will gain more insights on its distribution by computing the skewness.
The matplotlib.pyplot
and seaborn
packages have been imported for you with aliases plt
and sns
.
This exercise is part of the course
Feature Engineering with PySpark
Exercise instructions
- Sample 50% of the dataframe
df
withsample()
making sure to not use replacement and setting the random seed to 42. - Convert the Spark DataFrame to a
pandas.DataFrame()
withtoPandas()
. - Plot a distribution plot using
seaborn
'sdistplot()
method. - Import the
skewness()
function frompyspark.sql.functions
and compute it on the aggregate of the'LISTPRICE'
column with theagg()
method. Remember tocollect()
your result to evaluate the computation.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Select a single column and sample and convert to pandas
sample_df = df.select(['LISTPRICE']).____(____, ____, 42)
pandas_df = sample_df.____()
# Plot distribution of pandas_df and display plot
sns.____(____)
plt.show()
# Import skewness function
from pyspark.sql.functions import skewness
# Compute and print skewness of LISTPRICE
print(df.____({____: ____}).collect())