Using Visualizations: distplot

Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE' variable, and you will gain more insights on its distribution by computing the skewness.

The matplotlib.pyplot and seaborn packages have been imported for you with aliases plt and sns.

Cet exercice fait partie du cours

Feature Engineering with PySpark

Afficher le cours

Instructions

Sample 50% of the dataframe df with sample() making sure to not use replacement and setting the random seed to 42.
Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
Plot a distribution plot using seaborn's distplot() method.
Import the skewness() function from pyspark.sql.functions and compute it on the aggregate of the 'LISTPRICE' column with the agg() method. Remember to collect() your result to evaluate the computation.

Exercice interactif pratique

Essayez cet exercice en complétant cet exemple de code.

# Select a single column and sample and convert to pandas
sample_df = df.select(['LISTPRICE']).____(____, ____, 42)
pandas_df = sample_df.____()

# Plot distribution of pandas_df and display plot
sns.____(____)
plt.show()

# Import skewness function
from pyspark.sql.functions import skewness

# Compute and print skewness of LISTPRICE
print(df.____({____: ____}).collect())

Modifier et exécuter le code