Get startedGet started for free

Using Visualizations: distplot

Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE' variable, and you will gain more insights on its distribution by computing the skewness.

The matplotlib.pyplot and seaborn packages have been imported for you with aliases plt and sns.

This exercise is part of the course

Feature Engineering with PySpark

View Course

Exercise instructions

  • Sample 50% of the dataframe df with sample() making sure to not use replacement and setting the random seed to 42.
  • Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
  • Plot a distribution plot using seaborn's distplot() method.
  • Import the skewness() function from pyspark.sql.functions and compute it on the aggregate of the 'LISTPRICE' column with the agg() method. Remember to collect() your result to evaluate the computation.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Select a single column and sample and convert to pandas
sample_df = df.select(['LISTPRICE']).____(____, ____, 42)
pandas_df = sample_df.____()

# Plot distribution of pandas_df and display plot
sns.____(____)
plt.show()

# Import skewness function
from pyspark.sql.functions import skewness

# Compute and print skewness of LISTPRICE
print(df.____({____: ____}).collect())
Edit and Run Code