Using Visualizations: distplot

Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE' variable, and you will gain more insights on its distribution by computing the skewness.

The matplotlib.pyplot and seaborn packages have been imported for you with aliases plt and sns.

Sample 50% of the dataframe df with sample() making sure to not use replacement and setting the random seed to 42.
Convert the Spark DataFrame to a pandas.DataFrame() with toPandas().
Plot a distribution plot using seaborn's distplot() method.
Import the skewness() function from pyspark.sql.functions and compute it on the aggregate of the 'LISTPRICE' column with the agg() method. Remember to collect() your result to evaluate the computation.

Exploratory Data Analysis

Wrangling with Spark Functions

Feature Engineering

Building a Model

Exercise

Using Visualizations: distplot

Instructions