Using Visualizations: distplot
Understanding the distribution of our dependent variable is very important and can impact the type of model or preprocessing we do. A great way to do this is to plot it, however plotting is not a built in function in PySpark, we will need to take some intermediary steps to make sure it works correctly. In this exercise you will visualize the variable the 'LISTPRICE' variable, and you will gain more insights on its distribution by computing the skewness.
The matplotlib.pyplot and seaborn packages have been imported for you with aliases plt and sns.
Cet exercice fait partie du cours
Feature Engineering with PySpark
Instructions
- Sample 50% of the dataframe dfwithsample()making sure to not use replacement and setting the random seed to 42.
- Convert the Spark DataFrame to a pandas.DataFrame()withtoPandas().
- Plot a distribution plot using seaborn'sdistplot()method.
- Import the skewness()function frompyspark.sql.functionsand compute it on the aggregate of the'LISTPRICE'column with theagg()method. Remember tocollect()your result to evaluate the computation.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Select a single column and sample and convert to pandas
sample_df = df.select(['LISTPRICE']).____(____, ____, 42)
pandas_df = sample_df.____()
# Plot distribution of pandas_df and display plot
sns.____(____)
plt.show()
# Import skewness function
from pyspark.sql.functions import skewness
# Compute and print skewness of LISTPRICE
print(df.____({____: ____}).collect())