PySpark DataFrame subsetting and cleaning

After the data inspection, it is often necessary to clean the data which mainly involves subsetting, renaming the columns, removing duplicated rows etc., PySpark DataFrame API provides several operators to do this. In this exercise, your job is to subset 'name', 'sex' and 'date of birth' columns from people_df DataFrame, remove any duplicate rows from that dataset and count the number of rows before and after duplicates removal step.

Remember, you already have a SparkSession spark and a DataFrame people_df available in your workspace.

Select 'name', 'sex', and 'date of birth' columns from people_df and create people_df_sub DataFrame.
Print the first 10 observations in the people_df_sub DataFrame.
Remove duplicate entries from people_df_sub DataFrame and create people_df_sub_nodup DataFrame.
How many rows are there before and after duplicates are removed?

Introduction to Big Data analysis with Spark

Programming in PySpark RDD’s

PySpark SQL & DataFrames

Machine Learning with PySpark MLlib

Exercise

PySpark DataFrame subsetting and cleaning

Instructions