PySpark DataFrame subsetting and cleaning
After the data inspection, it is often necessary to clean the data which mainly involves subsetting, renaming the columns, removing duplicated rows etc., PySpark DataFrame API provides several operators to do this. In this exercise, your job is to subset 'name', 'sex' and 'date of birth' columns from people_df
DataFrame, remove any duplicate rows from that dataset and count the number of rows before and after duplicates removal step.
Remember, you already have a SparkSession spark
and a DataFrame people_df
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Select 'name', 'sex', and 'date of birth' columns from
people_df
and createpeople_df_sub
DataFrame. - Print the first 10 observations in the
people_df_sub
DataFrame. - Remove duplicate entries from
people_df_sub
DataFrame and createpeople_df_sub_nodup
DataFrame. - How many rows are there before and after duplicates are removed?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Select name, sex and date of birth columns
people_df_sub = people_df.____('name', ____, ____)
# Print the first 10 observations from people_df_sub
people_df_sub.____(____)
# Remove duplicate entries from people_df_sub
people_df_sub_nodup = people_df_sub.____()
# Count the number of rows
print("There were {} rows before removing duplicates, and {} rows after removing duplicates".format(people_df_sub.____(), people_df_sub_nodup.____()))