Filtering your DataFrame
In the previous exercise, you have subset the data using select()
operator which is mainly used to subset the DataFrame column-wise. What if you want to subset the DataFrame based on a condition (for example, select all rows where the sex is Female). In this exercise, you will filter the rows in the people_df
DataFrame in which 'sex' is female and male and create two different datasets. Finally, you'll count the number of rows in each of those datasets.
Remember, you already have a SparkSession spark
and a DataFrame people_df
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Filter the
people_df
DataFrame to select all rows where sex is female intopeople_df_female
DataFrame. - Filter the
people_df
DataFrame to select all rows where sex is male intopeople_df_male
DataFrame. - Count the number of rows in
people_df_female
andpeople_df_male
DataFrames.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Filter people_df to select females
people_df_female = people_df.____(people_df.____ == "female")
# Filter people_df to select males
people_df_male = people_df.____(____ == "____")
# Count the number of rows
print("There are {} rows in the people_df_female DataFrame and {} rows in the people_df_male DataFrame".format(people_df_female.____(), people_df_male.____()))