1. Learn
  2. /
  3. Courses
  4. /
  5. Big Data Fundamentals with PySpark

Connected

Exercise

Filtering your DataFrame

In the previous exercise, you have subset the data using select() operator which is mainly used to subset the DataFrame column-wise. What if you want to subset the DataFrame based on a condition (for example, select all rows where the sex is Female). In this exercise, you will filter the rows in the people_df DataFrame in which 'sex' is female and male and create two different datasets. Finally, you'll count the number of rows in each of those datasets.

Remember, you already have a SparkSession spark and a DataFrame people_df available in your workspace.

Instructions

100 XP
  • Filter the people_df DataFrame to select all rows where sex is female into people_df_female DataFrame.
  • Filter the people_df DataFrame to select all rows where sex is male into people_df_male DataFrame.
  • Count the number of rows in people_df_female and people_df_male DataFrames.