SQL queries for filtering Table
In the previous exercise, you have run a simple SQL query on a DataFrame. There are more sophisticated queries you can construct to obtain the result that you want and use it for downstream analysis such as data visualization and Machine Learning. In this exercise, we will use the temporary table people that you have created previously, filter out the rows where the "sex" is male and female and create two DataFrames.
Please note the "solution" is case sensitive for the SQL commands (For example, it only accepts FROM and not from). The "solution" only accepts "==" and not "=".
Remember, you already have a SparkSession spark and a temporary table people available in your workspace.
Deze oefening maakt deel uit van de cursus
Big Data Fundamentals with PySpark
Oefeninstructies
- Filter the
peopletable to select all rows where sex is female intopeople_female_dfDataFrame. - Filter the
peopletable to select all rows where sex is male intopeople_male_dfDataFrame. - Count the number of rows in both
people_femaleandpeople_maleDataFrames.
Praktische interactieve oefening
Probeer deze oefening eens door deze voorbeeldcode in te vullen.
# Filter the people table to select female sex
people_female_df = spark.____('SELECT * FROM ____ WHERE sex=="____"')
# Filter the people table DataFrame to select male sex
people_male_df = spark.____('SELECT * ____ people ____ ____=="____"')
# Count the number of rows in both people_df_female and people_male_df DataFrames
print("There are {} rows in the people_female_df and {} rows in the people_male_df DataFrames".format(people_female_df.____(), people_male_df.____()))