Part 2: SQL Queries on DataFrame
The fifa_df
DataFrame that we created has additional information about datatypes and names of columns associated with it. This additional information allows PySpark SQL to run SQL queries on DataFrame. SQL queries are concise and easy to run compared to DataFrame operations. But in order to apply SQL queries on DataFrame first, you need to create a temporary view of DataFrame as a table and then apply SQL queries on the created table (Running SQL Queries Programmatically).
In the second part, you'll create a temporary table of fifa_df
DataFrame and run SQL queries to extract the 'Age' column of players from Germany.
You already have a SparkContext spark
and fifa_df
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create temporary table
fifa_df_table
fromfifa_df
DataFrame. - Construct a "query" to extract the "Age" column from Germany players in
fifa_df_table
. - Apply the SQL "query" and create a new DataFrame
fifa_df_germany_age
. - Computes basic statistics of the created DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a temporary view of fifa_df
fifa_df.____('fifa_df_table')
# Construct the "query"
query = '''SELECT ____ FROM ____ WHERE Nationality == "Germany"'''
# Apply the SQL "query"
fifa_df_germany_age = spark.____(____)
# Generate basic statistics
fifa_df_germany_age.____().show()