Running SQL Queries Programmatically
DataFrames can be easily manipulated using SQL queries in PySpark. The sql()
function in a SparkSession enables applications to run SQL queries programmatically and returns the result as another DataFrame. In this exercise, you'll create a temporary table of DataFrame that you have created previously, then construct a query to select the names of the people from the temporary table and assign the result to a new DataFrame.
Remember, you already have a SparkSession spark
and a DataFrame available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create a temporary table
people
. - Construct a
query
to select the names of the people from the temporary tablepeople
. - Assign the result of Spark's
query
to a new DataFrame -people_df_names
. - Print the top 10 names of the people from
people_df_names
DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create a temporary table "people"
people_df.____("people")
# Construct a query to select the names of the people from the temporary table "people"
query = '''SELECT name FROM ____'''
# Assign the result of Spark's query to people_df_names
people_df_names = spark.sql(____)
# Print the top 10 names of the people
people_df_names.____(____)