Running SQL on DataFrames
DataFrames can be easily manipulated using SQL queries in PySpark. The .sql()
method in a SparkSession enables applications to run SQL queries programmatically and returns the result as another DataFrame. In this exercise, you'll create a temporary table of a DataFrame that you have created previously, then construct a query to select the names of the people from the temporary table and assign the result to a new DataFrame.
Remember, you already have a SparkSession spark
and a DataFrame df
available in your workspace.
Cet exercice fait partie du cours
Introduction to PySpark
Instructions
- Create a temporary table named
"people"
from thedf
DataFrame. - Construct a query to select the names of the people from the temporary table
people
. - Assign the result of Spark's query to a new DataFrame called
people_df_names
. - Print the top 10 names of the people from
people_df_names
DataFrame.
Exercice interactif pratique
Essayez cet exercice en complétant cet exemple de code.
# Create a temporary table "people"
df.____("people")
# Select the names from the temporary table people
query = """SELECT name FROM ____"""
# Assign the result of Spark's query to people_df_names
people_df_names = spark.sql(____)
# Print the top 10 names of the people
people_df_names.____(____)