Selecting II
Similar to SQL, you can also use the .select()
method to perform column-wise operations. When you're selecting a column using the df.colName
notation, you can perform any column operation and the .select()
method will return the transformed column. For example,
flights.select(flights.air_time/60)
returns a column of flight durations in hours instead of minutes. You can also use the .alias()
method to rename a column you're selecting. So if you wanted to .select()
the column duration_hrs
(which isn't in your DataFrame) you could do
flights.select((flights.air_time/60).alias("duration_hrs"))
The equivalent Spark DataFrame method .selectExpr()
takes SQL expressions as a string:
flights.selectExpr("air_time/60 as duration_hrs")
with the SQL as
keyword being equivalent to the .alias()
method. To select multiple columns, you can pass multiple strings.
Remember, a SparkSession
called spark
is already in your workspace, along with the Spark DataFrame flights
.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
Create a table of the average speed of each flight both ways.
- Calculate average speed by dividing the
distance
by theair_time
(converted to hours). Use the.alias()
method name this column"avg_speed"
. Save the output as the variableavg_speed
. - Select the columns
"origin", "dest", "tailnum"
, andavg_speed
(without quotes!). Save this asspeed1
. - Create the same table using
.selectExpr()
and a string containing a SQL expression. Save this asspeed2
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Define avg_speed
avg_speed = (flights.____/(flights.____/60)).alias("____")
# Select the correct columns
speed1 = flights.select("origin", "dest", "tailnum", avg_speed)
# Create the same table using a SQL expression
speed2 = flights.selectExpr("____", "____", "____", "distance/(air_time/60) as ____")