Selecting
The Spark variant of SQL's SELECT
is the .select()
method. This method takes multiple arguments - one for each column you want to select. These arguments can either be the column name as a string (one for each column) or a column object (using the df.colName
syntax). When you pass a column object, you can perform operations like addition or subtraction on the column to change the data contained in it, much like inside .withColumn()
.
The difference between .select()
and .withColumn()
methods is that .select()
returns only the columns you specify, while .withColumn()
returns all the columns of the DataFrame in addition to the one you defined. It's often a good idea to drop columns you don't need at the beginning of an operation so that you're not dragging around extra data as you're wrangling. In this case, you would use .select()
and not .withColumn()
.
Remember, a SparkSession called spark
is already in your workspace, along with the Spark DataFrame flights
.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Select the columns
"tailnum"
,"origin"
, and"dest"
fromflights
by passing the column names as strings. Save this asselected1
. - Select the columns
"origin"
,"dest"
, and"carrier"
using thedf.colName
syntax and then filter the result using both of the filters already defined for you (filterA
andfilterB
) to only keep flights from SEA to PDX. Save this asselected2
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Select the first set of columns
selected1 = flights.select("____", "____", "____")
# Select the second set of columns
temp = flights.select(____.____, ____.____, ____.____)
# Define first filter
filterA = flights.origin == "SEA"
# Define second filter
filterB = flights.dest == "PDX"
# Filter the data, first by filterA then by filterB
selected2 = temp.filter(____).filter(____)