Creating columns
In this chapter, you'll learn how to use the methods defined by Spark's DataFrame class to perform common data operations.
Let's look at performing column-wise operations. In Spark you can do this using the .withColumn() method, which takes two arguments. First, a string with the name of your new column, and second the new column itself.
The new column must be an object of class Column. Creating one of these is as easy as extracting a column from your DataFrame using df.colName.
Updating a Spark DataFrame is somewhat different than working in pandas because the Spark DataFrame is immutable. This means that it can't be changed, and so columns can't be updated in place.
Thus, all these methods return a new DataFrame. To overwrite the original DataFrame you must reassign the returned DataFrame using the method like so:
df = df.withColumn("newCol", df.oldCol + 1)
The above code creates a DataFrame with the same columns as df plus a new column, newCol, where every entry is equal to the corresponding entry from oldCol, plus one.
To overwrite an existing column, just pass the name of the column as the first argument!
Remember, a SparkSession called spark is already in your workspace.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Use the
spark.table()method with the argument"flights"to create a DataFrame containing the values of theflightstable in the.catalog. Save it asflights. - Show the head of
flightsusingflights.show(). Check the output: the columnair_timecontains the duration of the flight in minutes. - Update
flightsto include a new column calledduration_hrs, that contains the duration of each flight in hours (you'll need to divideair_timeby the number of minutes in an hour).
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Create the DataFrame flights
flights = spark.table(____)
# Show the head
____.____()
# Add duration_hrs
flights = flights.withColumn(____)