Are you query-ious?
One of the advantages of the DataFrame interface is that you can run SQL queries on the tables in your Spark cluster. If you don't have any experience with SQL, don't worry, we'll provide you with queries! (To learn more SQL, start with our Introduction to SQL course.)
As you saw in the last exercise, one of the tables in your cluster is the flights
table. This table contains a row for every flight that left Portland International Airport (PDX) or Seattle-Tacoma International Airport (SEA) in 2014 and 2015.
Running a query on this table is as easy as using the .sql()
method on your SparkSession
. This method takes a string containing the query and returns a DataFrame with the results!
If you look closely, you'll notice that the table flights
is only mentioned in the query, not as an argument to any of the methods. This is because there isn't a local object in your environment that holds that data, so it wouldn't make sense to pass the table as an argument.
Remember, we've already created a SparkSession
called spark
in your workspace. (It's no longer called my_spark
because we created it for you!)
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Use the
.sql()
method to get the first 10 rows of theflights
table and save the result toflights10
. The variablequery
contains the appropriate SQL query. - Use the DataFrame method
.show()
to printflights10
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Don't change this query
query = "FROM flights SELECT * LIMIT 10"
# Get the first 10 rows of flights
flights10 = ____
# Show the results
flights10.____