Dropping the middle man
Now you know how to put data into Spark via pandas
, but you're probably wondering why deal with pandas
at all? Wouldn't it be easier to just read a text file straight into Spark? Of course it would!
Luckily, your SparkSession
has a .read
attribute which has several methods for reading different data sources into Spark DataFrames. Using these you can create a DataFrame from a .csv file just like with regular pandas
DataFrames!
The variable file_path
is a string with the path to the file airports.csv
. This file contains information about different airports all over the world.
A SparkSession
named spark
is available in your workspace.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Use the
.read.csv()
method to create a Spark DataFrame calledairports
- The first argument is
file_path
- Pass the argument
header=True
so that Spark knows to take the column names from the first line of the file.
- The first argument is
- Print out this DataFrame by calling
.show()
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Don't change this file path
file_path = "/usr/local/share/datasets/airports.csv"
# Read in the airports data
airports = ____.____.____(____, ____=____)
# Show the data
____.____()