Dropping the middle man
Now you know how to put data into Spark via pandas, but you're probably wondering why deal with pandas at all? Wouldn't it be easier to just read a text file straight into Spark? Of course it would!
Luckily, your SparkSession has a .read attribute which has several methods for reading different data sources into Spark DataFrames. Using these you can create a DataFrame from a .csv file just like with regular pandas DataFrames!
The variable file_path is a string with the path to the file airports.csv. This file contains information about different airports all over the world.
A SparkSession named spark is available in your workspace.
This exercise is part of the course
Foundations of PySpark
Exercise instructions
- Use the
.read.csv()method to create a Spark DataFrame calledairports- The first argument is
file_path - Pass the argument
header=Trueso that Spark knows to take the column names from the first line of the file.
- The first argument is
- Print out this DataFrame by calling
.show().
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Don't change this file path
file_path = "/usr/local/share/datasets/airports.csv"
# Read in the airports data
airports = ____.____.____(____, ____=____)
# Show the data
____.____()