Loading data in PySpark shell
In PySpark, we express our computation through operations on distributed collections that are automatically parallelized across the cluster. In the previous exercise, you have seen an example of loading a list as parallelized collections and in this exercise, you'll load the data from a local file in PySpark shell.
Remember, you already have a SparkContext sc
and file_path
variable (which is the path to the README.md
file) available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Load a local text file
README.md
in PySpark shell.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load a local file into PySpark shell
lines = sc.____(file_path)