RDDs from External Datasets
PySpark can easily create RDDs from files that are stored in external storage devices, such as HDFS (Hadoop Distributed File System), Amazon S3 buckets, etc. However, the most common method of creating RDD's is from files stored in your local file system. This method takes a file path and reads it as a collection of lines. In this exercise, you'll create an RDD from the file path (file_path
) with the file name README.md
which is already available in your workspace.
Remember, you already have a SparkContext sc
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Print the
file_path
in the PySpark shell. - Create a RDD named
fileRDD
from afile_path
. - Print the type of the
fileRDD
created.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Print the file_path
print("The file_path is", ____)
# Create a fileRDD from file_path
fileRDD = sc.____(file_path)
# Check the type of fileRDD
print("The file type of fileRDD is", type(____))