1. Learn
  2. /
  3. Courses
  4. /
  5. Cleaning Data with PySpark

Exercise

File import performance

You've been given a large set of data to import into a Spark DataFrame. You'd like to test the difference in import speed by splitting up the file.

You have two types of files available: departures_full.txt.gz and departures_xxx.txt.gz where xxx is 000 - 013. The same number of rows is split between each file.

Instructions

100 XP
  • Import the departures_full.txt.gz file and the departures_xxx.txt.gz files into separate DataFrames.
  • Run a count on each DataFrame and compare the run times.