File import performance
You've been given a large set of data to import into a Spark DataFrame. You'd like to test the difference in import speed by splitting up the file.
You have two types of files available: departures_full.txt.gz and departures_xxx.txt.gz where xxx is 000 - 013. The same number of rows is split between each file.
Bu egzersiz
Cleaning Data with PySpark
kursunun bir parçasıdırEgzersiz talimatları
- Import the
departures_full.txt.gzfile and thedepartures_xxx.txt.gzfiles into separate DataFrames. - Run a count on each DataFrame and compare the run times.
Uygulamalı interaktif egzersiz
Bu örnek kodu tamamlayarak bu egzersizi bitirin.
# Import the full and split files into DataFrames
full_df = spark.read.csv('____')
split_df = ____(____)
# Print the count and run time for each DataFrame
start_time_a = time.time()
print("Total rows in full DataFrame:\t%d" % ____)
print("Time to run: %f" % (time.time() - start_time_a))
start_time_b = time.time()
print("Total rows in split DataFrame:\t%d" % ____)
print("Time to run: %f" % (time.time() - start_time_b))