Part 1: Create a DataFrame from CSV file
Every 4 years, soccer fans throughout the world celebrate a festival called “Fifa World Cup” and with that, everything seems to change in many countries. In this 3 part exercise, you'll be doing some exploratory data analysis (EDA) on the "FIFA 2018 World Cup Player" dataset using PySpark SQL which involves DataFrame operations, SQL queries, and visualization.
In the first part, you'll load FIFA 2018 World Cup Players dataset (Fifa2018_dataset.csv
), which is in CSV format, into a PySpark's dataFrame and inspect the data using basic DataFrame operations.
Remember, you already have a SparkSession spark
and a variable file_path
available in your workspace.
This exercise is part of the course
Big Data Fundamentals with PySpark
Exercise instructions
- Create a PySpark DataFrame from
file_path
(which is the path to theFifa2018_dataset.csv
file). - Print the schema of the DataFrame.
- Print the first 10 observations.
- How many rows are in there in the DataFrame?
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Load the Dataframe
fifa_df = spark.____(____, header=True, inferSchema=True)
# Check the schema of columns
fifa_df.____()
# Show the first 10 observations
fifa_df.____(____)
# Print the total number of rows
print("There are {} rows in the fifa_df DataFrame".format(fifa_df.____()))