Defining a schema
Creating a defined schema helps with data quality and import performance. As mentioned during the lesson, we'll create a simple schema to read in the following columns:
- Name
- Age
- City
The Name and City columns are StringType() and the Age column is an IntegerType().
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Import
*from thepyspark.sql.typeslibrary. - Define a new schema using the
StructTypemethod. - Define a
StructFieldforname,age, andcity. Each field should correspond to the correct datatype and not benullable.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the pyspark.sql.types library
____
# Define a new schema using the StructType method
people_schema = ____([
# Define a StructField for each field
StructField('name', ____, False),
____('____', IntegerType(), ____)
____
])