Defining a schema
Creating a defined schema helps with data quality and import performance. As mentioned during the lesson, we'll create a simple schema to read in the following columns:
- Name
- Age
- City
The Name
and City
columns are StringType()
and the Age
column is an IntegerType()
.
This exercise is part of the course
Cleaning Data with PySpark
Exercise instructions
- Import
*
from thepyspark.sql.types
library. - Define a new schema using the
StructType
method. - Define a
StructField
forname
,age
, andcity
. Each field should correspond to the correct datatype and not benullable
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import the pyspark.sql.types library
____
# Define a new schema using the StructType method
people_schema = ____([
# Define a StructField for each field
StructField('name', ____, False),
____('____', IntegerType(), ____)
____
])