Schema writeout
We've loaded Schemas multiple ways now. So lets define a schema directly. We'll use a Data dictionary:
Variable | Description |
---|---|
age | Individual age |
education_num | Education by degree |
marital_status | Marital status |
occupation | Occupation |
income | Categorical income |
This exercise is part of the course
Introduction to PySpark
Exercise instructions
- Specify the data schema, giving columns names (
age
,education_num
,marital_status
,occupation
, andincome
) and column types, setting a comma for thesep=
argument. - Read data from a comma-delimited file called
adult_reduced_100.csv
. - Print the schema for the resulting DataFrame.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Fill in the schema with the columns you need from the exercise instructions
schema = StructType([____("____",____()),
____("____",____()),
____("marital_status",StringType()),
StructField("____",____()),
____("____",____()),
])
# Read in the CSV, using the schema you defined above
census_adult = spark.read.csv("adult_reduced_100.csv", sep='____', header=False, schema=schema)
# Print out the schema
census_adult.____