Get startedGet started for free

Schema writeout

We've loaded Schemas multiple ways now. So lets define a schema directly. We'll use a Data dictionary:

Variable Description
age Individual age
education_num Education by degree
marital_status Marital status
occupation Occupation
income Categorical income

This exercise is part of the course

Introduction to PySpark

View Course

Exercise instructions

  • Specify the data schema, giving columns names (age,education_num,marital_status,occupation, and income) and column types, setting a comma for the sep= argument.
  • Read data from a comma-delimited file called adult_reduced_100.csv.
  • Print the schema for the resulting DataFrame.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Fill in the schema with the columns you need from the exercise instructions
schema = StructType([____("____",____()),
                     ____("____",____()),
                     ____("marital_status",StringType()),
                     StructField("____",____()),
                     ____("____",____()),
                    ])

# Read in the CSV, using the schema you defined above
census_adult = spark.read.csv("adult_reduced_100.csv", sep='____', header=False, schema=schema)

# Print out the schema
census_adult.____
Edit and Run Code