Examining the structure of categorical inputs
For this exercise, you will call model.matrix()
(docs) to
examine how R represents data with both categorical and numerical inputs for modeling.
The dataset flowers
(derived from the Sleuth3
package) has been loaded for you. It has the following columns:
Flowers
: the average number of flowers on a meadowfoam plantIntensity
: the intensity of a light treatment applied to the plantTime
: A categorical variable - when (Late
orEarly
) in the lifecycle the light treatment occurred
The ultimate goal is to predict Flowers
as a function of Time
and Intensity
.
This exercise is part of the course
Supervised Learning in R: Regression
Exercise instructions
- Call the
str()
function onflowers
to see the types of each column. - Use the
unique()
function on the columnflowers$Time
to see the possible values thatTime
takes. How many unique values are there? - Create a formula to express
Flowers
as a function ofIntensity
andTime
. Assign it to the variablefmla
and print it. - Use
fmla
andmodel.matrix()
to create the model matrix for the data frameflowers
. Assign it to the variablemmat
. - Use
head()
to examine the first 20 lines offlowers
. - Now examine the first 20 lines of
mmat
.- Is the numeric column
Intensity
different? - What happened to the categorical column
Time
fromflowers
? - How is
Time == 'Early'
represented? AndTime == 'Late'
?
- Is the numeric column
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Call str on flowers to see the types of each column
___
# Use unique() to see how many possible values Time takes
___
# Build and print a formula to express Flowers as a function of Intensity and Time: fmla
(fmla <- ___("Flowers ~ Intensity + Time"))
# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- ___
# Examine the first 20 lines of flowers
___
# Examine the first 20 lines of mmat
___