Get startedGet started for free

Examining the structure of categorical inputs

For this exercise, you will call model.matrix() (docs) to examine how R represents data with both categorical and numerical inputs for modeling. The dataset flowers (derived from the Sleuth3 package) has been loaded for you. It has the following columns:

  • Flowers: the average number of flowers on a meadowfoam plant
  • Intensity: the intensity of a light treatment applied to the plant
  • Time: A categorical variable - when (Late or Early) in the lifecycle the light treatment occurred

The ultimate goal is to predict Flowers as a function of Time and Intensity.

This exercise is part of the course

Supervised Learning in R: Regression

View Course

Exercise instructions

  • Call the str() function on flowers to see the types of each column.
  • Use the unique() function on the column flowers$Time to see the possible values that Time takes. How many unique values are there?
  • Create a formula to express Flowers as a function of Intensity and Time. Assign it to the variable fmla and print it.
  • Use fmla and model.matrix() to create the model matrix for the data frame flowers. Assign it to the variable mmat.
  • Use head() to examine the first 20 lines of flowers.
  • Now examine the first 20 lines of mmat.
    • Is the numeric column Intensity different?
    • What happened to the categorical column Time from flowers?
    • How is Time == 'Early' represented? And Time == 'Late'?

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Call str on flowers to see the types of each column
___

# Use unique() to see how many possible values Time takes
___

# Build and print a formula to express Flowers as a function of Intensity and Time: fmla
(fmla <- ___("Flowers ~ Intensity + Time"))

# Use fmla and model.matrix to see how the data is represented for modeling
mmat <- ___

# Examine the first 20 lines of flowers
___

# Examine the first 20 lines of mmat
___
Edit and Run Code