Looking at data

The dataset salesData is loaded in the workspace. It contains information on customers for the months one to three. Only the sales of month four are included. The following table gives a description of some of the variables whose meaning is less obvious.

Variable Description
id identification number of customer
mostFreqStore store person bought mostly from
mostFreqCat category person purchased mostly
nCats number of different categories
preferredBrand brand person purchased mostly
nBrands number of different brands

The packages readr, dplyr, corrplot, and ggplot2 have been installed and loaded.

This exercise is part of the course

Machine Learning for Marketing Analytics in R

View Course

Exercise instructions

  • Use the structure command str() in order to get an overview over the data.
  • Now visualize the correlation of the continuous explanatory variables for the past three months with the sales variable of this month. Use the functions cor() and corrplot() and the pipe operator. Note that the right variables have already been selected for you.
  • Additionally, make a boxplot displaying the distribution of the salesThisMon dependent on the levels of the categorical variable preferredBrand. The same has already been done for the categorical dependent variable mostFreqStore.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Structure of dataset
str(___, give.attr = FALSE)

# Visualization of correlations
salesData %>% select_if(is.numeric) %>%
  select(-id) %>%
  ___
  ___

# Frequent stores
ggplot(salesData) +
    geom_boxplot(aes(x = mostFreqStore, y = salesThisMon))

# Preferred brand
ggplot(___) +
    geom_boxplot(aes(x = ___, y = ___))