1. PCA for CRM data
As you saw in the chapter on linear regression, CRM data can get very extensive. There, we had, for example, a problem of multicollinearity because several variables were carrying similar information. In this chapter you will learn how to reduce the number of variables in your dataset using principal component analysis, or PCA for short.
I will show you how to compute a PCA in `R`, interpret the results, find the right number of components, and use them for further analysis.
2. Introduction
PCA reduces a large number of correlated variables (in the plot shown with red arrows) to fewer uncorrelated components. The first component is determined such that it covers as much of the observations variance as possible. It is called PC1 and is plotted on the x-axis. Then the second component is determined such that it covers as much as possible of the remaining variance - y-axis, and so on for the third component, and the fourth, et cetera. In the end, there are as many components as variables, but you can choose a subset of them.
3. PCA helps to...
High-dimensional data, which means data with many variables, is not easy to handle.
Remember that we checked for multicollinearity in the chapter about linear regression. With PCA, we can first reduce the variables to fewer components, and then use them for further analysis without any multicollinearity problems.
Another application of PCA is building an index. Instead of just averaging several variables, you can create a weighted average where the weights are derived from your data. Just use the first component of a PCA as your index. In the figure you see an index composed of features about customer activity.
PCA can also help you to get to know your data. It's easier to visualize two or three meaningful components than 20 or 30 variables.
In all these applications, PCA serves as an exploratory tool. It is not meant to test hypotheses about the structure of your data.
4. Data for PCA
In this chapter, we are going to work with data about customers of an online shop. All variables must be either continuous or binary.
The data is stored in the object `dataCustomers`. We print the structure of the data using the structure function. By specifying the argument `give.attr = FALSE`, we reduce the output's complexity. The data contains 16 variables about
- the number of orders and items,
- information on sales and orders,
- figures about returns,
- the composition of orders,
- the duration of customer relationship,
- and further metrics describing customer activity.
Some of these variables have similar contents which is perfect for a PCA!
5. Correlation structure
To get a first overview of the correlation structure of the data, we'll compute and visualize the correlations of all variables with `cor()` and `corrplot()`. Note that the `cor()` function takes the whole dataset as input. The `corrplot()` function takes the estimated correlations as input. Positive correlations are blue and negative correlations are red or orange.
You see that there are strong correlations between many of the variables. I can already tell you that the variables in the upper left corner will stand out in the results of the PCA!
But first it's your turn!
6. Let's practice!
Let's try some examples.