1. PCA computation
In this video we're going to compute a PCA with the online shop customer data and take a first look at the results.
2. Status quo
3. Standardization
With a PCA the focus lies on the variances of the respective variables. Consequently, variables with high variances are overrepresented in the resulting principal components. Additionally, differences in the measurement units introduce a fake weighting of variables. We can avoid this by standardizing the variables using the `scale()` function, transforming them to having a mean of zero and a variance of one.
4. PCA computation
OK, we are all set to compute a PCA. We do this using the function `prcomp()` from the `stats` package that takes the whole dataset as input. We save the result to an object called `pcaCust`. The result is a list with five elements.
5. Standard deviations of the components
The first element `sdev` of the prcomp object holds the standard deviations of the extracted components. All in all we extracted 16 components, the same as the number of variables. As expected, the first component has the highest standard deviation. The standard deviations of the remaining components become smaller and smaller as each component covers less of the original variance of the data.
The variances of the components, that is the squared standard deviations, are called eigenvalues. They serve as a nice measure of the importance of the respective component. The higher the eigenvalue, the more important a component is.
Additionally, if we divide the eigenvalues by the number of components, we get the proportion of variance that this component explains.
6. Loadings and interpretation
The element `rotation` of the `prcomp` object holds the loadings of the PCA. These are the correlations between the original variables and the components and help you to interpret the components.
The table shows the loadings for the first six components which I selected by using subsetting brackets. We will start to interpret the first three of them.
- For the first component the number of orders, number of ordered items, number of sold items, and the sales of ordered as well as sold items carry the highest negative loadings. All these variables reflect customer activity, hence, we name this component "low activity".
- For the second component, the sales per order and the items sold per order have high positive loadings, while both, monetary and normal return ratio, have high negative loadings. This reflects the tendency of customers to actually buy the ordered items instead of returning them. So we call this component "low returns".
- And finally the third component. It is positively correlated with a high price per item and order, and negatively correlated with the share of own brand. Consequently we will call this component "brand awareness".
We will stop here to move on to the so-called values.
7. Values of the observations
The components of a PCA can be considered as something like a weighted average. Customer-specific characteristics are weighted according to the loadings they have on the respective component and summed up.
Let me calculate the value for the first component for the first customer in the dataset by hand.
From the dataset we select the vector of the first customer's characteristics. Then, we multiply them with the vector containing the loadings on the first component from the `rotation` element. In the end, we sum up the elements of the resulting vector.
The values for each customer and each component are stored in the element named "x".
We will need the values later on to compute a regression analysis based on the PCA.
8. It's your turn!
But let's practice first.