Get startedGet started for free

Principal Component Analysis (PCA)

1. Principal Component Analysis (PCA)

Let's take a deeper dive into the details of a principal component analysis. We'll continue to use the employee attrition data.

2. Performing a PCA

To perform PCA in R, we use prcomp(). We pass it the continuous predictor features and set scale dot to true and store the results in pca_res. Let's look at a summary of pca_res. There were only five predictors in attrition_df so prcomp() returns five principal components. The summary() displays the standard deviation, the percentage of the original data's variance each PC explains, and the cumulative percentage of variation. Notice that in the cumulative proportion row, moving left to right, all five PCs together explain one-hundred percent of the variation. Also notice that PC1 and PC2 together explain approximately seventy-six percent of the total variation. We use the cumulative proportion to determine how many PCs to keep. We'll use this later in the model fitting process.

3. PC loadings

If we display the contents of pca_res, without summarizing it, we see how the original five features load on the resulting five PCs. The numbers under each PC are correlations between the PC and the feature. They help us interpret each PC.

4. PC loadings

Let's focus on the first two PCs to see how this works.

5. PC loadings

We can see that the first three features — monthly income, total working years, and years since last promotion — are positively correlated with PC1. So these are the major contributors to PC1. This is why we could label PC1 as duration — they all seem to be correlated with the amount of time the employee has worked. Note that correlations do not have to be positive to load on PCs.

6. PC loadings

Percent salary hike and performance rating are positively correlated with PC2. So we could label PC2 as performance. What we just reviewed illustrates the process we would go through to interpret the principal components.

7. PCA with tidymodels

Now, let's demonstrate how to use PCA in the tidymodels model building process. We start by creating a recipe. We add step_normalize() to scale all the numeric predictors. Then we add step_pca() to perform PCA on the numeric predictors. We set num_comp to two to keep the first two PCs for our model. Remember they accounted for seventy-six percent of the data's variation. We then create a workflow, pass it the pca recipe and a new logistic regression model spec. We use logistic regression because the target variable — Attrition — is categorical. Then we fit the workflow to the train data. We bind the predictions of the test set with the actual values found in the Attrition target variable and store them in attrition_pred_df. Lastly, we pass attrition_pred_df to f_meas() and specify the columns containing the actual and predicted values to evaluate the performance of the model.

8. See the PCs in the model details

Before we conclude, let's get more transparency and emphasize that the model fit in the workflow was built on the principal components extracted in the recipe. Let's look at the model section of the attrition_fit object — or the fitted workflow object. Notice how it provides coefficients for PC1 and PC2. That is because we requested two PCs in the step_pca recipe step. Those were the features passed on to fit the logistic regression model.

9. Let's practice!

Let's practice.