1. Foundations of feature extraction - principal components
Welcome back. Let's now turn our attention to feature extraction.
2. Feature extraction review
We'll start with a quick review. Instead of eliminating features like feature selection does, feature extraction combines parts of two or more features to create a new feature — called a principal component.
3. Feature extraction review
It's much like picking vegetables from a garden to make a salad. Feature selection is like pulling weeds, roots and all. Feature extraction is like picking the best part of the plants and combining them into a tasty salad.
We might have a recipe for the salad — one head of lettuce, three carrots, two tomatoes, and one cucumber. We don't use the whole plant, we just use the best parts of the vegetable plants.
4. PCA plot
Let's use a subset of the employee attrition data to better understand principal components. Here we have a principal component analysis plot. Each axis represents a principal component the analysis extracted.
5. Principal component 1
The first principal component is along the x axis. Notice how it captures forty point six six percent of the data's variation.
6. PC1: feature vectors
PC1 is composed of YearsSinceLastPromotion, TotalWorkingYears, and MonthlyIncome. The three arrows are the feature vectors. They point in the same general direction as PC1, so they were combined into PC1.
To illustrate how feature extraction can be difficult to interpret, let's attempt to give PC1 a descriptive label. Conceptually, YearsSinceLastPromotion and TotalWorkYears both have a length of time sense to them. At first glance, MonthlyIncome does not seem related to time; but, employees who have worked longer probably have higher incomes.
7. PC1: name
So, perhaps we could label PC1 as duration.
8. Principal component 2
Now let's look at the PC2 which captures an additional thirty-five point three five percent of the variation in the data.
9. PC2: feature vectors
Both PerformanceRating and PercentSalaryHike correlate with performance.
10. PC2: name
So that could be the label of PC2.
The idea behind principal components is the foundation of feature extraction. We will dive deeper into the details of principal component analysis as well as t-SNE and UMAP.
11. Code for a PCA plot
To conclude, let's look at the code required to generate PCA plots. We will use a specific implementation of the autoplot function which comes from the ggfortify package, so we start by loading ggfortify.
Then we perform a principal component analysis using prcomp(). PCA only works with continuous variables, so we use select to remove the categorical target variable — Attrition. We set the scale parameter to true. We scale the data so variables with larger ranges do not dominate the principal components. We store the PCA results in pca_res.
Then, we pass pca_res to autoplot(). We set data to attrition_df, our data frame.
The rest of the code sets parameters to control the plot appearance. We set color to attrition — the target variable to color-code the data points. We set alpha to zero point seven to make the plotted points slightly transparent. This makes overlapping points more visible.
Setting loadings to true draws the feature vector arrows. Loadings dot label equals true prints the name of the features. We change the color of the arrows and labels to black with loadings dot color and loadings dot label dot color, respectively. Lastly, to avoid the labels overlapping, we set loadings dot label dot repel to true.
This code will create the PCA plot we've been working with.
12. Let's practice!
Now it's your turn to practice.