Get startedGet started for free

Reducing dimensionality

1. Reducing dimensionality

In this video, we will explore the concept of dimensionality reduction. This is a way to reduce dimensionality in our data, thereby decreasing complexity and computational time and, for some models, improving performance and stability.

2. Zero variance features

It is common to work with datasets that contain columns with constant data, as illustrated by the orange values in the figure. This lack of variability is not informative, can slow processing time, and even result in an error with certain models, like linear regression. We can fix this by filtering out zero variance columns by including step_zv() in our recipe.

3. Near-zero variance features

We can run into predictors that have a few particular cases but are mostly constant. We call these near-zero variance features. Dealing with such features is less clear-cut than in the case of zero variance. For example, it is common to observe a small percentage of unique values in dummy variables, which is not an issue. A rule of thumb to consider removing features with low variance is if they exhibit both of the following characteristics: There are very few unique values relative to the number of samples, and the ratio of the frequency of the most common value to the frequency of the second most common value is significant. The step_nzv function identifies and removes predictors that conform to these criteria.

4. Principal Component Analysis (PCA)

The goal of PCA is to reduce the dimension of a dataset by deriving a lower-dimensional set while preserving as much information as possible, to reduce complexity and computation time, and in some cases, improve model performance. The details of PCA are out of scope for this course, but we do have a dedicated DataCamp course on this. In the figure, we see how a three-dimensional dataset is represented in two dimensions by the first principal components while preserving most of the class separation conveyed by the original data. PCA is an unsupervised method as it does not consider the target variable when deriving the components.

5. Let's prep a recipe

We will work with the numeric cut of the loans dataset to explore PCA. Since it is an unsupervised method, we do not define a target variable in the recipe. Instead, we only indicate which variables to consider, in our case, all five. Let's include a near-zero variance step in the processing pipeline. This step will also take care of any zero-variance features. Before adding the pca_step, normalizing our variables is good practice to prevent large-scale values from over-influencing the process. The prep() function fits a recipe to the data. It is to a recipe as the fit() function is to a model. We can see the type of output generated by prepping our pca recipe by extracting the names of pca_output. There is a lot of detail under each of these names. Next, we will dig into the "steps" element to extract the standard deviation for each component.

6. Unearthing variance explained

The standard deviation explained by each principal component is stored deep inside pca_output, within the third list of the steps element. The structure generated by the output of prep is full of valuable information. We can use it to retrieve the standard deviation vector, and generate variance explained by squaring each element and dividing by the sum of squares. With this data, we can assemble a tibble with variance, and cumulative variance explained by principal component.

7. Visualizing variance explained

We can visualize our output as a column chart. We will need at least three components to preserve a decent amount of variation, four being quite good, keeping almost 93% of the variance.

8. Let's practice!

Time to take it to the coding board.

Create Your Free Account

or

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.