Get startedGet started for free

Uniform Manifold Approximation and Projection (UMAP)

1. Uniform Manifold Approximation and Projection (UMAP)

Now, let's look at one last form of feature extraction — Uniform Manifold Approximation and Projection, or UMAP.

2. PCA, t-SNE, and UMAP

As we'll see, UMAP is very similar to t-SNE in many ways. For instance, it is a non-linear algorithm like t-SNE, whereas PCA is a linear algorithm.

3. PCA, t-SNE, and UMAP

PCA is deterministic, meaning we'll get the same results every time, while t-SNE and UMAP are non-deterministic, or stochastic.

4. PCA, t-SNE, and UMAP

UMAP is also more computationally efficient than t-SNE. This can be helpful with very large datasets.

5. PCA, t-SNE, and UMAP

One important difference in the three algorithms is whether they preserve the local or global structure of the data. Preserving local structure means that the distances between neighboring data points in the lower-dimensional output can be interpreted. Preserving the global structure means that the distances between clusters of data points in the lower-dimensional output can be interpreted. PCA preserves the global data structure, while t-SNE preserves the local structure. UMAP tries to preserves both.

6. PCA, t-SNE, and UMAP

t-SNE and UMAP both have hyperparameters than can and should be tuned.

7. UMAP plot

Let's look at how to implement and plot UMAP in R. We'll continue to use the employee attrition data to illustrate this. The embed R package contains a recipe step to implement UMAP. So we first load the embed library. Then we set the seed for reproducibility. We create a recipe object and apply step_normalize() to scale all the numeric predictors before we apply UMAP. We add step_umap and set num_comp to two to embed the data into two dimensions. We accept all the defaults for the hyperparameters. It should be noted however that, just like with t-SNE, tuning the hyperparameters can improve the results of UMAP. We then prep the recipe and extract the transformed data with juice() and store it in umap_df. To plot the results, we pass umap_df to ggplot(). Notice that step_umap(), by default, stores the transformed dimensions in columns that it names UMAP1, UMAP2, and so forth. We set the color argument to attrition so we can observe how well the new UMAP coordinates differentiate between employees that have left and stayed with the company.

8. UMAP: employee attrition

The resulting plot looks like this. UMAP, with its default parameters, doesn't perform much better than t-SNE; but it was more computationally efficient. Like with t-SNE, the x and y axes on the plot are the extracted features.

9. UMAP in tidymodels

step_umap() makes it easy to incorporate UMAP in the model building process. Let's review that quickly. First, we create a recipe where we normalize the data and apply the umap reduction. Notice here that we set the number of components to four. Then we create a linear regression model spec.

10. UMAP in tidymodels

Next, we create a workflow and add the UMAP recipe and the linear regression model spec to it. We then fit the workflow object on the training data.

11. UMAP in tidymodels

Lastly, we create a data frame with the testing data predictions and evaluate the model performance on the RMSE metric.

12. Let's practice!

Now it's your turn to practice with UMAP.