Session Ready
Exercise

Computing and visualising the t-SNE embedding

In this exercise, we are going to generate a t-SNE embedding using only the balanced training set creditcard_train. The idea is to train a random forest using the two coordinates of the generated embedding instead of the original 30 dimensions. Due to computational restrictions, we are going to compute the embedding of the training data only, but note that in order to generate predictions from the test set we should compute the embedding of the test set together with the train set.

Then, we will visualize the obtained embedding highlighting the two classes in order to clarify if we can differentiate between fraud and non-fraud transactions.

The creditcard_train data, as well as the Rtsne and ggplot2 packages, have been loaded.

Instructions
100 XP
  • Fix the seed to 1234.
  • Compute a t-SNE embedding named tsne_output from creditcard_train (removing the Class column) using default iterations and perplexity, without doing a PCA, and without checking duplicates.
  • Generate a data frame to visualize the embedding.
  • Visualize the embedding using ggplot().