1. Dimensionality reduction
A less manual way of reducing the size of our feature set is through dimensionality reduction.
2. Dimensionality reduction and PCA
Dimensionality reduction is a form of unsupervised learning that transforms our data in a way that shrinks the number of features in the feature space. This data transformation can be done in a linear or nonlinear fashion. Dimensionality reduction is really a feature extraction method, since that data is being transformed into new and different features. However, since we're treating it here as a reduction in our feature space, we'll cover it in this chapter.
The method of dimensionality reduction we'll cover is principal component analysis, or PCA. PCA uses a linear transformation to project features into a space where they are completely uncorrelated. While the feature space is reduced, the variance is captured in a meaningful way by combining features into components. PCA captures, in each component, as much of the variance in the dataset as possible. In terms of feature selection, it can be a useful method when we have a large number of features and no strong candidates for elimination.
3. PCA in scikit-learn
Transforming a dataset through PCA is relatively straightforward in scikit-learn. Similar to other machine learning methods, scikit-learn requires importing pca and creating the pca object. And just like we did to create tf-idf vectors, we can use PCA's fit_transform method on the dataset we want to reduce dimensionality on. If we print out the new PCA transformed vector, we can see that the data has been transformed. By default, PCA in scikit-learn keeps the number of components equal to the number of input features. If we print out the explained variance ratio, we can see, by component, the percentage of variance explained by that component. We can see that much of the variance is explained by the first component here, so it's likely that we could drop those components that don't explain much variance.
4. PCA caveats
There are a couple of things to note regarding PCA. The first is that it can be very difficult to interpret PCA components beyond which components explain the most variance. PCA is more of a black box method than other methods of dimensionality reduction. The other thing to note is that PCA is a good step to do at the end of the preprocessing journey, because of the way the data gets transformed and reshaped. It can be difficult to do much feature work post-PCA, other than eliminating components that aren't useful for explaining variance.
5. Let's practice!
Time for you to practice transforming data using PCA.