1. Introduction to dimensionality reduction
Hello! My name is Matt Pickard, and I'll be your instructor for this course. I am an associate professor of data and analytics at Northern Illinois University and do data analytics consulting on the side.
2. Dimensions
In this course, we'll be learning about dimensionality reduction; so, let's start by defining a dimension.
A dimension is a single column in a tidy data set. This table has three dimensions. In this course we'll use dimension, column, and feature interchangeably; so the number of dimensions is the number of columns which we can find using ncol().
3. What is dimensionality reduction?
So, what is dimensionality reduction? It is eliminating or combining features with little or no information while maintaining as much information as possible.
4. What is dimensionality reduction?
In this example there are two columns for weight — one in kilograms and one in pounds. The information in those two columns is completely redundant. So we could remove one of them.
5. What is dimensionality reduction?
Notice also that the values in the role column are all the same. Thus, role does not help us distinguish between the observations. In other words, it has no useful information.
6. Dimensionality reduction visually
From a visual perspective, we can think of dimensionality reduction as projecting a higher dimensional space onto a lower dimensional space. In this 3D plot, notice the three gray two-dimensional projections of the blue, three-dimensional observations.
Dimensions higher than 3D are harder to visualize. Try visualizing four dimensions. What about a hundred dimensions? Hard to do, isn't it? That's one reason dimensionality reduction is important. It helps us get our minds around the data.
7. Finding numeric columns with no variance
Remember that role column. It was a categorical variable that had no variance, meaning the value for every row was developer. With continuous variables we quantify the variance for each column using dplyr's summarize() function. We pass the var() function to summarize() with na-dot-rm set to true to remove NAs from the variance calculation. We apply var() to all columns with everything() and across(). Then we use pivot_longer() to orient the output vertically to make it easier to read. We see that num_garages and num_hvac_units have zero variance.
8. Mutual information
Remember the two redundant weight columns — one in kilograms and one in pounds? Those two columns have perfect mutual information. Mutual information is best understood using a Venn diagram.
9. Mutual information
Assume we have data about houses and the circle on the left represents the amount of information the square footage of the house provides.
10. Mutual information
The circle on the right represents the amount of information the number of bedrooms provides.
11. Mutual information
Intuitively, we know that square footage and number of bedrooms both contain information about the size of the house. This information is represented by the intersection of the two circles and is called mutual information. Mutual information is redundant information.
12. Create a correlation plot
For continuous variables, we can measure mutual information with correlation, so let's create a correlation plot with the corrr package using a data set about house sales in King County, California.
We pass the house sales data frame to select and keep only the continuous features, which we pipe to correlate() to create the correlation matrix. The shave function removes the upper triangle of the correlation matrix. Then, we call rplot() to create the correlation plot and set print_cor to true to overlay the numeric correlations.
This last line rotates the x-axis labels ninety degrees.
13. Correlation plot
That code generates this correlation plot. Features that are strongly correlated — like sqft_living and sqft_above — have high mutual information.
Later, we'll dig deeper into correlation and variance as dimensionality reduction tools.
14. Let's practice!
For now, let's put these concepts into practice.