1. Multi-dimensional Scaling
Multidimensional scaling is another technique for performing dimension reduction. The primary goal of multidimensional scaling is to provide a visual representation of the pattern of proximities, meaning the similarities or distances, among a set of objects. This is accomplished by assigning objects to specific locations in a conceptual space, usually in two or three dimensions, so the distances between points in the space match the given similarities as closely as possible.
2. What is Multidimensional Scaling?
The classical multidimensional scaling, known as MDS, takes the distance matrix between n observations as input and produces a set of lower dimensional points, typically 2 or 3, so the distances closely match the INPUT distances. If we have high dimensional data, we first obtain a distance matrix from the data, for which there might be many choices. We can also use the distance matrix directly if it is available.
The simplest way to implement classical MDS is to use the cmdscale() function with the distance matrix, d, and the maximum dimension of the space which the data are to be represented in, k. The default value of k is 2.
There are other options for non-metric scaling, such as the isoMDS() and sammon() functions, which we will not discuss in this course.
3. US City distance example
The starting point for the first example is a distance matrix. The object UScitiesD gives straight line distances between 10 cities in the US.
4. MDS on US city distance dataset
To perform MDS we use the cmdscale() function with UScitiedD and store the result in the object usloc.
We can see that the usloc object gives the two-dimensional coordinate for all 10 cities, which is designed to preserve the original distances.
Let's check whether the resulting coordinates coincides with the physical location of the cities. In the next slide, we will see a labeled scatterplot of the two-dimensional representation of the data.
5. US cities MDS output
As expected, Washington DC and New York are closer together than New York and Los Angeles. However, they are in the wrong corners of the plot.
After a simple rotation of east-west and north-south, the plot resembles their location on a standard map pretty closely.
6. Multidimensional scaling on mtcars dataset
Now, we will apply cmdscale on the original mtcars dataset. First, we calculate the distance matrix using the dist() function, which by default calculates a Euclidean distance. We could use other distance measures, especially for categorical and binary variables.
The distance object is named cars dot dist, and we now apply the cmdscale() function on this object to get the two-dimensional coordinates for the 32 cars.
Using ggplot(), the 32 cars of the original dataset are now scattered in a two-dimensional space. The distance between the elements was computed by MDS, which took into account all variables.
Note that for the MDS approach, two dimensions might not always be enough to represent the distance accurately. It is possible to use higher dimensions by changing the k argument in the cmdscale() function to a higher value.
7. Multidimensional scaling in more than two dimensions
For example, if we specify the k argument as three, we get a three-dimensional representation, which can be plotted using the scatterplot3d() function we discussed earlier in the course. Using this function, we specify the type, pch arguments, and use the lty dot hplot argument to draw the dashed lines from the points.
8. Multidimensional scaling in more than two dimensions
We can also modify the color argument to color the points and lines using a categorical column.
9. Now let's try using MDS!
Now lets try using MDS.