Clustering Wikipedia part I
You saw in the video that TruncatedSVD
is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.
Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).
The Wikipedia dataset you will be working with was obtained from here.
This exercise is part of the course
Unsupervised Learning in Python
Exercise instructions
- Import:
TruncatedSVD
fromsklearn.decomposition
.KMeans
fromsklearn.cluster
.make_pipeline
fromsklearn.pipeline
.
- Create a
TruncatedSVD
instance calledsvd
withn_components=50
. - Create a
KMeans
instance calledkmeans
withn_clusters=6
. - Create a pipeline called
pipeline
consisting ofsvd
andkmeans
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Perform the necessary imports
from ____ import ____
from ____ import ____
from ____ import ____
# Create a TruncatedSVD instance: svd
svd = ____
# Create a KMeans instance: kmeans
kmeans = ____
# Create a pipeline: pipeline
pipeline = ____