Clustering Wikipedia part I
You saw in the video that TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.
Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).
The Wikipedia dataset you will be working with was obtained from here.
This exercise is part of the course
Unsupervised Learning in Python
Exercise instructions
- Import:
TruncatedSVDfromsklearn.decomposition.KMeansfromsklearn.cluster.make_pipelinefromsklearn.pipeline.
- Create a
TruncatedSVDinstance calledsvdwithn_components=50. - Create a
KMeansinstance calledkmeanswithn_clusters=6. - Create a pipeline called
pipelineconsisting ofsvdandkmeans.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Perform the necessary imports
from ____ import ____
from ____ import ____
from ____ import ____
# Create a TruncatedSVD instance: svd
svd = ____
# Create a KMeans instance: kmeans
kmeans = ____
# Create a pipeline: pipeline
pipeline = ____