Get startedGet started for free

Clustering Wikipedia part I

You saw in the video that TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).

The Wikipedia dataset you will be working with was obtained from here.

This exercise is part of the course

Unsupervised Learning in Python

View Course

Exercise instructions

  • Import:
    • TruncatedSVD from sklearn.decomposition.
    • KMeans from sklearn.cluster.
    • make_pipeline from sklearn.pipeline.
  • Create a TruncatedSVD instance called svd with n_components=50.
  • Create a KMeans instance called kmeans with n_clusters=6.
  • Create a pipeline called pipeline consisting of svd and kmeans.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Perform the necessary imports
from ____ import ____
from ____ import ____
from ____ import ____

# Create a TruncatedSVD instance: svd
svd = ____

# Create a KMeans instance: kmeans
kmeans = ____

# Create a pipeline: pipeline
pipeline = ____
Edit and Run Code