1. Learn
  2. /
  3. Courses
  4. /
  5. Unsupervised Learning in Python

Exercise

Clustering Wikipedia part I

You saw in the video that TruncatedSVD is able to perform PCA on sparse arrays in csr_matrix format, such as word-frequency arrays. Combine your knowledge of TruncatedSVD and k-means to cluster some popular pages from Wikipedia. In this exercise, build the pipeline. In the next exercise, you'll apply it to the word-frequency array of some Wikipedia articles.

Create a Pipeline object consisting of a TruncatedSVD followed by KMeans. (This time, we've precomputed the word-frequency matrix for you, so there's no need for a TfidfVectorizer).

The Wikipedia dataset you will be working with was obtained from here.

Instructions

100 XP
  • Import:
    • TruncatedSVD from sklearn.decomposition.
    • KMeans from sklearn.cluster.
    • make_pipeline from sklearn.pipeline.
  • Create a TruncatedSVD instance called svd with n_components=50.
  • Create a KMeans instance called kmeans with n_clusters=6.
  • Create a pipeline called pipeline consisting of svd and kmeans.