Clustering Wikipedia part II
It is now time to put your pipeline from the previous exercise to work! You are given an array articles
of tf-idf word-frequencies of some popular Wikipedia articles, and a list titles
of their titles. Use your pipeline to cluster the Wikipedia articles.
A solution to the previous exercise has been pre-loaded for you, so a Pipeline pipeline
chaining TruncatedSVD with KMeans is available.
This exercise is part of the course
Unsupervised Learning in Python
Exercise instructions
- Import
pandas
aspd
. - Fit the pipeline to the word-frequency array
articles
. - Predict the cluster labels.
- Align the cluster labels with the list
titles
of article titles by creating a DataFramedf
withlabels
andtitles
as columns. This has been done for you. - Use the
.sort_values()
method ofdf
to sort the DataFrame by the'label'
column, and print the result. - Hit submit and take a moment to investigate your amazing clustering of Wikipedia pages!
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Import pandas
____
# Fit the pipeline to articles
____
# Calculate the cluster labels: labels
labels = ____
# Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})
# Display df sorted by cluster label
print(____)