Top terms in movie clusters
Now that you have created a sparse matrix, generate cluster centers and print the top three terms in each cluster. Use the .todense()
method to convert the sparse matrix, tfidf_matrix
to a normal matrix for the kmeans()
function to process. Then, use the .get_feature_names()
method to get a list of terms in the tfidf_vectorizer
object. The zip()
function in Python joins two lists.
The tfidf_vectorizer
object and sparse matrix, tfidf_matrix
, from the previous have been retained in this exercise. kmeans
has been imported from SciPy.
With a higher number of data points, the clusters formed would be defined more clearly. However, this requires some computational power, making it difficult to accomplish in an exercise here.
This exercise is part of the course
Cluster Analysis in Python
Exercise instructions
- Generate cluster centers through the
kmeans()
function. - Generate a list of terms from the
tfidf_vectorizer
object. - Print top 3 terms of each cluster.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
num_clusters = 2
# Generate cluster centers through the kmeans function
cluster_centers, distortion = ____
# Generate terms from the tfidf_vectorizer object
terms = tfidf_vectorizer.____()
for i in range(num_clusters):
# Sort the terms and print top 3 terms
center_terms = dict(zip(____, ____))
sorted_terms = sorted(____, key=center_terms.get, reverse=True)
print(____)