Get startedGet started for free

Document clustering

1. Document clustering

In the introductory lesson, we discussed about the use of unsupervised learning techniques to group news items together by a service such as Google News. This technique, which is known as document clustering, will be explored in this video.

2. Document clustering: concepts

Document clustering uses some concepts from natural language processing, or NLP. Although NLP is a huge subject, let us try and understand its basics to apply in this use case. First, we will clean the data for anything that does not add value to our analysis. Some items to remove include punctuation, emoticons and words such as "the, is, are". Next we find the TF-IDF of the terms, or a weighted statistic that describes the importance of a term in a document. Finally, we cluster the TF-IDF matrix and display the top terms in each cluster.

3. Clean and tokenize data

The text in itself cannot be analyzed before converting into smaller parts called tokens, which we achieve by using NLTK's word_tokenize method. First, we remove all special characters from tokens and check if it contains to any stop words. Finally, we return the cleaned tokens. Here's the output of sample quote from the movie Pink Panther.

4. Document term matrix and sparse matrices

Once relevant terms have been extracted, a matrix is formed, with the terms and documents as dimensions. An element of the matrix signifies how many times a term has occurred in each document. Most elements are zeros, hence, sparse matrices are used to store these matrices more efficiently. A sparse matrix only contains terms which have non zero elements.

5. TF-IDF (Term Frequency - Inverse Document Frequency)

To find the TF-IDF of terms in a group of documents, we use the TfidfVectorizer class of sklearn. We initialize it with the following features: max_df and min_df signify the maximum and minimum fraction of documents a word should occur in - here we go ahead with terms that appear in more than 20% but less than 80% documents. We keep the top 50 terms. Finally, we use our custom function as a tokenizer. The fit_transform method creates the TF-IDF matrix for the data, which is a sparse matrix.

6. Clustering with sparse matrix

kmeans in scipy does not work with sparse matrices, so we convert the tfidf matrix to its expanded form using the todense method. kmeans can then be applied to get the cluster centers. We do not use the elbow plot, as it will take an erratic form due to the high number of variables.

7. Top terms per cluster

Each cluster center is a list of tfidf weights, which signifies the importance of each term in the matrix. To find the top terms, we first create a list of all terms. Then, we create a dictionary with the terms are keys and tfidf as values. We then sort the dictionary by its values in descending order and display top terms. Zip method joins two lists in python. We analyze a list of 1000 hotel reviews to find that the top terms in one of the clusters were room, hotel, and staff, whereas the other cluster, had bad, location, and breakfast as the top terms.

8. More considerations

Due to the scope of the course, we have seen a simple form of document clustering. There are more considerations when it comes to NLP. For instance, you can modify the remove_noise method to filter hyperlinks, or replace emoticons with text. You can normalize every word to its base form: for instance, run, ran and running are the forms of the same verb run. Further, the todense method may not work with large datasets, and you may need to consider an implementation of k-means that works with sparse matrices.

9. Next up: exercises!

Let us now try to cluster movies based on their synopses.