Text-based similarities

1. Text-based similarities

You can now generate content-based recommendations when descriptive attributes are available.

2. Working without clear attributes

Unfortunately in the real world, this is often not the case as attribute labels such as book genres might not be available. Thankfully if there is text tied to an item then we may still be in luck. This could be a plot summary, an item description, or even the contents of a book itself. For this kind of data, we use "Term Frequency Inverse Document Frequency" or TF-IDF to transform the text into something usable.

3. Term frequency inverse document frequency

TF-IDF divides the number of times a word occurs in a document by a measure of what proportion of all the documents a word occurs in. This has the effect of reducing the value of common words while increasing the weight of words that do not occur in many documents. For example, if you were comparing the script of this course against the scripts of all the courses on DataCamp, the term "DataFrame" might get a low score as although it occurs a lot, it is present in many DataCamp courses. The term "recommendation" on the other hand would get a high score as it is not as common in other course's scripts.

4. Our data

In this video, we will be working with a dataset of books and their descriptions as seen here.

5. Instantiate the vectorizer

To transform our data we import TfidfVectorizer() from sklearn. We instantiate it to a variable; tfidfvec in this case. By default, the vectorizer generates a feature for every word in every document, which is a lot of features. Thankfully we can specify restraints on the features being generated.

6. Filtering the data

First, we set the min_df argument to two. This limits our features to only those that have occurred in at least two documents. Useful as terms occurring once are not valuable for finding similarities.

7. Filtering the data

We should also remove words that are too common using max_df. By setting this to point seven, words that occur in more than 70% of the descriptions will be excluded.

8. Vectorizing the data

Once the vectorizer is instantiated we call its fit_transform method on the text column. The vectorizer's get_feature_names method shows the features that were generated. Vectorized_data when converted to an array has a row for each book, and a column for each feature. Success! We have transformed unorganized text into usable features for our models.

9. Formatting the data

Let's wrap the array in a DataFrame (using the output of the get_feature_names method as the columns). And assign the titles from the original DataFrame as the index. The resulting DataFrame will look familiar to you from the previous exercises, with a row per item, and a column per feature. The scores represent how prominent that word is in the text compared to other texts, a useful attribute. For example, the term 'battle' is much higher for A Game of Thrones, understandable due to its theme.

10. Cosine similarity

As we advance from Boolean features to continuous TF-IDF values, we will use a metric that's better at measuring between items that have more variation in their data; cosine similarity. We won't go into it in depth here, but mathematically, it's the measure of the angle between two documents in the high dimensional metric space as seen on this two-dimensional example. All values are between 0 and 1 where 1 is an exact match.

11. Cosine similarity

Thankfully sklearn has a premade cosine_similarity function, that we use to find the distance between all rows by calling it on the DataFrame. Or between two rows by shaping their values as seen here.

12. Let's practice!

Now its your turn to use these similarities to generate recommendations!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.