Get startedGet started for free

Embeddings for classification tasks

1. Embeddings for classification tasks

The final common embedding use case we'll discuss are classification tasks.

2. Classification tasks

Classification tasks can take different forms, but generally, they involve assigning labels to items. Common tasks include categorization, such as categorizing headlines into different topics; and sentiment analysis,

3. Classification tasks

such as classifying reviews as positive or negative. Embeddings can be used for both of these cases, utilizing the model's ability to capture semantic meaning.

4. Classification with embeddings

We'll be using a type of classification called zero-shot classification, which means that the classifications won't be based on labeled examples. Let's look at how that works with an example on classifying news articles by topic. First, we begin by embedding a description of each class label. Here, we'll be using four classes: tech, science, sport, and business. We'll be embedding these labels and using them as reference points to base the classification on.

5. Classification with embeddings

Next, we embed the article to classify, and calculate cosine distances to each embedded label.

6. Classification with embeddings

Finally, we assign the article the label with the smallest cosine distance. Let's do this in Python!

7. Embedding class descriptions

Here are the topic classes we'll be categorizing with. In this example, we'll categorize using the label itself, so the first step is to extract the labels as a single list and use these as the class descriptions. Then embed each topic label using the create_embeddings custom function that makes a call to the OpenAI embedding model.

8. Embedding item to classify

Here's the article we want to classify. The first step here is to combine the headline and keyword information into a single string that we can embed. We do this by defining a custom function that uses an F-string to concatenate the headline and keywords inside a nicely formatted string. Finally, we can embed the text by calling create_embeddings again, remembering to zero-index the list returned so we have a single list of numbers.

9. Compute cosine distances

Now that we have the embeddings, it's time for the cosine distances calculations. This is a modified version of the find_n_closest custom function from earlier in the course, where instead of returning n results, we only want one: the nearest label. This means that instead of sorting by distance, we can find the minimum using the min function. Calling this function will return the distance and index of this label.

10. Extract the most similar label

Finally, we can use this index to subset the topics dictionary and extract the label. Printing the result, returns the Business label. Wait, this doesn't seem right. If we take another look at the article we're classifying, we can see that the headline indicates that the focus of the article is on tech; it's likely that the model captured the business keyword which resulted in the mislabeling. The limitation in our approach that led to this was that the class descriptions lacked detail. The word "Business" or "Tech" doesn't contain much meaning on its own for the model to capture, so a better approach would be to use more detailed class descriptions.

11. More detailed descriptions

Let's try this again, but instead of using the labels as the descriptions, we use short descriptions to represent each class. The steps here are almost identical. We extract the descriptions in a single list, this time using the description key, and embed them. The rest of the code is the same! This time, when we print the result, we can see that the model classified correctly.

12. Let's practice!

Now it's your turn to give this a try!