Get startedGet started for free

Content moderation

1. Content moderation

Welcome back! We've already seen how to work with audio through speech-to-text and text-to-speech models. Now, let's dive into moderation - a powerful feature for keeping our AI apps safe.

2. Moderation

Content moderation helps identify inappropriate text based on its context. This is commonly used to prevent harmful or offensive content in online spaces like social networks or chatbots. Traditionally, this moderation was done by hand, where a team of moderators flagged content that breached usage rules. More recently, this was done by algorithms that detected and flagged content containing particular words. Manual moderation is time-consuming and often inconsistent. Keyword filters are much faster but can miss harmful content or wrongly flag benign messages because they lack nuance or don't fully understand the context of the discussion. To prevent the misuse of their own models, OpenAI have developed moderation models to flag content that breaches their usage policies.

3. Violation categories

The OpenAI moderation models can not only detect violations of their terms of use, but also differentiate the type of violation across different categories, including violence and hate speech.

4. Creating a moderations request

To create a request to the Moderations endpoint, we call the .create() method on client.moderations, and specify the moderation model we wish to use. Next is the input, which is the content that the model will consider. This statement could be easily classed as violent by traditional moderation systems that worked by flagging particular keywords. Let's see what OpenAI's moderation model makes of it.

5. Interpreting the results

We'll dump the response to a dictionary using model_dump function for easier readability. The output is similar other endpoints. There are three useful indicators that can be used for moderation: categories, representing whether the model believed that the statement violated a particular category, category_scores, an indicator of the model's confidence of a violation, and finally, flagged, whether it believes the terms of use have been violated in any way. Let's extract the category_scores from the response for a closer look.

6. Interpreting the category scores

The category_scores are float values for each category indicating the model's confidence of a violation. They can be extracted from the results attribute, and through that, the category_scores attribute. The scores can be between 0 and 1, and although higher values represent higher confidence, they should not be interpreted as probabilities. In our example, the model recognized

7. Interpreting the category scores

the statement wasn't violent, thanks to its understanding of the surrounding context.

8. Considerations for implementing moderation

The beauty of having access to these category scores means that we can tune thresholds based on our use case instead of relying only on the final flagged result. For some use cases, such as student communications in a school, stricter thresholds may be chosen that flag more content, even if it means accidentally flagging some non-violations. The goal here would be to minimize the number of missed violations, so-called false negatives. Other use cases, such as communications in law enforcement, may use more lenient thresholds so reports on crimes aren't accidentally flagged. Incorrectly flagging a crime report here would be an example of a false positive.

9. Let's practice!

Time for some practice!