Get startedGet started for free

Moderation

1. Moderation

Welcome back! We have now come across various techniques to develop advanced systems using the OpenAI API.

2. Understanding moderation in the OpenAI API

To ensure that these systems are robust and reliable, we should follow best practices in moderating, validating, and securing content. Moderation in the context of the OpenAI API refers to the process of analyzing input to determine if it contains any content that violates predefined policies or guidelines. It is a critical aspect of managing user-generated content. OpenAI provides a moderations endpoint as part of its API to help developers automatically flag and filter out such content that violates community guidelines.

3. Understanding moderation in the OpenAI API

The moderation endpoint uses OpenAI's models to evaluate text and assign a probability for each category of content violation. These categories have been selected by OpenAI and are hate, harassment, self-harm, sexually explicit content, and violent content.

4. Moderating content

In the following code examples, we'll test the OpenAI API in moderating some text excerpts. Here we have extracted part of the instructions of the game "Exploding Kittens", and are passing it to the moderations endpoint as an input using client.moderations.create. Let's have a look at how the moderations endpoint classifies it for the 'violence' category. Without much context, this part of the game's instructions has been classified as violent, and so the output to the 'violence' category is 'True'.

5. Moderation in context

Let's now test the same instructions, but this time giving the full text, that includes more context and mentions the fact that it is a game. In this case the model recognizes the context and this time doesn't classify the content as violent, so the output to the category 'violence' is 'False'.

6. Prompt injection

As we integrate AI more deeply into our systems, the complexity increases, and we might end up with larger volumes of text, making it more challenging to identify malicious content. This gap in moderation capabilities opens the door to prompt injection attacks, where malicious actors manipulate AI models to produce undesirable outcomes.

7. Prompt injection

Several strategies can be used to mitigate the risks associated with prompt injections. First, limiting the amount of text a user can input is a straightforward yet effective measure. Similarly, constraining the number of output tokens generated can significantly decrease the chances of misuse. Finally, narrowing the range of acceptable topics by drawing inputs or outputs from trusted sources can also ensure a higher degree of reliability in the system's responses. For instance, configuring the system to return outputs from a validated set of materials can be much safer than letting the model generate completely novel content.

8. Adding guardrails

In certain applications, we might want to avoid certain topics as they are outside the scope, while at the same time they don't belong to any of the moderated categories in the API. In this case, we present the model with the same instructions for the game of Exploding Kittens, but this time we want to only allow prompts about the game of chess. We need a way to instruct the model to avoid these topics, and so we provide a system message to specify this. These instructions given to steer the model away from going offtopic are called guardrails.

9. Adding guardrails

These are not implemented through the moderations API, but by passing a system message to the chat completions endpoint. Let's go back to the code example, where the messages on the previous slide are now passed to the chat completions endpoint. As we want to only allow topics related to the game of chess, when we pass the instructions for Exploding Kittens, the model returns a response saying the topic is not allowed.

10. Let's practice!

Moderation is a critical aspect of managing user interactions, ensuring that conversations remain relevant and appropriate. Let's practice these concepts in the coming exercises!