Input Flexibility and Multimodality

1. Input Flexibility and Multimodality

Welcome back! In this video, we'll look at how the Responses API allows us to provides inputs however works best for us, including how images can be sent as inputs. Let's dive in!

2. Input Flexibility

So far, we've primarily been sending prompts to the model in the form of strings to the input and instructions arguments. Recall that inputs are typically the user's input, and the instructions are used to set the model's behavior and any requirements. Although this is convenient in a lot of cases, a role-based input can sometimes make managing conversation histories easier. You may recall that we briefly switched to it when using function-calling tools, so we could store the function call's output.

3. Role-Based Messages

For role-based inputs, each message in the messages list is a dictionary with role and content keys, and the list is passed to the input argument. The role determines how they are interpreted by the model. The system role is equivalent to the instructions argument, and sets clear requirements on the model. It can be used to create guardrails to restrict what users can can ask. The "user" role is for user inputs, equivalent to the input argument before, and the "assistant" role is used for marking messages as generated by the model. Here, we use a user-assistant message pair to give the model examples on how to respond to the final user message. This list of messages can be easily managed and appended to rather than relying on OpenAI's ID-based caching. Let's take a look at that.

4. A String and ID-Based Conversation

Here's the code we wrote earlier to have a back-and-forth conversation with the model. Recall that a conversation requires a control flow to start the conversation and keep it going until a condition is met, and a memory for storing the conversation history. The control flow remains the same when using role-based messages, but instead of using IDs to reload conversations at different points, we'll create a message history list to add to as the conversation flows.

5. A Role-Based Conversation

We start by defining a list of dictionaries containing only a system message. The user input is given the "user" role in its own dictionary, and appended to the messages list. This is sent to the model. The response from the model is added to the messages list with the assistant role, and the loop starts again. This is about as simple as a conversation history can get. This message history can be written to a file system or database, and we could even call an LLM after a certain number of messages to summarize it to control the length.

6. Images in Prompts!

Even though we've been working solely with text so far, the models we have worked with are multi-modal, which means they can interfaced with through more than one data modality. We'll do this using the role-based messaging system, which makes things easier. We'll use this to interpret a stock performance plot, but as well as image question-answering, we could classify images into different groups, and much more. We have the choice of inputting images from URLs or local files, and we'll cover both, starting with URLs.

7. Images URLs in Prompts

We start our responses request by opening a list of messages, then define two messages coming from the user role by assigning the "content" key to a list. Nesting messages in this way prevent us having to write the user role twice. The input text is given a type of "input_text", and the text is assigned to the "text" key. For the image, it's given the "input_image" type and the URL is assigned to "image_url". We can extract the response text as before,

8. Images URLs in Prompts

and view the result alongside the plot. This interpretation is pretty good, but as with other AI models, always be sure to verify any conclusions before sharing.

9. Images from Local Files

To load images from local files, we need to convert them to base64, which encodes binary data like images into text characters. Then, all that changes in our code is that the image URL must specify that the data is a JPEG image, represented as base64, and then insert the base64 encoding using an f-string.

10. Let's practice!

Time to give role-based prompting and images inputs a go!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.