Streaming with Semantic Events

1. Streaming with Semantic Events

Welcome back!

2. Why Stream?

Normally, when we send a request to a model, we have to wait until it has finished generating a full response before receiving it. For some applications, this experience is fine, but when large numbers of tokens are being generated, having to sit and wait to see some output isn't fun.

3. Why Stream?

With streaming, you get partial updates delivered to you - almost like the model is typing in real time. This term "streaming" means exactly the same for LLMs as it does for movies or TV. Rather than having to wait to download the entire film or TV show, you can stream the film while you watch it. Streaming makes your interface feel alive, and it's the same technique that powers chat applications like ChatGPT, Claude, and Copilot.

4. Semantic Events

Instead of sending raw text chunks, the Responses API streams semantic events, which are structured updates that describe what's happening.

5. Semantic Events

The Responses API supports many different semantic event types, and this is another example of how it's built with production applications in mind. There is an event when the generation has started, when text blocks are updated and completed, for when tool arguments are specified, and when the response has completed. This lets your app react intelligently to each stage. It's structured, predictable, and ideal for building rich interfaces.

6. Example: Basic Text Streaming

We start by defining a prompt, then we open a context manager with the Python with statement. This opens a connection for streaming from the API, then closes that connection cleanly once it's completed. To enable streaming, we pass stream=True to the .create() method. To start, we'll loop over the events returned by the API and check if the event has type "response.output_text.delta", which are partial text outputs. For each of these events, we'll return the updated string as it's built up.

7. Example: Basic Text Streaming

Here's what that looks like. Each event here is returning the next token in the output text.

8. Example: Handling Multiple Events

Let's try handling multiple events, which is what a production application would do. We'll be checking for the "response created", "output text done", and "response completed" events, printing each as it's produced.

9. Example: Handling Multiple Events

Being able to capture these events is great for logging and debugging, as we can see exactly which events introduce issues. The Responses API can also stream tool calls.

10. Example: Streaming Tool Events

The convert_currency() function we defined earlier is available to use, and we've created a tool definition suitable for the OpenAI Responses API as we covered in an earlier lesson.

11. Example: Streaming Tool Events

Here, we'll define a prompt that requires a tool call, then open our context manager with the same code as before, but also providing the tools this time. We'll stream three events here: the function call arguments delta, so we can see the function call arguments being streamed, and also when the function call arguments are done, and the response is complete. We're not actually performing the tool call, as recall, that's a two-step process - we're just looking at the events associated with the first step.

12. Example: Streaming Tool Events

Here, we can see the model building the function call arguments string token-by-token!

13. Summary

Semantic events make it easy to progressively update your interface. You can display messages as they arrive, animate typing indicators, or show when a tool is being called. This gives users immediate feedback and makes your AI app feel much more natural and conversational. Here's an example from ChatGPT, where end-users are continually updated with the model's reasoning summaries as it progresses.

14. Let's practice!

Time to put this into practice!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.