Speech-to-text

1. Speech-to-text

Welcome to the course! I'm James, and together, we'll explore the OpenAI API's capabilities beyond text generation.

2. Coming up...

In this course, you'll learn how to use OpenAI's audio models for speech-to-text and text-to-speech, perform text moderation, and even combine them together in a case study to build a customer support chatbot. Let's start with a quick recap!

3. Recap...

We import the OpenAI class to instantiate an OpenAI API client to communicate with the API. Using OpenAI's models usually incurs a cost, but we've pre-filled the API key so you don't need to create one or incur any cost in the exercises. Next, we send a request to the chat completions endpoint, specifying the model and messages as a list of dictionaries with "role" and "content" keys.

4. Recap...

We dig into the API response using the .message and .content attributes, to extract the final answer. Text generation is very cool, but OpenAI provides models that go far beyond this.

5. OpenAI's audio models

OpenAI provides models for converting audio to text, so-called speech-to-text. This includes transcribing audio into text, and for translating non-English audio into an English transcript. These models support most common audio formats, including MP3s, but there is a file size limit, so splitting larger audio files may be required. Common use cases include automating business meeting transcripts, generating captions, and processing customer calls.

6. Loading audio files

Let's start with transcribing audio. We'll use the OpenAI API to transcribe a meeting recording, which is saved locally as an MP3 file. To begin, we'll load it into Python by passing it to the open() function. The "rb" here stands for read binary - all this means is that we're opening a file that is stored in a binary format, which is typical for non-text files like audio, video, and images. If the audio file is found in a different directory, we'll also need to prepend the file name with its path. This audio file can now be used like any other Python variable.

7. Creating the transcription

Transcription requests are sent to OpenAI’s Audio endpoint. To create a transcribe request to this endpoint, we call the .create() method on client.audio.transcriptions. Inside, we specify the audio model to use and the file to transcribe. Let's print the response. Like the other endpoints, we receive an object with attributes, and the text transcript is stored under the .text attribute.

8. The transcript

Accessing it, we can see the model performs well in this case.

9. Transcribing non-English audio

Since these models are also trained on non-English audio, we can transcribe other languages. The process remains the same: open the file, send a transcription request, and the model will output text in the same language as the file.

10. Creating translations

What if we want to create an English transcript from some non-English audio? Let's load an audio file containing a conversation in another language. Our request to the audio endpoint only requires one change: using audio.translations.create() instead of audio.transcriptions.create(); the model and file arguments remain the same. We extract the transcript with the .text attribute, as before, and we can see that it wasn't perfect - making two errors on "AI" and "ChatGPT". We'll perfect this in the case study!

11. Transcription performance

Transcription performance can vary wildly depending on audio quality, the audio language, and the model's knowledge of the subject matter. Make sure to robustly test the system before rolling it out.

12. Let's practice!

Time to get transcribing!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.