Dealing with different kinds of audio

1. Dealing with different kinds of audio

The SpeechRecognition library is very powerful out of the box, however, since speech recognition is still an active area of research, there are some limitations to the library. In this lesson, we'll start exploring some of the challenges you might run into and what you can do about them.

2. What language?

Although SpeechRecognition is capable of transcribing audio, it doesn't necessarily know what kind of audio it's transcribing. For example, if you pass recognizer an audio file with Japanese speech but the language tag was English US, you'd get back the Japanese audio in English characters. That makes sense.

3. What language?

And if you passed the same audio file with the language tag set to jay ay for Japanese, you'd get the audio transcribed into Japanese characters. The thing to note is the SpeechRecognition library doesn't automatically detect languages. So you'll have to ensure this parameter is set manually and make sure the API you're using has the capability to transcribe the language your audio files are in. We've seen the language tag in previous lessons. But what happens with non-speech audio?

4. Non-speech audio

If we pass SpeechRecogition an audio file of a leopard roaring, it will return an unknown value error because no speech was detected. Which also makes sense because a leopard roar, although very cool, isn't human speech.

5. Non-speech audio

We can prevent errors by using the show all parameter. The show all parameter shows a list of all the potential transcriptions the recognize google function came up with. In the case of our leopard roar, the list comes back empty but we avoid raising an error.

6. Showing all

Or in the case of our Japanese file, you can see the different potential transcriptions.

7. Multiple speakers

Next comes multiple speakers. The free Google Web API transcribes speech and returns it as a single block of text no matter how many speakers there are. A returned single text block can still be useful, however, if your problem requires knowing who said what, you may want to consider the free API we're using as a proof of a concept. And then use one of the paid versions for more complex tasks. The process of splitting more than one speaker from a single audio file is called speaker diarization, however, it is beyond the scope of this course.

8. Multiple speakers

To get around the multiple speakers problem, you could ensure your audio files are recorded separately for each speaker. Then transcribe the individual speakers audio.

9. Noisy audio

Finally, there's the problem of background noise. A rule of thumb to remember is if you have trouble understanding what is being said on an audio file due to background noise, chances are, a speech recognition system will too. To try and accommodate for background noise, the recognizer class has a built-in function, adjust for ambient noise, which takes a parameter, duration. The recognizer class then listens for duration seconds at the start of the audio file and adjusts the energy threshold, or the amount the recognizer class listens, to a level suitable for the background noise. How much space you have at the start of your audio file will dictate what you can set the duration value to. The SpeechRecognition documentation recommends somewhere between zero point five to one second as a good starting point.

10. Let's practice!

As you can see, speech has a whole lot of variability which makes transcribing it a tough challenge. But now we've talked about some of the ways to deal with different kinds of audio, let's head over to the console and see it all in action!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.