1. Introduction to audio data in Python
Hello and welcome to the course! My name is Daniel Bourke and I'll be your instructor. To get started, we're first going to see how speech and audio processing is different to other kinds of data processing.
2. Dealing with audio files in Python
Much like other data types, audio files come in many different formats, such as, mp3, wav, m4a and flac. But each of these formats has a standard measure of frequency.
Frequency is measured in kilohertz but is also referred to as kHz or sampling rate. Much like how a movie shows 30 pictures per second which our brains register as moving pictures, the sampling rate of an auido file is a measure of the number of data chunks per second used to represent a digital sound.
With one kilohertz equaling one thousand pieces of information per second.
3. Frequency examples
For example, a song you stream will usually have a 32 kHz sampling rate. This means 32,000 pieces of information per second. Speech and audio books are usually between 8 and 16 kHz. We'll look at some of these later.
And as you might've guessed, audio files are different to tabular or text data because you can't immediately see the data you're working with.
To get spoken language audio files into something we can see and manipulate, we first have to open the audio file with Python's built-in wave module.
We can get started with the wave module by running the command import wave.
4. Opening an audio file in Python
Now, we have an audio file, good morning dot wav ready to go. It contains a person saying the words good morning.
To import it, we'll use wave's open method.
Now we've saved the good morning dot wav audio file to the variable good_morning in the format of a wave_object. However, in this state it's not very useful to us.
To manipulate it further, we'll use the readframes method to convert the wave_object to bytes. The -1 means we want to read in all of the pieces of information within the wave_object.
Now we've converted the audio file to bytes, what do they look like?
Okay, we can see a snippet of the entire soundwave in byte form.
But remember how kilohertz means thousands of pieces of information per second? The good morning dot wav audio file is 48 kilohertz and 2-seconds long. 48,000 pieces of information per second and 2-seconds long equals 96,000 chunks of data all for only two words.
So if we printed out the entire soundwave in byte form we'd see 96,000 of these combinations of letters and numbers.
Don't worry, if the output looks confusing for now, we'll learn how to convert these bytes into something more useful shortly.
5. Working with audio is different
Now you can start to see how working with audio and spoken language files is different to other kinds of data.
First of all, unlike text or tabular data, you can't immediately see what you're working with. So many audio files often require a conversion step before you can begin working with them.
And because of the frequency measure, even a few seconds of audio can contain large amounts of data. Add in background noise, other sounds, more speakers and the number of pieces of information grows even more. We'll look into this later on.
6. Let's practice!
Alright, it's time to get hands on and practice importing your first audio file!