Get startedGet started for free

Classifying transcribed speech with Sklearn

1. Classifying transcribed speech with Sklearn

Acme are impressed with your work so far and have sent over two folders full of phone call audio snippets. And they've manually labelled them with pre-purchase if the customer was calling before a purchase or post-purchase if the customer was calling after making a purchase. They said the process of labeling audio files was labor intensive and want to know if machine learning can help. You immediately start to think of building an sklearn text classifier, and that's what we'll be doing in this lesson.

2. Inspecting the data

You inspect the folders by importing os and using the listdir function on the folder path. You notice there's about 50 files in each but they're in the mp3 format. Luckily you built a function to handle this earlier.

3. Converting to wav

Using your convert to wav function you built earlier, you convert all the files from mp3 to wav.

4. Transcribing all phone call excerpts

Excellent, now they're all in wav format, you decide to create a function, create text list, to transcribe all of the files in a folder to text. You start with an empty list, then looping through the folder, if a file ends with a wav extension, you pass the filepath to your transcribe audio function which returns the text. Once you have the text, you append it to your empty list and then return the list full of transcribed text.

5. Transcribing all phone call excerpts

Running the function on the post purchase folder, returns a list of text. Let's see what the first five look like.

6. Organizing transcribed text

Okay, we're making progress. Those helper functions came in handy. To make building your text classifier easier, you decide to put all the text into a pandas dataframe. You start by importing pandas as pd. Then create a post purchase dataframe by passing pd DataFrame a dictionary with a key named label which has a value of post purchase and a text key whose value is the text list. You do the same for the pre purchase text. And to have everything in one place, you combine the two dataframes with pd dot concat. Let's set it.

7. Organizing transcribed text

Beautiful! Now you've got your data in a dataframe, you can use it to build a text classifier with sklearn.

8. Building a text classifier

We'll start by importing the necessary packages. Numpy as np, Pipeline from sklearn's pipeline module, MultinomialNB from sklearn's naive bayes module for our classifier, CountVectorizer and TfidfTransformer from sklearn's text feature extraction module to transform our text into numbers. And train test split to split our data into training and test sets. To start, we'll use train test split to split the data using a test size of 30%. Where our X value is the text column and our y value is the label column of the dataframe we created earlier.

9. Naive Bayes Pipeline

Next, you setup a classifier pipeline as text classifier which uses CountVectorizer and TfidfTransformer to transform each of the test samples into a certain value depending on the words they contain. Then MultinomialNB builds a naive bayes model to classifiy each sample. To train the model you call the fit function on your text classifier and pass it the training data.

10. Not so Naive

Once you've got a trained model, you can evaluate it by calling the predict function on your classifier and passing it the test set data. Then you can use Numpy to compare the predictions to the test data labels. That's not a bad model! Not so Naive after all.

11. Let's practice!

Alright, you've seen enough, time to get this model into Acme's hands! Let's code.