Building word count vectors with scikit-learn

1. Building word count vectors with scikit-learn

In this video, we'll build our first scikit learn vectors from the movie plot and genre dataset.

2. Predicting movie genre

We have a dataset full of movie plots and what genre the movie is -- either action or sci-fi. We want to create bag of words vectors for these movie plots to see if we can predict the genre based on the words used in the plot summary.

3. Count Vectorizer with Python

To do so, we first need to import some necessary tools from Sci-kit learn. Once the data is loaded, we can create y which traditionally refers to the labels or outcome you want the model to learn. We can use the Sci-Fi column which has 1 if the movie is Sci-Fi and 0 if it is Action. Then, scikit learn's train_test_split function can be used to split the dataframe into training and testing data. This method will split the features which is the plot summary or column PLOT and the labels (y) based on a given test_size such as 0.33, representing 33 percent. I have also set random state so we have a repeatable result, it operates similar to setting a random seed and ensures I get the same results when I run the code again. The function will take 33% of rows to be marked as test data, and remove them from the training data. The test data is later used to see what my model has learned. The resulting data from train_test_split are training data (as X_train) and training labels (as y_train) and testing data as X_test and testing labels as y_test. Next, we create a countvectorizer which turns my text into bag of words vectors similar to a Gensim corpus, it will also remove English stop words from the movie plot summaries as a preprocessing step. Each token now acts as a feature for the machine learning classification problem, just like the flower measurements in the iris data set. We can then call fit_transform on the training data to create the bag-of-words vectors. Fit_transform is a handy shortcut which will call the model's fit and then transform methods; which here generates a mapping of words with IDs and vectors representing how many times each word appears in the plot. Fit_transform operates differently for each model, but generally fit will find parameters or norms in the data and transform will apply the model's underlying algorithm or approximation -- similar to preprocessing but with a specific use case in mind. For the CountVectorizer class, fit_transform will create the bagofwords dictionary and vectors for each document using the training data. After calling fit_transform on the training data, we call transform on the test data to create bag of words vectors using the same dictionary. The training and test vectors need to use a consistent set of words, so the trained model can understand the test input. If we don't have much data, there can be an issue with words in the test set which don't appear in the training data. This will throw an error, so you will need to either add more training data or remove the unknown words from the test dataset. In only a few lines of Python, we have transformed text into bagofwords vectors and generated test and training datasets. Scikitlearn is a great aid in helping make NLP machine learning simple and accessible.

4. Let's practice!

Now it's your turn to create some scikitlearn vectors!

Create Your Free Account

By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.