1. Classifying fake news using supervised learning with NLP
In this video, we'll be learning about supervised machine learning with NLP. Throughout this chapter you will be using the skills and ideas applied to classifying fake news.
2. What is supervised learning?
Supervised learning is a form of machine learning where you are given or create training data. This data has a label or outcome which you want the model or algorithm to learn.
One common problem used as a good example of introductory machine learning is the Fischer's iris data; we have a few example rows of it here. The data has several features: Sepal Length and width and Petal length and width.
The label we want to learn and predict is the species. This is a classification problem, so you want to be able to classify or categorize some data based on what you already know or have learned.
Our goal is to use the dataset to make intelligent hypotheses about the species based on the geometric features.
3. Supervised learning with NLP
But instead of using geometric features like the Iris dataset, we need to use language. To help create features and train a model, we will use Scikit learn, a powerful open-source library.
One of the ways you can create supervised learning data from text is by using bag of words models or TFIDF as features.
4. IMDB Movie Dataset
Let's say I have a dataset full of movie plots and genres from the IMDB database, as shown in this chart. I've separated the action and sci-fi movies, removing any movies labeled both action and scifi. I want to predict whether a movie is action or sci-fi based on the plot summary.
The dataset we've extracted has categorical features generated using some preprocessing. We can see the plot summary, and the sci-fi and action columns. You can also see the Sci-Fi column, which is 1 for movies that are scifi and 0 for movies that are action. The Action column is the inverse of the Sci-Fi column.
5. Supervised learning steps
In the next video, we'll use scikit-learn to predict a movie's genre from its plot. But first, let's review the supervised learning process as a whole. To begin, we collect and preprocess our data. Then, we determine a label - this is what we want the model to learn, in our case, the genre of the movie.
We can split our data into training and testing datasets, keeping them separate so we can build our model using only the training data. The test data remains unseen so we can test how well our model performs after it is trained. This is an essential part of Supervised Learning!
We also need to extract features from the text to predict the label. We will use a bagof words vectorizer built into scikit-learn to do so.
After the model is trained, we can then test it using the test dataset. There are also other methods to evaluate model performance, such as k-fold cross validation and you can check out DataCamp's Machine Learning curriculum to learn that and more!
6. Let's practice!
Let's review some of the supervised learning steps, like splitting testing and training data before applying it to our movie plot data.