1. Bringing it all together
Welcome back! In this video we will put together all the steps we have applied in this course on sentiment analysis. I find myself apply these same steps in my work as a data scientist.
2. The Sentiment Analysis problem
We defined sentiment analysis as the process of understanding the opinion of an author about a subject.
Throughout the course we worked with examples of movie and Amazon product reviews,
Twitter airline sentiment data,
and different emotionally charged literary examples.
We went through various steps to transform the text column, which contained the review, to numeric features. We finished our analysis by training a logistic regression model and predicting the sentiment of a new review based on the words in the text. Let's go through these steps in more detail.
3. Exploration of the reviews
We started with exploring the review column in the movies reviews dataset. We found which were the shortest and longest reviews.
We also plotted word clouds from the movie reviews, which allowed us to quickly see which are the most frequently mentioned words in positive or negative reviews.
Furthermore, we created features for the length of a review in terms of number of words and number of sentences,
and we learned how to detect the language of a document.
4. Numeric transformations of sentiment-carrying columns
We continued with numeric transformations of the review features. We transformed the text using a bag-of-words approach and a Tfidf vectorizer.
The bag-of-words created features corresponding to the frequency count of a word in a respective review or tweet (also called document in NLP problems).
The term frequency-inverse document frequency approach is similar to the bag-of-words but it accounts for how frequently a word occurs in a document with respect to the rest of the documents. So, we can capture 'important' words, whereas words that occur frequently have lower tfidf score.
We used the CountVectorizer and TfidfVectorizer from sklearn.feature_extraction.text to construct each of the vectors.
As a reminder of the syntax, we called the vectorizer function and fit and then transformed it to the text column in our data.
5. Arguments of the vectorizers
There are many arguments we specified in the vectorizers.
We dealt with stop words: those frequently occurring and non-informative words.
We had a video on n-grams, which allowed us to use different lengths of phrases instead of a single word.
We learned how to limit the size of the vocabulary by setting any of a number of parameters: max_features (for the maximum number of features), max and min_df (which tells the vectorizer to ignore terms with higher or lower than the specified frequency, respectively).
We could capture only certain characters using the token_pattern argument.
Last but not least, we learned about lemmas and stems and practiced lemmatizing and stemming of tokens and strings.
We could adjust all these arguments - with the exception of lemmas and stems - in both the the count- and tfidfvectorizers.
6. Supervised learning model
In the final step, we used a logistic regression to train a classifier predicting the sentiment.
We evaluated the performance of the model using metrics such as accuracy score and a confusion matrix.
Since the goal is for our model to perform well on unseen data, we randomly split the data into a training and testing set; we used the training set to build the model and the test set to evaluate its performance.
7. Let's practice!
These are all very valuable skills and essential in performing a sentiment analysis task. Let's perform some of these steps in the exercises.