Normalizing reviews
It's time to extract some important words present in your movie review dataset. First, you need to normalize them and then, count their frequency. Part of the normalization implies converting all the words to lowercase, removing special characters and extracting the root of a word so you count the variants as one.
So imagine you have the following reviews: The movie surprises me very much
and Marvel movies always surprise their audience
. If you count the word frequency, you will count surprises
one time and surprise
one time. However, the verb surprise
appears in both and its frequency should be two.
The text of a movie review for only one example has been already saved in the variable movie
. You can use print(movie)
to view the variable in the IPython Shell.
This exercise is part of the course
Regular Expressions in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Convert to lowercase and print the result
movie_lower = ____
print(____)