Match and split
Some of the tweets in your dataset were downloaded incorrectly. Instead of having spaces to separate words, they have strange characters. You decide to use regular expressions to handle this situation. You print some of these tweets to understand which pattern you need to match.
You notice that the sentences are always separated by a special character, followed by a number, the word break
, and after that, another special character, e.g &4break!
. The words are always separated by a special character, the word new
, and a normal random character, e.g #newH
.
The variable sentiment_analysis
containing the text of one tweet, as well as the re
module were already loaded in your session. You can use print(sentiment_analysis)
to view it in the IPython Shell.
This exercise is part of the course
Regular Expressions in Python
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Write a regex to match pattern separating sentences
regex_sentence = ____"____"