Reeepeated characters
Back to your sentiment analysis! Your next task is to replace elongated words that appear in the tweets. We define an elongated word as a word that contains a repeating character twice or more times. e.g. "Awesoooome".
Replacing those words is very important since a classifier will treat them as a different term from the source words lowering their frequency.
To find them, you will use capturing groups and reference them back using numbers. E.g \4
.
If you want to find a match for Awesoooome
. You first need to capture Awes
. Then, match o
and reference the same character back, and then, me
.
The list sentiment_analysis
, containing the text of three tweets, and the re
module are loaded in your session. You can use print()
to view the data in the IPython Shell.
This exercise is part of the course
Regular Expressions in Python
Exercise instructions
- Complete the regular expression to match an elongated word as described.
- Search the elements in
sentiment_analysis
list to find out if they contain elongated words. Assign the result tomatch_elongated
. - Assign the captured group number zero to the variable
elongated_word
. - Print the result contained in the variable
elongated_word
.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Complete the regex to match an elongated word
regex_elongated = r"____(____)____\w*"
for tweet in sentiment_analysis:
# Find if there is a match in each tweet
match_elongated = re.____(____, ____)
if match_elongated:
# Assign the captured group zero
elongated_word = match_elongated.____(____)
# Complete the format method to print the word
print("Elongated word found: {____}".format(word=____))
else:
print("No elongated word found")