Get startedGet started for free

Finding files

You are not satisfied with your tweets dataset cleaning. There are still extra strings that do not provide any sentiment. Among them are strings that refer to text file names.

You also find a way to detect them:

  • They appear at the start of the string.
  • They always start with a sequence of 2 or 3 upper or lowercase vowels (a e i o u).
  • They always finish with the txt ending.

You are not sure if you should remove them directly. So you write a script to find and store them in a separate dataset.

You write down some metacharacters to help you: ^ anchor to beginning, . any character.

The variable sentiment_analysis containing the text of two tweets as well as the re module are already loaded in your session. You can use print() to view it in the IPython Shell.

This exercise is part of the course

Regular Expressions in Python

View Course

Exercise instructions

  • Write a regex that matches the pattern of the text file names, e.g. aemyfile.txt.
  • Find all matches of the regex in the elements of sentiment_analysis. Print out the result.
  • Replace all matches of the regex with an empty string "". Print out the result.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Write a regex to match text file name
regex = ____"____[____]{____}____txt"

for text in sentiment_analysis:
	# Find all matches of the regex
	print(re.____(____, ____))
    
	# Replace all matches with empty string
	print(re.____(____, ____, ____))
Edit and Run Code