Practicing regular expressions: re.split() and re.findall()
Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string first by printing it in the IPython Shell, to determine how you might best match the different steps.
Note: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n" in Python is used to indicate a new line, but if you use the r prefix, it will be interpreted as the raw string "\n" - that is, the character "\" followed by the character "n" - and not as a new line.
The regular expression module re has already been imported for you.
Remember from the video that the syntax for the regex library is to always to pass the pattern first, and then the string second.
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Split
my_stringon each sentence ending. To do this:- Write a pattern called
sentence_endingsto match sentence endings (.?!). - Use
re.split()to splitmy_stringon the pattern and print the result.
- Write a pattern called
- Find and print all capitalized words in
my_stringby writing a pattern calledcapitalized_wordsand usingre.findall().- Remember the
[a-z]pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
- Remember the
- Write a pattern called
spacesto match one or more spaces ("\s+") and then usere.split()to splitmy_stringon this pattern, keeping all punctuation intact. Print the result. - Find all digits in
my_stringby writing a pattern calleddigits("\d+") and usingre.findall(). Print the result.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[___]"
# Split my_string on sentence endings and print the result
print(re.____(____, ____))
# Find all capitalized words in my_string and print the result
capitalized_words = r"[___]\w+"
print(re.____(____, ____))
# Split my_string on spaces and print the result
spaces = r"___"
print(re.____(____, ____))
# Find all digits in my_string and print the result
digits = r"___"
print(re.____(____, ____))