Practicing regular expressions: re.split() and re.findall()
Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string
first by printing it in the IPython Shell, to determine how you might best match the different steps.
Note: It's important to prefix your regex patterns with r
to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n"
in Python is used to indicate a new line, but if you use the r
prefix, it will be interpreted as the raw string "\n"
- that is, the character "\"
followed by the character "n"
- and not as a new line.
The regular expression module re
has already been imported for you.
Remember from the video that the syntax for the regex library is to always to pass the pattern first, and then the string second.
This is a part of the course
“Introduction to Natural Language Processing in Python”
Exercise instructions
- Split
my_string
on each sentence ending. To do this:- Write a pattern called
sentence_endings
to match sentence endings (.?!
). - Use
re.split()
to splitmy_string
on the pattern and print the result.
- Write a pattern called
- Find and print all capitalized words in
my_string
by writing a pattern calledcapitalized_words
and usingre.findall()
.- Remember the
[a-z]
pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
- Remember the
- Write a pattern called
spaces
to match one or more spaces ("\s+"
) and then usere.split()
to splitmy_string
on this pattern, keeping all punctuation intact. Print the result. - Find all digits in
my_string
by writing a pattern calleddigits
("\d+"
) and usingre.findall()
. Print the result.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[___]"
# Split my_string on sentence endings and print the result
print(re.____(____, ____))
# Find all capitalized words in my_string and print the result
capitalized_words = r"[___]\w+"
print(re.____(____, ____))
# Split my_string on spaces and print the result
spaces = r"___"
print(re.____(____, ____))
# Find all digits in my_string and print the result
digits = r"___"
print(re.____(____, ____))