Practicing regular expressions: re.split() and re.findall()
Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string
first by printing it in the IPython Shell, to determine how you might best match the different steps.
Note: It's important to prefix your regex patterns with r
to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n"
in Python is used to indicate a new line, but if you use the r
prefix, it will be interpreted as the raw string "\n"
- that is, the character "\"
followed by the character "n"
- and not as a new line.
The regular expression module re
has already been imported for you.
Remember from the video that the syntax for the regex library is to always to pass the pattern first, and then the string second.
This is a part of the course
“Introduction to Natural Language Processing in Python”
Exercise instructions
- Split
my_string
on each sentence ending. To do this:- Write a pattern called
sentence_endings
to match sentence endings (.?!
). - Use
re.split()
to splitmy_string
on the pattern and print the result.
- Write a pattern called
- Find and print all capitalized words in
my_string
by writing a pattern calledcapitalized_words
and usingre.findall()
.- Remember the
[a-z]
pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
- Remember the
- Write a pattern called
spaces
to match one or more spaces ("\s+"
) and then usere.split()
to splitmy_string
on this pattern, keeping all punctuation intact. Print the result. - Find all digits in
my_string
by writing a pattern calleddigits
("\d+"
) and usingre.findall()
. Print the result.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[___]"
# Split my_string on sentence endings and print the result
print(re.____(____, ____))
# Find all capitalized words in my_string and print the result
capitalized_words = r"[___]\w+"
print(re.____(____, ____))
# Split my_string on spaces and print the result
spaces = r"___"
print(re.____(____, ____))
# Find all digits in my_string and print the result
digits = r"___"
print(re.____(____, ____))
This exercise is part of the course
Introduction to Natural Language Processing in Python
Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.
This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.
Exercise 1: Introduction to regular expressionsExercise 2: Which pattern?Exercise 3: Practicing regular expressions: re.split() and re.findall()Exercise 4: Introduction to tokenizationExercise 5: Word tokenization with NLTKExercise 6: More regex with re.search()Exercise 7: Advanced tokenization with NLTK and regexExercise 8: Choosing a tokenizerExercise 9: Regex with NLTK tokenizationExercise 10: Non-ascii tokenizationExercise 11: Charting word length with NLTKExercise 12: Charting practiceWhat is DataCamp?
Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.