Get Started

Practicing regular expressions: re.split() and re.findall()

Now you'll get a chance to write some regular expressions to match digits, strings and non-alphanumeric characters. Take a look at my_string first by printing it in the IPython Shell, to determine how you might best match the different steps.

Note: It's important to prefix your regex patterns with r to ensure that your patterns are interpreted in the way you want them to. Else, you may encounter problems to do with escape sequences in strings. For example, "\n" in Python is used to indicate a new line, but if you use the r prefix, it will be interpreted as the raw string "\n" - that is, the character "\" followed by the character "n" - and not as a new line.

The regular expression module re has already been imported for you.

Remember from the video that the syntax for the regex library is to always to pass the pattern first, and then the string second.

This is a part of the course

“Introduction to Natural Language Processing in Python”

View Course

Exercise instructions

  • Split my_string on each sentence ending. To do this:
    • Write a pattern called sentence_endings to match sentence endings (.?!).
    • Use re.split() to split my_string on the pattern and print the result.
  • Find and print all capitalized words in my_string by writing a pattern called capitalized_words and using re.findall().
    • Remember the [a-z] pattern shown in the video to match lowercase groups? Modify that pattern appropriately in order to match uppercase groups.
  • Write a pattern called spaces to match one or more spaces ("\s+") and then use re.split() to split my_string on this pattern, keeping all punctuation intact. Print the result.
  • Find all digits in my_string by writing a pattern called digits ("\d+") and using re.findall(). Print the result.

Hands-on interactive exercise

Have a go at this exercise by completing this sample code.

# Write a pattern to match sentence endings: sentence_endings
sentence_endings = r"[___]"

# Split my_string on sentence endings and print the result
print(re.____(____, ____))

# Find all capitalized words in my_string and print the result
capitalized_words = r"[___]\w+"
print(re.____(____, ____))

# Split my_string on spaces and print the result
spaces = r"___"
print(re.____(____, ____))

# Find all digits in my_string and print the result
digits = r"___"
print(re.____(____, ____))

This exercise is part of the course

Introduction to Natural Language Processing in Python

IntermediateSkill Level
4.1+
38 reviews

Learn fundamental natural language processing techniques using Python and how to apply them to extract insights from real-world text data.

This chapter will introduce some basic NLP concepts, such as word tokenization and regular expressions to help parse text. You'll also learn how to handle non-English text and more difficult tokenization you might find.

Exercise 1: Introduction to regular expressionsExercise 2: Which pattern?Exercise 3: Practicing regular expressions: re.split() and re.findall()
Exercise 4: Introduction to tokenizationExercise 5: Word tokenization with NLTKExercise 6: More regex with re.search()Exercise 7: Advanced tokenization with NLTK and regexExercise 8: Choosing a tokenizerExercise 9: Regex with NLTK tokenizationExercise 10: Non-ascii tokenizationExercise 11: Charting word length with NLTKExercise 12: Charting practice

What is DataCamp?

Learn the data skills you need online at your own pace—from non-coding essentials to data science and machine learning.

Start Learning for Free