Non-ascii tokenization
In this exercise, you'll practice advanced tokenization by tokenizing some non-ascii based text. You'll be using German with emoji!
Here, you have access to a string called german_text, which has been printed for you in the Shell. Notice the emoji and the German characters!
The following modules have been pre-imported from nltk.tokenize: regexp_tokenize and word_tokenize.
Unicode ranges for emoji are:
('\U0001F300'-'\U0001F5FF'), ('\U0001F600-\U0001F64F'), ('\U0001F680-\U0001F6FF'), and ('\u2600'-\u26FF-\u2700-\u27BF').
This exercise is part of the course
Introduction to Natural Language Processing in Python
Exercise instructions
- Tokenize all the words in
german_textusingword_tokenize(), and print the result. - Tokenize only the capital words in
german_text.- First, write a pattern called
capital_wordsto match only capital words. Make sure to check for the GermanÜ! To use this character in the exercise, copy and paste it from these instructions. - Then, tokenize it using
regexp_tokenize().
- First, write a pattern called
- Tokenize only the emoji in
german_text. The pattern using the unicode ranges for emoji given in the assignment text has been written for you. Your job is to useregexp_tokenize()to tokenize the emoji.
Hands-on interactive exercise
Have a go at this exercise by completing this sample code.
# Tokenize and print all words in german_text
all_words = ____(____)
print(all_words)
# Tokenize and print only capital words
capital_words = r"[____]\w+"
print(____(____, ____))
# Tokenize and print only emoji
emoji = "['\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF']"
print(____(____, ____))