Choosing a tokenizer

Given the following string, which of the below patterns is the best tokenizer? If possible, you want to retain sentence punctuation as separate tokens, but have '#1' remain a single token.

my_string = "SOLDIER #1: Found them? In Mercea? The coconut's tropical!"

The string is available in your workspace as my_string, and the patterns have been pre-loaded as pattern1, pattern2, pattern3, and pattern4, respectively.

Additionally, regexp_tokenize has been imported from nltk.tokenize. You can use regexp_tokenize(string, pattern) with my_string and one of the patterns as arguments to experiment for yourself and see which is the best tokenizer.

Bu egzersiz

Introduction to Natural Language Processing in Python

kursunun bir parçasıdır

Kursu Görüntüle

Uygulamalı interaktif egzersiz

İnteraktif egzersizlerimizden biriyle teoriyi pratiğe dökün

Egzersizi başlat