As we talked about in the video, tokenization is the process of chopping up a character sequence into pieces called tokens.
How do we determine what constitutes a token? Often, tokens are separated by whitespace. But we can specify other delimiters as well. For example, if we decided to tokenize on punctuation, then any punctuation mark would be treated like a whitespace. How we tokenize text in our DataFrame can affect the statistics we use in our model.
A particular cell in our budget DataFrame may have the string content
Title I - Disadvantaged Children/Targeted Assistance. The number of n-grams generated by this text data is sensitive to whether or not we tokenize on punctuation, as you'll show in the following exercise.
How many tokens (1-grams) are in the string
Title I - Disadvantaged Children/Targeted Assistance
if we tokenize on whitespace and punctuation?