1. Introduction to Text Encoding
So far in this course you have dealt with data that, while sometimes messy, has been generally columnar in nature. When you are faced with text data this is often not going to be the case.
2. Standardizing your text
Data that is not in a predefined form is called unstructured data, and free text data is a good example of this. Before you can leverage text data in a machine learning model you must first transform it into a series of columns of numbers or vectors. There are many different approaches to doing this and in this chapter we will go through the most common approaches. In this chapter, you will be working with the United States inaugural address dataset, which contains the text for each President's inaugural speech. With George Washington's shown here. It is clear that free text like this is not in tabular form.
3. Dataset
Before any text analytics can be performed, you must ensure that the text data is in a format that can be used. The speeches have been loaded as a pandas DataFrame called speech_df, with the body of the text in the 'text' column as can be seen by looking at the top five rows using the head() method as shown.
4. Removing unwanted characters
Most bodies of text will have non letter characters such as punctuation, that will need to be removed before analysis.
This can be achieved by using the replace() method along with the str accessor. We have used this in an earlier chapter, but instead of specifying the exact characters you wish to replace, this time you will use patterns called regular expressions.
Now unless you go through the text of all speeches, it is difficult to determine which non-letter characters are present in the data. So the easiest way to deal with this to specify a pattern which replaces all non letter characters as shown here.
The pattern lowercase a to lowercase z followed by uppercase A to uppercase Z inside square brackets basically indicates include all letter characters. Placing a caret before this pattern inside square brackets negates this, that is, says all non letter characters.
So we use the replace() method with this pattern to replace all non letter characters with a white-space as shown here.
5. Removing unwanted characters
Here you can see the text of the first speech before and after processing. Notice that the hyphen and the colon are missing.
6. Standardize the case
Once all unwanted characters have been removed you will want to standardize the remaining characters in your text so that they are all lower case. This will ensure that the same word with and without capitalization will not be counted as separate words. You can use the lower() method to achieve this as shown here.
7. Length of text
Later in this chapter you will work through the creation of features based on the content of different texts, but often there is value in the fundamental characteristics of a passage, such as its length. Using the len() method, you can calculate the number of characters in each speech.
8. Word counts
Along with the pure character length of the speech, you may want to know how many words are contained in it. The most straight forward way to do this is to split the speech based an any white-spaces, and then count how many words there are after the split.
First, you will need to split the text with with the split() method as shown here and
9. Word counts
then chain the len() method to count the total number of words in each speech.
10. Average length of word
Finally, one other stat you can calculate is the average word length. Since you already have the total number of characters and the word count, you can simply divide them to obtain the average word length.
11. Let's practice!
Now it's time for you to practice what you have learned about how to manipulate text.