Get startedGet started for free

Engineering text features

1. Engineering text features

Though text data is a little more complicated to work with, there's a lot of useful feature engineering we can do with it.

2. Extraction

One method is to extract the pieces of information that you need: maybe part of a string, or extracting a number, and transforming it into a feature. We can also transform the text itself into features, for use with natural language processing methods or prediction tasks. Let's learn how to extract data from text fields. We're going to use regular expressions to extract information from strings. Regular expressions are patterns that can be used to extract information from text data. You should already be familiar with regular expression, but for the purposes of this course, we're going to only focus on extracting numbers from strings. To use Python's rich regular expressions functionality, we'll need to first import the re module. Here we have a string, and we want to extract the temperature digit from it, so we can model using the numerical data. We'll need use a pattern to extract this float, so let's break down the pattern in re-dot-search. "backslash d" means that we want to grab digits, and the "plus" means we want to grab as many as possible. So if there are two next to each other, we want both (like the 75). "backslash period" means we want to grab the decimal point, and then there's another "backslash d plus" at the end to grab the digits on the right-hand side of the decimal. re-dot-search then searches for a string matching the pattern, which we can extract with the group method.

3. Vectorizing text

If we're working with text, we might want to model it in some way. Maybe we want to use document text in a classification task, such as classifying emails as spam or not. In order to do that, we'll need to vectorize the text and transform it into a numerical input that scikit-learn can use. We're going to create a tf/idf vector. tf/idf is a way of vectorizing text that reflects how important a word is in a document beyond how frequently it occurs. It stands for term frequency inverse document frequency and places the weight on words that are ultimately more significant in the entire corpus of words.

4. Vectorizing text

We can create tf/idf vectors in scikit-learn using TfidfVectorizer. Here we have a collection of text. In order to vectorize it, we can simply pass the column of text we want to vectorize into the fit_transform method, which is called on the TfidfVectorizer.

5. Text classification

Now that we have a vectorized version of the text, we can use it for classification. We'll use a Naive Bayes classifier, which is based on Bayes' theorem of conditional probability, seen here, and performs well on text classification tasks. Naive Bayes treats each feature as independent from the others, which can be a naive assumption, but works out quite well on text data. Because each feature is treated independently, this classifier works well on high-dimensional data and is very efficient.

6. Let's practice!

Now it's your turn to extract features from text.